Many developers learn:

programming;
APIs;
databases;
cloud;
Artificial Intelligence.

But few learn what happens after deployment.

And it's precisely at this moment that engineering really begins.

In this article, you will learn:

What observability is;
How to monitor modern systems;
What logs, metrics, and traces are;
How to identify failures;
What SLOs and SLAs are;
How to monitor applications with AI;
How to operate systems in production.

What is Observability?

Imagine your system is in production.

Suddenly, a customer sends a message:

The system is slow.

The first question is:

Why?

Without observability, you don't know.

With observability, you can investigate quickly.

Observability is the ability to understand a system's internal behavior through the data it produces.

This data is usually:

Logs;
Metrics;
Traces.

Why is Observability Important?

When a system grows, problems become inevitable.

Examples:

Slow APIs;
Overloaded databases;
Unavailable integrations;
Unexpected errors;
Infrastructure failures.

Without observability:

Problem 
↓ 
Panic

With observability:

Problem 
↓ 
Diagnosis 
↓ 
Correction

The Three Pillars of Observability

Modern observability is based on three pillars.

Logs 

Metrics 

Tracing

Together, they provide a complete view of the system.

Logs

Logs record events.

Examples:

User authenticated 

Order created 

Payment approved 

Error processing checkout

A well-structured log usually contains:

{ 
  "timestamp": "2026-01-01T10:00:00Z", 
  "level": "INFO", 
  "service": "orders", 
  "message": "Order created", 
  "order_id": "123" 
}

What to Log?

Best practices:

Log:

Important events;
State changes;
Errors;
External integrations.

Avoid:

Passwords;
Tokens;
Sensitive data.

Metrics

Logs explain what happened.

Metrics show trends.

Examples:

Number of users 

Number of orders 

CPU usage 

Memory usage 

Latency

A metric usually answers:

How much?

Essential Metrics

For APIs:

Requests per second 

Error rate 

Latency

For databases:

Connections 

Slow queries 

Locks

For infrastructure:

CPU 

Memory 

Disk

Tracing

Tracing shows the path taken by a request.

Imagine a checkout:

Frontend 
↓ 
API 
↓ 
Orders Service 
↓ 
Payments Service 
↓ 
Database

Which step was slow?

The trace shows exactly where the time was spent.

Example of a Trace

Request 

├── API Gateway (20ms) 
├── Orders Service (40ms) 
├── Payments Service (800ms) 
└── Database (15ms)

Problem identified:

Payments Service

What is an SLI?

SLI stands for:

Service Level Indicator

It's a metric that represents the system's health.

Examples:

Latency 

Availability 

Error rate

What is an SLO?

SLO stands for:

Service Level Objective

It's the goal we want to achieve.

Example:

99.5% monthly availability

Or:

P95 below 500ms

What is an SLA?

SLA stands for:

Service Level Agreement

It's a formal commitment to customers.

Example:

99.9% availability

If not met, it can result in penalties.

Monitoring Applications with AI

AI applications require additional metrics.

In addition to CPU and memory, we need to monitor:

Consumed tokens;
Costs;
LLM latency;
Handoff rate to humans;
Response accuracy.

AI Metrics

Examples:

Questions per day 

Consumed tokens 

Daily cost 

Average latency 

Resolution rate

Example of Observability for AI

Question 
↓ 
Intent Classifier 
↓ 
Vector Search 
↓ 
RAG 
↓ 
LLM 
↓ 
Response

We need to know:

Which step was slow? 

How much did it cost? 

Which model responded? 

How many tokens were used?

Alerts

Monitoring is not enough.

We need to be notified when something goes wrong.

Examples:

CPU above 90% 

Error above 5% 

Database unavailable 

Latency above SLO

Dashboards

Dashboards consolidate important information.

They usually display:

Availability 

Latency 

Errors 

Resource usage 

Costs

A good dashboard allows for quick problem identification.

Operation in Production

Production is a living environment.

Systems change constantly.

That's why we need to:

Monitor;
Investigate;
Correct;
Evolve.

Operation is not a separate activity from engineering.

It's part of engineering.

What to Monitor in Lumina Store?

Our application has:

Frontend 

Backend 

PostgreSQL 

pgvector 

Payment Gateway 

LLM 

RAG 

Cloud

So we need to monitor:

Orders 

Payments 

Conversations 

AI 

Database 

Infrastructure

Example of Lumina Store Dashboard

Orders per Hour 

Conversion Rate 

Failed Payments 

API Latency 

LLM Latency 

AI Costs 

Availability

Best Practices for Operation

Automate Alerts

Don't wait for users to complain.

Monitor Costs

Especially in AI applications.

Log Important Events

Logs are your operational memory.

Create Simple Dashboards

Excessive complexity hinders investigations.

Define SLOs

What is not measured cannot be improved.

Popular Tools

Modern observability usually uses:

Logs

ELK Stack
OpenSearch
Loki

Metrics

Prometheus
Grafana

Tracing

OpenTelemetry
Jaeger
Tempo

Cloud

CloudWatch
Azure Monitor
Google Cloud Monitoring

The Operation Cycle

Every healthy system follows a continuous cycle:

Monitor 
↓ 
Detect 
↓ 
Investigate 
↓ 
Correct 
↓ 
Learn 
↓ 
Improve

This cycle never ends.

Conclusion

Observability is one of the most important disciplines in modern engineering.

Throughout this article, we've seen:

What observability is;
Logs;
Metrics;
Tracing;
SLI;
SLO;
SLA;
Observability for AI;
Dashboards;
Operation in production.

The main lesson is simple:

Systems don't fail because they have bugs. Systems fail because we can't see what's happening.

Building software is important.

Operating software in production is what turns an application into a reliable product.