Fundamentals of Observability and Operation for Beginners
Building a system is only half the work. The other half is ensuring it continues to function in production.

Many developers learn:
- programming;
- APIs;
- databases;
- cloud;
- Artificial Intelligence.
But few learn what happens after deployment.
And it's precisely at this moment that engineering really begins.
In this article, you will learn:
- What observability is;
- How to monitor modern systems;
- What logs, metrics, and traces are;
- How to identify failures;
- What SLOs and SLAs are;
- How to monitor applications with AI;
- How to operate systems in production.
What is Observability?
Imagine your system is in production.
Suddenly, a customer sends a message:
The system is slow.
The first question is:
Why?
Without observability, you don't know.
With observability, you can investigate quickly.
Observability is the ability to understand a system's internal behavior through the data it produces.
This data is usually:
- Logs;
- Metrics;
- Traces.
Why is Observability Important?
When a system grows, problems become inevitable.
Examples:
- Slow APIs;
- Overloaded databases;
- Unavailable integrations;
- Unexpected errors;
- Infrastructure failures.
Without observability:
Problem
↓
Panic
With observability:
Problem
↓
Diagnosis
↓
Correction
The Three Pillars of Observability
Modern observability is based on three pillars.
Logs
Metrics
Tracing
Together, they provide a complete view of the system.
Logs
Logs record events.
Examples:
User authenticated
Order created
Payment approved
Error processing checkout
A well-structured log usually contains:
{
"timestamp": "2026-01-01T10:00:00Z",
"level": "INFO",
"service": "orders",
"message": "Order created",
"order_id": "123"
}
What to Log?
Best practices:
Log:
- Important events;
- State changes;
- Errors;
- External integrations.
Avoid:
- Passwords;
- Tokens;
- Sensitive data.
Metrics
Logs explain what happened.
Metrics show trends.
Examples:
Number of users
Number of orders
CPU usage
Memory usage
Latency
A metric usually answers:
How much?
Essential Metrics
For APIs:
Requests per second
Error rate
Latency
For databases:
Connections
Slow queries
Locks
For infrastructure:
CPU
Memory
Disk
Tracing
Tracing shows the path taken by a request.
Imagine a checkout:
Frontend
↓
API
↓
Orders Service
↓
Payments Service
↓
Database
Which step was slow?
The trace shows exactly where the time was spent.
Example of a Trace
Request
├── API Gateway (20ms)
├── Orders Service (40ms)
├── Payments Service (800ms)
└── Database (15ms)
Problem identified:
Payments Service
What is an SLI?
SLI stands for:
Service Level Indicator
It's a metric that represents the system's health.
Examples:
Latency
Availability
Error rate
What is an SLO?
SLO stands for:
Service Level Objective
It's the goal we want to achieve.
Example:
99.5% monthly availability
Or:
P95 below 500ms
What is an SLA?
SLA stands for:
Service Level Agreement
It's a formal commitment to customers.
Example:
99.9% availability
If not met, it can result in penalties.
Monitoring Applications with AI
AI applications require additional metrics.
In addition to CPU and memory, we need to monitor:
- Consumed tokens;
- Costs;
- LLM latency;
- Handoff rate to humans;
- Response accuracy.
AI Metrics
Examples:
Questions per day
Consumed tokens
Daily cost
Average latency
Resolution rate
Example of Observability for AI
Question
↓
Intent Classifier
↓
Vector Search
↓
RAG
↓
LLM
↓
Response
We need to know:
Which step was slow?
How much did it cost?
Which model responded?
How many tokens were used?
Alerts
Monitoring is not enough.
We need to be notified when something goes wrong.
Examples:
CPU above 90%
Error above 5%
Database unavailable
Latency above SLO
Dashboards
Dashboards consolidate important information.
They usually display:
Availability
Latency
Errors
Resource usage
Costs
A good dashboard allows for quick problem identification.
Operation in Production
Production is a living environment.
Systems change constantly.
That's why we need to:
- Monitor;
- Investigate;
- Correct;
- Evolve.
Operation is not a separate activity from engineering.
It's part of engineering.
What to Monitor in Lumina Store?
Our application has:
Frontend
Backend
PostgreSQL
pgvector
Payment Gateway
LLM
RAG
Cloud
So we need to monitor:
Orders
Payments
Conversations
AI
Database
Infrastructure
Example of Lumina Store Dashboard
Orders per Hour
Conversion Rate
Failed Payments
API Latency
LLM Latency
AI Costs
Availability
Best Practices for Operation
Automate Alerts
Don't wait for users to complain.
Monitor Costs
Especially in AI applications.
Log Important Events
Logs are your operational memory.
Create Simple Dashboards
Excessive complexity hinders investigations.
Define SLOs
What is not measured cannot be improved.
Popular Tools
Modern observability usually uses:
Logs
- ELK Stack
- OpenSearch
- Loki
Metrics
- Prometheus
- Grafana
Tracing
- OpenTelemetry
- Jaeger
- Tempo
Cloud
- CloudWatch
- Azure Monitor
- Google Cloud Monitoring
The Operation Cycle
Every healthy system follows a continuous cycle:
Monitor
↓
Detect
↓
Investigate
↓
Correct
↓
Learn
↓
Improve
This cycle never ends.
Conclusion
Observability is one of the most important disciplines in modern engineering.
Throughout this article, we've seen:
- What observability is;
- Logs;
- Metrics;
- Tracing;
- SLI;
- SLO;
- SLA;
- Observability for AI;
- Dashboards;
- Operation in production.
The main lesson is simple:
Systems don't fail because they have bugs. Systems fail because we can't see what's happening.
Building software is important.
Operating software in production is what turns an application into a reliable product.
Related tags