Back to blog
    Observabilidade

    Fundamentals of Observability and Operation for Beginners

    Building a system is only half the work. The other half is ensuring it continues to function in production.

    Fundamentals of Observability and Operation for Beginners
    June 12, 20265 min read

    Many developers learn:

    • programming;
    • APIs;
    • databases;
    • cloud;
    • Artificial Intelligence.

    But few learn what happens after deployment.

    And it's precisely at this moment that engineering really begins.

    In this article, you will learn:

    • What observability is;
    • How to monitor modern systems;
    • What logs, metrics, and traces are;
    • How to identify failures;
    • What SLOs and SLAs are;
    • How to monitor applications with AI;
    • How to operate systems in production.

    What is Observability?

    Imagine your system is in production.

    Suddenly, a customer sends a message:

    The system is slow. 
    

    The first question is:

    Why? 
    

    Without observability, you don't know.

    With observability, you can investigate quickly.

    Observability is the ability to understand a system's internal behavior through the data it produces.

    This data is usually:

    • Logs;
    • Metrics;
    • Traces.

    Why is Observability Important?

    When a system grows, problems become inevitable.

    Examples:

    • Slow APIs;
    • Overloaded databases;
    • Unavailable integrations;
    • Unexpected errors;
    • Infrastructure failures.

    Without observability:

    Problem 
    ↓ 
    Panic 
    

    With observability:

    Problem 
    ↓ 
    Diagnosis 
    ↓ 
    Correction 
    

    The Three Pillars of Observability

    Modern observability is based on three pillars.

    Logs 
    
    Metrics 
    
    Tracing 
    

    Together, they provide a complete view of the system.


    Logs

    Logs record events.

    Examples:

    User authenticated 
    
    Order created 
    
    Payment approved 
    
    Error processing checkout 
    

    A well-structured log usually contains:

    { 
      "timestamp": "2026-01-01T10:00:00Z", 
      "level": "INFO", 
      "service": "orders", 
      "message": "Order created", 
      "order_id": "123" 
    } 
    

    What to Log?

    Best practices:

    Log:

    • Important events;
    • State changes;
    • Errors;
    • External integrations.

    Avoid:

    • Passwords;
    • Tokens;
    • Sensitive data.

    Metrics

    Logs explain what happened.

    Metrics show trends.

    Examples:

    Number of users 
    
    Number of orders 
    
    CPU usage 
    
    Memory usage 
    
    Latency 
    

    A metric usually answers:

    How much? 
    

    Essential Metrics

    For APIs:

    Requests per second 
    
    Error rate 
    
    Latency 
    

    For databases:

    Connections 
    
    Slow queries 
    
    Locks 
    

    For infrastructure:

    CPU 
    
    Memory 
    
    Disk 
    

    Tracing

    Tracing shows the path taken by a request.

    Imagine a checkout:

    Frontend 
    ↓ 
    API 
    ↓ 
    Orders Service 
    ↓ 
    Payments Service 
    ↓ 
    Database 
    

    Which step was slow?

    The trace shows exactly where the time was spent.


    Example of a Trace

    Request 
    
    ├── API Gateway (20ms) 
    ├── Orders Service (40ms) 
    ├── Payments Service (800ms) 
    └── Database (15ms) 
    

    Problem identified:

    Payments Service 
    

    What is an SLI?

    SLI stands for:

    Service Level Indicator 
    

    It's a metric that represents the system's health.

    Examples:

    Latency 
    
    Availability 
    
    Error rate 
    

    What is an SLO?

    SLO stands for:

    Service Level Objective 
    

    It's the goal we want to achieve.

    Example:

    99.5% monthly availability 
    

    Or:

    P95 below 500ms 
    

    What is an SLA?

    SLA stands for:

    Service Level Agreement 
    

    It's a formal commitment to customers.

    Example:

    99.9% availability 
    

    If not met, it can result in penalties.


    Monitoring Applications with AI

    AI applications require additional metrics.

    In addition to CPU and memory, we need to monitor:

    • Consumed tokens;
    • Costs;
    • LLM latency;
    • Handoff rate to humans;
    • Response accuracy.

    AI Metrics

    Examples:

    Questions per day 
    
    Consumed tokens 
    
    Daily cost 
    
    Average latency 
    
    Resolution rate 
    

    Example of Observability for AI

    Question 
    ↓ 
    Intent Classifier 
    ↓ 
    Vector Search 
    ↓ 
    RAG 
    ↓ 
    LLM 
    ↓ 
    Response 
    

    We need to know:

    Which step was slow? 
    
    How much did it cost? 
    
    Which model responded? 
    
    How many tokens were used? 
    

    Alerts

    Monitoring is not enough.

    We need to be notified when something goes wrong.

    Examples:

    CPU above 90% 
    
    Error above 5% 
    
    Database unavailable 
    
    Latency above SLO 
    

    Dashboards

    Dashboards consolidate important information.

    They usually display:

    Availability 
    
    Latency 
    
    Errors 
    
    Resource usage 
    
    Costs 
    

    A good dashboard allows for quick problem identification.


    Operation in Production

    Production is a living environment.

    Systems change constantly.

    That's why we need to:

    • Monitor;
    • Investigate;
    • Correct;
    • Evolve.

    Operation is not a separate activity from engineering.

    It's part of engineering.


    What to Monitor in Lumina Store?

    Our application has:

    Frontend 
    
    Backend 
    
    PostgreSQL 
    
    pgvector 
    
    Payment Gateway 
    
    LLM 
    
    RAG 
    
    Cloud 
    

    So we need to monitor:

    Orders 
    
    Payments 
    
    Conversations 
    
    AI 
    
    Database 
    
    Infrastructure 
    

    Example of Lumina Store Dashboard

    Orders per Hour 
    
    Conversion Rate 
    
    Failed Payments 
    
    API Latency 
    
    LLM Latency 
    
    AI Costs 
    
    Availability 
    

    Best Practices for Operation

    Automate Alerts

    Don't wait for users to complain.


    Monitor Costs

    Especially in AI applications.


    Log Important Events

    Logs are your operational memory.


    Create Simple Dashboards

    Excessive complexity hinders investigations.


    Define SLOs

    What is not measured cannot be improved.


    Popular Tools

    Modern observability usually uses:

    Logs

    • ELK Stack
    • OpenSearch
    • Loki

    Metrics

    • Prometheus
    • Grafana

    Tracing

    • OpenTelemetry
    • Jaeger
    • Tempo

    Cloud

    • CloudWatch
    • Azure Monitor
    • Google Cloud Monitoring

    The Operation Cycle

    Every healthy system follows a continuous cycle:

    Monitor 
    ↓ 
    Detect 
    ↓ 
    Investigate 
    ↓ 
    Correct 
    ↓ 
    Learn 
    ↓ 
    Improve 
    

    This cycle never ends.


    Conclusion

    Observability is one of the most important disciplines in modern engineering.

    Throughout this article, we've seen:

    • What observability is;
    • Logs;
    • Metrics;
    • Tracing;
    • SLI;
    • SLO;
    • SLA;
    • Observability for AI;
    • Dashboards;
    • Operation in production.

    The main lesson is simple:

    Systems don't fail because they have bugs. Systems fail because we can't see what's happening.

    Building software is important.

    Operating software in production is what turns an application into a reliable product.

    Related tags