Skip to content

Investigating Performance Issues

Understanding Performance in Distributed Systems

In a distributed application, a single user request might trigger:

  • Frontend rendering
  • API calls over the network
  • Database queries
  • Cache lookups
  • External API calls (payment processing)

Traditional logging makes it hard to see the full picture. Distributed tracing shows you the entire flow with precise timing.

Using Sentry Trace Explorer

Sentry’s Trace Explorer allows you to search, filter, and analyze traces from your OpenTelemetry-instrumented application. With OTLP data, you can query spans using attributes, create visualizations, and identify performance bottlenecks.


Scenario 1: Identifying Slow Database Queries

Your product search endpoint is occasionally slow. Let’s use simple filter strategies to find slow queries.

Basic Search Strategy

Navigate to Explore > Traces and start with this filter to find slow database operations:

span.op is db
span.duration > 200ms

Sort by span.duration descending to see slowest first. Adjust the threshold (200ms, 500ms, 1s) based on your needs.

Quick Reference: Common Filter Patterns

GoalFilter Query
Any slow DB queryspan.op is db AND span.duration > 500ms
Slow SELECTsspan.op is db AND db.action is SELECT AND span.duration > 200ms
Slow on products tablespan.op is db AND db.collection.name is products AND span.duration > 100ms
Failed DB operationsspan.op is db AND span.status contains error

Using the Aggregates Tab

Click the Aggregates tab, then:

  1. Select a metric (e.g., p95(span.duration), avg(span.duration), or count())
  2. Group by a field

Group by span.description to identify which specific queries are consistently slow.

Group by db.collection.name to see which database tables have performance issues.

Pro Tips

  1. Start with P95: Look at 95th percentile duration, not just average
  2. Check frequency: A query that’s 200ms but runs 1000x/min is worse than one that’s 2s but runs 1x/hour
  3. Check Starred Queries list for quick access
  4. Save your queries: Click “Save as…” to create saved queries like “Slow DB Queries (>500ms)” for quick access

Inspecting Slow Spans

Click any Span ID to open the Trace Waterfall View where you’ll see:

  • The complete request flow
  • Exact timing for each operation
  • Span attributes

Scenario 2: Analyzing Cache Performance

Are your cache hits effective? Let’s measure cache hit rates and their performance impact.

  1. Search for cache operations

    In Trace Explorer, search:

    span.op is cache.get
  2. Create a hit rate calculation

    Group by cache.hit in the Aggregates tab:

    • cache.hit IS True - Cache hits
    • cache.hit IS False - Cache misses

    Compare the count of each to calculate your hit rate.

  3. Compare performance

    Create two separate queries to compare:

    Query 1 (Cache Hits):

    span.description is cache.get AND cache.hit IS True

    Add metric: avg(span.duration)

    Query 2 (Cache Misses):

    span.description is cache.get AND cache.hit IS False

    Add metric: avg(span.duration)

    Cache hits should be significantly faster (< 5ms vs 100ms+ for database queries).

  4. Identify frequently missed keys

    In Span Samples, filter for cache misses:

    span.description is cache.get AND cache.hit IS False

    Look at the cache.key attribute to see which keys are missing most often.

  5. View the full request flow

    Click a trace ID where cache.hit IS False. In the waterfall, you’ll see:

    • cache.get span (miss, ~2ms)
    • db span immediately after (query, ~150ms)
    • cache.set span (storing result, ~3ms)

    This shows the cache-aside pattern in action.


Scenario 3: Order Creation Performance Breakdown

Order creation involves multiple steps. Let’s identify which step is the bottleneck.

  1. Filter for order creation traces

    span.description is POST /api/orders
  2. View the waterfall

    Click any Trace ID to open the waterfall view. You’ll see a complete breakdown. Something like:

    └─ POST /api/orders (850ms)
    ├─ order.validate_user (45ms)
    │ └─ db SELECT users (42ms)
    ├─ order.validate_products (220ms)
    │ ├─ db SELECT products (38ms)
    │ ├─ db SELECT products (41ms)
    │ └─ db SELECT products (39ms)
    ├─ inventory.check (180ms)
    │ ├─ db SELECT products (55ms)
    │ ├─ db SELECT products (58ms)
    │ └─ db SELECT products (62ms)
    ├─ order.create_record (95ms)
    │ └─ db INSERT transaction (92ms)
    ├─ inventory.reserve (145ms)
    │ └─ db UPDATE transaction (140ms)
    └─ payment.process (250ms)
    └─ simulated payment gateway (248ms)
  3. Analyze the bottleneck

    Look for the slowest single operation

  4. Examine span attributes

    In the waterfall, click any span to see attributes:

    Parent span (order.create):

    order.user_id: 1
    order.items_count: 3
    order.total_amount: 459.97
    order.payment_method: credit_card
    order.status: confirmed

    Child span (payment.process):

    payment.order_id: 42
    payment.amount: 459.97
    payment.method: credit_card
    payment.status: success
    payment.transaction_id: txn_1234567890_abc123

Setting Up Performance Alerts

Once you’ve identified performance patterns, create alerts to catch issues before users report them.

How to Create Alerts

From any Trace Explorer query, click Save As > Alert. Configure the alert with:

  • Query: The trace filter that identifies the operation
  • Metric: What to measure (p95, avg, max duration)
  • Threshold: When to trigger the alert
  • Action: How to notify your team
  1. Slow Order Creation

    Query: span.description is POST /api/orders
    Metric: p95(span.duration)
    Threshold: > 1000ms

    Why: Order creation is a critical business flow. If it’s consistently over 1 second, users will notice and abandon carts.

  2. Database Query Performance Degradation

    Query: span.op is db AND db.collection.name is products
    Metric: p90(span.duration)
    Threshold: > 200ms

    Why: Product table queries power your core catalog. Slow queries indicate missing indexes or query optimization issues.

  3. Cache Miss Rate

    Query: span.description is cache.get AND cache.hit IS False
    Metric: count() where cache.hit IS False
    Threshold: > 1000 per hour

    Why: High miss rates mean you’re hitting the database unnecessarily. Check cache TTL or if cache is being cleared too frequently.

  4. Payment Processing Latency

    Query: span.description is payment.process
    Metric: p95(span.duration)
    Threshold: > 500ms

    Why: Payment gateway slowness impacts conversion. May indicate external API issues.

  5. Transaction Duration

    Query: db.transaction IS True
    Metric: p90(span.duration)
    Threshold: > 300ms

    Why: Long-running transactions can cause lock contention and connection pool exhaustion.


Building Performance Dashboards

Dashboards give you at-a-glance visibility into application performance trends.

How to Create a Dashboard

  1. Go to Dashboards and click Create Dashboard
  2. Give it a name (e.g., “Performance Monitoring”)
  3. From any Trace Explorer query, click Save As > Dashboard Widget
  4. Choose the dashboard and select visualization type:
    • Time Series: Trends over time (latency, throughput)
    • Big Number: Current state (error rate, cache hit rate)
    • Table: Top N slowest operations

Essential Performance Dashboard Widgets

1. API Endpoint Latency Overview

Widget Type: Time Series

Queries:

Query 1: span.description is GET /api/products
Metric: p90(span.duration)
Name: "Products List"
Query 2: span.description is POST /api/orders
Metric: p90(span.duration)
Name: "Order Creation"
Query 3: span.description is GET /api/products/search
Metric: p90(span.duration)
Name: "Product Search"

Why: See all critical endpoints on one chart. Spot performance regressions after deployments.

2. Operation Type Breakdown

Widget Type: Time Series (stacked)

Configuration:

Query: span.op:*
Group By: span.op
Metric: avg(span.duration)

Why: See relative performance of HTTP requests, DB queries, cache operations, and external calls in one view.