Investigating Performance Issues

Understanding Performance in Distributed Systems

In a distributed application, a single user request might trigger:

Frontend rendering
API calls over the network
Database queries
Cache lookups
External API calls (payment processing)

Traditional logging makes it hard to see the full picture. Distributed tracing shows you the entire flow with precise timing.

Using Sentry Trace Explorer

Sentry’s Trace Explorer allows you to search, filter, and analyze traces from your OpenTelemetry-instrumented application. With OTLP data, you can query spans using attributes, create visualizations, and identify performance bottlenecks.

Scenario 1: Identifying Slow Database Queries

Your product search endpoint is occasionally slow. Let’s use simple filter strategies to find slow queries.

Basic Search Strategy

Navigate to Explore > Traces and start with this filter to find slow database operations:

span.op is db
span.duration > 200ms

Sort by span.duration descending to see slowest first. Adjust the threshold (200ms, 500ms, 1s) based on your needs.

Quick Reference: Common Filter Patterns

Goal	Filter Query
Any slow DB query	`span.op is db` AND `span.duration > 500ms`
Slow SELECTs	`span.op is db` AND `db.action is SELECT` AND `span.duration > 200ms`
Slow on products table	`span.op is db` AND `db.collection.name is products` AND `span.duration > 100ms`
Failed DB operations	`span.op is db` AND `span.status contains error`

Using the Aggregates Tab

Click the Aggregates tab, then:

Select a metric (e.g., p95(span.duration), avg(span.duration), or count())
Group by a field

Group by span.description to identify which specific queries are consistently slow.

Group by db.collection.name to see which database tables have performance issues.

Pro Tips

Start with P95: Look at 95th percentile duration, not just average
Check frequency: A query that’s 200ms but runs 1000x/min is worse than one that’s 2s but runs 1x/hour
Check Starred Queries list for quick access
Save your queries: Click “Save as…” to create saved queries like “Slow DB Queries (>500ms)” for quick access

Inspecting Slow Spans

Click any Span ID to open the Trace Waterfall View where you’ll see:

The complete request flow
Exact timing for each operation
Span attributes

Scenario 2: Analyzing Cache Performance

Are your cache hits effective? Let’s measure cache hit rates and their performance impact.

Search for cache operations

In Trace Explorer, search:
```
span.op is cache.get
```
Create a hit rate calculation

Group by cache.hit in the Aggregates tab:
- cache.hit IS True - Cache hits
- cache.hit IS False - Cache misses
Compare the count of each to calculate your hit rate.
Compare performance

Create two separate queries to compare:

Query 1 (Cache Hits):
```
span.description is cache.get AND cache.hit IS True
```
Add metric: avg(span.duration)

Query 2 (Cache Misses):
```
span.description is cache.get AND cache.hit IS False
```
Add metric: avg(span.duration)

Cache hits should be significantly faster (< 5ms vs 100ms+ for database queries).
Identify frequently missed keys

In Span Samples, filter for cache misses:
```
span.description is cache.get AND cache.hit IS False
```
Look at the cache.key attribute to see which keys are missing most often.
View the full request flow

Click a trace ID where cache.hit IS False. In the waterfall, you’ll see:
- cache.get span (miss, ~2ms)
- db span immediately after (query, ~150ms)
- cache.set span (storing result, ~3ms)
This shows the cache-aside pattern in action.

Scenario 3: Order Creation Performance Breakdown

Order creation involves multiple steps. Let’s identify which step is the bottleneck.

Filter for order creation traces
```
span.description is POST /api/orders
```

View the waterfall

Click any Trace ID to open the waterfall view. You’ll see a complete breakdown. Something like:

└─ POST /api/orders (850ms)
   ├─ order.validate_user (45ms)
   │  └─ db SELECT users (42ms)
   ├─ order.validate_products (220ms)
   │  ├─ db SELECT products (38ms)
   │  ├─ db SELECT products (41ms)
   │  └─ db SELECT products (39ms)
   ├─ inventory.check (180ms)
   │  ├─ db SELECT products (55ms)
   │  ├─ db SELECT products (58ms)
   │  └─ db SELECT products (62ms)
   ├─ order.create_record (95ms)
   │  └─ db INSERT transaction (92ms)
   ├─ inventory.reserve (145ms)
   │  └─ db UPDATE transaction (140ms)
   └─ payment.process (250ms)
      └─ simulated payment gateway (248ms)

Analyze the bottleneck

Look for the slowest single operation

Examine span attributes

In the waterfall, click any span to see attributes:

Parent span (order.create):

order.user_id: 1
order.items_count: 3
order.total_amount: 459.97
order.payment_method: credit_card
order.status: confirmed

Child span (payment.process):

payment.order_id: 42
payment.amount: 459.97
payment.method: credit_card
payment.status: success
payment.transaction_id: txn_1234567890_abc123

Setting Up Performance Alerts

Once you’ve identified performance patterns, create alerts to catch issues before users report them.

How to Create Alerts

From any Trace Explorer query, click Save As > Alert. Configure the alert with:

Query: The trace filter that identifies the operation
Metric: What to measure (p95, avg, max duration)
Threshold: When to trigger the alert
Action: How to notify your team

Recommended Performance Alerts

Slow Order Creation
```
Query: span.description is POST /api/orders
Metric: p95(span.duration)
Threshold: > 1000ms
```
Why: Order creation is a critical business flow. If it’s consistently over 1 second, users will notice and abandon carts.
Database Query Performance Degradation
```
Query: span.op is db AND db.collection.name is products
Metric: p90(span.duration)
Threshold: > 200ms
```
Why: Product table queries power your core catalog. Slow queries indicate missing indexes or query optimization issues.
Cache Miss Rate
```
Query: span.description is cache.get AND cache.hit IS False
Metric: count() where cache.hit IS False
Threshold: > 1000 per hour
```
Why: High miss rates mean you’re hitting the database unnecessarily. Check cache TTL or if cache is being cleared too frequently.
Payment Processing Latency
```
Query: span.description is payment.process
Metric: p95(span.duration)
Threshold: > 500ms
```
Why: Payment gateway slowness impacts conversion. May indicate external API issues.
Transaction Duration
```
Query: db.transaction IS True
Metric: p90(span.duration)
Threshold: > 300ms
```
Why: Long-running transactions can cause lock contention and connection pool exhaustion.

Building Performance Dashboards

Dashboards give you at-a-glance visibility into application performance trends.

How to Create a Dashboard

Go to Dashboards and click Create Dashboard
Give it a name (e.g., “Performance Monitoring”)
From any Trace Explorer query, click Save As > Dashboard Widget
Choose the dashboard and select visualization type:
- Time Series: Trends over time (latency, throughput)
- Big Number: Current state (error rate, cache hit rate)
- Table: Top N slowest operations

Essential Performance Dashboard Widgets

1. API Endpoint Latency Overview

Widget Type: Time Series

Queries:

Query 1: span.description is GET /api/products
Metric: p90(span.duration)
Name: "Products List"

Query 2: span.description is POST /api/orders
Metric: p90(span.duration)
Name: "Order Creation"

Query 3: span.description is GET /api/products/search
Metric: p90(span.duration)
Name: "Product Search"

Why: See all critical endpoints on one chart. Spot performance regressions after deployments.

2. Operation Type Breakdown

Widget Type: Time Series (stacked)

Configuration:

Query: span.op:*
Group By: span.op
Metric: avg(span.duration)

Why: See relative performance of HTTP requests, DB queries, cache operations, and external calls in one view.