Investigating Performance Issues
Understanding Performance in Distributed Systems
In a distributed application, a single user request might trigger:
- Frontend rendering
- API calls over the network
- Database queries
- Cache lookups
- External API calls (payment processing)
Traditional logging makes it hard to see the full picture. Distributed tracing shows you the entire flow with precise timing.
Using Sentry Trace Explorer
Sentry’s Trace Explorer allows you to search, filter, and analyze traces from your OpenTelemetry-instrumented application. With OTLP data, you can query spans using attributes, create visualizations, and identify performance bottlenecks.
Scenario 1: Identifying Slow Database Queries
Your product search endpoint is occasionally slow. Let’s use simple filter strategies to find slow queries.
Basic Search Strategy
Navigate to Explore > Traces and start with this filter to find slow database operations:
span.op is dbspan.duration > 200msSort by span.duration descending to see slowest first. Adjust the threshold (200ms, 500ms, 1s) based on your needs.
Quick Reference: Common Filter Patterns
| Goal | Filter Query |
|---|---|
| Any slow DB query | span.op is db AND span.duration > 500ms |
| Slow SELECTs | span.op is db AND db.action is SELECT AND span.duration > 200ms |
| Slow on products table | span.op is db AND db.collection.name is products AND span.duration > 100ms |
| Failed DB operations | span.op is db AND span.status contains error |
Using the Aggregates Tab
Click the Aggregates tab, then:
- Select a metric (e.g.,
p95(span.duration),avg(span.duration), orcount()) - Group by a field
Group by span.description to identify which specific queries are consistently slow.
Group by db.collection.name to see which database tables have performance issues.
Pro Tips
- Start with P95: Look at 95th percentile duration, not just average
- Check frequency: A query that’s 200ms but runs 1000x/min is worse than one that’s 2s but runs 1x/hour
- Check Starred Queries list for quick access
- Save your queries: Click “Save as…” to create saved queries like “Slow DB Queries (>500ms)” for quick access
Inspecting Slow Spans
Click any Span ID to open the Trace Waterfall View where you’ll see:
- The complete request flow
- Exact timing for each operation
- Span attributes
Scenario 2: Analyzing Cache Performance
Are your cache hits effective? Let’s measure cache hit rates and their performance impact.
-
Search for cache operations
In Trace Explorer, search:
span.op is cache.get -
Create a hit rate calculation
Group by
cache.hitin the Aggregates tab:cache.hit IS True- Cache hitscache.hit IS False- Cache misses
Compare the count of each to calculate your hit rate.
-
Compare performance
Create two separate queries to compare:
Query 1 (Cache Hits):
span.description is cache.get AND cache.hit IS TrueAdd metric:
avg(span.duration)Query 2 (Cache Misses):
span.description is cache.get AND cache.hit IS FalseAdd metric:
avg(span.duration)Cache hits should be significantly faster (< 5ms vs 100ms+ for database queries).
-
Identify frequently missed keys
In Span Samples, filter for cache misses:
span.description is cache.get AND cache.hit IS FalseLook at the
cache.keyattribute to see which keys are missing most often. -
View the full request flow
Click a trace ID where
cache.hit IS False. In the waterfall, you’ll see:cache.getspan (miss, ~2ms)dbspan immediately after (query, ~150ms)cache.setspan (storing result, ~3ms)
This shows the cache-aside pattern in action.
Scenario 3: Order Creation Performance Breakdown
Order creation involves multiple steps. Let’s identify which step is the bottleneck.
-
Filter for order creation traces
span.description is POST /api/orders -
View the waterfall
Click any Trace ID to open the waterfall view. You’ll see a complete breakdown. Something like:
└─ POST /api/orders (850ms)├─ order.validate_user (45ms)│ └─ db SELECT users (42ms)├─ order.validate_products (220ms)│ ├─ db SELECT products (38ms)│ ├─ db SELECT products (41ms)│ └─ db SELECT products (39ms)├─ inventory.check (180ms)│ ├─ db SELECT products (55ms)│ ├─ db SELECT products (58ms)│ └─ db SELECT products (62ms)├─ order.create_record (95ms)│ └─ db INSERT transaction (92ms)├─ inventory.reserve (145ms)│ └─ db UPDATE transaction (140ms)└─ payment.process (250ms)└─ simulated payment gateway (248ms) -
Analyze the bottleneck
Look for the slowest single operation
-
Examine span attributes
In the waterfall, click any span to see attributes:
Parent span (order.create):
order.user_id: 1order.items_count: 3order.total_amount: 459.97order.payment_method: credit_cardorder.status: confirmedChild span (payment.process):
payment.order_id: 42payment.amount: 459.97payment.method: credit_cardpayment.status: successpayment.transaction_id: txn_1234567890_abc123
Setting Up Performance Alerts
Once you’ve identified performance patterns, create alerts to catch issues before users report them.
How to Create Alerts
From any Trace Explorer query, click Save As > Alert. Configure the alert with:
- Query: The trace filter that identifies the operation
- Metric: What to measure (p95, avg, max duration)
- Threshold: When to trigger the alert
- Action: How to notify your team
Recommended Performance Alerts
-
Slow Order Creation
Query: span.description is POST /api/ordersMetric: p95(span.duration)Threshold: > 1000msWhy: Order creation is a critical business flow. If it’s consistently over 1 second, users will notice and abandon carts.
-
Database Query Performance Degradation
Query: span.op is db AND db.collection.name is productsMetric: p90(span.duration)Threshold: > 200msWhy: Product table queries power your core catalog. Slow queries indicate missing indexes or query optimization issues.
-
Cache Miss Rate
Query: span.description is cache.get AND cache.hit IS FalseMetric: count() where cache.hit IS FalseThreshold: > 1000 per hourWhy: High miss rates mean you’re hitting the database unnecessarily. Check cache TTL or if cache is being cleared too frequently.
-
Payment Processing Latency
Query: span.description is payment.processMetric: p95(span.duration)Threshold: > 500msWhy: Payment gateway slowness impacts conversion. May indicate external API issues.
-
Transaction Duration
Query: db.transaction IS TrueMetric: p90(span.duration)Threshold: > 300msWhy: Long-running transactions can cause lock contention and connection pool exhaustion.
Building Performance Dashboards
Dashboards give you at-a-glance visibility into application performance trends.
How to Create a Dashboard
- Go to Dashboards and click Create Dashboard
- Give it a name (e.g., “Performance Monitoring”)
- From any Trace Explorer query, click Save As > Dashboard Widget
- Choose the dashboard and select visualization type:
- Time Series: Trends over time (latency, throughput)
- Big Number: Current state (error rate, cache hit rate)
- Table: Top N slowest operations
Essential Performance Dashboard Widgets
1. API Endpoint Latency Overview
Widget Type: Time Series
Queries:
Query 1: span.description is GET /api/productsMetric: p90(span.duration)Name: "Products List"
Query 2: span.description is POST /api/ordersMetric: p90(span.duration)Name: "Order Creation"
Query 3: span.description is GET /api/products/searchMetric: p90(span.duration)Name: "Product Search"Why: See all critical endpoints on one chart. Spot performance regressions after deployments.
2. Operation Type Breakdown
Widget Type: Time Series (stacked)
Configuration:
Query: span.op:*Group By: span.opMetric: avg(span.duration)Why: See relative performance of HTTP requests, DB queries, cache operations, and external calls in one view.