Beyond Tracing - What do we do with all this data

Overview

Metrics-generator
Parquet
TraceQL

Matrics-generator

Why metrics if you have traces?

Transcation-oriented : Highly structured
Service-oriented : Aggregated, historical
Span metrics:
- Rate, Error, Duration
Service graph metrics:
- Extract service topology
Tempo Launched at Oct 2022
Tempo 1.0 Jun 2021
Search over recent data Nov 2021
Full backend search Jan 2022
Parquet storage format Dec 2021

What is Parquet?

Apache Parquet is an opensource, column-oriented data file format designed for efficient data storage and retrieval.
What dos this mean?:
- Tempo can store and access data more efficiently
- So can you - arege ecosystem of tools
- No new infrastructure - just a new file format

Schema

TraceID
Duration
  Span #1
    Name
    ServiceName
    Tag #1
    Tag #2
    .
    Duration
  Span #2
    Name
    ServiceName
    Tag #1
    Event#1
    .
    Duration

1. Encodings:
- traceID into dictionary
- duration into delta
- tags into dictionary
- events into snappy
1. FindTraceByID
1. Attribute search:
- cluster="foo", namespace="bar"
- It uses their tags
1. Felxible schema:
- easily add new column (e.g. cluster, http.url)
- This feature makes us easily find tracing data using custom columns.

Inside a block

Parquet:
- Open file format - use existing tools
- parquet-tools head data.parquet

TraceQL

Selecting Traces - Basics

{ duration > 2s }
{ name = "GET /:endpoint" }
{ .http.status = 200 }
{ span.http.url =~ "/api/v1/.*" }
{ resource.namespace = "prod" }
{ .http.url="/:endpoint" && .http.status = 200 }

TraceQL - Aggregates

{ .db.system = "postgres" } | cound() > 3 }
{ name = "dns.lookup" } | avg(duration) > 500ms }

TraceQL - Pipelines of Spansets

Selecting Traces - Structural

TraceQL - Structural

{ .service.name = "foo" } >> {.service.name = "bar" }
{ name = "tcp.connect" } ~ { name = "dns.lookup" }
{ .service.name != parent.service.name }

Personal Notes

How to connect metircs and tracing graph? Is only the traceId enough to do it?:
- For now, prometheus is well known as a troublesome because of its structure which cannot be horizontally scaled.
- To solve this problem, lots of companies use Thanos with low resolution.
- In this situation, tracing information is newly occurred data to save it.
- In my opinion, it is needed to be examined especially in the sight of storage like retention & resolution. Because it is linked to metrics which can be stored with short retention and low resolution.

Beyond Tracing - What do we do with all this data

created : Mon, 06 Mar 2023 02:28:31 +0900

modified : Mon, 06 Mar 2023 02:56:19 +0900

Links

Overview

Matrics-generator

Why metrics if you have traces?

What is Parquet?

Schema

Inside a block

TraceQL

Selecting Traces - Basics

TraceQL - Aggregates

TraceQL - Pipelines of Spansets

Selecting Traces - Structural

TraceQL - Structural

Personal Notes