Beyond Tracing - What do we do with all this data
created : Mon, 06 Mar 2023 02:28:31 +0900
modified : Mon, 06 Mar 2023 02:56:19 +0900
Links
Overview
- Metrics-generator
- Parquet
- TraceQL
Matrics-generator
Why metrics if you have traces?
Transcation-oriented : Highly structured
Service-oriented : Aggregated, historical
Span metrics:
- Rate, Error, Duration
Service graph metrics:
- Extract service topology
Tempo Launched at Oct 2022
Tempo 1.0 Jun 2021
Search over recent data Nov 2021
Full backend search Jan 2022
Parquet storage format Dec 2021
What is Parquet?
- Apache Parquet is an opensource, column-oriented data file format designed for efficient data storage and retrieval.
- What dos this mean?:
- Tempo can store and access data more efficiently
- So can you - arege ecosystem of tools
- No new infrastructure - just a new file format
Schema
TraceID
Duration
Span #1
Name
ServiceName
Tag #1
Tag #2
.
Duration
Span #2
Name
ServiceName
Tag #1
Event#1
.
Duration
- Encodings:
- traceID into dictionary
- duration into delta
- tags into dictionary
- events into snappy
- FindTraceByID
- Attribute search:
cluster="foo", namespace="bar"
- It uses their tags
- Felxible schema:
- easily add new column (e.g. cluster, http.url)
- This feature makes us easily find tracing data using custom columns.
Inside a block
- Parquet:
- Open file format - use existing tools
parquet-tools head data.parquet
TraceQL
Selecting Traces - Basics
{ duration > 2s }
{ name = "GET /:endpoint" }
{ .http.status = 200 }
{ span.http.url =~ "/api/v1/.*" }
{ resource.namespace = "prod" }
{ .http.url="/:endpoint" && .http.status = 200 }
TraceQL - Aggregates
{ .db.system = "postgres" } | cound() > 3 }
{ name = "dns.lookup" } | avg(duration) > 500ms }
TraceQL - Pipelines of Spansets
Selecting Traces - Structural
TraceQL - Structural
{ .service.name = "foo" } >> {.service.name = "bar" }
{ name = "tcp.connect" } ~ { name = "dns.lookup" }
{ .service.name != parent.service.name }
Personal Notes
- How to connect metircs and tracing graph? Is only the traceId enough to do it?:
- For now, prometheus is well known as a troublesome because of its structure which cannot be horizontally scaled.
- To solve this problem, lots of companies use Thanos with low resolution.
- In this situation, tracing information is newly occurred data to save it.
- In my opinion, it is needed to be examined especially in the sight of storage like retention & resolution. Because it is linked to metrics which can be stored with short retention and low resolution.