minuk.dev 블로그 : prometheus-native-histograms-in-proudction

prometheus-native-histograms-in-proudction

created : 2023-09-03T16:05:34+00:00
modified : 2023-09-03T16:32:44+00:00

kubecon prometheus observability

Links

https://www.youtube.com/watch?v=TgINvIK9SYc&list=PLj6h78yzYM2ORxwcjTn4RLAOQOYjvQ2A3&index=6

Disclaimer

Native Histograms are an experimental feature!
Everything described here can stil lchange!
Things might break or behave weirdly!

prometheus --enable-feature=native-histograms

Wishlist

Everything that works well now should continue to work well.
I never want to configure buckets again.
All histograms should always be aggregatable with each other, across time and space.
I want accurate quantile and percentage estimations across the whole range of observations.
I want all of that at a lower cost thant current histograms so that I can finally partition histograms at will.

1. Resource consumption of the instrumented binary

2. Frequency of resets and resolution reduction

Scraping 15 instances of the cloud-backend-gateway.
Drop everything but the cortex_request_duration_seconds histograms.
Scraping classic histograms:
- 964 histograms (peak)
- 16388 series (964 * 17)
- 14460 buckets (964 * 15)
Scraping native histograms:
- 964 histograms (peak)
- 964 series

Frequency of resets to reduce bucket count

Top 10 rest histograms are all:
- {route=~"api_prom_api_v1_query(_range)", status_code="200"}
Even among thos typically just a handful of rests per day.
Worst offender during the 15d of the experiment: 8 resets per day. (Please check the original video)
Rarely touching the configured 1h limit.

Frequency of resolution reduction

Only ever happended one stap (from growth factor 1.0905… to 1.1892…).
Happens “occasionally”…

3. Prometheus resource consumption

4. Query Performance