Mastering-OpenTelemetry-And-Observability

created : Sat, 12 Apr 2025 22:55:34 +0900
modified : Wed, 23 Apr 2025 02:28:09 +0900

Mastring OpenTelemetry And Observability

Chapter 1. What Is Observability?

Definition

Cloud Native Era

Monitoring Compared to Observability

Types of Monitoring


Metadata

Dimensionality

Cardinality

Semantice Conventions

Data Sensitivity

Signals

Metrics

Logs

Traces

Other Signals

Collecting Signals

Instrumentation

Push Versus Pull Collection

Data Collection

Sampling Signals

Observability

Application Performance Monitoring

The Bottom Line

Chapter 2. Introducing OpenTelemetry

Background

Observability Pain Points

The Rise of Open Source Software

Specification

Data Collection

Instrumentation

OpenTelemetry Concepts

Distributions

Pipelines

Resources

Registry

Roadmap

The Bottom Line

Chapter 3. Getting Started with the Astronomy Shop

Chapter 4. Understanding the OpenTelemetry Specification

API Specification

API Definition

API Context

API Signals

API Implementation

SDK Specification

SDK Definition

SDK Signals

SDK Implementation

Data Specification

Data Models

InstrumentPropertiesTypeDefault Aggregation
CounterMonotonicSynchronousSum
UpDownCounterAdditiveSynchronousSum
ObserableCounterMonotonicAsynchronousSum
ObservableUpDownCounterAddtiveAsynchrousSum
GaugeNonadditiveSynchronousLast Value
Observable GaugeNondditiveAsynchronousLast Value
HistogramGroupedSynchronousHistogram
Field NameDescriptionNotes
TimestampWhen the event occurredCommon syslog concepts
ObservedTimestampWhen the event was observed
SeverityTextLog level
SeverityNumberNumeric value of log level
BodyThe message of the log record
ResourceSource informationOTel concept; metadata
AttributesAdditional information
InstrumentationScopeScope that emitted the log record
TraceIDRequest trace IDUsed to enable trace correlation
SpanIdRequest span ID
TraceFlagsW3C trace flags

Data Protocols

Data Semantic Conventions

Data Compatibility

General Specification

The Bottom Line

Chapter 5. Managing the OpenTelemetry Collector

Deployment Modes

Agent Mode

Gateway Mode

FlowProsCons
Instrumentation to observability platform- Quickest time to value; simplicity.- Lowest latency.- Less data processing flexibility and requires language-specific components, such as resource detection and configuration.- Operational complexity as each language and possibly each application needs to be independently configured.- Added resource requirements to handle processing, and buffer and retry logic.- Decentralized security controls.
Instrumentation to agent to observability platform- Quick time to value, especially given that instrumentation sends data to a local OTLP destination by default.- Separates telemetry generation from transmission, reducing application load.Enhanced data processing capabilities and dynamic configuration without redploying applications.- Agent is a single point of failure and must be sized and monitored properly.
Instrumentation to gateway to observability platform- If a gateway cluster separates telemetry generation from transmission without a single point of failure.- Supports advanced data processing capabilities, including metric aggregation and tail-based sampling.- Useful in certain environments, such as serverless, where an agent deployment may not be possible.- Cannot offload all application processing capabilities, including resource detection.- Requires thought when configuring pull-based receivers to ensure proper load balancing and no data duplication.- May introduce unacceptable latency, impacting applications.
Instrumentation to agent to gateway to observability platform- The pros of agent and gateway mode. Supports the most use cases and requirements while providing the most data flexibility and portability.- Complex configuration and highest management costs.

Sizing

Components

Configuration

CategoryExamples
Metadata processing- k8sattributesprocessor- resourceprocessor
Filtering, routing, and sampling- filterprocessor- routingprocessor (fyi. deprecated router connector)- tailsamplingprocessor
Enriching- k8sattributeprocessor- resourcedetection
Generating (primarily metrics)- metricsgenerationprocessor- spanmetricsprocessor
Grouping (helpful in batching and processing)- groupbyattrprocessor- groupbytraceprocessor (valid for tail-based sampling)
Transforming (primarily metrics)- cumulativetodeltaprocessor- deltatorateprocessor- schemaprocessor

Extensions

CategoryExamples
Authentication - Used by receivers and exporters- basicauthextension- bearertokenauthextension- oidcauthextension
Health and Troubleshooting- healthcheckextension- pprofextension- remotetapextension- zpagesextension
Observers - Used by receivers to discover and collect data dynamically- dockerobserver- hostobserver- k8sobserver
Persistence - Via a database or filesystem- storage/dbstorage- storage/filestorage

Connectors

Observing

Relevant Metrics

Troubleshooting

Out of Memory crashes

Data Not Being Received or Exported

Performance Issues

Beyond the Basics

Distributions

The Bottom Line

Chapter 6. Leveraging OpenTelemetry Instrumentation

Distributions

The Bottom Line

Chapter 7. Adopting OpenTelemetry

The Basics

Why OTel and Why Now?

Instrumentation

Production Readiness

Maturity Framework

Brownfield Deployment

Data Collection

Instrumentation

Dashboards and Alerts

Greenfield Deployment

Data Collection

Instrumentation

Other Considerations

Administration and Maintenance

Environments

Semantic Conventions

The Future

The Bottom Line

Chapter 8 The Power of Context and Correlation

Background

Context

OTel Context

Trace Context

Challenges

Resource Context

Logic Context

Correlation

Time Correlation

Context Correlation

Trace Correlation

Metric Correlation

The Bottom Line

Chapter 9 Choosing an Observability Platform

Primary Considerations


RequirementsExamplesPriority
Data collection for specific environmentsK8s, OpenShift, Cloud FoundryP0
Platform integration to speicifc applicationsServiceNow, Slack, PagerDutyP0
Platform complianceSOC 2 Type II, PCI, FedRaAMPP0
Important platform capabilitiesSLIs/SLOs, Session replay (RUM)P1
Nice to have platform capabilitiesAuto-discovery, buyilt-in contentP2

Platform Capabilities

Marketing Versus Reality

ScenarioGoals
Monitor the performance of a distributed application under heavy loadInstrument and collect data from applications, and identify performance and latency issues.
Response to and dignose an incident affecting application availability.Receive alerts, troubleshoot with real-time dashboards, and determine the root cause.
Plan and execute scaling operations based on observed workload patterns.Monitor resource utilization and set automatic-scaling policies.
Validate the performance and stability of a new application release.Correlate deployment events with other signals, monitor KPIs, and use anomaly detection.
Monitor and maintain service levels defined by SLOs for critical services.Define SLIs, alert against SLOs, and anlyze historical data.

Price, Cost, and value

Observability Fragmentation

Primary Factors

Build, Buy, or Manage

ApproachProsCons
BuildFlexibility and choiceDistraction from the business value proposition
BuyReduced OpExVendor lock-in and increased CapEx overtime
ManageInfrastructure CapEx onlyIncreased CapEx and OpEx over time
CombinationFlexibility and choiceMultiple systems to manage; inconsistency

Licensing, Operations, and Deployment

DecisionOptionsNotes
LicensingOpen source or proprietaryCapabilities are usually more important than the licensing model.
OperationsSelf or vendor-managedOpen source vendor-managed may not be the same as open source self-managed.
DeploymentOn-premises or SaaSMost vendor-managed observability platforms are SaaS-based today.
FactorNotes
OTLP ingestionIn additiona to yes, is it the default setting - and if not, why?
DistributionIf one is offered, does it contain proprietary components?
API and SDKWhat versions and features are supported?
InstrumentationWhich languages are supported?
CollectorWhich components are supported?
Semantic conventionsDoes the platform support semantic conventiosn?
General supportIf vendor-managed, is OTel support provided?
ContributionsWhat commitment, influence, and ability to support is provided?

Stakeholders and Company Culture

StakeholderExampleInfluenced by
BuyerCTO or VP of EngineeringLegal, Finance, and Security
AdministratorSRE TeamDevelopment Team
UserDevelopment TeamSRE Team

Implementation Basics

Administration

Usage

Maturity Framework

The Bottom Line

Chapter 10 Observability Antipatterns and Pitfalls

Telemetry Data Missteps

Mixing Instrumentation Libraries Scenario

Automatic Instrumentation Scenario

Custom Instrumentation Scenario

Component Configuration Scenario

Performance Overhead Scenario

Resource Allocation Scenario

Security Considerations Scenario

Monitoring and Maintenance Scenario

Observability Platform Missteps

Vendor Lock-in Scenario

Fragmented Tooling Scenario

Tool Fatigue Scenario

Inadequate Scalability Scenario

Data Overload Scenario

Company Culture Implications

Lack of Leadership Support Scenario

Resistance to Change Scenario

Collaboration and Alignment Scenario

Goals and Success Criteria Scenario

Standardization and Consistency Scenario

Incentives and Recognition Scenario

Feedback and Improvement Scenario

Prioritization Framework

ScenarioUnweighted ScoreEffort to fixPriority
Automatic Instrumentation19LowP0
Component Configuration18LowP0
Resource Allocation18LowP0
Performance Overhead18MediumP0
Monitoring and Maintenance18MediumP0
Inadequate Scalability18HighP1
Tool Fatigue17MediumP1
Fragmented Tooling17HighP1
Lack of Leadership Support17HighP1
Resistance to Change17HighP1
Custom Instrumentation16HighP1
Vendor Lock-in15HighP1
Mixing Instrumentation Libraries14HighP1
Standardization and Consistency13MediumP2
Incentives and Recognition12LowP2
Goals and Success Criteria11LowP2
Feedback and improvement11MediumP2
Security Considerations10MediumP2
Collaboration and Alignment10MediumP2
data Overload8MediumP2

The Bottom Line

Chapter 11 Observability at Scale

Understanding the Challenges

Volume and Velocity of Telemetry Data

Distributed System Complexity

Observability Platform Complexity

Infrastructure and Resource Constraints

Strategies for Scaling Observability

Elasticity, Elasticity, Elasticity!

Leverage Cloud Native Technologies