Mastring OpenTelemetry And Observability
- Enhancing Application and Infrastructure Performance and Avoiding Outages
- Terms:
- Backend: The data access layer of an application, which often includes processing and persistence of data.
- Framework: A structure on which other things are built.
- OTel is a telemetry framework that can be extended to support various use cases.
- Frontend: The presentation layer of an application, which is often a user interface of user-facing way to interact with an application.
- Instrumentation: Software added to an application to generate telemetry data. Various forms of instrumentation are available, including automatic, which is injected at runtime, manual, which is added with the existing code, and programmatic, which is a particular form of manual instrumenation where specific libraries or frameworks have already been instrumented.
- Platform: An environment in which software is executed. An observability or monitoring platform typically consists of one or more backend and frontend components.
- Telemetry: Data used to determine the health, performance, and usage of applications.
Chapter 1. What Is Observability?
Definition
- Observability is a system property that defines the degree to which the system can generate actionable insights. It allows users to understand a system’s state from these external outputs and take (corrective) action.
- The goal of observability should be where a system’s state can be fully understood from its external output without the need to ship code.
- Observability is not just about collecting data but about collecting meaningful data.
- A system is truly observable when you can troubleshoot without prior knowledge of the system.
Cloud Native Era
Monitoring Compared to Observability
- The data collection needs to be added before issues happen; otherwise, you cannot proactively determine nor quickly resolve the problems as the arise.
Types of Monitoring
- White box monitoring
- Blackbox-monitoring
- Due to the difficulty in troubleshooting mircroservice-based architectures, a popular meme was shared throughout the community:
- We replaced our monolith with micro services so that every outage could be more like a murder mystery.
Metadata
Dimensionality
- Dimensionality refres to the number of unique keys
Cardinality
- Cardinality refres to the number of unique values for a given key within a set.
Semantice Conventions
- semconvs are standardized dimensions, or keys, for metadata and ensure consistency in how data is recorded, labeled, and interpreted across different systems and services.
Data Sensitivity
- Metadata can contain sensitive information.
Signals
- MELT: metrics, events, logs, and traces
Metrics
- RED: requests, error, and duration
- USE: Utilization, saturation, and erros
- Four golden signals: latency, traffic, errors, saturation
Logs
- Structured
- Unstructured
Traces
- A trace is a recording of a time-based transaction or end-to-end request with metadata
- Head-based
- Tail based
Other Signals
- Baggage: metadata passed between spans that must be explicitly added to signals.
- Sessions: used in Real User Monitoring (RUM) to analyze user experience (UX)
- Profiles
Collecting Signals
- Metrics emit small payloads at very frequent intervals. As a result, performance is critical. Metrics are usually not collected nor sent anywhere by default. While it is easy to add metrics and many frameworks exist, most metrics do not contain context or correlation information.
- Logs can contain richer information than metrics but, as a result, have larger payloads, and parsing requires proper formatting. Logs are usually written to a destination like a disk or a remote solution. While frameworks exist to add logs to applications, developers add most logging manually. Likce metrics, most logs do not contain context or correlation.
- Traces are similar to logs. While they are easier to parse, it requires assembling an entire trace to realize the full potential. Traces require passing a context (typically a header) between request. In addition, adding trace instrumentation is often significantly more challenging than metrics or logs. Trace payloads are as big, if not bigger, than logs and are frequently sampled.
Instrumentation
- Manual, Automatic, Via an instrumentation library
Push Versus Pull Collection
- Cloud native workloads, pull is most common for metrics
Data Collection
- Simplicity, Efficiency, Middle ground, Consistency
Sampling Signals
- Filtering
- Sampling
Observability
- Scalability
- Reliability
- User experience
- Ease of use
- Performance
- Security
- Cost
- Lock-in
Application Performance Monitoring
The Bottom Line
- Differentiate between monitoring and observability
- What is the difference between a “Known known” and “unknown unknown”
- Explain the importance of metadata
- What are the differences between dimensionality, cardinality, and semantic conventions?
- Identify the differences between telemetry signals.
- Why are there at least three separate ways to collect telemetry data from applications?
- Distinguish between instrumentation and data collection.
- Given instrumentation, why is data collection necessary?
- Analyze the requirements for choosing an observability platform
- How are observability platforms different from APM?
Chapter 2. Introducing OpenTelemetry
Background
Observability Pain Points
- Inconsistent features
- vendor lock-in
- Supportability and security concerns
- Duplication among vendors
- Lack of standardization, including context and correlation across signals and semantic conventions (semconvs)
The Rise of Open Source Software
Specification
- Data specification: Used to define implementation guidance, data models, semconvs, and protocols
- API Specification: Used to define the instrumentation interface standard for applications
- SDK Specification: Used to define the standard for processing and exporting signals provided by the API specification
- Versioning and Stabability: SUed to define versioning scheme and support gurantees
Data Collection
- The OTel Collector is a significant component and primary data collection mechanism in the Otel Project.
- Otel collector supports:
- A robust and extensible architecture to receive, process, and export traces, metrics, and logs
- A variety of form factors, including agent and edge processing as well as push and pull collection mechanism
- A variety of integrations, including Prometheus, Fluent Bit, Apache Arrow, and eBPF, to name a few
Instrumentation
- Supports:
- A robust and extensive architecture to generate, process, and export signals
- A single instrumentation solution for more than ten languages that supports traces, metrics, and logs.
- Manual and automatic instrumentation
OpenTelemetry Concepts
Distributions
Pipelines
- Instrumentation: Providers, Generators, Processors, Exporters
- Collector: Receivers or Connectors, Processors, Exporters or Connectors
Resources
- Cloud Provider: In develpment
- Compute Unit: In development
- Compute Instance: In develpment
- Environment: In develpment
- Service: Mixed (service.name and service.version stable)
- Telemetry: Mixed (telemetry.sdk.* stable)
Registry
- Automatic instrumentation for specific frameworks and languages
- Instrumentation and Collector components, including generators, receivers, processors, exporters, and extensions. Vendoer components are listed here as well as components for other open source projects.
- OTel components for instrumentation languages not hostsed by OTel, including Crystal, Dart, Haskell, Kotlin, Ocamel, Perl, and Scalar.
Roadmap
- 1 traces, 2 metrics, 3 logs
- No plans for creating a major version of OpenTelemetry (API) past v1.0
The Bottom Line
- Recognize observability problems and the need for open standards.:
- What is an open stardard and why does it matter?
- Explain the history and goals of the OpenTelemetry project:
- What does the OTel project provides, and what does it intentionally not provide?
- Identify the OpenTelemetry components and project status:
- Is OTel generally available (GA) and production-ready?
Chapter 3. Getting Started with the Astronomy Shop
Chapter 4. Understanding the OpenTelemetry Specification
API Specification
API Definition
API Context
API Signals
Traces support TracerProvider, Tracer, and Span
Metrics support MeterProvider, Meter, and Instrument.
Logs support LoggerProvider and Logger.
A Signal Provider is a stateful object that holds all the Signal Recorders.
An application should user a signal Signal Provider and may have one or more Signal Recorders.
A Signal Provider must provide a way to get its respective Signal Recorder
A Signal Provider must be the only way to create a Signal Recorder.
The interfaces supports by signals are vast and signal specific.
API Implementation
SDK Specification
- Signal-specific processors, exporters, and samplers
- Configuration
- Resources
SDK Definition
- Categories:
- Constructors are used by application owners and include configuration objects, environment variables, and SDK builders.
- Plug-in interfaces are used by plug-in authors and include processors, exporters, and samplers.
- OTel SDK spec covers
- Configuration: OTel requires the SDK configuration be possible programmatically and via a file, while envrionment variables (envvars) are optional.
- Resources: OTel requires taht the SDK provide access to a resource with at least semantic attributes with a default value. A resource is an “immutable representation of the entity producing telemetry.” Beyond the options to create and merge resources, resource detectors may be implemented to add metadata automatically.
SDK Signals
- Signal Provider: Shutdown, Force Flush
- Traces: OTLP, standard out, Jaeger, Zipkin
- Metrics: OTLP, standard out, in-memory, and Prometheus
- Logs: OTLP and standard out
SDK Implementation
Data Specification
- Data models: https://opentelemetry.io/docs/specs/otel/metrics/data-model
- Protocols: https://opentelemetry.io/docs/specs/otel/protocol/
- Semantic conventions: https://opentelemetry.io/docs/specs/otel/semantic-conventions
Data Models
- OTel instruments have multiple proprties
- Additive, nonadditive, monotonic, or grouped
- Synchronous or asynchronous
- Temporality is not applicable to the Last Value aggregation. Temporality is confugred when creating instrumentation. Beyond temporality, metrics can also be reaggregated in the following ways:
- Spatial: Used to reduce the number of attributes on a metric
- Transformative: Used to change the temporality of metrics that are sum aggregated (cumulative to delta or vice versa)
- The metric instruments available in OTel, along with their associated properties, type, and default aggregation
Instrument | Properties | Type | Default Aggregation |
---|---|---|---|
Counter | Monotonic | Synchronous | Sum |
UpDownCounter | Additive | Synchronous | Sum |
ObserableCounter | Monotonic | Asynchronous | Sum |
ObservableUpDownCounter | Addtive | Asynchrous | Sum |
Gauge | Nonadditive | Synchronous | Last Value |
Observable Gauge | Nondditive | Asynchronous | Last Value |
Histogram | Grouped | Synchronous | Histogram |
- Field names available to log records and what they mean
Field Name | Description | Notes |
---|---|---|
Timestamp | When the event occurred | Common syslog concepts |
ObservedTimestamp | When the event was observed | |
SeverityText | Log level | |
SeverityNumber | Numeric value of log level | |
Body | The message of the log record | |
Resource | Source information | OTel concept; metadata |
Attributes | Additional information | |
InstrumentationScope | Scope that emitted the log record | |
TraceID | Request trace ID | Used to enable trace correlation |
SpanId | Request span ID | |
TraceFlags | W3C trace flags |
Data Protocols
- OTLP
Data Semantic Conventions
Data Compatibility
- Trace: W3C, B3, and Jaeger context propagation
- Metrics: Prometheus and OpenMetrics
- Logs: Trace context in non-OTLP formats
General Specification
The Bottom Line
- Distinguish between OpenTelemetry versioning and stability, including support guarantees.
- What are the long-term support guarantees for OTel?
- Understand the OpenTelemetry data model, including protocol support and OTLP
- How is OTLP leveraged in OTel, and what value does it provide?
- Differentiate betwen the OpenTelemetry API and SDK
- Who or what typically implements the OTel API and SDK?
Chapter 5. Managing the OpenTelemetry Collector
- All data received by a reciever is converted to OpenTelemetry Protocol (OTLP) protobuf structs.
- By default, the Collector is stateless and keeps all data in memory.
- OTel Pros:
- Open source and vendor-agnostic: The collector can transform data from any available receiver format to any available exporter format.
- Extensible: The Collector supports observability data formats, including Jaeger, Prometheus, and OpenSearch, and can be extended to support future formats.
- One agent for all signals: The collector is among the first to support trace, metrics, and logs in both agent and gateway modes.
- Processing capabilities: The Collector offers a rich set of processing capabilities, which can be leveraged by any data that the Collector can receive.
- Multiple destinations: One use case that most agents do not handle well is the ability to export the same data to two different platforms in parallel. The Collector fully supports this capability.
- Fully OTel compliant: Given the Collector exists in the OTel project, it fully supports all OTel concepts, including signals, resources, and schemas.
Deployment Modes
Agent Mode
- Stand-alone instance
- Unavailable:
- During an upgrade
- During a restart
- Improperly sized
- Improperly configured
- Unavailable:
Gateway Mode
- Scenarios:
- You want to collect data from specific APIs, like the k8s API, or scrape a Prometheus server configured for federation
- You want to leverage tail-based sampling.
- You want to receive data from applications that do not support sending to an agent.
- Your network configuration prevents agents from accessing the Internet and you leverage a cloud-based observability platform.
- Your security team requires that API tokens be managed central and/or separated from agents.
Flow | Pros | Cons |
---|---|---|
Instrumentation to observability platform | - Quickest time to value; simplicity.- Lowest latency. | - Less data processing flexibility and requires language-specific components, such as resource detection and configuration.- Operational complexity as each language and possibly each application needs to be independently configured.- Added resource requirements to handle processing, and buffer and retry logic.- Decentralized security controls. |
Instrumentation to agent to observability platform | - Quick time to value, especially given that instrumentation sends data to a local OTLP destination by default.- Separates telemetry generation from transmission, reducing application load.Enhanced data processing capabilities and dynamic configuration without redploying applications. | - Agent is a single point of failure and must be sized and monitored properly. |
Instrumentation to gateway to observability platform | - If a gateway cluster separates telemetry generation from transmission without a single point of failure.- Supports advanced data processing capabilities, including metric aggregation and tail-based sampling.- Useful in certain environments, such as serverless, where an agent deployment may not be possible. | - Cannot offload all application processing capabilities, including resource detection.- Requires thought when configuring pull-based receivers to ensure proper load balancing and no data duplication.- May introduce unacceptable latency, impacting applications. |
Instrumentation to agent to gateway to observability platform | - The pros of agent and gateway mode. Supports the most use cases and requirements while providing the most data flexibility and portability. | - Complex configuration and highest management costs. |
Sizing
- Running complex regular expressions via configured processors against a large volume of data may result in an excessive amount of CPU being consumed. Optimizing the configuration to be more efficient or load balancing the data across a larger pool of smaller Collectors could help offset this issue.
- Significant spikes in traffic, custom buffer and retry configurations, and the tail-based sampling processor may result in excessive amounts of memory being consumed, leading to the Collector restarting. Testing failure scenarios to understand Collector behavior and validating Collector configuration is vital.
- Excessive logging or configuring the storage extension may consume all disk space. Monitoring and alerting against disk space can help mitigate this issue.
Components
- Receivers
- Processors
- Exporters
- Connectors
- Extensions
Configuration
GOGC: default 100
GOMEMLIMIT: no default
Ballast extensions: deprecated. removed recommended
Important:
- The components available depend on the Collector distribution being rune.
- Every component has a GItHub README that details status, supported signals, defaults, and configuration options.
- Within service::pipeline, one ore more receivers and exporters and zero or more processors or connectors msut be defined per signal.
- Not everything defined outside of the service section needs to be used in the service::pipelines section.
- The components specified in a
service::pipelines::<signal>
must support the signal type. - More than one configuration can be passed to the Collector in which case configurations are merged.
- You can use something defined outside of the service section in multiple service::pipelines.
- The order in which
service::pipelines::<signal>::processors
are defined determines the order in which processors are executed. - The same component and the same
service::pipelines::<signal>
can be defined multiple times by adding a forward slash followed by one more characters like<component>[/<name>]
.
otelbin
Processors:
- Memory limiter: highly recommended
Category | Examples |
---|---|
Metadata processing | - k8sattributesprocessor- resourceprocessor |
Filtering, routing, and sampling | - filterprocessor- routingprocessor (fyi. deprecated router connector)- tailsamplingprocessor |
Enriching | - k8sattributeprocessor- resourcedetection |
Generating (primarily metrics) | - metricsgenerationprocessor- spanmetricsprocessor |
Grouping (helpful in batching and processing) | - groupbyattrprocessor- groupbytraceprocessor (valid for tail-based sampling) |
Transforming (primarily metrics) | - cumulativetodeltaprocessor- deltatorateprocessor- schemaprocessor |
- Order
- Memory limiter
- Any filtering or sampling
- Any processor relying on sending source from context (e.g. k8s attritubes)
- batch
- Any other processor, including CRUD metadata
Extensions
Category | Examples |
---|---|
Authentication - Used by receivers and exporters | - basicauthextension- bearertokenauthextension- oidcauthextension |
Health and Troubleshooting | - healthcheckextension- pprofextension- remotetapextension- zpagesextension |
Observers - Used by receivers to discover and collect data dynamically | - dockerobserver- hostobserver- k8sobserver |
Persistence - Via a database or filesystem | - storage/dbstorage- storage/filestorage |
Connectors
- Exceptions: Generating metrics or logs from span exceptions
- Failover: Allows for health-based routing between trace, metric, and log pipelines, depending on the health of target downstream exporters
- Service graph: Building a map representing the interrelationships between various services in a system
Observing
- Metrics
- Health check extension
- zPages extension
Relevant Metrics
- Dropped data:
- otelcol_processor_dropped_spans
- otelcol_processor_dropped_metric_points
- otelcol_processor_dropped_log_records
- Queue length:
- otelcol_exporter_queue_capacity
- otelcol_exporter_queue_size
- Enqueue failed:
- otelcol_exporter_enqueue_failed_spans
- otelcol_exporter_enqueue_failed_metric_points
- otelcol_exporter_enqueue_failed_log_records
- Recevier refused:
- otelcol_receiver_refused_spans
- otelcol_receiver_refused_metric_points
- otelcol_receiver_refused_log_records
- Exporter send faield:
- otelcol_exporter_send_failed_spans
- otelcol_exporter_send_failed_metric_points
- otelcol_exporter_send_failed_log_records
- CPU cors against a known rate:
- otelcol_receiver_accepted_spans
- otelcol_receiver_accpeted_metric_points
- otelcol_receiver_accpted_log_records
Troubleshooting
Out of Memory crashes
- Reasons:
- A misconfiguration
- GOGC, GOMEMLIMIT
- An improperly sized Collector
- Using an alpha component
- A misconfiguration
Data Not Being Received or Exported
- Debug exporter
- zPages extension
Performance Issues
- Memory -> Go envrionment variables, memory limiter
- CPU:
- alpha component which is not optizmied yet.
- memory related: zPages
- pprof extension
Beyond the Basics
Distributions
- Core, Contrib, K8s
- Beyond the provided distributions, you may want to create your own.
- Remove unused or unneeded components to reduce the security surface of the Collector, including the required dependencies
- Extend the Collector with additional capabilities
- Create custom packaging beyond what OTel provides
- Securing
- Configuration
- SHOULD only enable the minimum required components. As covered earlier, everything enabled is a potential attack vector.
- SHOULD ensure sensitive configuration information is stored securely.
- Permissions
- SHOULD NOT run Collector as root/admin user. If the Collector is compromised and run as root/admin, then other systems may be at risk of being compromised.
- MAY require privileged access for some components. Care should be taken in these circumstances.
- Receivers/exporters
- SHOULD use encryption and authentication.
- SHOULD limit exposure of servers to authorized users.
- MAY pose a security risk if configuration parameters are modified improperly.
- Processors
- SHOULD configure obfuscation/scrubbing of sensitive metadata. Security can also be of the telemetry data being processed.
- SHOULD configure recommended processors. The recommended processor can help mitigate security concerns such as a distributed denial-of-service (DDos) attack.
- Extensions
- SHOULD NOT expose sensitive health or telemetry data. Any information made available could be used to compromise the system.
- Configuration
- Management
- OpAMP:
- Remote configuration
- Status reporting
- Collector telemetry reporting
- Management of downloadable Collector-specific packages
- Secure auto-updating capabilities
- Connection credentials management
- OpAMP:
The Bottom Line
- Distinguish between agent and gateway mode.
- What is the difference between agent and gateway mode?
- Identify Collector components.
- When getting started, what are the most essential components to configure?
- Configure and run the Collector.
- How are Collector components configured?
- Size, secure, observe, and troubleshoot the Collector
- Which components can be used to observe and troubleshoot the Collector?
Chapter 6. Leveraging OpenTelemetry Instrumentation
Get started:
- Download the required dependencies.
- Update the configuration
- Automatic instrumentation: set env variables or runtime parameters
- Manual instrumentation: add code interfaces
- Update the runtime parameters and start the application.
Caution:
- Setting the
service.name
is critical and highly recommended, otherwise it will be hard to understand what service was impacted when anlyzing data. - Updating the exporter settings may be necessary.
- Changing at least the OTLP endpoint address will be necessary for most containerized environments.
- Setting the
Auto Instrumentation
Manual Instrumentation
Programmatic Instrumentation
Mixing Automatic and Manual Trace Instrumentation
Distributions
The Bottom Line
- Instrument an application in various ways.
- What is the difference between the automatic, manual, programmatic, and mixed methods of instrumentation?
- Add production-ready instrumentation.
- After the basics of generating telemetry data that is exported to the console, what are some additional capabilities you should add in preparation for production?
- Enrich instrumentation with metadata.
- What are some ways you can enrich telemetry data with metadata?
Chapter 7. Adopting OpenTelemetry
The Basics
- Data portability and sovereignty, with a goal of gaining deeper insights into application availability and performance.
- Reduced complexity, with a goal of ensuring compliance with industry standards.
- Improved observability, with a goal of reducing mean time to detection (MTTD) and mean time to recovery (MTTR).
Why OTel and Why Now?
de facto standard
support various programming languages and frameworks.
reducing vendor lock-in
Consistency:
- Semantic conventions
- Processors and telemetry pipelines
- Context and correlation, which help end users reduce MTTR by enabling problem isolation and root cause analysis
gRPC vs HTTP:
- Instrumentation leveraged
- Protocols leveraged: For example, if gRPC is not used anywhere in your environment, you may not be comfortable using it.
- The amount of data expected: For environments generating a large amount of telemetry data, gRPC might perform better, but to date, there are no accurate benchmarks in OTel to confirm this. Most instrumentation libraries now default to HTTP/protobuf, but the question remains: what is the best option for the Collection.
- The number of dependencies: gRPC usually has many more dependencies than HTTP. The net result is that gRPC and its dependencies may need to be upgraded more often to address vulnerabilities, and the package size is often larger.
- The HTTP version used: gRPC uses HTTP 2.0 whereas HTTP defaults to 2.0 and can fallback to 1.1 if needed. It can also be explictly configured only to use 1.1, though this not recommended.
Instrumentation
- Extensive performance tests to understand the impacts on the application, including resource utilization and startup time.
- Configuration validation to ensure that items such as context propagation and tagging are correctly set. Also, note that changes in the data received may result in the need to create or update charts and alerts to ensure that they work correctly.
- Metadata enrichment to help provide observability and ensure parity with any previous instrumentation. Not that enrichment may also be possible from the Collector instead of or in addition to the instrumentation.
- Keep in mind when configuring OTel instrumentation:
- The exporter endpoint may need to be updated, especially for containerized environments or when the Collector is not deployed in agent mode.
- The service name should be set to distinguish applications from one another. This is necessary to understand the behavior, health, and performance of different services.
- You should determine how you want additional resource information to be added. Options include instrumentation, Collector, or both, and implementation depends on requirements. Leveraging Collector instances running in agent mode is the recommended default option.
- You should consider abstracting configuration and instrumentation where it makes sense.
Production Readiness
- Performance, Reliability, Security
Maturity Framework
- Level 1((Initial Implementation):
- Deployment of the Colelctor and basic configuration to start collecting and forwarding data to an observability platform
- Installation of OTel SDKs in select applications
- Level 2 (Basic Instrumentation)
- Identifying and instrumenting critical services and application components using OTel SDKs
- Utilizing OTel APIs to generate custom telemetry data specific your application’s needs
- Ensuring that essential signals are being collected through an environment
- Level 3 (Advanced Instrumentation)
- Full-stack instrumentation, including external dependencies and third-party services
- Signals enriches with contextual metadata such as service names, environment tags, and user identifiers (IDs)
- Enhanced use of OTel APIs to capture fine-grained telemetry data for detailed analysis
- Level 4 (Integrated Observability)
- Complete integration with existing monitoring and observability tools
- Centralized dashboards that aggregate and visualize telemetry data from multiple sources for unified visibility
- Advanced querying and visualization techniques to gain deeper insights into system performance and health
- Level 5 (Proactive Observability)
- Predictive analytics and machine learning (ML) for anomaly detection and potential issue forecasting
- Incident response with automated remediation workflows to address incidents and performance bottlenecks in real time
- Continuous improvement through feedback loops and performance tuning
Brownfield Deployment
Data Collection
- In addition, several considerations must be taken into account, including:
- Resource usage and constraints
- Port conflicts
- Network configuration
- Effects of migration
Instrumentation
- When adopting OTel in a brownfield deployment, you should be aware of the following:
- Replacing manual instrumentation is not required for any instrumentation formats supported by the Collector but may be considered to reduce the number of technologies used and maintained in an environment. Other reasons to consider switching to OTel instrumentation include it being the most adopted open standard and that it can handle multiple signals with the same SDK. At the very least, proprietary instrumentation should be removed over time. One way to approach this is to ensure every new service introduced into the environment leverages the new instrumentation standard.
- While automatic instrumentation is most commonly used to generate trace data, in OTel, it supports more, including OTel SDK configuration and possibly other signals. Even if automatic instrumentation is not used in the environment, it should be considered in brownfield deployments.
- The same context propagation mechanism must be used when replacing trace instrumentation, regardless of whether it is via automatic or manual instrumentation.
Dashboards and Alerts
- Send the new name and make it a platform problem
- Send the original name and wait for the migration to complete before cutting over
Greenfield Deployment
Data Collection
- GOGC, GOMEMLIMIT environment variables for the Collectir is highly recommended.
Instrumentation
- With instrumentation, it is common for people to want to start by instrumenting metrics. The reasons why people often start with metrics is severalfold, including:
- Developers are often more comfortable troubleshooting with metrics.
- Service owners care more about their services than those they are dependent on or that depend on them.
- Tracing requires context propagation, meaning services must be instrumented end-to-end to provide value. Instrumenting all services in a transaction can take time, depending on the number of services, the types of languages and frameworks used, and the teams involved.
Other Considerations
Administration and Maintenance
- Collector - SRE
- Automatic instrumentation - SRE
- Manual instrumentation - Developer
Environments
- Development, staging, and production
- Kubernetes
- Functions-as-a-service (FaaS)
- Internet of Things (IoT)
- Air-gapped
Semantic Conventions
The Future
- Profiling to provide code-level telemetry data helpful in investigating performance issues in an application.
- Client instrumentation, which is required to support Real User Monitoring (RUM)
- eBPF telemetry for observability.
The Bottom Line
- Prepare to adopt OpenTelemetry:
- Preparation will ensure that your OTel adopton is strategic, well-supported, and aligned with organizational goals. What are some reasons why you might want to adopt OTel?
- Approach adopting OpenTelemetry in brownfield deployments.:
- What are the recommended steps to adopt OTel in a brownfield deployment?
- Approach adopting OpenTelemetry in greenfield deployments.:
- What are the recommended steps to adopt OTel in a greenfield deployment?
Chapter 8 The Power of Context and Correlation
Background
- Context is “the interrelated conditions in which something exists or occurs.” For the tracing signal, a span contains a trace ID and associated metadata. This additional information on the span would be an example of context.
- Correlation is “a phenomenon that accompanies another phenomenon, is usually parallel to it, and is related in some way to it” A single HTTP request, which would be represented by a metric, being associated with a transaction or trace ID would be an example of correlation.
Context
- Database semantic conventions are in development and include the ability to define context, including:
db.name
: A name for the database being accessed. Required and similar toservice.name
used for applications.db.operation
: What operation was executed, such asSELECT
. Required unlessdb.statement
is specified.db.statement
: The entire command executed. Note that this may contain sensitive information. In addition, unless normalized, indexing this value would result in high cardinality.db.system
: The type of database called, such asCASSANDRA
,REDIS
, orSQL
; required.db.user
: The user who executed the command.
OTel Context
- OTel context is essentially key-value pairs, or metadata, used to propagate information across the lifecycle of a request.
- Baggage is a specific type of OTel context that is desinged to be shared within and propagated across service boundaries.
Trace Context
- Golden signals or RED metrics: At a minimum, requests, error, and duration information can be extracted from traces.
- Metrics: Beyond golden signals or RED metrics, additional metrics may also be available that are library specific or developer added.
- Logs: Typically in the form of span events, but log correlation can also bring traces and logs together.
- Metadata: While additional signal information may be valuable, especially if it can be extracted from a trace, metadata is one of the most important pieces of context.
Challenges
- Metrics and logs have been around for a long time, are easier to add, and are more commonly used to monitor environments.
- Tracing is more challenging to add than metrics and logs. Adding the appropriate metrics, logs, and metadata is nontrivial, even if context can be passed.
- Most tracing platforms cannot fully leverage metrics and logs attached to spans. Problems here include dimensionality and cardinality of the data.
- Traces are often heavily sampled, meaning metrics and logs must be appropriately handled at sample time to provide proper observability.
- Additional context and correlation, including business, user, and application logic is needed and may only be available form other signals.
Resource Context
- End users
- Applications or services
- Orchestration or platform
- Infrastructure
Logic Context
- Business logic: organization ID and environment
- User logic: customerID and geolocation
- Application logic: version number and feature flag stutas
Correlation
- In addition to context, it is critical to be able to correlate telemetry data to achieve observability
Time Correlation
- One fundamental way to correlate information is by viewing different data sources together and visual correlating on the dimension of time.
- Time-based correlation’s limitation:
- Missing data: it is difficult to determine whether all data is being collected and visualized.
- Noise: Variable amounts of error and latency are possible even in healthy environments. Quickly distinguishing between noise and signal can decrease MTTR.
- Assumptions: Multiple investigations are often needed to test hypotheses and learn more about the system’s behavior.
Context Correlation
- A significant improvement to time correlation is context correlation. Context correlation is when metadata is used to perform problem isolation.
Trace Correlation
Metric Correlation
The Bottom Line
- Differentiate between context and correlation:
- Understanding the distinction between context and correlation is fundamental to observing systems and leveraging OTel effectively. What is the difference between context and correlation?
- Identify the types of context and the value each provides.:
- In OTel, context is available through concepts including attributes, resources, and baggage. What is the difference between these concepts?
- Explain the value proposition of correlation.:
- Correlation transforms disparate telemetry data into a coherent narrative, empowering teams to make informed decisions and maintain robust, high-performing steams. What are some examples of correlation?
Chapter 9 Choosing an Observability Platform
Primary Considerations
- Functional requirements, or what you need the system to support, including signal, anomaly detection, data retention, real-time analysis, historical data analysis, machine learning (ML) and artificial intelligence (AI) capabilities
- Non-functional requirements, or how you need a system to perform, including cost, scalability, performance, reliability, security, compliance, usability, documentation, and support.
- Requirements:
- Data residency
- Heterogeneous versus unified observability
- Security and compliance, including for regulated industries and government agencies
Requirements | Examples | Priority |
---|---|---|
Data collection for specific environments | K8s, OpenShift, Cloud Foundry | P0 |
Platform integration to speicifc applications | ServiceNow, Slack, PagerDuty | P0 |
Platform compliance | SOC 2 Type II, PCI, FedRaAMP | P0 |
Important platform capabilities | SLIs/SLOs, Session replay (RUM) | P1 |
Nice to have platform capabilities | Auto-discovery, buyilt-in content | P2 |
Platform Capabilities
- OTel suuport
- Integrations
- Ease of setup, migration, and use
- Troubleshooting ease
- Scalability and performance
- Support and community
- Security and compliance
- Cost and licensing model
- Platform differentiating features
Marketing Versus Reality
- Startup companies claiming to support massive scale.
- Enterprise companies claming to support new generation trends or solutions
- Any companies offering significant cost savings for similar capabilities.
Scenario | Goals |
---|---|
Monitor the performance of a distributed application under heavy load | Instrument and collect data from applications, and identify performance and latency issues. |
Response to and dignose an incident affecting application availability. | Receive alerts, troubleshoot with real-time dashboards, and determine the root cause. |
Plan and execute scaling operations based on observed workload patterns. | Monitor resource utilization and set automatic-scaling policies. |
Validate the performance and stability of a new application release. | Correlate deployment events with other signals, monitor KPIs, and use anomaly detection. |
Monitor and maintain service levels defined by SLOs for critical services. | Define SLIs, alert against SLOs, and anlyze historical data. |
Price, Cost, and value
Integrations
Where the processing occurs.
Network egress
Price Models:
- Ingestion or usage
- Host or service
- User or set
- Feature or query
Observability Fragmentation
Primary Factors
Build, Buy, or Manage
Approach | Pros | Cons |
---|---|---|
Build | Flexibility and choice | Distraction from the business value proposition |
Buy | Reduced OpEx | Vendor lock-in and increased CapEx overtime |
Manage | Infrastructure CapEx only | Increased CapEx and OpEx over time |
Combination | Flexibility and choice | Multiple systems to manage; inconsistency |
Licensing, Operations, and Deployment
Decision | Options | Notes |
---|---|---|
Licensing | Open source or proprietary | Capabilities are usually more important than the licensing model. |
Operations | Self or vendor-managed | Open source vendor-managed may not be the same as open source self-managed. |
Deployment | On-premises or SaaS | Most vendor-managed observability platforms are SaaS-based today. |
- Some OTel decision factors that may influence the observability platform selected
Factor | Notes |
---|---|
OTLP ingestion | In additiona to yes, is it the default setting - and if not, why? |
Distribution | If one is offered, does it contain proprietary components? |
API and SDK | What versions and features are supported? |
Instrumentation | Which languages are supported? |
Collector | Which components are supported? |
Semantic conventions | Does the platform support semantic conventiosn? |
General support | If vendor-managed, is OTel support provided? |
Contributions | What commitment, influence, and ability to support is provided? |
Stakeholders and Company Culture
Stakeholder | Example | Influenced by |
---|---|---|
Buyer | CTO or VP of Engineering | Legal, Finance, and Security |
Administrator | SRE Team | Development Team |
User | Development Team | SRE Team |
Implementation Basics
- Day 0: Establishing Observability: The initial phase of setting up observability practices and tools within a software system.:
- Choosing and setting up observability tools such as instrumentation frameworks, data collectors, and observability platforms.
- Instrumenting the application code to emit relevant signals (usually one at first)
- Configuring dashboards, alerts, and other monitoring settings based on initial requirements and expectations
- Providing training to the team on how to use observability tools effectively
- Day 1: Initial Monitoring and Analysis: Focuses on the initial implementation and use of observability data to monitor and analyze one ore more environments.:
- Monitoring key signals to ensure that the system is performing as expected
- Using observability data to diagnose issues and understand system behavior during normal operations
- Establishing baseline metrics and performance indicators to measure against future changes
- Gathering feedback from initial observations to refine monitoring configurations and improve understanding of the system’s behavior
- Day 2: Continuous Improvement and Optimization Involves the ongoing enhancement and optimization of observability practices to better understand and manage the software system.:
- Implementing automation for monitoring setup, alerting, and response
- Performing deeper analysis using aggregated metrics anomaly detection, and correlation across different observability data sources
- Scaling observability solutions to handle larger volumes of data and more comples systems
- Integrating observability into the development lifecycle (DevOps practices) to ensure continuous feedback and improvement
Administration
Usage
- Training and onboarding for team members.
- Establishing and maintaining best practices and governance processes
- Monitoring the observability platform.
- Minimizing disruption to operation during a transition to a new observability platform.
- Measuring and monitoring adoption and migration progress.
- Continuously identifying areas for optimization and improvement to provide and improve observability.
Maturity Framework
- Level 1: Basic Monitoring:
- Basic system metrics collection (CPU, memory, disk)
- Basic monitoring capabilities, including initial dashboards, with manual telemetry review
- Simple alerting for critical system metrics and failures
- Level 2: Enhance Monitoring
- Application-level metrics and log collection, including the initial usage of a methodology, such as requests, errors, and duration (RED) or golden signals.
- Automated alerting and notifications based on predefined thresholds
- Basic aggregation, filtering and search capabilities
- Level 3: Proactive Observability:
- Compprehensive metrics collection (infrastructure and applications)
- Centralized logging with advanced search and analysis
- Distributed tracing for understanding request flows
- Dashboards and visualization for real-time insights and correlation across signals
- Level 4: Predictive Observability:
- ML Models for anomaly detection
- Predictive analytics for identifying potential failures
- Automated remediation and response using tools like runbook automation
- Use of AI/ML to dynamically adjust monitoring and alerting thresholds
- Fully integrating observability practices with continuous integration and continuous delivery (CI/CD) pipelines
The Bottom Line
- Distinguish between observability platform capabilities.
- What are some of the primary differences among observability platforms?
- Decide which observability platform is right for you.
- What are some of the key considerations and decision factors that need to be decided to choose the right observability platform?
- Get a quick return on your observability platform investment:
- How can you get a quick return on your observability platform investment?
Chapter 10 Observability Antipatterns and Pitfalls
Telemetry Data Missteps
- Data antipatterns:
- Incomplete instrumentation and blind spots
- Over-instrumentation, or Big Bang instrumentation
- Ignoring sampling or sampling bias
- Inconsistent naming conventions
- Data pitfalls:
- High-cardinality data
- Lack of data validation
- Misconfigured aggregation
- Failure to evolve
Mixing Instrumentation Libraries Scenario
- Standardization
- Migration plan
- Compatibility layers
Automatic Instrumentation Scenario
- Manual instrumentation
- Code reviews and testing
- Continuous improvement
Custom Instrumentation Scenario
- Custom instrumentation frameworks
- Community contributions
- Vendor collaboration
Component Configuration Scenario
- Configuration management
- Validation and testing
- Monitoring and alerting
Performance Overhead Scenario
- Performance profiling
- Fine-tuning parameters
- Scalability testing
Resource Allocation Scenario
- Capacity planning
- Resource scaling
- Monitoring and alerting
Security Considerations Scenario
- Redaction
- Data encryption
- Access controls
- Audit logging
Monitoring and Maintenance Scenario
- Dashboard and alerts
- Health checks
- Regular maintenance
Observability Platform Missteps
- Anti patterns:
- Vendor lock-in
- Non-OTel-native
- Poor integration support
- Underestimating scalability requirements
- Tool sprawl
- Alert storms
- Static dashboards
- Ignoring latency
- Ignoring latency
- Ignoring context
- Observability platform pitfalls:
- Complex deployment
- Data silos
- Inadequate security measures
- Insufficient customization options
- High total cost of ownership
- Alert fatigue
- Failure to monitor key business metrics
- How to solve?:
- Understanding present and predicting future observability requirements
- Defining and maintaining observability processes and best practices
- Following the “keep it simple, stupid” (KISS) principle
Vendor Lock-in Scenario
- Standardize data formats
- Prioritize automation
- Evaluate alternatives
Fragmented Tooling Scenario
- Centralize data sources
- Standardize tooling
- Invest in integration
Tool Fatigue Scenario
- Streamline alerting
- Customize dashboards
- Invet in integration
Inadequate Scalability Scenario
- Self-managed solutions
- Software-as-a-service
- Performance testing
Data Overload Scenario
- Sampling strategies
- Filtering techniques
- Data retention policies
Company Culture Implications
- Antipatterns:
- Silos and lack of collaboration
- Lack of ownership and accountability
- Short-term thinking
- Pitfalls:
- Underestimating the importance of observability culture
- Misalignment of incentives
- Complexity of distributed systems
- Lack of training and education
- Lack of continuous improvement
Lack of Leadership Support Scenario
- Educate leadership
- Quantify impact
- Engage stakeholders
Resistance to Change Scenario
- Communicate benefits
- Provide training and support
- Address concerns
Collaboration and Alignment Scenario
- Establish cross-functional teams
- Define shared goals and objectives
- Align incentives and recognition
Goals and Success Criteria Scenario
- Recommendations to mitigate:
- Define SMART goals: Establish Specific, Measurable, Achievable, Relevant, and Time-bound goals for observability initiatives, outlining what success looks like and how it will be measured.
- Identify KPIs
- Tack and monitor progress
Standardization and Consistency Scenario
- Recommendations to mitigate:
- Standardization guidelines
- Tool rationalization
- Centralized platforms
Incentives and Recognition Scenario
- Recommendations to mitigate:
- Recognition programs
- Performance metrics
- team incentives
Feedback and Improvement Scenario
- Recommendations to mitigate:
- Post-incident reviews
- Continuous improvement culture
- Feedback loop closure
Prioritization Framework
- Standardization
- Reliability
- Performance
- Security
- Troubleshooting Efficiency
- Developer Productivity
Scenario | Unweighted Score | Effort to fix | Priority |
---|---|---|---|
Automatic Instrumentation | 19 | Low | P0 |
Component Configuration | 18 | Low | P0 |
Resource Allocation | 18 | Low | P0 |
Performance Overhead | 18 | Medium | P0 |
Monitoring and Maintenance | 18 | Medium | P0 |
Inadequate Scalability | 18 | High | P1 |
Tool Fatigue | 17 | Medium | P1 |
Fragmented Tooling | 17 | High | P1 |
Lack of Leadership Support | 17 | High | P1 |
Resistance to Change | 17 | High | P1 |
Custom Instrumentation | 16 | High | P1 |
Vendor Lock-in | 15 | High | P1 |
Mixing Instrumentation Libraries | 14 | High | P1 |
Standardization and Consistency | 13 | Medium | P2 |
Incentives and Recognition | 12 | Low | P2 |
Goals and Success Criteria | 11 | Low | P2 |
Feedback and improvement | 11 | Medium | P2 |
Security Considerations | 10 | Medium | P2 |
Collaboration and Alignment | 10 | Medium | P2 |
data Overload | 8 | Medium | P2 |
The Bottom Line
- Distinguish between observability antipatterns and pitfalls.
- Understanding the differnce between observability antipatterns and pitfalls is crucial for building effective observability practices. What is the difference between an antipattern and a pitfall?
- Recognize and overcome common observability antipatterns and pitfalls.:
- What is an observability antipattern that the OTel project helping to mitigate?
- Describe the impacts of company culture on observability goals.:
- How can you use company culture to help achieve your observability goals?
Chapter 11 Observability at Scale
Understanding the Challenges
- The sheer volume and velocity of telemetry data generated by distributed applications and infrastructure components. As systems grow in size and complexity, the amount of data produced can overburden traditional observability tools and platforms, making it challenging to extract meaningful insights in real time and cost-effectively.
- The complexity of distributed systems introduces challenges related to understanding and tracing the flow of data and requests across various components and services. This distributed system complexity can hinder visibility and make it challenging to identify and diagnose issues effectively.
- Infrastructure and resource constraints, such as limited storage capacity and processing power, pose significant challenges to scaling observability, as organizations must balance the need for comprehensive data collection with resource limitations and cost constraints.
Volume and Velocity of Telemetry Data
Processing and storage costs
Data management
Performance impact
Real-time processing
Scalability
Observing and alerting
Distributed System Complexity
Considerations:
- Heterogeneity of components
- Service interdependencies
- Latency, reliability, and error handling
- Data consistency and synchronization
- Scalability and performance
- Security and compliance
Divers data sources
Inconsistent instrumentation
Dependency mapping
Chaining failures
Latency monitoring
Network reliability
Buffer and retry logic
Eventual consistency
Synchronization mechanisms
Horizontal scaling
Performance bottleneckss
Access control
Data Privacy
summary:
- Changing how data is sent or ingested
- Upgrading, especially to new major versions
- Using a different technology
Observability Platform Complexity
Infrastructure and Resource Constraints
Hardware and resource limitations
Cost considerations
Elasticity and scalability
Efficient data processing pielines
Data retention policies
Compression and serialization
Throttles and limits
Strategies for Scaling Observability
Elasticity, Elasticity, Elasticity!
- HPA
- Stateful services, auto-scaling can be more challenging.
- Otel Collector:
- All spans must be processed in the same collector:
- Load balancing policies
- Advanced configuration: keepalive, timeout
- All spans must be processed in the same collector:
Leverage Cloud Native Technologies
- Containerization
- Orchestration