Observability Engineering

created : Mon, 20 Feb 2023 22:40:32 +0900
modified : Fri, 05 May 2023 22:42:41 +0900

Part I. The Path to Observability

Chapter 1. What Is Observability?

The Mathematical Definition of Observability

Applying Observability to Software Systems

Why Observability Matters Now

Is This Really the Best Way?

Why Are Metrics and Monitoring Not Enough?

Debugging with Metrics Versus Observability

The Role of Cardinality

The Role of Dimensionality

Debugging with Observability

Observability Is for Modern Systems

Chapter 2. How Debugging Practices Differ Between Observability and Monitoring

How Monitoring Data Is Used for Debugging

Troubleshooting Behaviors When Using Dashboards

The Limitations of Troubleshooting by Intuition

Traditional Monitoring Is Fundamentally Reactive

How Observability Enables Better Debugging

Conclusion

Chapter3. Lessons from Scaling Wihtout Observability

The Evolution Toward Modern Practices

Chapter 4. How Observability Relates to DevOps, SRE, and Cloud Native

Cloud Native, DevOps, and SRE in a Nutshell

Observability: Deubbing Then Versus Now

Observability Empowers DevOps and SRE Practices

Conclusion

Part II. Fundamentals of Observability

Chatper 5. Structured Events Are the Building Blocks of Observability

Debugging with Structured Events

The Limitations of Metrics as a Building Block

The Limitations of Traditional Logs as a Building Block

Unstructured Logs

Structured Logs

Properties of Events That Are Useful in Debugging

Conclusion

Chapter 6. Stitchign Events into Traces

Distributed Tracing and Why It Matters Now

The Components of Tracing

Instrumenting a Trace the Hard Way

func rootHandler(r *http.Request, w http.ResponseWriter) {
  traceData := make(map[string]any)
  traceData["trace_id"] = uuid.String()
  traceData["span_id"] = uuid.String()

  startTime := time.Now()
  traceData["timestamp"] = startTime.Unix()

  authorized := callAuthService(r)
  name := call NameService(r)

  if authorized {
    w.Write([]byte(fmt.Sprintf(`{"message": "Waddup %s"}`, name)))
  } else {
    w.Write([]byte(`{"message": "Not cool dawg"}`))
  }

  traceData["duration_ms"] = time.Now().Sub(startTime)
  sendSpan(traceData)
}

Adding Custom Fields into Trace Spans

func rootHandler(r *http.Request, w http.ResponseWriter) {
  traceData := make(map[string]any)
  traceData["tags"] = make(map[string]any)

  hostname, _ := os.Hostname()
  traceData["tags"]["hostname"] = hostname
  //
}

Stitching Events into Traces

Conclusion

Chapter 7. Instrumentation with OpenTelemetry

A Brief Introduction to Instrumentation

Open Instrumentation Standards

Instrumentation Using Code-Based Examples

Start with Automatic Instrumentation

Add Custom Instrumentation

import "go.opentelemetry.io/otel"

var tr = otel.Tracer("module_name")

func funcName(ctx context.Context) {
  sp := tr.Start(ctx, "span_name")
  defer sp.End()

  // do work here
}
import "go.opentelemetry.io/otel/attribute"

sp.SetAttributes(attribute.Int("http.code", resp.ResponseCode))
sp.SetAttributes(attribute.String("app.user", username))

Send Instrumentation Data to a Backend System

import (
  x "github.com/my/backend/exporter"
  "go.opentelemetry.io/otel"
  sdktrace "go.opentelemetry.io/otel/sdk/trace"
)

func main() {
  exporterX := x.NewExporter(...)
  exporterY := y.NewExporter(...)
  tp, err := sdktrace.NewTracerProvider(
    sdktrace.WithSampler(sdktrace.AlwaysSample()),
    sdktrace.WithSyncer(exporterX),
    sdktrace.WithBatcher(exporterY),
  )
  otel.SetTracerProvider(tp)
}

Conclusion

Chapter 8. Analyzing Events to Achieve Observability

Debugging from Known Conditions

Debugging from First Principles

Using the Core Analysis Loop

Automating the Brute-Force Portion of the Core Analysis Loop

This Misleading Promise of AIOps

Conclusion

Chapter 9. How Observability and Monitoring Come Together

Where Monitoring Fits

Where Observability Fits

System Versus Software Considerations

| Factor | Your systems | Your software | | Rate of change | Package updates (monthly) | Repo commits (daily) | | Predictability | High (stable) | Low (many new features) | | Value to your business | Low (cost center) | High (revenue generator) | | Number of users | Few (internal teams) | Many (your customers) | | Core concern | Is the system or service healthy? | Can each request acquire the resources it needs for end-to-end execution in a timely and reliable manner? | | Evaluation perspective | The system | Your customers | | Evaluation criteria | Low-level kernel and hardware device drivers | Variables and API endpoint | | Functional responsibility | Infrastructure operations | Software development | | Method for understanding | Monitoring | Observability |

Accessing Your Organizational Needs

Exceptions: Infrastructure Monitoring That Can’t Be Ignored.

Real-World Examples

Conclusion

Part III. Observability for Teams

Chapter 10. Applying Observability Practices in Your Team

Join a Community Group

Start with the Biggest Pain Points

Buy Instead of Build

Flesh Out Your Instrumentation Iteratively

Look for Opportunities to Leverage Existing Efforts

Prepare for the Hardest Last Push

Conclusion

Chapter 11. Observability-Driven Development

Test-Driven Development

Observability in the Development Cycle

Determining Where to Debug

Debugging in the Time of Microservices

How Instrumentation Drives Observability

Shifting Observability Left

Using Observability to Speed Up Software Delivery

Conclusion

Chapter 12. Using Service-Level Objectives for Reliability

Traditional Monitoring Approaches Create Dangerous Alert Fatigue

Threshold Alert Is for Known-Unknowns Only

User Experience Is a North Star

What Is a Service-Level Objective?

Reliable Alerting with SLOs

Conclusion

Chatper 13. Acting on and Debugging SLO-Based ALerts

Alerting Before Your Error Budget Is Empty

Framing Time as a Sliding Window

Forecasting to Create a Predictive Burn Alert

The Baseline Window

Acting on SLO Burn Alerts

Using Observability Data for SLOs Versus Time-Series data

Conclusion

Chapter 14. Observability and the Software Supply Chain

Part IV. Observability at Scale

Chatper 15. Build Versus Buy and Return on Investment

How to Anlyze the ROI of Observability

The Real Costs of Building Your Own

The Real Costs of Buying Software

Buy Versus Build Is Not a Binary Choice

Chatper 16. Efficient Data Storage

The Functional Requirements for Observability

Chapter 17. Cheap and AccurateEnough: Sampling

Sampling to Refine Your Data Collection

Using Different Approaches to Sampling

Constant-Probability Sampling

Sampling on Recent Traffic Volume

Sampling Based on Event Content (Keys)

Combining per Key and Historical Methods

Choosing Dynamic Sampling Options

When to Make a Sampling Decision for Traces

Translating Sampling Strategies into Code

func handler(resp http.ResponseWriter, req *http.Request) {
  start := time.Now()
  i, err := callAnotherService()
  resp.Write(i)
  RecordEvent(req, start, err)
}
var sampleRate = flag.Int("sampleRate", 1000, "Static sample rate")
func handler(resp http.ResponseWriter, req *http.Request) {
  start := time.Now()
  i, err := callAnotherService()
  resp.Write(i)

  r := rand.Float64()
  if r < 1.0 / *sampleRate {
    RecordEvent(req, start, err)
  }
}
var sampleRate = flag.Int("sampleRate", 1000, "Service's sample rate")

fun handler(resp http.ResposneWriter, req *http.Request) {
  start := time.Now()
  i, err := callAnotherService()
  resp.Write(i)

  r := rand.Float64()
  if r < 1.0 / *sampleRate {
    RecordEvent(req, *sampleRate, start, err)
  }
}
var sampleRate = flag.Int("sampleRate", 1000, "Service's sample rate")

func handler(resp http.ResponseWriter, req *http.Request) {
  // Use an upstream-generated random sampling ID if it exists.
  // otherwise we're a root span. generate & pass down a random ID.
  var r float64
  if r, err := floatFromHexBytes(req.Header.Get("Sampling-ID")); err != nil {
    r = rand.Float64()
  }

  start := time.Now()
  // Propagate the Sampling-ID when creating a child span
  i, err := callAnotherService(r)
  resp.Write(i)

  if r < 1.0 / *sampleRate {
    RecordEvent(req, *sampleRate, start, err)
  }
}
var targetEventsPerSec = flag.Int("targetEventsPerSec", 5,
  "The Target number of requests per second to sample from this ervice.")

var sampleRate float64 = 1.0
var requestsInPastMinute *int

func main() {
  // Initialize counters.
  rc := 0
  requestsInPastMinute = &rc

  go func() {
    for {
      time.Sleep(time.Minute)
      newSampleRate = *requestsInPastMinute / (60 * *targetEventsPerSec)
      if newSampleRate < 1 {
        sampleRate = 1.0
      } else {
        sampleRate = newSampleRate
      }
      newRequestCounter := 0
      requestsInPastMinute = &newRequestCounter
    }
  }()
  http.Handler("/", handler)
}

fun handler(resp http.ResponseWriter, req *http.Request) {
  var r float64
  if r, err := floatFromHexBytes(req.Header.Get("Sampling-ID")); err != nil {
    r = rand.Float64()
  }

  start := time.Now()
  *requestsInPastMinute ++
  i, err := call AnotherService(r)
  resp.Write(i)

  if r < 1.0 / sampleRate {
    RecordEvent(req, sampleRate, start, err)
  }
}
var sampleRate = flag.Int("sampleRate", 1000, "Service's sample rate")

func handler(resp http.ResponseWriter, req *http.Request) {
  start := time.Now()
  i, err := callAnotherService(r)
  resp.Write(i)

  r := rand.Float64()
  if err != nil || time.Since(start) > 500*time.Millisecond {
    if r < 1.0 / *outlierSampleRate {
      RecordEvent(req, *outlierSampleRate, start, err)
    }
  } else {
    if r < 1.0 / *sampleRate {
      RecordEvent(req, *sampleRate, start, err)
    }
  }
}

Chapter 18. Telemetry Mangement with Pipelines

Attributes of Telemetry Pipelines

Managing a Telemetry Pipeline: Anatomy

Challenges When Managing a Telemetry Pipeline

Use Case: Telemetry Management at Slack

Open Source Alternatives

Part V. Spreading Observability Culture

Chapter 19. The business Case for Observability

The Reactive Approach to Introducing Change

The Return on Investment of Observability

The Proactive Approach to Introducing Change

Introducing Observability as a Practice

Using the Approriate Tools

Instrumentation

Data Storage and Analytics

Rolling Out Tools to Your Teams

Knowing When You Have Enough Observability

Conclusion

Chapter 20. Observability’s Stakeholders and Allies

Recognizing Nonengineering Observability Needs

Creating Observability Allies in Practice

Using Observability Versus Business Intelligence Tools

Chapter 21. An Observability Maturity Model

A Note About Maturity Models

Why Observability Needs a Maturity Model

About the Obserability Maturity Model

Capabilities Refernced in the OMM

Respond to System Failure with Resilience

Deliver High-Quality Code

Manage Complexity and Technical Debt

Release on a Predictable Cadence

Understand User Behavior

Using the OMM for Your Organization

Conclusion

Chatper 22. Where to Go from Here

Observability, Then Versus Now

Additional Resource

Predictions for Where Observability Is Going