minuk.dev 블로그 : Make Cloud Native Chaos Engineering Easier Deep Dive into Chaos Mesh

Make Cloud Native Chaos Engineering Easier Deep Dive into Chaos Mesh

created : 2022-09-03T17:42:20+00:00
modified : 2022-09-03T18:17:07+00:00

chaos-mesh chaos-engineering kubecon k8s kubernetes

원본

Testing a distributed system is difficult

Distributed systems are more and more complex nowadays:
- Faults can happen anytime, anywhere, in any ways
Writing tests and debugging is hard:
- Deterministic test is very hard and impossible to cover all faults
But, No crash, No data loss, No wrong results

Chaos Engieering to the rescue

Chaos engineering is about breaking things in a controlled environment and through well-planned experiments in order to build confidence in your application to withstand turbulent conditions.
Chaos engineering is NOT about breaking things randomly without a perpose.
Program Cycle:
- Improve -> Steady State -> Hypothesis -> Run Experiment -> Verify -> Improve…

Why Chaos Mesh

On Kubernetes:
- More application clusters (40+)
- More nodes on each cluster
- More target objects may fail, e.g. Container / Pod / Network / Disk / System Clock / Kernel / etc.
We need more Chaos experiments. However, managing and scheduling many chaos experiments is a hug pain!
In Docker:
- The environment is different from the physical nodes
- Tools like tc / iptables / fuse / bcc can’t be used directly
- Containers on the same node cannot affect each other
Chaos scope must be customizable and manageable for containers.

What is Chaos Mesh

A Cloud-Native Chaos Engineering platform on Kubernetes environments
Started out as the internal platform to test TiDB
Provides fault injection methods into the container, Pod, network, system I/O, kernel, etc.
Chaos Mesh’s Mission:
- Make Chaos Engineering easy

Deep into Chaos Mesh

Architecture

Chaos Dashboard:
- Manage and monitor chaos experiment
Chaos Controller Manager:
- Schedule and controle component
- Workflow engine
Chaos Daemon:
- Executive component on kubernetes node
Chaosd:
- Executive component on non-kubernetes node

CustomResourceDeifnitions

PodChaos, NetworkChaos, …
Examples

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-kill-example
spec:
  action: pod-kill
  mode: one
  selector:
    labelSelectors:
      "app.kubernetes.io/component": "tikv"

apiVersion: chaos-mesh.org/v1alph1
kind: Schedule
metadata:
  name: schedule-pod-kill-example
spec:
  schedule: "@every 5m"
  type: "PodChaos"
  historyLimit: 5
  concurrencyPolicy: Forbid
  podChaos:
    action: "pod-kill"
    mode: one
    selector:
      labelSelectors:
        "app.kubernetes.io/component": "tikv"

Workflow Engine

Three parts of the a workflow:
- Workflow Name
- Entry, the entry of the whole workflow
- Template array
Five different types of templates:
- Serial
- Parallel
- Chaos
- Suspend
- Task
Serial, Parallel, Task allow other nodes to be referenced as child nodes

Selectors

Namespace, Label, Expression, Annotation, Field, PodPhase, Node Selector
Node, Pod list

Authorization

Authorization mechanism based on Kubernetes RBAC permission policies

개인 의견

강연 자체가 깔끔했다. 어떻게 동작하는지 아키텍쳐 그림만 봐도 이해가 가능할정도로 설명하고 있다.
기능 자체가 kubernetes 틱 하다. 그러면서도 web ui를 잘 제공한다고 생각한다. 개인적으로는 argo cd 같은 느낌이 들었다. dashboard 가 좋아서 시각화가 잘되고, 시각화가 잘되니까 머리속으로 어떻게 해야하는 구나가 잘 떠오르는 것 같다.
demo가 엄청 좋다. 공식 홈페이지에서 직접 해볼수 있도록 제공하는데, 어떻게 설치하는 지, 어떻게 설정하는지를 잘 보여준다.
아직 documentation 을 읽어보지는 않았지만, 이 정도만 해도 어떻게 사용하는지 감을 잡고 자잘한것 정도만 찾아보는 수준일것 같다.
생각한게 거의 다 있어서 좋았다. 강연 흐름이 들으면 궁금해할만한 순서로 잘 구성되어 있다.
예전에 Litmus를 찍먹 해본적이 있었는데 튜토리얼 경험은 이게 더 좋은것 같다. 물론 litmus chaos 가 지금은 더 널리쓰여서 써야한다면 litmus를 쓰겠지만 말이다. 개인적으로 기대가 되는 프로젝트다 2022 2월에서야 인큐베이팅 됬으니까 아직은 발전을 더 지켜봐야할듯 하다.