用 Kyverno 让 Argo Workflow 单步执行

AWS 的 SSM Automation 中，有个有趣的特性就是单步执行，在编写自动化脚本的时候，这个功能对调试非常有帮助。Argo Workflow 也有个暂停特性，官网给出的例子是这样的：

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: pause-after-
spec:
  entrypoint: whalesay
  templates:
    - name: whalesay
      container:
        image: argoproj/argosay:v2
        env:
          - name: ARGO_DEBUG_PAUSE_AFTER
            value: 'true'

把他提交到 Argo 会看到暂停的情况：

$ argo submit --watch debug.yml
Name:                pause-after-hpvg9                                                                                                                                          [0/1455]
Namespace:           default
ServiceAccount:      unset (will run with the default ServiceAccount)
Status:              Running
Conditions:
 PodRunning          True
Created:             Thu Jul 18 23:18:46 +0800 (18 seconds ago)
Started:             Thu Jul 18 23:18:46 +0800 (18 seconds ago)
Duration:            18 seconds
Progress:            0/1

STEP                  TEMPLATE  PODNAME            DURATION  MESSAGE
 ● pause-after-hpvg9  whalesay  pause-after-hpvg9  18s

你会发现，这个 Workflow 会一直冻结在这个状态，

$ argo list
NAME                STATUS      AGE   DURATION   PRIORITY   MESSAGE
pause-after-hpvg9   Running     11m   11m        0
...

这时候只要进入 Pod，执行一个命令，工作流就会完成：

$ kubectl exec -it pause-after-hpvg9 -- bash
root@pause-after-hpvg9:/# touch /proc/1/root/var/run/argo/ctr/main/after
root@pause-after-hpvg9:/# command terminated with exit code 137

可以看到 Argo 的 Watch 也发生了变化：

STEP                  TEMPLATE  PODNAME            DURATION  MESSAGE
 ✔ pause-after-hpvg9  whalesay  pause-after-hpvg9  21m

问题来了，正常的工作流不会只有一个步骤，要实现单步执行的效果，就需要给每个步骤加入环境变量，是不是有点麻烦？我想到一个办法——用 Kyverno 做个自动补丁。只要 Workflow 加上一个 debug 标签，就给所有步骤加入暂停标志。

废话不多说，上策略代码：

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: add-argo-debug-env
spec:
  rules:
    - name: add-debug-env-var
      match:
        resources:
          kinds:
            - argoproj.io/v1alpha1/Workflow
          selector:
            matchLabels:
              debug: "true"
          operations:
          - CREATE
      mutate:
        foreach:
          - list: request.object.spec.templates[]
            patchesJson6902: |-
              - path: /spec/templates/{{elementIndex}}/container/env/-
                op: add
                value:
                  name: ARGO_DEBUG_PAUSE_AFTER
                  value: "true"

这段策略有几个要点：

selector 指定，只处理带有 Debug 标签，并且操作为 CREATE 的
使用 foreach 语法，处理工作流中出现的每一个步骤
用 patchesJson6902 方式，给每个步骤的容器加入 ARGO_DEBUG_PAUSE_AFTER 环境变量。

提交策略之后，用如下任务脚本测试一下：

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: debug314159-
  labels:
    debug: "true"
spec:
  entrypoint: whalesay
  templates:
    - name: whalesay
      container:
        image: argoproj/argosay:v2
    - name: whalesayagain
      container:
        image: argoproj/argosay:v2

提交工作流：

$ argo submit debug.yml
Name:                debug314159-dvqmw
Namespace:           default
ServiceAccount:      unset (will run with the default ServiceAccount)
Status:              Pending
Created:             Fri Jul 19 00:11:15 +0800 (now)
Progress:

查看生成的工作流：

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
...
  labels:
    debug: "true"
    workflows.argoproj.io/completed: "false"
    workflows.argoproj.io/phase: Running
  name: debug314159-dvqmw
  namespace: default
...
spec:
...
  - container:
      env:
      - name: ARGO_DEBUG_PAUSE_AFTER
        value: "true"
      image: argoproj/argosay:v2
...
  - container:
      env:
      - name: ARGO_DEBUG_PAUSE_AFTER
        value: "true"
      image: argoproj/argosay:v2
      name: ""
...

可以看到，Kyverno 给每个步骤都加入了环境变量，这样一来，就实现了单步执行的效果。

后记

这个办法还有个问题，就是恢复太麻烦了，我打算接下来用 Shell Operator 来解决。

不明白为什么 Argo Workflow 没有给这种步骤设置一个暂停状态。

用 Kyverno 让 Argo Workflow 单步执行

后记

Comments

More from this blog

FDE 是传统运维产品厂商的出路吗？

绵里藏针才是 AIOps 的本质？

龙虾恐慌：AIOps 又要改名了？

再见 2025

辅助编程？dora 说：我知道你很急可是请你别急

Command Palette

后记

Comments

More from this blog