我正在编写一个工作流程,从 S3 下载 CSV,使用 Docker 容器执行操作(转换为另一个文件),然后将转换后的文件上传回 S3 存储桶。
AWS EKS K8S 版本:1.29
头盔图:
chart: argo-workflows
targetRevision: 0.40.14 # Version of Argo Workflows
问题:模板似乎正在下载文件,但容器或脚本无法找到它。我注意到一些奇怪的事情:初始化容器日志显示该文件已下载到我的输入工件中未定义的位置。为什么?另外,我找不到在任何地方引用的目录(舵图值、配置映射或代码库)。这是从哪里来的?
这是我尝试过的:
注意:所有三个调试选项都在下面提供的工作流程模板中进行了注释和取消注释。
ls -l /data && cd /data && ls -l && cat data.csv,
捕获 CSV 文件,但我无法 cd
进入目录。这是 main
容器的输出:│ /usr/bin/sh: 1: cd: can't cd to /data
│ time="2024-03-26T19:41:59.188Z" level=info msg="sub-process exited" argo=true error="<nil>"
│ Error: exit status 2
据观察,dir 大小是 228b,这是准确的,因为 s3 中的文件也是 228b。这让我相信该文件已下载到目录中。但为什么我不能 cd 进入目录?
main
容器的输出:│ hello
│ Traceback (most recent call last):
│ File "/argo/staging/script", line 5, in <module>
│ print(os.listdir(path='/data'))
│ ^^^^^^^^^^^^^^^^^^^^^^^^
│ NotADirectoryError: [Errno 20] Not a directory: '/data'
│ time="2024-03-26T19:44:08.838Z" level=info msg="sub-process exited" argo=true error="<nil>"
│ Error: exit status 1
A templates.download.inputs.artifacts[0].path '/data' already mounted in container.volumeMounts.workdir
。因此,我尝试创建两个引用 PVC 的模板。第一个模板下载 S3 文件,而第二个模板执行操作。与上面的结果相同。初始化容器日志:
│ time="2024-03-26T19:32:33.499Z" level=info msg="Starting Workflow Executor" version=v3.5.5 │
│ time="2024-03-26T19:32:33.508Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5 │
│ time="2024-03-26T19:32:33.508Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=argo podName=kvs-csv-to-delta-download-2650783132 templateName=download version="&Vers │
│ ion{Version:v3.5.5,BuildDate:2024-02-29T20:59:20Z,GitCommit:c80b2e91ebd7e7f604e8442f45ec630380ffa0,GitTag:v3.5.5,GitTreeState:clean,GoVersion:go1.21.7,Compiler:gc,Platform:linux/amd64,}" │
│ time="2024-03-26T19:32:33.628Z" level=info msg="Start loading input artifacts..." │
│ time="2024-03-26T19:32:33.628Z" level=info msg="Downloading artifact: storage" │
│ time="2024-03-26T19:32:33.628Z" level=info msg="S3 Load path: /argo/inputs/artifacts/storage.tmp, key: data.csv" │
│ time="2024-03-26T19:32:33.650Z" level=info msg="Creating minio client using AWS SDK credentials" │
│ time="2024-03-26T19:32:33.655Z" level=info msg="Getting file from s3" bucket=<REMOVED> endpoint=s3.amazonaws.com key=data.csv path=/argo/inputs/artifacts/storage.tmp │
│ time="2024-03-26T19:32:33.743Z" level=info msg="Load artifact" artifactName=storage duration=115.271126ms error="<nil>" key=data.csv │
│ time="2024-03-26T19:32:33.744Z" level=info msg="Detecting if /argo/inputs/artifacts/storage.tmp is a tarball" │
│ time="2024-03-26T19:32:33.744Z" level=info msg="Successfully download file: /argo/inputs/artifacts/storage" │
│ time="2024-03-26T19:32:33.744Z" level=info msg="Alloc=10853 TotalAlloc=16524 Sys=23141 NumGC=4 Goroutines=7" │
│ Stream closed EOF for argo/kvs-csv-to-delta-download-2650783132 (init) │
工作流程模板
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: kvs-csv-to-delta
spec:
entrypoint: diamond
volumes:
- name: workdir
persistentVolumeClaim:
claimName: s3-pv-claim
templates:
- name: download
inputs:
artifacts:
- name: storage
path: /data
mode: 0777
s3:
endpoint: s3.amazonaws.com
bucket: <BUCKET_NAME>
key: data.csv
region: us-east-1
useSDKCreds: true
# DEBUG 1 ========================
container:
image: debian:latest
command: [sh, -c]
args: ["ls -l /data && cd /data && ls -l && cat data.csv"]
# DEBUG 2 ========================
# script:
# image: python:alpine
# imagePullPolicy: IfNotPresent
# command: [ python ]
# source: |
# import os
# import time
# print("hello")
# print(os.listdir(path='/data'))
# print("\n listing files for /data: \n")
# DEBUG 3 ========================
# container:
# image: <ECR_IMAGE>
# command: ["/tools/data_cli/data_cli"]
# args: ["format_data", "--input_file=/data/data.csv", "--input_format=CSV", "--output_file=/data/DELTA_0000000000000001", "--output_format=DELTA"]
# volumeMounts:
# - name: workdir
# mountPath: /data
- name: diamond
dag:
tasks:
- name: A
template: download
我感觉这是一个微不足道且普遍存在的用例。我在工作流程中遗漏了一些简单的东西吗?是配置问题吗?任何帮助将不胜感激。谢谢你。
我想通了。 CSV 正在从 S3 下载,一切正常。事实上,路径属性的“/data”定义值是对data.csv 文件的引用。因此,如果我将“cat /data”参数传递给容器,它将显示下载的 CSV 文件的内容。
artifacts:
- name: storage
path: /data
mode: 0777
s3:
endpoint: s3.amazonaws.com
bucket: <BUCKET_NAME>
key: data.csv
region: us-east-1
useSDKCreds: true