Synapse Pipeline azure - 将 xlsx 转换为 csv

问题描述 投票:0回答:1

我对Azure真的很陌生,我开始设置管道,但我仍然卡住了

我在 Azure 存储帐户中存储了 xlsx 格式文件,其中包含一张或多张工作表。

我想设置一个管道,它将遍历指定路径的所有文件夹和子文件夹,将 xlsx 文件转换为 csv 文件。

两种无花果都是可能的 如果“AAA”文件包含三个选项卡“Sheet1”、“Sheet2”和“Sheet3”,则管道必须生成 3 个 csv 文件

  • 'AAA_sheet1'
  • 'AAA_sheet2'
  • “AAA_sheet3”将存储在特定子文件夹中

第二种情况,“AAA”文件仅包含一个选项卡“Sheet1”,因此将生成一个 csv 文件:“AAA_Sheet1”

感谢您的帮助

我希望得到帮助解决这个问题

azure storage pipeline azure-synapse account
1个回答
0
投票

直接无法获取 ADF 中的工作表名称。您可以在下面提到的 Synapse 笔记本中运行 Python 代码,以从 Excel 文件中获取工作表名称:

from azure.storage.blob import BlobServiceClient
from openpyxl import load_workbook

account_name = '<storageAccountName>'
account_key = '<accesskey>'
container_name = '<containerName>'
blob_name = '<directory>/AAA.xlsx'
blob_service_client = BlobServiceClient(account_url=f"https://{account_name}.blob.core.windows.net", credential=account_key)
blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_name)
local_temp_file = 'temp_excel_file.xlsx'
with open(local_temp_file, 'wb') as file:
    data = blob_client.download_blob()
    data.readinto(file)
workbook = load_workbook(filename=local_temp_file, read_only=True)
sheet_names = workbook.sheetnames
workbook.close()
mssparkutils.notebook.exit(sheet_names)

您将得到如下所示的工作表名称:

使用笔记本活动运行笔记本。将 foreach 活动添加到笔记本活动并为该项目使用以下表达式:

enter image description here @json(activity('Notebook1').output.status.Output.result.exitValue)

在 foreach 内添加复制活动,添加 Excel 作为源,并使用参数

sheetName
fileName
使用 @item() 表达式分隔为接收器数据集。调试管道;运行成功如下图:

enter image description here

工作表复制成功为.CSV格式,如下所示:

enter image description here

以下是管道 JSON 供您参考:

{
    "name": "Pipeline 1",
    "properties": {
        "activities": [
            {
                "name": "Notebook1",
                "type": "SynapseNotebook",
                "dependsOn": [],
                "policy": {
                    "timeout": "0.12:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false,
                    "secureInput": false
                },
                "userProperties": [],
                "typeProperties": {
                    "notebook": {
                        "referenceName": "Notebook 2",
                        "type": "NotebookReference"
                    },
                    "snapshot": true,
                    "sparkPool": {
                        "referenceName": "spark",
                        "type": "BigDataPoolReference"
                    },
                    "executorSize": "Small",
                    "driverSize": "Small"
                }
            },
            {
                "name": "ForEach1",
                "type": "ForEach",
                "dependsOn": [
                    {
                        "activity": "Notebook1",
                        "dependencyConditions": [
                            "Succeeded"
                        ]
                    }
                ],
                "userProperties": [],
                "typeProperties": {
                    "items": {
                        "value": "@json(activity('Notebook1').output.status.Output.result.exitValue\n)",
                        "type": "Expression"
                    },
                    "isSequential": true,
                    "activities": [
                        {
                            "name": "Copy data1",
                            "type": "Copy",
                            "dependsOn": [],
                            "policy": {
                                "timeout": "0.12:00:00",
                                "retry": 0,
                                "retryIntervalInSeconds": 30,
                                "secureOutput": false,
                                "secureInput": false
                            },
                            "userProperties": [],
                            "typeProperties": {
                                "source": {
                                    "type": "ExcelSource",
                                    "storeSettings": {
                                        "type": "AzureBlobFSReadSettings",
                                        "recursive": true,
                                        "enablePartitionDiscovery": false
                                    }
                                },
                                "sink": {
                                    "type": "DelimitedTextSink",
                                    "storeSettings": {
                                        "type": "AzureBlobFSWriteSettings"
                                    },
                                    "formatSettings": {
                                        "type": "DelimitedTextWriteSettings",
                                        "quoteAllText": true,
                                        "fileExtension": ".txt"
                                    }
                                },
                                "enableStaging": false,
                                "translator": {
                                    "type": "TabularTranslator",
                                    "typeConversion": true,
                                    "typeConversionSettings": {
                                        "allowDataTruncation": true,
                                        "treatBooleanAsNumber": false
                                    }
                                }
                            },
                            "inputs": [
                                {
                                    "referenceName": "Excel1",
                                    "type": "DatasetReference",
                                    "parameters": {
                                        "sheetName": {
                                            "value": "@item()",
                                            "type": "Expression"
                                        }
                                    }
                                }
                            ],
                            "outputs": [
                                {
                                    "referenceName": "DelimitedText1",
                                    "type": "DatasetReference",
                                    "parameters": {
                                        "fileName": {
                                            "value": "@item()",
                                            "type": "Expression"
                                        }
                                    }
                                }
                            ]
                        }
                    ]
                }
            }
        ],
        "annotations": [],
        "lastPublishTime": "2024-01-12T07:14:07Z"
    },
    "type": "Microsoft.Synapse/workspaces/pipelines"
}
© www.soinside.com 2019 - 2024. All rights reserved.