我对Azure真的很陌生,我开始设置管道,但我仍然卡住了
我在 Azure 存储帐户中存储了 xlsx 格式文件,其中包含一张或多张工作表。
我想设置一个管道,它将遍历指定路径的所有文件夹和子文件夹,将 xlsx 文件转换为 csv 文件。
两种无花果都是可能的 如果“AAA”文件包含三个选项卡“Sheet1”、“Sheet2”和“Sheet3”,则管道必须生成 3 个 csv 文件
第二种情况,“AAA”文件仅包含一个选项卡“Sheet1”,因此将生成一个 csv 文件:“AAA_Sheet1”
感谢您的帮助
我希望得到帮助解决这个问题
直接无法获取 ADF 中的工作表名称。您可以在下面提到的 Synapse 笔记本中运行 Python 代码,以从 Excel 文件中获取工作表名称:
from azure.storage.blob import BlobServiceClient
from openpyxl import load_workbook
account_name = '<storageAccountName>'
account_key = '<accesskey>'
container_name = '<containerName>'
blob_name = '<directory>/AAA.xlsx'
blob_service_client = BlobServiceClient(account_url=f"https://{account_name}.blob.core.windows.net", credential=account_key)
blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_name)
local_temp_file = 'temp_excel_file.xlsx'
with open(local_temp_file, 'wb') as file:
data = blob_client.download_blob()
data.readinto(file)
workbook = load_workbook(filename=local_temp_file, read_only=True)
sheet_names = workbook.sheetnames
workbook.close()
mssparkutils.notebook.exit(sheet_names)
您将得到如下所示的工作表名称:
使用笔记本活动运行笔记本。将 foreach 活动添加到笔记本活动并为该项目使用以下表达式:
@json(activity('Notebook1').output.status.Output.result.exitValue)
在 foreach 内添加复制活动,添加 Excel 作为源,并使用参数
sheetName
和 fileName
使用 @item() 表达式分隔为接收器数据集。调试管道;运行成功如下图:
工作表复制成功为.CSV格式,如下所示:
以下是管道 JSON 供您参考:
{
"name": "Pipeline 1",
"properties": {
"activities": [
{
"name": "Notebook1",
"type": "SynapseNotebook",
"dependsOn": [],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"notebook": {
"referenceName": "Notebook 2",
"type": "NotebookReference"
},
"snapshot": true,
"sparkPool": {
"referenceName": "spark",
"type": "BigDataPoolReference"
},
"executorSize": "Small",
"driverSize": "Small"
}
},
{
"name": "ForEach1",
"type": "ForEach",
"dependsOn": [
{
"activity": "Notebook1",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"items": {
"value": "@json(activity('Notebook1').output.status.Output.result.exitValue\n)",
"type": "Expression"
},
"isSequential": true,
"activities": [
{
"name": "Copy data1",
"type": "Copy",
"dependsOn": [],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "ExcelSource",
"storeSettings": {
"type": "AzureBlobFSReadSettings",
"recursive": true,
"enablePartitionDiscovery": false
}
},
"sink": {
"type": "DelimitedTextSink",
"storeSettings": {
"type": "AzureBlobFSWriteSettings"
},
"formatSettings": {
"type": "DelimitedTextWriteSettings",
"quoteAllText": true,
"fileExtension": ".txt"
}
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"typeConversion": true,
"typeConversionSettings": {
"allowDataTruncation": true,
"treatBooleanAsNumber": false
}
}
},
"inputs": [
{
"referenceName": "Excel1",
"type": "DatasetReference",
"parameters": {
"sheetName": {
"value": "@item()",
"type": "Expression"
}
}
}
],
"outputs": [
{
"referenceName": "DelimitedText1",
"type": "DatasetReference",
"parameters": {
"fileName": {
"value": "@item()",
"type": "Expression"
}
}
}
]
}
]
}
}
],
"annotations": [],
"lastPublishTime": "2024-01-12T07:14:07Z"
},
"type": "Microsoft.Synapse/workspaces/pipelines"
}