为什么 Pandas 库未在 Azure Synapse Spark 池中使用 Blob 存储读取更新的 Excel 文件？

Question

主题：当 Synapse Notebook Karnel 运行时，pd.ExcelFile() 未从 Blob 存储读取 Synapse Spark 池中更新的 Excel 文件。

我在使用 Python 3.11 读取 Azure Synapse Spark 池（版本 3.3）中更新的 Excel 文件时遇到问题。我的设置包括：

数据源：存储在 Azure Blob 存储中的 Excel 文件。更新过程：使用 Microsoft Power Automate 使用 SharePoint 中的新数据定期更新该文件。读取数据：我在 Synapse 笔记本中使用 pandas.ExcelFile() 来读取 Excel 文件。

问题：

当笔记本第一次运行时，它通过从初始 ExcelFile() 将 abfss 路径传递到 Excel（xls 或 xlsx）文件来成功读取数据，上传并填充数据框。

xlsx = pd.ExcelFile(file_path)

wb = pd.read_excel(xlsx,sheet_name='discount')

但是，如果笔记本已在运行并且 Excel 文件已在 SharePoint 中更新（触发 Power Automate 流来更新 Blob 存储），我会遇到错误：HttpResponseError：指定的范围对于资源的当前大小无效。

RequestId:8a416b5b-401e-0070-6581-6e06f0000000
Time:2024-03-04T22:16:21.5453793Z
ErrorCode:InvalidRange
Content: <?xml version="1.0" encoding="utf-8"?><Error><Code>InvalidRange</Code><Message>The range specified is invalid for the current size of the resource.

如果我停止 Spark 池并重新启动笔记本 Karnel，更新的 Excel 文件将再次正确读取。

问题：

为什么笔记本运行时 pd.ExcelFile() 无法读取更新的 Excel 文件，从而导致 HttpResponseError。
有没有一种方法可以在不重新启动内核的情况下读取更新的文件，并可能利用 Excel 文件机制？

我已经搜索了有关该问题的解决方案，但使其工作的唯一解决方案是强制刷新并再次使用 pd.ExcelFile() 运行脚本。但这是不可持续的，我需要让它与数据框架的更新一起工作。任何见解或建议将不胜感激。

Answer 1

当我尝试以下操作时：

import pandas as pd
import adlfs
spark.conf.set("fs.azure.account.key.stgsynp.dfs.core.windows.net", "TzXfEEHLnOojJcE1lqfV4HV9rK0BDpZaHq/Yqo6eH3ZKclogX5zDGwby2EMov8xmdZezCDdGWlv5+AStNeictA==")
file_path = "abfss://[email protected]/sheet1.xlsx"
def read_excel_file(file_path, sheet_name='Sheet1'):
    fs = adlfs.AzureBlobFileSystem(account_name='stgsynp', account_key='TzXfEEHLnOojJcE1lqfV4HV9rK0BDpZaHq/Yqo6eH3ZKclogX5zDGwby2EMov8xmdZezCDdGWlv5+AStNeictA==')
    with fs.open(file_path, 'rb') as f:
        df = pd.read_excel(f, sheet_name=sheet_name)
    return df
df = read_excel_file(file_path)
print("Initial data:")
print(df)

我遇到错误

HttpResponseError: The specifed resource name contains invalid characters.

我尝试过以下方法：

import pandas as pd
def read_excel_file(file_path, sheet_name='Sheet1'):
    df = pd.read_excel(file_path, sheet_name=sheet_name)
    return df
file_path = 'https://stgsynp.dfs.core.windows.net/folder02/sheet1.xlsx?sv=2022-11-02&ss=bfqt&srt=sco&sp=rwdlacupyx&se=2024-03-05T17:52:01Z&st=2024-03-05T09:52:01Z&spr=https&sig=uMKiqL4JXL%2FT0sZcjargX1GjDEL7eaWtzRIPyVxYn6U%3D'
df = read_excel_file(file_path)
print("Initial data:")
print(df)

结果：

updated data:
      Name  Age         City
0   Thomas   26  small heath
1     jhon   22  small heath
2      Ada   16  small heath
3  aurthur   30  small heath
4   fredie   30  small heath
5      sai   99   shamshabad

在上面的代码中定义了一个函数

(read_excel_file)

有两个参数：file_path（Excel文件的路径）和sheet_name（要读取的工作表的名称，默认为'Sheet1'）。

它使用 pandas

read_excel

函数将指定的 Excel 文件和工作表读取到 DataFrame

(df)

。

pandas.read_excel

方法不支持使用wasbs或abfss方案URL访问Excel文件。

在本例中使用 SAS 令牌：

要通过 Azure 门户创建 SAS 令牌以访问 Azure 存储帐户：

导航到 Azure 门户并选择您的 Azure 存储帐户。转到存储帐户的设置部分。单击“共享访问签名”以生成 SAS 令牌。

ReadExcel=pd.read_excel('https://<account name>.dfs.core.windows.net/<file system>/<path>?<sas token>')  
print(ReadExcel)

参考： Azure Synapse Workspace - 如何使用 Pandas 或 PySpark 从 Data Lake Gen2 读取 Excel 文件？

为什么 Pandas 库未在 Azure Synapse Spark 池中使用 Blob 存储读取更新的 Excel 文件？

问题描述投票：0回答：1

1个回答

最新问题

为什么 Pandas 库未在 Azure Synapse Spark 池中使用 Blob 存储读取更新的 Excel 文件？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1