使用databricks笔记本将两个pandas数据框写入ADLS目录中excel中的两个不同工作表

Question

首先，我尝试直接写入 blob。但这没有用。因此，我尝试写入临时目录，然后将文件移动到所需的目录。即使这样也不起作用。我正在寻找一种解决方案，将具有多个工作表的 Excel 写入 Azure Blob。

filename = os.path.join(arg_dict['out_dir'], old_attribute_file_path.replace(old_attribute_file_path.split('/')[-1].split('-')[-1].split('.')[0], attribute_files[0].split('-')[1]))

temp_file_name = os.path.join(TMP_PATH, old_attribute_file_path.replace(old_attribute_file_path.split('/')[-1].split('-')[-1].split('.')[0], attribute_files[0].split('-')[1]))

fill_color = PatternFill(start_color='FFFF00', end_color='FFFF00', fill_type='solid')

# Write DataFrames to Excel
with pd.ExcelWriter(temp_file_name, engine='openpyxl') as writer:
    df1.to_excel(writer, index=False, sheet_name='Sheet1')
    df2.to_excel(writer, index=False, sheet_name='Sheet2')
    
    # Load the workbook
    workbook = writer.book

    # Save the workbook
    workbook.save(temp_file_name)

shutil.move(temp_file_name, filename)

我收到的错误->

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-6a81eac8-0226-477b-9715-070566214b43/lib/python3.10/site-packages/openpyxl/writer/excel.py:294, in save_workbook(workbook, filename)
    292 workbook.properties.modified = datetime.datetime.utcnow()
    293 writer = ExcelWriter(workbook, archive)
--> 294 writer.save()
    295 return True

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-6a81eac8-0226-477b-9715-070566214b43/lib/python3.10/site-packages/openpyxl/writer/excel.py:275, in ExcelWriter.save(self)
    273 def save(self):
    274     """Write data into the archive."""
--> 275     self.write_data()
    276     self._archive.close()

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-6a81eac8-0226-477b-9715-070566214b43/lib/python3.10/site-packages/openpyxl/writer/excel.py:60, in ExcelWriter.write_data(self)
     57 archive = self._archive
     59 props = ExtendedProperties()
---> 60 archive.writestr(ARC_APP, tostring(props.to_tree()))
     62 archive.writestr(ARC_CORE, tostring(self.workbook.properties.to_tree()))
     63 if self.workbook.loaded_theme:

File /usr/lib/python3.10/zipfile.py:1816, in ZipFile.writestr(self, zinfo_or_arcname, data, compress_type, compresslevel)
   1814 zinfo.file_size = len(data)            # Uncompressed size
   1815 with self._lock:
-> 1816     with self.open(zinfo, mode='w') as dest:
   1817         dest.write(data)

File /usr/lib/python3.10/zipfile.py:1182, in _ZipWriteFile.close(self)
   1180     self._fileobj.seek(self._zinfo.header_offset)
   1181     self._fileobj.write(self._zinfo.FileHeader(self._zip64))
-> 1182     self._fileobj.seek(self._zipfile.start_dir)
   1184 # Successfully written: Add file to our caches
   1185 self._zipfile.filelist.append(self._zinfo)

OSError: [Errno 95] Operation not supported

Answer 1

PySpark 数据帧缺少

to_excel

方法，并且 databricks 无法将 PySpark 数据帧转换为 Excel 文件。

解决方案是将文件保存在

databricks/drivers

中。然后移动该文件并将其从驱动程序中删除。

with pd.ExcelWriter(r'export2.xlsx', engine="openpyxl") as writer:
    #file will be written to /databricks/driver/ i.e., local file system
    data.to_excel(writer, index=False, sheet_name='Sheet1')
    data2.to_excel(writer, index=False, sheet_name='Sheet2')
    workbook = writer.book

    # Save the workbook
    workbook.save('export2.xlsx')

在这里您可以看到文件已存储在驱动程序文件夹中：

enter image description here

然后将文件从驱动程序移动到 DBFS 文件夹：

from shutil import move
move('/databricks/driver/export2.xlsx','/dbfs/export2.xlsx')

Answer 2

openpyxl 和 xlsxwriter 都用于 pandas 数据帧。您可以使用 spark 插件将 Excel 直接写入 Blob 存储

您可以像这样在集群级别安装它

你的代码将如下所示。首先获取您的访问密钥并将其设置为 Spark 配置。

spark.conf.set(
    "fs.azure.account.key.<storage_name>.dfs.core.windows.net",
    dbutils.secrets.get(scope=<scope_name>, key=<access_key>))

接下来，设置路径并将 Spark 数据帧写入相同的路径但不同的工作表。

path = "abfss://[email protected]/testdir/test.xlsx"  

spark_Df1.write.format("com.crealytics.spark.excel")\
  .option("header", "true")\
  .option("dataAddress", "'My Sheet1'!A1")\
  .mode("append")\
  .save(path)

spark_Df2.write.format("com.crealytics.spark.excel")\
  .option("header", "true")\
  .option("dataAddress", "'My Sheet2'!A1")\
  .mode("append")\
  .save(path)

使用databricks笔记本将两个pandas数据框写入ADLS目录中excel中的两个不同工作表

问题描述投票：0回答：2

2个回答

最新问题

使用databricks笔记本将两个pandas数据框写入ADLS目录中excel中的两个不同工作表

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2