如何将Oracle HCM HDL(.dat)文件读入pandas?

问题描述 投票:0回答:1

我有一个 Oracle HCM HDL(.dat) 文件,其中包含以特定格式构建的数据,我想将其读入 pandas,对某些列上的值执行匿名化,然后将每个部分保存到自己的 CSV 文件中,并且还转换回相同的 HDL 格式。

这是样本

METADATA|Worker|SourceSystemOwner|SourceSystemId|EffectiveStartDate|PersonNumber|StartDate|DateOfBirth|ActionCode
MERGE|Worker|EMP|HDL001|2001/09/08|HDL-1001|2001/09/08|1952/05/21|HIRE
MERGE|Worker|EMP|HDL002|2005/02/08|HDL-1002|2005/02/08|1966/04/21|HIRE

METADATA|PersonName|SourceSystemOwner|SourceSystemId|EffectiveStartDate|PersonId(SourceSystemId)|NameType|LegislationCode|Title|LastName|FirstName
MERGE|PersonName|EMP|HDL001_NME|2001/09/08|HDL001|GLOBAL|US|MR.|Wells|Christopher
MERGE|PersonName|EMP|HDL002_NME|2005/02/08|HDL002|GLOBAL|US|MRS.|Hugh|Lorraine

有关 Oracle HCM HDL 文件结构的文档:https://docs.oracle.com/en/cloud/saas/tutorial-hdl-load-files/index.html#task_one

有人可以提供有关如何使用 pandas read_csv 和 to_csv 有效完成此任务的 Python 代码示例或指南吗?

python python-3.x pandas oracle
1个回答
0
投票
import pandas as pd

# Read the data from the Oracle HCM HDL file
with open('data.hdl', 'r') as file:
    data = file.readlines()

# Initialize empty lists to store dataframes for each section
worker_data = []
person_name_data = []

# Process each line in the data
for line in data:
    line = line.strip().split('|')
    section = line[0]
    values = line[1:]

    # Convert the values into a dataframe
    df = pd.DataFrame([values], columns=['SourceSystemOwner', 'SourceSystemId', 'EffectiveStartDate', 'PersonId', 'NameType', 'LegislationCode', 'Title', 'LastName', 'FirstName'])

    # Anonymize specific columns if it's the worker section
    if section == 'MERGE|Worker':
        # Perform anonymization on specific columns
        df['PersonNumber'] = df['PersonNumber'].apply(lambda x: 'Anon' + str(x))
    
    # Append the dataframe to the appropriate list
    if section == 'MERGE|Worker':
        worker_data.append(df)
    elif section == 'MERGE|PersonName':
        person_name_data.append(df)

# Concatenate dataframes for each section
worker_df = pd.concat(worker_data, ignore_index=True)
person_name_df = pd.concat(person_name_data, ignore_index=True)

# Save each section into its own CSV file
worker_df.to_csv('worker_data.csv', index=False)
person_name_df.to_csv('person_name_data.csv', index=False)

# Convert back into HDL format
with open('worker_data.csv', 'r') as file:
    worker_data_csv = file.readlines()

with open('person_name_data.csv', 'r') as file:
    person_name_data_csv = file.readlines()

# Write the data back into the HDL file
with open('output.hdl', 'w') as file:
    file.writelines(worker_data_csv)
    file.writelines(person_name_data_csv)

确保将“data.hdl”替换为 Oracle HCM HDL 文件的路径。此代码假定 HDL 文件中的数据的结构与示例数据中提供的完全相同。根据数据的实际结构,可能需要进行调整。

© www.soinside.com 2019 - 2024. All rights reserved.