读取数据时无法预定义dtype

Question

我正在读取不带标题的管道分隔文件，而没有进入Pandas，并且我正在使用Pandas版本0.24.2。这是公共数据，因此不必担心机密性。

数据看起来像：

999778247820|R|JPMORGAN CHASE BANK, NATIONAL ASSOCIATION|7.375|113000|360|02/2001|04/2001|95|95|1|52|665|Y|P|SF|1|P|IL|601|30|FRM||1|N
999783196683|R|OTHER|7.25|59000|360|01/2001|04/2001|97|97|2|43|682|Y|P|PU|1|P|HI|967|30|FRM|676|1|N
999783470376|C|BANK OF AMERICA, N.A.|7.875|110000|360|12/2000|02/2001|74|74|2|26|700|N|P|SF|1|P|NY|125||FRM|698||N
999786911479|C|BANK OF AMERICA, N.A.|7.5|57000|360|12/2000|02/2001|90|90|1|28|699|N|P|SF|1|P|TX|781|25|FRM||1|N
999786913710|R|JPMORGAN CHASE BANK, NA|7.125|114000|360|01/2001|04/2001|73|73|2|16|745|N|C|SF|1|P|WA|992||FRM|||N
999788833695|B|OTHER|9|50000|360|10/2000|12/2000|90|90|2|40|674|N|P|SF|2|I|WI|535|25|FRM|737|1|N

这是我正在使用的代码：

orig_files_fnma = glob.glob("/...1/Acquisition*.txt")

col_names = ["loan_id", "origination_channel","seller_name","original_interest_rate","original_upb","original_loan_term","origination_date","first_payment_date","original_ltv","original_cltv","number_of_borrowers","original_dti",
            "borrower_fico_at_origination","first_time_home_buyer_indicator", "loan_purpose","property_type","number_of_units","occupancy_type","property_state","zip_code_short","primary_mortgage_insurance_percent",
            "product_type","coborrower_fico_at_origination","mortgage_insurance_type","relocation_mortgage_indicator"]

col_type = {"loan_id": "object","origination_channel": "object","seller_name": "object","original_interest_rate": "float","original_upb": "float","original_loan_term": "int","origination_date": "object",
            "first_payment_date": "object","original_ltv": "object","original_cltv": "object","number_of_borrowers": "int","original_dti": "float","borrower_fico_at_origination": "int",
            "first_time_home_buyer_indicator": "object", "loan_purpose": "object","property_type": "object","number_of_units": "int","occupancy_type": "object","property_state": "object",
            "zip_code_short": "object","primary_mortgage_insurance_percent": "float",
            "product_type": "object","coborrower_fico_at_origination": "int","mortgage_insurance_type": "object","relocation_mortgage_indicator": "object"}

dfs = []
temp_df = []

for orig_files_fnma in orig_files_fnma:
    temp_df = pd.read_csv(orig_files_fnma, sep = '|', header = None, names = col_names, dtype = col_type, index_col = None, parse_dates=True, verbose = True, engine='python')
    dfs.append(temp_df)

总是出现以下错误：

Filled 1 NA values in column original_ltv
Filled 52 NA values in column original_cltv
ValueError: Unable to convert column number_of_borrowers to type int

我确实发现了是否不预定义dtype和.astype在加载后更改数据类型。但是请问是否可以像上面的代码那样先预定义数据类型。

另外，我想将对象的长度定义为20个长度。这样做的正确代码是什么？

非常感谢！

Answer 1

我遇到了另一个错误：

ValueError: Unable to convert column coborrower_fico_at_origination to type int

将数据导入Excel时，您会看到此列中有3行为空白。 int类型无法处理空白。您应该将其更改为float，这时空格变为nan：

col_type = {..., "coborrower_fico_at_origination": "float", ...}

此后命令成功执行。

读取数据时无法预定义dtype

问题描述投票：1回答：1

1个回答

最新问题

读取数据时无法预定义dtype

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1