我有一个 txt 文件形式的数据集,如下所示:
beer_name: Legbiter
beer_id: 19827
brewery_name: Strangford Lough Brewing Company Ltd
brewery_id: 10093
style: English Pale Ale
abv: 4.8
date: 1357729200
user_name: AgentMunky
user_id: agentmunky.409755
appearance: 4.0
aroma: 3.75
palate: 3.5
taste: 3.5
overall: 3.75
rating: 3.64
text: Poured from a 12 ounce bottle into a pilsner glass.A: A finger of creamy head with clear-dark amber body.S: Rich brown sugar. Malty...T: Slight sugars, dry malt, vague hops. Big malty-brown with sugar.M: Dry and slightly astringent before a boring endtaste.O: Solid beer. Drinkable and interesting. Still vaguely bland.
review: True
我正在使用以下函数尝试将其变成正确的 df (之后进行更多处理,但这就是抛出错误的地方):
rb_file_data = pd.read_csv(os.path.join(MATCHED_BEER_DIR, 'ratings_with_text_rb.txt'), sep=":", header=None, names=["Key", "Value"])
我遇到的问题是,一些评论在文本部分使用“:”(我特意选择向您展示包含一些内容的评论),这会引发以下错误:
ParserError: Error tokenizing data. C error: Expected 2 fields in line 34, saw 7
如果需要,我有足够的数据来删除整个评论,但如果可能的话我很乐意保留它。
有没有办法仅在分隔符第一次出现在一行或其他任何地方时使用分隔符?
您可以尝试使用以下代码
import pandas as pd
import os
MATCHED_BEER_DIR = "give your directory path"
with open(os.path.join(MATCHED_BEER_DIR, 'ratings_with_text_rb.txt'), 'r') as file:
lines = file.readlines()
data = [line.strip().split(':', 1) for line in lines]
rb_file_data = pd.DataFrame(data, columns=["Key", "Value"])
rb_file_data['Value'] = rb_file_data['Value'].str.strip()
print(rb_file_data)
你可以试试这个:
df = (pd.read_csv("file.txt", header=None, engine="python",
sep=r"(.+?):\s*(.+)") # Click here to see the regex-demo
.dropna(how="all", axis=1).set_index(1).T)
输出:
print(df)
1 beer_name beer_id brewery_name ... rating text review
2 Legbiter 19827 Strangford Lough... ... 3.64 Poured from a 12... True
[1 rows x 17 columns]