我有一个包含 893 列的 .csv 文件,我需要将其读入 pandas(或 R)进行分析。生成电子表格时,它会创建需要合并为一列的重复列。
我遇到的问题是,当我将 .csv 读入 pandas 或 R 来创建数据框时,它会自动为每个额外的重复列分配一个数字,这意味着它们无法轻松分组。
原始数据的格式如下:
****** PYTHON ******
!!!EXAMPLE!!!
import pandas as pd
d = {'Name':["Jim","Jim","Jim","Jim","Jim","Jim","Jim","Jim","Jim","Jim",
"Sue","Sue","Sue","Sue","Sue","Sue","Sue","Sue","Sue","Sue"],
"Dates":["2010-1-1", "2010-1-2", "2010-01-5","2010-01-17","2010-01-20",
"2010-01-29","2010-02-6","2010-02-9","2010-02-16","2010-02-28",
"2010-1-1", "2010-1-2", "2010-01-5","2010-01-17","2010-01-20",
"2010-01-29","2010-02-6","2010-02-9","2010-02-16","2010-02-28"],
"Event" : [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],
"Event" : [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],
"Event" : [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]}
d = pd.DataFrame(d)
d
****** R ******
df_date <- data.frame( Name = c("Jim","Jim","Jim","Jim","Jim","Jim","Jim","Jim","Jim","Jim",
"Sue","Sue","Sue","Sue","Sue","Sue","Sue","Sue","Sue","Sue"),
Dates = c("2010-1-1", "2010-1-2", "2010-01-5","2010-01-17","2010-01-20",
"2010-01-29","2010-02-6","2010-02-9","2010-02-16","2010-02-28",
"2010-1-1", "2010-1-2", "2010-01-5","2010-01-17","2010-01-20",
"2010-01-29","2010-02-6","2010-02-9","2010-02-16","2010-02-28"),
Event = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),
Event = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),
Event = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1))
仅供参考 - 这只是一个例子。实际数据将使用
df = pd.read_csv("dummy.csv")
或等效工具从 .csv 文件中读取。
有没有什么方法;
或
N.B:有趣的是,我在制作示例时注意到它甚至不允许我创建具有相同名称的列的数据框。
问题是您正在创建一个字典,其中的键不是唯一的,因此无法以您想要的形式创建它(值只是被最后一个值覆盖)。然后,字典本身被正确地传递给 pandas 并用于创建 DataFrame。
您可以使用例如另一种方法来添加额外的列,您可以在其中显式允许重复项。
import pandas as pd
d = {'Name': ["Jim", "Jim", "Jim", "Jim", "Jim", "Jim", "Jim", "Jim", "Jim", "Jim",
"Sue", "Sue", "Sue", "Sue", "Sue", "Sue", "Sue", "Sue", "Sue", "Sue"],
"Dates": ["2010-1-1", "2010-1-2", "2010-01-5", "2010-01-17", "2010-01-20",
"2010-01-29", "2010-02-6", "2010-02-9", "2010-02-16", "2010-02-28",
"2010-1-1", "2010-1-2", "2010-01-5", "2010-01-17", "2010-01-20",
"2010-01-29", "2010-02-6", "2010-02-9", "2010-02-16", "2010-02-28"],
"Event": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
d = pd.DataFrame(d)
d.insert(len(d.columns), "Event", [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], allow_duplicates=True)
d.insert(len(d.columns), "Event", [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], allow_duplicates=True)
这给你:
> Name Dates Event Event Event
0 Jim 2010-1-1 1 1 1
1 Jim 2010-1-2 1 1 1
2 Jim 2010-01-5 1 1 1
3 Jim 2010-01-17 1 1 1
4 Jim 2010-01-20 1 1 1
5 Jim 2010-01-29 1 1 1
6 Jim 2010-02-6 1 1 1
7 Jim 2010-02-9 1 1 1
8 Jim 2010-02-16 1 1 1
9 Jim 2010-02-28 1 1 1
10 Sue 2010-1-1 1 1 1
11 Sue 2010-1-2 1 1 1
12 Sue 2010-01-5 1 1 1
13 Sue 2010-01-17 1 1 1
14 Sue 2010-01-20 1 1 1
15 Sue 2010-01-29 1 1 1
16 Sue 2010-02-6 1 1 1
17 Sue 2010-02-9 1 1 1
18 Sue 2010-02-16 1 1 1
19 Sue 2010-02-28 1 1 1
如果列存在,但它们只是有不需要的编号
[.1, .2, ...]
,您可以使用 re
包:
import pandas as pd
import re
df = pd.read_csv("example.csv", sep=";")
> a b b.1 b.2
0 a 1 5 7
1 b 2 5 7
2 c 3 6 7
3 d 4 7 7
4 e 5 8 7
df.columns = [re.sub("(.*?)(\.\d+)", "\\1", c) for c in df.columns]
> a b b b
0 a 1 5 7
1 b 2 5 7
2 c 3 6 7
3 d 4 7 7
4 e 5 8 7
我想建议这个解决方案:
import pandas as pd
from io import StringIO
data = StringIO("""Name;Dates;Event;Event;Event
Jim;2010-1-1;1;1;1
Jim;2010-1-2;1;1;1
Jim;2010-01-5;1;1;1
Jim;2010-01-17;1;1;1
Jim;2010-01-20;1;1;1
Jim;2010-01-29;1;1;1
Jim;2010-02-6;1;1;1
Jim;2010-02-9;1;1;1
Jim;2010-02-16;1;1;1
Jim;2010-02-28;1;1;1
Sue;2010-1-1;1;1;1
Sue;2010-1-2;1;1;1
Sue;2010-01-5;1;1;1
Sue;2010-01-17;1;1;1
Sue;2010-01-20;1;1;1
Sue;2010-01-29;1;1;1
Sue;2010-02-6;1;1;1
Sue;2010-02-9;1;1;1
Sue;2010-02-16;1;1;1
Sue;2010-02-28;1;1;1
""")
df = pd.read_csv(data, sep=';')
df.rename(columns={col:col.split('.')[0] for col in df.columns}, inplace=True)
print(df)
输出:
Name Dates Event Event Event
0 Jim 2010-1-1 1 1 1
1 Jim 2010-1-2 1 1 1
2 Jim 2010-01-5 1 1 1
3 Jim 2010-01-17 1 1 1
4 Jim 2010-01-20 1 1 1
5 Jim 2010-01-29 1 1 1
6 Jim 2010-02-6 1 1 1
7 Jim 2010-02-9 1 1 1
8 Jim 2010-02-16 1 1 1
9 Jim 2010-02-28 1 1 1
10 Sue 2010-1-1 1 1 1
11 Sue 2010-1-2 1 1 1
12 Sue 2010-01-5 1 1 1
13 Sue 2010-01-17 1 1 1
14 Sue 2010-01-20 1 1 1
15 Sue 2010-01-29 1 1 1
16 Sue 2010-02-6 1 1 1
17 Sue 2010-02-9 1 1 1
18 Sue 2010-02-16 1 1 1
19 Sue 2010-02-28 1 1 1
我知道这是一篇旧帖子,但我遇到了类似的问题,搜索了几个小时寻找解决方案,但一无所获。我认为您最初的问题还包括重复的列“需要合并为一个”,这是以前的答案中没有解决的问题。所以我想分享我所做的!加载 df 后,使用aggregate_labels(df) 应该可以得到你想要的。
import pandas as pd
df = pd.read_csv('your_csv_file.csv')
def list_destroyer(df):
for column in df.axes[1]:
for element in range(len(df[column])):
if type(df[column][element]) == type(list()):
if len(df[column][element]) == 1:
df[column][element] = df[column][element][0]
def nan_destroyer(df):
for column in df.axes[1]:
for element in range(len(df[column])):
if type(df[column][element]) == type(list()):
nan_indexes = []
for i in range(len(df[column][element])):
if pd.isnull(df[column][element][i]):
nan_indexes.append(i)
for j in sorted(nan_indexes, reverse=True):
del df[column][element][j]
def add_nan_rows(df, num_rows):
while len(df) < num_rows:
df.loc[len(df)] = pd.Series(dtype='float64')
def clean_nan(df):
df.dropna(axis = 0, how = 'all', inplace = True)
def aggregate_labels(df):
df.columns = df.columns.str.split('.').str[0]
df = df.groupby(by=df.columns.values, axis=1).apply(lambda x: pd.Series(x.values.tolist()))
nan_destroyer(df)
list_destroyer(df)
for columna in df.axes[1]:
exploded = df[columna].explode(columna)
add_nan_rows(df,len(exploded))
df[columna] = exploded
clean_nan(df)
return df
df_clean = aggregate_labels(df)