凌乱的 CSV 自动标头提取器

Question

我有一堆（100 多个）CSV 文件。他们每个人都可以有空白行，或者我不需要的行（一些模糊信息，例如“恭喜，你们都是bla bla”）。在 Pandas 中阅读时，我需要指定哪一行是标题行。处理多个文件需要做很多工作。请记住，所有文件都有不同的格式。

目前，我迭代所有行，仅检查一行中的所有单元格是否都是字符串，然后选择该单元格作为标题。

我需要一个更好的函数来将字符串列表压缩为单个置信度分数（以便我可以看到哪一行是概率最高的标题）

你能帮我一下吗？

Answer 1

前段时间我正在处理同样的需求，我的方法如下。

此函数接收已加载的数据帧，如果第一行不是数字，则可以创建标题（如果不是，我们将表格保持原样）：

def reset_headers(df):
    indexes = []
    indexes_final = []
    for index, row in df.iterrows():
        #Check if all the cells in these first rows are not numeric:
        if row.apply(lambda x: False if represents_number(x) else True).all():
            #So, this one should be considered as part of the header:
            indexes.append(df.iloc[index])
        else:
            break
    #The data of the dataframe discarding the rows that are considered header:
    df = df.iloc[index:, :].copy()
    # Concatenate the information in the header if there are more than one row considered header:
    if len(indexes) > 0:
        for row in zip(*indexes):
            temp = ""
            for i in row:
                temp += i + " "
            indexes_final.append(temp.strip())
        df.columns = indexes_final
    # At the end, rename properly if there are some duplicate column names:
    duplicated_cols = df.columns.duplicated()
    duplicated = df.columns[duplicated_cols].unique()
    rename_cols = []
    i=1
    for col in df.columns:
        if col in duplicated:
            rename_cols.extend([col + '_' + str(i)])
            i=i+1
        else:
            rename_cols.extend([col])
    df.columns = rename_cols

    return df

要执行上面的函数，您可能需要另一个函数来检查字符串是否代表数字，如下所示：

def represents_number(s):
    try:
        float(s)
    except ValueError:
        return False
    except TypeError:
        return False
    else:
        return True

希望它可以帮助你开始玩......

凌乱的 CSV 自动标头提取器

问题描述投票：0回答：1

1个回答

最新问题

凌乱的 CSV 自动标头提取器

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1