比较表以创建存在/不存在矩阵填充空而不包含小数

问题描述 投票:1回答:1

命令行:

文件可以在github上找到。

菲尔1:

https://raw.githubusercontent.com/felipelira/files_to_test/master/file1.txt

文件2:

https://raw.githubusercontent.com/felipelira/files_to_test/master/file2.txt

命令行:python teste2.py file1.txt file2.txt test

在存在/不存在矩阵中转换表格文件时,我最终错过了一些数据。没有与加入匹配的基因组不是情节。

我之前的结果是这样的(根据帖子Convert tables to presence/absence matrix python - Solved上的脚本和示例):

genome  accession1  accession2  accession3  accession4  accession5
genome1           1           1           1           0           0
genome2           1           0           0           1           1

但是在我的前向分析中我需要其他基因组。我试图安排这个在df1之前移动定义df2的块:

asmbly_dict = sys.argv[1]
blast_result = sys.argv[2]
outName = sys.argv[3] + '.txt'

with open(blast_result, 'r') as file2:
    col_genes = ['gene', 'accession']
    df2 = pd.read_csv(file2, sep='\t', header=None, names=col_genes)
    print df2

with open(asmbly_dict, 'r') as file1:
    col_asmbly = ['gene', 'genome']
    df1 = pd.read_csv(file1, sep='\t', header=None, names=col_asmbly)
    df1['accession'] = df1['gene'].map(df2.set_index('gene')['accession'])
    #print df1
    g = df1.groupby('genome')['accession'].apply(list).reset_index()
    testdf = g.join(pd.get_dummies(g['accession'].apply(pd.Series).stack()).sum(level=0)).drop('accession', 1)
    #print testdf.to_string(index=False)
    testdf.to_csv(outName, sep='\t', header=True, index=False)

打印df2:

    gene   accession
0  gene1  accession1
1  gene2  accession2
2  gene3  accession3
3  gene4  accession1
4  gene5  accession4
5  gene6  accession5

打印df1:

    gene   genome   accession
0  gene1  genome1  accession1
1  gene2  genome1  accession2
2  gene3  genome1  accession3
3  gene4  genome2  accession1
4  gene5  genome2  accession4
5  gene6  genome2  accession5
6  gene7  genome3         NaN
7  gene8  genome3         NaN
8  gene9  genome4         NaN

打印testdf:

genome  accession1  accession2  accession3  accession4  accession5
genome1         1.0         1.0         1.0         0.0         0.0
genome2         1.0         0.0         0.0         1.0         1.0
genome3         NaN         NaN         NaN         NaN         NaN
genome4         NaN         NaN         NaN         NaN         NaN

和.csv文件:

genome  accession1  accession2  accession3  accession4  accession5
genome1         1.0         1.0         1.0         0.0         0.0
genome2         1.0         0.0         0.0         1.0         1.0
genome3
genome4

问题是:

如何在数字(1.0 - > 1)后绘制无小数,如何用零填充空值来打印和写入文件?

python pandas pandas-groupby
1个回答
2
投票

如果想要使用你的原始解决方案,请添加fillna并使用强制转换为int

testdf = g.join(pd.get_dummies(g['accession'].apply(pd.Series).stack()).sum(level=0)).drop('accession', 1)

testdf = testdf.fillna(0).astype(int)

但更好的解决方案是使用get_dummies,然后为每个索引和每列设置max(在实际数据中可能不需要样本):

df1['accession'] = df1['gene'].map(df2.set_index('gene')['accession'])

df1 = pd.get_dummies(df1.set_index('genome')['accession']).max(level=0).max(level=0, axis=1)

或者使用crosstabclip_upper并添加reindex缺少的类别:

df1 = (pd.crosstab(df1['genome'], df1['accession'])
        .clip_upper(1)
        .reindex(df1['genome'].unique(), fill_value=0))

要么:

df1 = (df1.groupby(['genome', 'accession'])
         .size()
         .clip_upper(1)
         .unstack(fill_value=0)
         .reindex(df1['genome'].unique(), fill_value=0))

print (df1)
         accession1  accession2  accession3  accession4  accession5
genome                                                             
genome1           1           1           1           0           0
genome2           1           0           0           1           1
genome3           0           0           0           0           0
genome4           0           0           0           0           0

最后写入文件:

df1.to_csv(outName, sep='\t')
© www.soinside.com 2019 - 2024. All rights reserved.