制作存在/不存在矩阵（y轴为文件名，x轴为文件读取）

Question

我有多个文件（文件名），其中有多个序列读取（每个文件的读取名都以>开头）：

文件名1

>Readname1

>Readname2

Filename2

>Readname1

>Readname3

给出一个包含所有可能的读名称的字典：

g={}

g['Readname1']=[]

g['Readname2']=[]

g['Readname3']=[]

我如何编写将迭代每个文件并生成以下矩阵的代码：

          Filename1 Filename2

Readname1  1        1

Readname2  1        0

Readname3  0        1

代码应扫描目录中每个文件的内容。理想情况下，我可以从输入文件中读取字典，而不是硬编码，因此我可以为不同的字典生成矩阵。每次读取的内容（例如其基因序列）都无关紧要，只是该文件中是否存在该读取名称。

我只是在学习python，所以一位同事分享了他们的代码以使我入门。他们在这里在一个指定的文件（files.txt）中创建其字典（Readnames）的存在/不存在矩阵。我想从第二个文件中输入字典（这样它在代码中就不是静态的）并遍历多个文件。

from Bio import SeqIO
import os
dir_path="" #directory path
files=os.listdir(path=dir_path)
with open(dir_path+'files.txt') as f:
    files=f.readlines()
files=[x.strip() for x in files]
enter code here

g={}  
g['Readname1']=[]
g['Readname2']=[]
g['Readname3']=[]

for i in files:
    a = list(SeqIO.parse(dir_path + i, 'fasta')) 
    for j in a:
        g[j.id].append(i) 
print('generating counts...')
counts={} 
for i in g.keys():
   counts[i]=[] 

for i in files: 
    for j in g: 
        if i in g[j]: 
            counts[j].append(1)
        else:
            counts[j].append(0)

print('writing out...')
outfile=open(dir_path+'core_withLabels.csv','w') 
outfile2=open(dir_path+'core_noLabels.csv','w') 
temp_string=''
for i in files:
    outfile.write(','+i) 
    temp_string=temp_string+i+',' 
temp_string=temp_string[:-1] 
outfile2.write(temp_string+'\n')
outfile.write('\n')
for i in counts: 
    outfile.write(i) 
    temp_string=''
    for j in counts[i]: 
        outfile.write(','+str(j))
        temp_string=temp_string+str(j)+','
   temp_string=temp_string[:-1]
   outfile2.write(temp_string+'\n')
   outfile.write('\n')
outfile.close()
outfile2.close()

Answer 1

按矩阵，您是指一个numpy矩阵还是List [List [int]]？

如果您知道读取名的总数，那么使用numpy矩阵很容易。对于numpy矩阵，请创建一个相应大小的零矩阵。matrix = np.zeros((n_filenames, n_readnames), dtype=int)或者，定义matrix = [[] for _ in range(n_filenames)]

此外，在矩阵中定义将readname映射到idx的映射

mapping = dict()
next_available_idx = 0

然后，遍历所有文件，并在其中填写相应的条目。

for i, filename in enumerate(filenames):
    with open(filename) as f:
        for readname in f:
            readname.strip() # get rid of extra spaces
            # find the corresponding column
            if readname in mapping:
                col_idx = mapping[readname]
            else:
                col_idx = next_available_idx
                next_available_idx += 1
                mapping[readname] = col_idx
            matrix[i, col_idx] = 1 # for numpy matrix
            """
            if you use list of lists, then:
            matrix[i] += [0] * (col_idx - len(matrix[i]) + [1]
            """

最后，如果使用列表列表，请确保所有列表的长度相同。您需要再次遍历矩阵的行。

制作存在/不存在矩阵（y轴为文件名，x轴为文件读取）

问题描述投票：0回答：1

1个回答

最新问题

制作存在/不存在矩阵（y轴为文件名，x轴为文件读取）

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1