我有一个看起来像这样的file.txt
(为了简化示例,我删除了几行):
PLXNA3 ### <- filename1
Missense/nonsense : 13 mutations # <- header spaces
accession codon_change amino_acid_change # <- column names tsv
ID73 CAT-TAT His66Tyr # <- line tsv
ID63 GAC-AAC Asp127Asn # <- line tsv
ID31 GCC-GTC Ala307Val # <- line tsv
NEDD4L ### <- filename2
Splicing : 1 mutation # <- header spaces
accession splicing_mutation # <- column names tsv
ID51 IVS1 as G-A -16229 # <- line tsv
Gross deletions : 1 mutation # <- header spaces
accession DNA_level description HGVS_(nucleotide) HGVS_(protein) # <- column names tsv
ID853 gDNA 4.5 Mb incl. entire gene Not yet available Not yet available # <- line tsv
OPHN1 ### <- filename3
Small insertions : 3 mutations # <- header spaces
accession insertion HGVS_(nucleotide) # <- column names tsv
ID96 TTATGTT(^183)TATtCAAATCCAGG c.549dupT p.(Gln184Serfs*23) # <- line tsv
ID25 GTGCT(^310)AAGCAcaG_EI_GTCAGTTCT c.931_932dupCA # <- line tsv
我想分割此文件以获得3个不同的文件:
PLXNA3.txt
PLXNA3 ### <- filename1
Missense/nonsense : 13 mutations # <- header spaces
accession codon_change amino_acid_change # <- column names tsv
ID73 CAT-TAT His66Tyr # <- line tsv
ID63 GAC-AAC Asp127Asn # <- line tsv
ID31 GCC-GTC Ala307Val # <- line tsv
NEDD4L.txt
NEDD4L ### <- filename2
Splicing : 1 mutation # <- header spaces
accession splicing_mutation # <- column names tsv
ID51 IVS1 as G-A -16229 # <- line tsv
Gross deletions : 1 mutation # <- header spaces
accession DNA_level description HGVS_(nucleotide) HGVS_(protein) # <- column names tsv
ID853 gDNA 4.5 Mb incl. entire gene Not yet available Not yet available # <- line tsv
OPHN1
OPHN1 ### <- filename3
Small insertions : 3 mutations # <- header spaces
accession insertion HGVS_(nucleotide) # <- column names tsv
ID96 TTATGTT(^183)TATtCAAATCCAGG c.549dupT p.(Gln184Serfs*23) # <- line tsv
ID25 GTGCT(^310)AAGCAcaG_EI_GTCAGTTCT c.931_932dupCA # <- line tsv
如何使用awk
或python
之类的任何Linux命令来实现所需的输出?
注意:
-
。提前感谢。
所以这是我想出的解决方案。它首先打开您要拆分的文件。然后,它读取第一行,即第一文件的文件名。现在让我跳过while循环。它将打开一个新文件,其文件名只是读入的(strip()是删除行尾新行字符所必需的)。然后读入行并将它们写入新文件,直到出现一个没有空格或制表符的新文件。然后重复该过程,直到文件没有更多行可读取(我之前跳过了while循环)。
希望有帮助:)
file = open("file.txt", "r")
new_filename = file.readline()
while new_filename:
with open(new_filename.strip() + ".txt", "w") as new_file:
new_file.write(new_filename)
line = file.readline()
while " " in line or "\t" in line:
# still the same new file
new_file.write(line)
line = file.readline()
# file ended so read in line was the filename of the next file
new_filename = line
file.close()
awk 'NF==1{filename=$0 ".txt"};{print > filename}' file.txt