如何将包含文件名和信息的文件分别拆分为多个文件?

问题描述 投票:0回答:1

我有一个看起来像这样的file.txt(为了简化示例,我删除了几行):

PLXNA3                                                                                     ### <- filename1
Missense/nonsense : 13 mutations                                                           # <- header spaces
accession   codon_change    amino_acid_change                                              # <- column names tsv
ID73        CAT-TAT         His66Tyr                                                       # <- line tsv
ID63        GAC-AAC         Asp127Asn                                                      # <- line tsv
ID31        GCC-GTC         Ala307Val                                                      # <- line tsv
NEDD4L                                                                                     ### <- filename2
Splicing : 1 mutation                                                                      # <- header spaces
accession      splicing_mutation                                                           # <- column names tsv
ID51           IVS1 as G-A -16229                                                          # <-  line tsv
Gross deletions : 1 mutation                                                               # <- header spaces
accession   DNA_level   description                 HGVS_(nucleotide)   HGVS_(protein)     # <- column names tsv
ID853       gDNA        4.5 Mb incl. entire gene    Not yet available   Not yet available  # <- line tsv
OPHN1                                                                                      ### <- filename3
Small insertions : 3 mutations                                                             # <- header spaces
accession         insertion                            HGVS_(nucleotide)                   # <- column names tsv
ID96          TTATGTT(^183)TATtCAAATCCAGG c.549dupT    p.(Gln184Serfs*23)                  # <- line tsv
ID25          GTGCT(^310)AAGCAcaG_EI_GTCAGTTCT         c.931_932dupCA                      # <- line tsv

我想分割此文件以获得3个不同的文件:

PLXNA3.txt

PLXNA3                                                                                     ### <- filename1
Missense/nonsense : 13 mutations                                                           # <- header spaces
accession   codon_change    amino_acid_change                                              # <- column names tsv
ID73        CAT-TAT         His66Tyr                                                       # <- line tsv
ID63        GAC-AAC         Asp127Asn                                                      # <- line tsv
ID31        GCC-GTC         Ala307Val                                                      # <- line tsv

NEDD4L.txt

NEDD4L                                                                                     ### <- filename2
Splicing : 1 mutation                                                                      # <- header spaces
accession      splicing_mutation                                                           # <- column names tsv
ID51           IVS1 as G-A -16229                                                          # <-  line tsv
Gross deletions : 1 mutation                                                               # <- header spaces
accession   DNA_level   description                 HGVS_(nucleotide)   HGVS_(protein)     # <- column names tsv
ID853       gDNA        4.5 Mb incl. entire gene    Not yet available   Not yet available  # <- line tsv

OPHN1

OPHN1                                                                                      ### <- filename3
Small insertions : 3 mutations                                                             # <- header spaces
accession         insertion                            HGVS_(nucleotide)                   # <- column names tsv
ID96          TTATGTT(^183)TATtCAAATCCAGG c.549dupT    p.(Gln184Serfs*23)                  # <- line tsv
ID25          GTGCT(^310)AAGCAcaG_EI_GTCAGTTCT         c.931_932dupCA                      # <- line tsv

如何使用awkpython之类的任何Linux命令来实现所需的输出?

注意:

  • 文件名没有任何空格或制表符,但它们可以包含-
  • 标题包含空格。
  • 行被制表符分隔。
  • 真正的分隔符应该是文件名,因为我可以有多个标题。

提前感谢。

python regex file awk split
1个回答
0
投票

所以这是我想出的解决方案。它首先打开您要拆分的文件。然后,它读取第一行,即第一文件的文件名。现在让我跳过while循环。它将打开一个新文件,其文件名只是读入的(strip()是删除行尾新行字符所必需的)。然后读入行并将它们写入新文件,直到出现一个没有空格或制表符的新文件。然后重复该过程,直到文件没有更多行可读取(我之前跳过了while循环)。

希望有帮助:)

file = open("file.txt", "r")

new_filename = file.readline()
while new_filename:
   with open(new_filename.strip() + ".txt", "w") as new_file:
      new_file.write(new_filename)
      line = file.readline()
      while " " in line or "\t" in line:
         # still the same new file
         new_file.write(line)
         line = file.readline()
   # file ended so read in line was the filename of the next file
   new_filename = line

file.close()

0
投票
awk 'NF==1{filename=$0 ".txt"};{print > filename}' file.txt
© www.soinside.com 2019 - 2024. All rights reserved.