sed, awk, perl)

问题描述 投票:0回答:1

Thanks for viewing this post. I will try to be clear and comprehensive in return!

Below the situation:

  • Hundreds of ~GB size .gz archives

  • List of wanted data that consists in identifiers. Each identifier is associated with the name of the unique archive in which to find the data.

Data structure of a .gz archive:

zcat archive.gz

    ...
    identifier_nth
    ...
    END_BLOCK
    ...
    ...
    ...
    identifier_1
    ...
    END_BLOCK
    ...
    ...
    ...
    identifier_1
    ...
    ...
    END_BLOCK
    ...
    ...
    identifier_nth
    ...
    END_BLOCK
    ...
    ...
    ...
    identifier_1
    ...
    END_BLOCK
    ...
    identifier_nth
    ...
    END_BLOCK

I currently do:

start=$(echo "$wanted_identifier_of_list") # I cat | while read through a list of thousands identifiers for the process (here $wanted_identifier_of_list = identifier_1)
end=$(echo "END_BLOCK")

zcat nth_archive.gz | sed -n "/${start}/,/${end}/p" > ${start}.dat

It works fine, but it is slow and there are too many blocks extracted for each identifier. I just need a fraction of them from first to Nth occurrence.

So I would like to:

1) limit the number of block I retrieve to an arbitrary number (here N = 2 for example)2) quit both zcat

任何帮助将是非常感激

非常感谢。

弗洛里安

awk sed gzip extract large-files
1个回答
0
投票

像这样的东西应该与早期退出工作。 然而,未经测试。

$ zcat ... | awk -v start="identifier_1" -v end="END_BLOCK" -v n=2 '
                     !f && $0~start{f=n} f; f && $0~end{f--; if(!f) exit}'

-1
投票

下面的一些更多的输入:我使用 "#############名称。 ZINC000005215379 "作为开始,"#########名称:"作为当前停止。

...
##########                 Name:     ZINC000005215379
...

@<TRIPOS>MOLECULE
 ZINC000005215379      none
   58    62     1     0     0
...
@<TRIPOS>ATOM
      1 C1         -1.3168    -6.3293    -6.1200 C.3        1  LIG1  -0.1600
      2 C2         -0.1404    -5.3624    -5.9715 C.3        1  LIG1   0.0700
...
@<TRIPOS>BOND
     1    1    2 1
     2    1   41 1
...
##########                 Name:     ZINC000005215379
...

@<TRIPOS>MOLECULE
 ZINC000005215379      none
   58    62     1     0     0
...
@<TRIPOS>ATOM
      1 C1         -1.3168    -6.3293    -6.1200 C.3        1  LIG1  -0.1600
      2 C2         -0.1404    -5.3624    -5.9715 C.3        1  LIG1   0.0700
...
@<TRIPOS>BOND
     1    1    2 1
     2    1   41 1
...
##########                 Name:     ZINC000004473749
...

@<TRIPOS>MOLECULE
 ZINC000004473749      none
...
@<TRIPOS>ATOM

...
@<TRIPOS>BOND
     1    1    2 1
     2    1   41 1
...
© www.soinside.com 2019 - 2024. All rights reserved.