我尝试在多个文件中替换某些模式和其他模式。例如我的 infile 看起来像这样:
>Genus_species_SRR13259292|ENSG00000000457_ENST00000367772
TACGCCGCGCACTTCACGCGAGAGCAGCTGCGCACTATCGTCCTGCCCCAGGTGCTGCTGGGCCTGCGAGACACCAGCACCCCCATCGTGGCCATCACCCTGCACAGCCTCGCCGTGCTGGTCTCCCTGCTCGGACCAGAGGTGGTTGTGGGCGGAGAAAGAACCAAGATCTTCAAACGCACTGCCCCCAGCTTTACAAAAACCACTGACCTCTCCCCAGAAGAC
我想要输出:
>Genus_species_Something_something|ENSG00000000457_ENST00000367772
TACGCCGCGCACTTCACGCGAGAGCAGCTGCGCACTATCGTCCTGCCCCAGGTGCTGCTGGGCCTGCGAGACACCAGCACCCCCATCGTGGCCATCACCCTGCACAGCCTCGCCGTGCTGGTCTCCCTGCTCGGACCAGAGGTGGTTGTGGGCGGAGAAAGAACCAAGATCTTCAAACGCACTGCCCCCAGCTTTACAAAAACCACTGACCTCTCCCCAGAAGAC
我有两个列表文件,我的旧模式:
Genus_species_SRR13259292
和新模式:
Genus_species_Something_something
我尝试用 sed 来做到这一点。这是我的命令:
while IFS= read -r line1 && IFS= read -r line2 <&3; do
for f in *.fasta; do
sed -e "s/${line1}/${line2}/g" "$f" > "${f%.fasta}_NewName.fasta"
done
done < "List_oldpattern.txt" 3<"List_newpatterns.txt"
但这不起作用,也许是因为 > 和 |划定了模式?
如果 sed 不起作用,可以使用 awk 吗?
谢谢您的建议
由于问题已被标记为
awk
,我建议我们用单个awk
脚本替换OP的所有当前代码...
我的样本
.fasta
文件:
$ head f?.fasta
==> f1.fasta <==
>Genus_species_SRR13259292|ENSG00000000457_ENST00000367772
TACGCCGCGCACTTCACGCGAGAGCAGCTGCGCACTATCGTCCTGCCCCAGGTGCTGC....
>Genus_buckets_ABC13259292|ENSG00000000457_ENST00000367772
TACGCCGCGCACTTCACGCGAGAGCAGCTGCGCACTATCGTCCTGCCCCAGGTGCTGC....
==> f2.fasta <==
>Genus_species_SRR13259292|ENSG00000000457_ENST00000367772
TACGCCGCGCACTTCACGCGAGAGCAGCTGCGCACTATCGTCCTGCCCCAGGTGCTGC....
>Genus_buckets_ABC13259292|ENSG00000000457_ENST00000367772
TACGCCGCGCACTTCACGCGAGAGCAGCTGCGCACTATCGTCCTGCCCCAGGTGCTGC....
我们将使用
paste
命令将 OP 的旧模式和新模式附加到一行中;我们将使用 |
作为分隔符:
$ paste -d'|' List_oldpattern.txt List_newpatterns.txt
Genus_species_SRR13259292|Genus_species_Something_something
现在是
awk
脚本:
awk '
BEGIN { FS = OFS = "|" } # input/output field delimiter
FNR==NR { map[">" $1] = ">" $2; next } # 1st file (paste output): populate our map[] array; $1==old $2==new; then skip to next input line
FNR==1 { close(outf) # 2nd-nth files: 1st record; close previous output file
outf = FILENAME # make copy of input FILENAME
sub(/.fasta/,"",outf) # strip trailing ".fasta"
outf = outf "_NewName.fasta" # append new suffix to our output filename
}
$1 in map { $1 = map[$1] } # if 1st field (">some_string") is an index in the map[] array then replace 1st field with array contents
{ print > outf } # print current line to output file
' <(paste -d'|' List_oldpattern.txt List_newpatterns.txt) *.fasta
注意: 假设 OP 有多个旧/新模式对,此脚本的额外好处是仅扫描每个
*.fasta
文件一次(与 OP 当前的 while/read/for/sed
循环扫描每个 .fasta
文件 相反) N
次 - 其中 N
是旧/新模式对的数量)
这会生成:
$ head *_NewName.fasta
==> f1.fasta <==
>Genus_species_SRR13259292|ENSG00000000457_ENST00000367772
TACGCCGCGCACTTCACGCGAGAGCAGCTGCGCACTATCGTCCTGCCCCAGGTGCTGC....
>Genus_buckets_ABC13259292|ENSG00000000457_ENST00000367772
TACGCCGCGCACTTCACGCGAGAGCAGCTGCGCACTATCGTCCTGCCCCAGGTGCTGC....
==> f2.fasta <==
>Genus_species_SRR13259292|ENSG00000000457_ENST00000367772
TACGCCGCGCACTTCACGCGAGAGCAGCTGCGCACTATCGTCCTGCCCCAGGTGCTGC....
>Genus_buckets_ABC13259292|ENSG00000000457_ENST00000367772
TACGCCGCGCACTTCACGCGAGAGCAGCTGCGCACTATCGTCCTGCCCCAGGTGCTGC....