如何计算每个部分中的字母数量

问题描述 投票:0回答:3

我有这样的数据

>sp|Q96A73|P33MX_HUMAN Putative monooxygenase p33MONOX OS=Homo sapiens OX=9606 GN=KIAA1191 PE=1 SV=1
RNDDDDTSVCLGTRQCSWFAGCTNRTWNSSAVPLIGLPNTQDYKWVDRNSGLTWSGNDTCLYSCQNQTKGLLYQLFRNLFCSYGLTEAHGKWRCADASITNDKGHDGHRTPTWWLTGSNLTLSVNNSGLFFLCGNGVYKGFPPKWSGRCGLGYLVPSLTRYLTLNASQITNLRSFIHKVTPHR
>sp|P13674|P4HA1_HUMAN Prolyl 4-hydroxylase subunit alpha-1 OS=Homo sapiens OX=9606 GN=P4HA1 PE=1 SV=2
VECCPNCRGTGMQIRIHQIGPGMVQQIQSVCMECQGHGERISPKDRCKSCNGRKIVREKKILEVHIDKGMKDGQKITFHGEGDQEPGLEPGDIIIVLDQKDHAVFTRRGEDLFMCMDIQLVEALCGFQKPISTLDNRTIVITSHPGQIVKHGDIKCVLNEGMPIYRRPYEKGRLIIEFKVNFPENGFLSPDKLSLLEKLLPERKEVEE
>sp|Q7Z4N8|P4HA3_HUMAN Prolyl 4-hydroxylase subunit alpha-3 OS=Homo sapiens OX=9606 GN=P4HA3 PE=1 SV=1
MTEQMTLRGTLKGHNGWVTQIATTPQFPDMILSASRDKTIIMWKLTRDETNYGIPQRALRGHSHFVSDVVISSDGQFALSGSWDGTLRLWDLTTGTTTRRFVGHTKDVLSVAFSSDNRQIVSGSRDKTIKLWNTLGVCKYTVQDESHSEWVSCVRFSPNSSNPIIVSCGWDKLVKVWNLANCKLK
>sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens OX=9606 GN=TP53 PE=1 SV=4
IQVVSRCRLRHTEVLPAEEENDSLGADGTHGAGAMESAAGVLIKLFCVHTKALQDVQIRFQPQL

我想在每个部分得到K的数量,所以我想要的输出是这样的

         K    R
Q96A73   7    11   
P13674   17   13
Q7Z4N8   11   11
P04637   2    4  

我一直在尝试使用它

cat mydata.txt | grep -v '^>' | grep -i -e [k] |wc -l

例如,如果我们看第一个

          K    R    KK   RR
Q96A73   7    11    0      0
P13674   17   13    1     2
Q7Z4N8   11   11    1     0
P04637   2    4     0     0
awk sed grep
3个回答
1
投票

你可以尝试一下吗?

awk -F'|' '/^>/{val=$2;next} {print val,gsub(/[kK]/,""),gsub(/[rR]/,"")}' Input_file


如果您想获得标题输出,请尝试以下操作。

awk -F'|' 'BEGIN{print "       K R"}/^>/{val=$2;next} {print val,gsub(/[kK]/,""),gsub(/[rR]/,"")}'  Input_file


EDT1:根据OP的评论,如果我们想要连续出现2次qazxsw poi或qazxsw poi,请尝试以下。

KK


编辑2:要获得kkawk -F'|' '/^>/{val=$2;next} {print val,gsub(/kk|KK/,""),gsub(/rr|RR/,"")}' Input_file kkk计数使用以下。

r

带标题:

rr

输出如下。

awk -F'|' '/^>/{val=$2;next} {line=$0;print val,gsub(/[kK]/,""),gsub(/[rR]/,""),gsub(/kk|KK/,"",line),gsub(/rr|RR/,"",line)}' Input_file

2
投票

使用Perl,

awk -F'|' '
BEGIN{
  print "       k/K\tr/R\tkk/KK\trr/RR"
}
/^>/{
  val=$2
  next
}
{
  line=$0
  print val,gsub(/[kK]/,""),gsub(/[rR]/,""),gsub(/kk|KK/,"",line),gsub(/rr|RR/,"",line)
}' OFS="\t"   Input_file

有输入

       k/K      r/R     kk/KK   rr/RR
Q96A73  7       11      0       0
P13674  17      13      1       2
Q7Z4N8  11      11      0       1
P04637  2       4       0       0

OP已更新问题..请立即查看

 perl -F"\|" -lne ' BEGIN{print "ID   K R"} s/(K|R)/$kv{$1}++/ge; if(not /^>/ ) { print "$x $kv{K} $kv{R}" ;%kv=() } $x=$F[1] '

2
投票
$ cat KR.txt
>sp|Q96A73|P33MX_HUMAN Putative monooxygenase p33MONOX OS=Homo sapiens OX=9606 GN=KIAA1191 PE=1 SV=1
RNDDDDTSVCLGTRQCSWFAGCTNRTWNSSAVPLIGLPNTQDYKWVDRNSGLTWSGNDTCLYSCQNQTKGLLYQLFRNLFCSYGLTEAHGKWRCADASITNDKGHDGHRTPTWWLTGSNLTLSVNNSGLFFLCGNGVYKGFPPKWSGRCGLGYLVPSLTRYLTLNASQITNLRSFIHKVTPHR
>sp|P13674|P4HA1_HUMAN Prolyl 4-hydroxylase subunit alpha-1 OS=Homo sapiens OX=9606 GN=P4HA1 PE=1 SV=2
VECCPNCRGTGMQIRIHQIGPGMVQQIQSVCMECQGHGERISPKDRCKSCNGRKIVREKKILEVHIDKGMKDGQKITFHGEGDQEPGLEPGDIIIVLDQKDHAVFTRRGEDLFMCMDIQLVEALCGFQKPISTLDNRTIVITSHPGQIVKHGDIKCVLNEGMPIYRRPYEKGRLIIEFKVNFPENGFLSPDKLSLLEKLLPERKEVEE
>sp|Q7Z4N8|P4HA3_HUMAN Prolyl 4-hydroxylase subunit alpha-3 OS=Homo sapiens OX=9606 GN=P4HA3 PE=1 SV=1
MTEQMTLRGTLKGHNGWVTQIATTPQFPDMILSASRDKTIIMWKLTRDETNYGIPQRALRGHSHFVSDVVISSDGQFALSGSWDGTLRLWDLTTGTTTRRFVGHTKDVLSVAFSSDNRQIVSGSRDKTIKLWNTLGVCKYTVQDESHSEWVSCVRFSPNSSNPIIVSCGWDKLVKVWNLANCKLK
>sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens OX=9606 GN=TP53 PE=1 SV=4
IQVVSRCRLRHTEVLPAEEENDSLGADGTHGAGAMESAAGVLIKLFCVHTKALQDVQIRFQPQL

$  perl -F"\|" -lne ' BEGIN{print "ID   K R"} s/(K|R)/$kv{$1}++/ge; if(not /^>/ ) { print "$x $kv{K} $kv{R}" ;%kv=() } $x=$F[1] ' KR.txt
ID   K R
Q96A73 8 11
P13674 17 13
Q7Z4N8 11 11
P04637 2 4

$
© www.soinside.com 2019 - 2024. All rights reserved.