将重复的序列删除到fasta文件(大文件)中

问题描述 投票:0回答:1

您好,我有一个巨大的fasta文件,例如:

  >sequence1_CP [seq  virus]
    MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
    DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE

    >sequence2 [virus]
    MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
    DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE

    >sequence3
    MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
    DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE

    >sequence4_CP hypothetical protein [another virus]
    MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN
    IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK

    >sequence5 hypothetical protein [another virus]
    MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN
    IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK

    >sequence6 |hypothetical protein[virus]
    MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNLD
    ITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE

    >sequence7 |hypothetical protein[virus]
    MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNLD
    ITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE

>sequence8 |hypothetical protein[Musca domestica salivary gland hypertrophy virus]
MNKITRTDYLLNKLCRPQDGDDNLVASFMPCERAAIRRKYTTLYAYNYTECPHRILETCK
LQRIPYFTCIEYRANVECVERHVCDIFPVHIGLRLDRQIYAFLYGDDELNSPAVQRTMYD
LYGTIFVVSPQYFSNIFTNRKEIIHSSRDSDKLYNIYMYDVHDRGHRIWMTADANKTCIF
RNSNGQEHVIEASQSFRDFIDGIEYEVDIQRHMNFERMFEAFARYQPINDIDDLSNKNIL

>sequence9 |hypothetical protein[Musca domestica salivary gland hypertrophy virus]
MNKITRTDYLLNKLCRPQDGDDNLVASFMPCERAAIRRKYTTLYAYNYTECPHRILETCK
LQRIPYFTCIEYRANVECVERHVCDIFPVHIGLRLDRQIYAFLYGDDELNSPAVQRTMYD
LYGTIFVVSPQYFSNIFTNRKEIIHSSRDSDKLYNIYMYDVHDRGHRIWMTADANKTCIF
RNSNGQEHVIEASQSFRDFIDGIEYEVDIQRHMNFERMFEAFARYQPINDIDDLSNKNIL

并且我正在寻找一种方法来删除重复的序列:

例如,sequence1_CP,sequence2,sequence3,sequence6和sequence7具有完全相同的序列,那么我只想保留一个。序列4_CP和序列5或序列6和序列7或序列8和9相同。

文件中的序列号为:2196136所以我需要一种快速的方法...

这里我应该参加示例:

>sequence1_CP [seq  virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE

>sequence4_CP hypothetical protein [another virus]
MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN
IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK

>sequence8 |hypothetical protein[Musca domestica salivary gland hypertrophy virus]
MNKITRTDYLLNKLCRPQDGDDNLVASFMPCERAAIRRKYTTLYAYNYTECPHRILETCK
LQRIPYFTCIEYRANVECVERHVCDIFPVHIGLRLDRQIYAFLYGDDELNSPAVQRTMYD
LYGTIFVVSPQYFSNIFTNRKEIIHSSRDSDKLYNIYMYDVHDRGHRIWMTADANKTCIF
RNSNGQEHVIEASQSFRDFIDGIEYEVDIQRHMNFERMFEAFARYQPINDIDDLSNKNIL

您好,我有一个巨大的Fasta文件,例如:> sequence1_CP [seq病毒] MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE> ...

bash awk fasta
1个回答
0
投票
© www.soinside.com 2019 - 2024. All rights reserved.