fastq文件中具有超过8个相同连续核苷酸的过滤序列?

问题描述 投票:3回答:3

我想过滤我的[[fastq文件中具有超过8个相同连续核苷酸的序列,例如"GGGGGGGG""CCCCCCCC"等。

我应该怎么做?
bioinformatics fastq
3个回答
1
投票
    快速而错误的方法,可能足够接近:grep -E -B1 -A2 'A{8}|C{8}|G{8}|T{8}' yourfile.fastq。这会错过将8-mer分为两行的块(例如,第一行以AAAA结尾,第二行以AAAA开始)。它还假定输出具有每个4行的块。
  1. [正确的方法:编写一个小程序(用Python或您选择的语言),该程序可以缓冲一个FASTQ块(例如4行),并检查前一个(缓冲的)块序列和当前块的序列是否串联在一起没有上述的8聚体。如果是这种情况,则输出缓冲的块。

1
投票
我最终在R中使用了以下代码并解决了我的问题。

library(ShortRead) fq <- FastqFile("/Users/path/to/file") reads_fq <- readFastq(fq) trimmed_fq <- reads_fq[grep("GGGGGGGG|TTTTTTTTT|AAAAAAAAA|CCCCCCCCC", sread(reads_fq), invert = TRUE)] writeFastq(trimmed_fq, "new_name_for_fq.fastq", compress = FALSE)


0
投票
您可以为其使用Python包

biotite(https://www.biotite-python.org)。

假设您具有以下FASTQ文件:

@Read:01 CCCAAGGGCCCCCCCCCACTGCGATCACCTGGTTGCTGCCGGGAAAGGAGACCCAGGAGGTGAAACGGACTGGTGAATTG CGGGGGTAGATATGGCGGGTGACACAAAAACATATAATCGGGCC + .+.+:'-FEAC-4'4CA-3-5#/4+?*G@?,<)<E&5(*82C9FH4G315F*DF8-4%F"9?H5535F7%?7@+6!FDC& +4=4+,@2A)8!1B#,HA18)1*D1A-.HGAED%?-G10'6>:2 @Read:02 AACACTACTTCGCTGTCGCCAAAGGTTGGTGTAGGTCGGACTTCGAATTATCGATACTAGTTAGTAGTACGTCGCGTGGC GTCAGCTCGTATGCTCTCAGAACAGGGAGAACTAGCACCGTAAGTAACCTAGCTCCCAAC + 6%9,@'4A0&%.19,1E)E?!9/$.#?(!H2?+E"")?6:=F&FE91-*&',,;;$&?@2A"F.$1)%'"CB?5$<.F/$ 7055E>#+/650B6H<8+A%$!A=0>?'@",8:#5%18&+3>'8:28+:5F0);E9<=,+

这是一个脚本,应该可以完成工作:

import biotite.sequence.io.fastq as fastq import biotite.sequence as seq # 'GGGGGGGG', 'CCCCCCCC', etc. consecutive_nucs = [seq.NucleotideSequence(nuc * 8) for nuc in "ACGT"] fastq_file = fastq.FastqFile("Sanger") fastq_file.read("example.fastq") # Iterate over sequence entries in file for header in fastq_file: sequence = fastq_file.get_sequence(header) # Iterative over each of the consecutive sequences for consecutive_nuc in consecutive_nucs: # Find all indices, where a match was found matches = seq.find_subsequence(sequence, consecutive_nuc) if len(matches) > 0: # If any match was found report it print( f"Found '{consecutive_nuc}' " f"in sequence '{header}' at position {matches[0]}" )

这是输出:

Found 'CCCCCCCC' in sequence 'Read:01' at pos 8

© www.soinside.com 2019 - 2024. All rights reserved.