我想过滤我的[[fastq文件中具有超过8个相同连续核苷酸的序列,例如"GGGGGGGG"
,"CCCCCCCC"
等。
grep -E -B1 -A2 'A{8}|C{8}|G{8}|T{8}' yourfile.fastq
。这会错过将8-mer分为两行的块(例如,第一行以AAAA结尾,第二行以AAAA开始)。它还假定输出具有每个4行的块。library(ShortRead)
fq <- FastqFile("/Users/path/to/file")
reads_fq <- readFastq(fq)
trimmed_fq <- reads_fq[grep("GGGGGGGG|TTTTTTTTT|AAAAAAAAA|CCCCCCCCC",
sread(reads_fq), invert = TRUE)]
writeFastq(trimmed_fq, "new_name_for_fq.fastq", compress = FALSE)
biotite(https://www.biotite-python.org)。
假设您具有以下FASTQ文件:@Read:01
CCCAAGGGCCCCCCCCCACTGCGATCACCTGGTTGCTGCCGGGAAAGGAGACCCAGGAGGTGAAACGGACTGGTGAATTG
CGGGGGTAGATATGGCGGGTGACACAAAAACATATAATCGGGCC
+
.+.+:'-FEAC-4'4CA-3-5#/4+?*G@?,<)<E&5(*82C9FH4G315F*DF8-4%F"9?H5535F7%?7@+6!FDC&
+4=4+,@2A)8!1B#,HA18)1*D1A-.HGAED%?-G10'6>:2
@Read:02
AACACTACTTCGCTGTCGCCAAAGGTTGGTGTAGGTCGGACTTCGAATTATCGATACTAGTTAGTAGTACGTCGCGTGGC
GTCAGCTCGTATGCTCTCAGAACAGGGAGAACTAGCACCGTAAGTAACCTAGCTCCCAAC
+
6%9,@'4A0&%.19,1E)E?!9/$.#?(!H2?+E"")?6:=F&FE91-*&',,;;$&?@2A"F.$1)%'"CB?5$<.F/$
7055E>#+/650B6H<8+A%$!A=0>?'@",8:#5%18&+3>'8:28+:5F0);E9<=,+
这是一个脚本,应该可以完成工作:
import biotite.sequence.io.fastq as fastq import biotite.sequence as seq # 'GGGGGGGG', 'CCCCCCCC', etc. consecutive_nucs = [seq.NucleotideSequence(nuc * 8) for nuc in "ACGT"] fastq_file = fastq.FastqFile("Sanger") fastq_file.read("example.fastq") # Iterate over sequence entries in file for header in fastq_file: sequence = fastq_file.get_sequence(header) # Iterative over each of the consecutive sequences for consecutive_nuc in consecutive_nucs: # Find all indices, where a match was found matches = seq.find_subsequence(sequence, consecutive_nuc) if len(matches) > 0: # If any match was found report it print( f"Found '{consecutive_nuc}' " f"in sequence '{header}' at position {matches[0]}" )
这是输出:
Found 'CCCCCCCC' in sequence 'Read:01' at pos 8