使用python搜索特定的重复

问题描述 投票:1回答:2

输入文件的示例:

1  AAcgGGGGGGtacctgt    yes
2  TTcccccctgtAAcgta   no
3  tcgAAAAaatacgacc     no
4  AAcgtataatacctgt   no
...

我想编写一个程序来扫描每个序列并检查单体核苷酸重复序列(mnr)

示例输出:

1,AAcgGGGGGGtacctgt,yes
2,TTcccccctgtAAcgta,no

定义:单体核苷酸是:A,T,C,G的重复序列(不区分大小写)

我正在寻找这样的东西:AAAAaaAAgtcgtAAAAAAAAAAcaaaaaaAAAaaaaaaaaaacccccccccccCCCCCcccCCC或......

我试过这个正则表达式,但不起作用:

import csv
import re
list=[]
with open('sequences.txt', 'r') as f:
    reader = csv.reader(f,delimiter="\t")
    seq=re.findall(r'[Aa]{6, }','sequences.txt')
    for line in reader:
        if line.__contains__(seq):
            print(list.append(line))

任何帮助赞赏。

python regex bioinformatics dna-sequence
2个回答
0
投票

在这里,您可以获得所需的紧凑型解决方案:

import csv
with open('sequences.txt', 'r') as f:
    reader = csv.reader(f, delimiter=",")
    for line in reader:
        seq_lower = line[1].lower()
        if 'aaaaaa' in seq_lower or 'cccccc' in seq_lower or 'tttttt' in seq_lower or 'gggggg' in seq_lower:
            print(line)

在这里,我假设你只是在研究a,c,g,t的mnrs,因为你正在使用DNA序列。


0
投票

更新:此后提出了使用正则表达式的部分解决方案。请注意,以下解决方案不能使用正则表达式,而是查找长度为6或更长的任何字符的任何序列。

测试数据:

number,sequence,status
1,kjhfklashfkldflkhasdfl,0
2,aaaaaljgkldfkjgldkfjgfldj,0
3,bbbbbbjigdfsjgjg,0
4,ccCccCCcjjfijsdfjsdf,0
5,klsjdflsjdfhdddddjnjlkhngjk,0
6,kjkljfhnlasjkdfheeeeeeejjjeeeeeeeeeekjdkljfleeef,0
7,jhfshffFffFFFFffkljjjj908u89,0

找到长度为6或更大的MNR的代码:

import csv

def contains_mnr(sequence):
    start_char = "$" # choose a character that is sure not to be in the sequence
    count = 0
    seq_lower = sequence.lower()

    for pos in range(0, len(seq_lower)):
        if seq_lower[pos] == start_char:
            count += 1
        else:
            start_char = seq_lower[pos]
            count = 1
        if count >= 6:
            return True

    return False

with open("input.csv", "r") as input_file:
    with open("output.csv", "w") as output_file:
        reader = csv.DictReader(input_file, dialect=csv.unix_dialect())
        writer = csv.writer(output_file, dialect=csv.unix_dialect())
        writer.writerow(reader.fieldnames)

        for row in reader:
            if contains_mnr(row["sequence"]):
                writer.writerow([
                    row["number"],
                    row["sequence"],
                    row["status"]
                ])

请注意,可能必须将CSV方言调整到运行代码并生成数据文件的系统。

以上给出测试数据的输出:

"number","sequence","status"
"3","bbbbbbjigdfsjgjg","0"
"4","ccCccCCcjjfijsdfjsdf","0"
"6","kjkljfhnlasjkdfheeeeeeejjjeeeeeeeeeekjdkljfleeef","0"
"7","jhfshffFffFFFFffkljjjj908u89","0"
© www.soinside.com 2019 - 2024. All rights reserved.