对具有 DNA 序列的 fasta 文件进行聚类,以找到最不匹配的克隆

问题描述 投票:0回答:1

我正在尝试创建一个集群来分析 DNA 序列并找到其中不太匹配的模式(例如 <25% match). Is it possible to perform cluster analysis (k-means or any other approach) for the DNA in a fasta file.

例如,我想为给定的数据生成一个聚类图(图像)(下面是 fasta 序列)并找到不太相似的(<25%) sequence (encircled dots)

>1
 GATGTACTTCGTTCAGTTACATTGCTGCTCAAGGAGATTTTCAACGTGAAAAAAATTATTATTCGCAATTCCTTTAGTTGTTCCTT
 >2
 CAGTGTACTTCGTTCAGTTACGTATTGCTGCTCAAGGAGATTTTCAACGTGAAAAAATTATTATTCGCAATTCCTTTAGTTTGTTC
 >3
 CGGTATTACTTCGTTCAGTTACGTATTATGCTCGAAAGGAATTTCTATTGAAAGGTATTGCAATTCCTTTAGTTGTTCCTTTCTAT
 >4
 CGGTGTACTTCGTTCAGTTACGTATTGCTGCTCAAGGAGATTTTCAACGTGAAAAAAATTATTATTCGCAATTCCTTTAGTTGTTC
 >5
 GATGTACTTCGTTCCAGTTGTGTGTGCTGCTCAAGGAGATTTTTCAACGTGAAAAAATTATTATTCGCAATTCAACTTTGAATTTG
 >6
 CAATGTACTTCGTTCGGTTACGTATTGCTGCTCAAGAGATTTTCAACGTGAAAAAAATTATTATTCGCAATTCCTTTAGTTGTTCC
 >7
 CAAACACTTCGTTCAGTTACGTATTGCTGCTCAAGGAGATTTTCAACGTGAAAAAAATTATTATTCGCAATTCCTTTAGTTGTTCC 

python data-science cluster-analysis biopython
1个回答
0
投票
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Function to convert DNA sequence to one-hot encoding
def one_hot_encode(sequence):
    mapping = {'A': [1, 0, 0, 0], 'C': [0, 1, 0, 0], 'G': [0, 0, 1, 0], 'T': [0, 0, 0, 1]}
    return [mapping[base] for base in sequence]

# DNA sequences in fasta format
sequences = {
    ">1": "GATGTACTTCGTTCAGTTACATTGCTGCTCAAGGAGATTTTCAACGTGAAAAAAATTATTATTCGCAATTCCTTTAGTTGTTCCTT",
    ">2": "CAGTGTACTTCGTTCAGTTACGTATTGCTGCTCAAGGAGATTTTCAACGTGAAAAAATTATTATTCGCAATTCCTTTAGTTTGTTC",
    ">3": "CGGTATTACTTCGTTCAGTTACGTATTATGCTCGAAAGGAATTTCTATTGAAAGGTATTGCAATTCCTTTAGTTGTTCCTTTCTAT",
    ">4": "CGGTGTACTTCGTTCAGTTACGTATTGCTGCTCAAGGAGATTTTCAACGTGAAAAAAATTATTATTCGCAATTCCTTTAGTTGTTC",
    ">5": "GATGTACTTCGTTCCAGTTGTGTGTGCTGCTCAAGGAGATTTTTCAACGTGAAAAAATTATTATTCGCAATTCAACTTTGAATTTG",
    ">6": "CAATGTACTTCGTTCGGTTACGTATTGCTGCTCAAGAGATTTTCAACGTGAAAAAAATTATTATTCGCAATTCCTTTAGTTGTTCC",
    ">7": "CAAACACTTCGTTCAGTTACGTATTGCTGCTCAAGGAGATTTTCAACGTGAAAAAAATTATTATTCGCAATTCCTTTAGTTGTTCC"
}

# Assign different colors to sequences
colors = ['red', 'blue', 'green', 'orange', 'purple', 'brown', 'pink']

# One-hot encode sequences
sequences_matrix = np.array([np.concatenate(one_hot_encode(seq)) for seq in sequences.values()])

# Perform PCA
pca = PCA(n_components=2)
pca_result = pca.fit_transform(sequences_matrix)

# Plot PCA results with different colors
plt.figure(figsize=(10, 6))
for i, seq_key in enumerate(sequences.keys()):
    plt.scatter(pca_result[i, 0], pca_result[i, 1], color=colors[i], label=seq_key)

plt.title('PCA of DNA Sequences')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.show()

© www.soinside.com 2019 - 2024. All rights reserved.