我正在尝试创建一个集群来分析 DNA 序列并找到其中不太匹配的模式(例如 <25% match). Is it possible to perform cluster analysis (k-means or any other approach) for the DNA in a fasta file.
例如,我想为给定的数据生成一个聚类图(图像)(下面是 fasta 序列)并找到不太相似的(<25%) sequence (encircled dots)
>1
GATGTACTTCGTTCAGTTACATTGCTGCTCAAGGAGATTTTCAACGTGAAAAAAATTATTATTCGCAATTCCTTTAGTTGTTCCTT
>2
CAGTGTACTTCGTTCAGTTACGTATTGCTGCTCAAGGAGATTTTCAACGTGAAAAAATTATTATTCGCAATTCCTTTAGTTTGTTC
>3
CGGTATTACTTCGTTCAGTTACGTATTATGCTCGAAAGGAATTTCTATTGAAAGGTATTGCAATTCCTTTAGTTGTTCCTTTCTAT
>4
CGGTGTACTTCGTTCAGTTACGTATTGCTGCTCAAGGAGATTTTCAACGTGAAAAAAATTATTATTCGCAATTCCTTTAGTTGTTC
>5
GATGTACTTCGTTCCAGTTGTGTGTGCTGCTCAAGGAGATTTTTCAACGTGAAAAAATTATTATTCGCAATTCAACTTTGAATTTG
>6
CAATGTACTTCGTTCGGTTACGTATTGCTGCTCAAGAGATTTTCAACGTGAAAAAAATTATTATTCGCAATTCCTTTAGTTGTTCC
>7
CAAACACTTCGTTCAGTTACGTATTGCTGCTCAAGGAGATTTTCAACGTGAAAAAAATTATTATTCGCAATTCCTTTAGTTGTTCC
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
# Function to convert DNA sequence to one-hot encoding
def one_hot_encode(sequence):
mapping = {'A': [1, 0, 0, 0], 'C': [0, 1, 0, 0], 'G': [0, 0, 1, 0], 'T': [0, 0, 0, 1]}
return [mapping[base] for base in sequence]
# DNA sequences in fasta format
sequences = {
">1": "GATGTACTTCGTTCAGTTACATTGCTGCTCAAGGAGATTTTCAACGTGAAAAAAATTATTATTCGCAATTCCTTTAGTTGTTCCTT",
">2": "CAGTGTACTTCGTTCAGTTACGTATTGCTGCTCAAGGAGATTTTCAACGTGAAAAAATTATTATTCGCAATTCCTTTAGTTTGTTC",
">3": "CGGTATTACTTCGTTCAGTTACGTATTATGCTCGAAAGGAATTTCTATTGAAAGGTATTGCAATTCCTTTAGTTGTTCCTTTCTAT",
">4": "CGGTGTACTTCGTTCAGTTACGTATTGCTGCTCAAGGAGATTTTCAACGTGAAAAAAATTATTATTCGCAATTCCTTTAGTTGTTC",
">5": "GATGTACTTCGTTCCAGTTGTGTGTGCTGCTCAAGGAGATTTTTCAACGTGAAAAAATTATTATTCGCAATTCAACTTTGAATTTG",
">6": "CAATGTACTTCGTTCGGTTACGTATTGCTGCTCAAGAGATTTTCAACGTGAAAAAAATTATTATTCGCAATTCCTTTAGTTGTTCC",
">7": "CAAACACTTCGTTCAGTTACGTATTGCTGCTCAAGGAGATTTTCAACGTGAAAAAAATTATTATTCGCAATTCCTTTAGTTGTTCC"
}
# Assign different colors to sequences
colors = ['red', 'blue', 'green', 'orange', 'purple', 'brown', 'pink']
# One-hot encode sequences
sequences_matrix = np.array([np.concatenate(one_hot_encode(seq)) for seq in sequences.values()])
# Perform PCA
pca = PCA(n_components=2)
pca_result = pca.fit_transform(sequences_matrix)
# Plot PCA results with different colors
plt.figure(figsize=(10, 6))
for i, seq_key in enumerate(sequences.keys()):
plt.scatter(pca_result[i, 0], pca_result[i, 1], color=colors[i], label=seq_key)
plt.title('PCA of DNA Sequences')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.show()