我有一个包含四倍体基因型调用的 vcf 文件,并且想根据每个 snp 和样本的缺失基因型调用的比例来过滤 snps 和样本。我无法使用我尝试过的任何工具(vcftools、bcftools、plink)实现四倍体调用。 Bcftools 应该能够做到这一点,但无法正确计算丢失站点的比例。问题似乎是 vcf 文件中存在两种类型的丢失调用:一种没有信息 .:.:.:.:.:.:。一个被正确计数为缺失,另一个指示缺失基因型 ./././。其次是覆盖范围信息,该信息不计为缺失。我编写了一个自定义脚本来计算 ./././ 的出现次数。并更新每个 snp 的“NS”字段,以便可用于计算位点缺失。但是,我无法计算样本缺失情况。
我提供了一个我一直在测试故障排除的 snp 示例,因为它有很多缺失的基因型。
Scaffold_1 43629806 . CTCTC CTCTG 16620.9 . AB=0.196078;ABP=43.9277;AC=25;AF=0.0932836;AN=268;AO=54;CIGAR=4M1X;DP=728;DPB=728;DPRA=0.996137;EPP=80.8616;EPPR=18.2106;GTI=8;LEN=1;MEANALT=1.55556;MQM=57.7593;MQMR=60;NS=67;NUMALT=2;ODDS=0.391415;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=1948;QR=259;RO=7;RPL=5;RPP=80.8616;RPPR=18.2106;RPR=49;RUN=1;SAF=54;SAP=120.27;SAR=0;SRF=7;SRP=18.2106;SRR=0;TYPE=snp;technology.illumina=1 GT:DP:AD:RO:QR:AO:QA ./././.:15:0,15,0:0:0:0:0 ./././.:9:0,9,0:0:0:0:0 ./././.:4:0,4,0:0:0:0:0 ./././.:5:0,5,0:0:0:0:0 .:.:.:.:.:.:. ./././.:14:0,14,0:0:0:0:0 ./././.:7:0,7,0:0:0:0:0 ./././.:11:0,11,0:0:0:0:0 ./././1:23:0,19,3:0:0:3:111 ./././.:12:0,12,0:0:0:0:0 .:.:.:.:.:.:. ./././.:23:0,23,0:0:0:0:0 .:.:.:.:.:.:. ./././.:5:0,5,0:0:0:0:0 ./././.:10:0,10,0:0:0:0:0 ./././.:21:0,21,0:0:0:0:0 .:.:.:.:.:.:. ./././.:10:0,10,0:0:0:0:0 .:.:.:.:.:.:. ./././.:34:0,34,0:0:0:0:0 .:.:.:.:.:.:. ./././.:17:0,17,0:0:0:0:0 .:.:.:.:.:.:. ./././.:10:0,10,0:0:0:0:0 ./././.:22:0,22,0:0:0:0:0 ./././.:20:0,20,0:0:0:0:0 ./././.:1:0,1,0:0:0:0:0 ./././.:1:0,1,0:0:0:0:0 ./././.:16:0,15,0:0:0:0:0 .:.:.:.:.:.:. .:.:.:.:.:.:. .:.:.:.:.:.:. .:.:.:.:.:.:. ./././.:18:0,18,0:0:0:0:0 .:.:.:.:.:.:. ./././1:4:0,3,1:0:0:1:37 ./././.:8:0,8,0:0:0:0:0 ./././.:1:0,1,0:0:0:0:0 ./././.:15:0,15,0:0:0:0:0 ./././.:11:0,11,0:0:0:0:0 .:.:.:.:.:.:. .:.:.:.:.:.:. .:.:.:.:.:.:. ./././.:9:0,9,0:0:0:0:0 ./././.:7:0,7,0:0:0:0:0 ./././.:12:0,12,0:0:0:0:0 .:.:.:.:.:.:. 1/1/1/1:8:0,0,8:0:0:8:296 ./././.:2:0,2,0:0:0:0:0 ./././.:4:0,4,0:0:0:0:0 ././1/1:11:0,6,5:0:0:5:185 ./././.:1:0,1,0:0:0:0:0 .:.:.:.:.:.:. ./././.:17:0,17,0:0:0:0:0 .:.:.:.:.:.:. .:.:.:.:.:.:. .:.:.:.:.:.:. ./././.:13:0,13,0:0:0:0:0 ./././.:15:0,15,0:0:0:0:0 .:.:.:.:.:.:. ./././.:14:0,14,0:0:0:0:0 .:.:.:.:.:.:. ./././0:29:7,22,0:7:259:0:0 .:.:.:.:.:.:. ./././.:8:0,8,0:0:0:0:0 ./././.:12:0,12,0:0:0:0:0 ./././.:1:0,1,0:0:0:0:0 ./././.:3:0,3,0:0:0:0:0 ./././.:15:0,15,0:0:0:0:0 ./././.:9:0,9,0:0:0:0:0 ./././.:8:0,8,0:0:0:0:0 .:.:.:.:.:.:. ./././.:8:0,8,0:0:0:0:0 ./././.:9:0,9,0:0:0:0:0 ./././.:8:0,8,0:0:0:0:0 ./././.:6:0,6,0:0:0:0:0 1/1/1/1:7:0,0,7:0:0:7:259 ./././.:14:0,14,0:0:0:0:0 .:.:.:.:.:.:. ./././.:1:0,1,0:0:0:0:0 ./././.:6:0,6,0:0:0:0:0 .:.:.:.:.:.:. ./././.:2:0,2,0:0:0:0:0 ./././.:8:0,8,0:0:0:0:0 1/1/1/1:13:0,0,13:0:0:13:443 .:.:.:.:.:.:. ./././1:13:0,12,1:0:0:1:37 ./././.:19:0,19,0:0:0:0:0 1/1/1/1:1:0,0,1:0:0:1:37 ./././.:9:0,9,0:0:0:0:0 .:.:.:.:.:.:. .:.:.:.:.:.:. 1/1/1/1:15:0,0,15:0:0:15:543 ./././.:29:0,29,0:0:0:0:0 ./././.:4:0,4,0:0:0:0:0 ./././.:11:0,11,0:0:0:0:0
我已确认“NS”字段仅计算 .:.:.:.:.:.:.缺失,而不是 ./././ 的基因型调用。在 GPT 的帮助下,我编写了一个自定义脚本来根据 .:.:.:.:.:.:. 的数量更新“NS”字段。每个 snp 的出现次数(在此示例中,运行后 NS 为 10,脚本如下)。
awk -F'\t' 'BEGIN {OFS = FS}
!/^#/ { # Only process non-header lines
missing = 0
for (i = 10; i <= NF; i++) {
# Check if the genotype column starts with "./././."
if (substr($i, 1, 7) == "./././.") {
missing++
}
}
# Split INFO field into key-value pairs
split($8, info_array, ";")
for (j in info_array) {
# Split each info field on "=" and check for NS key
split(info_array[j], kv, "=")
if (kv[1] == "NS") {
# Adjust the NS value based on missing genotype count
kv[2] -= missing
info_array[j] = kv[1] "=" kv[2] # Reconstruct NS key-value pair
break
}
}
# Reassemble INFO field from the modified info_array
$8 = info_array[1]
for (j=2; j <= length(info_array); j++) {
$8 = $8 ";" info_array[j]
}
# Reconstruct and print the entire line with updated INFO
print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10
for (k = 11; k <= NF; k++) {
printf("%s%s", OFS, $k)
}
printf("\n") # Print newline at the end of the record
}
/^#/ {print} # Print header lines as they are
' freebayes_midwest_lib1A-2B_snp_call_QUAL20_MINDP5.vcf > freebayes_midwest_lib1A-2B_snp_call_QUAL20_MINDP5_corrected_NS.vcf
我还无法使用类似的方法正确计算每个样本的缺失情况——有人对如何做到这一点有想法或建议吗?是否有我不知道的工具可以实现这一目标?预先感谢您的帮助!
首先,我根据尚未反映在 NS 字段中的任何部分或完整未接来电更新了 NS 字段,这些未接来电必须以
./
开头(./././.
或 ././0/0
或 ./1/1/1
等)。
awk -F'\t' '
BEGIN { OFS=FS; }
!/^#/ {
missing = 0;
for (i = 10; i <= NF; i++) {
if ($i ~ /^\.\//) {
missing++;
}
}
if (missing > 0) {
# Split INFO field into key-value pairs
nsplit = split($8, info_array, ";")
# Iterate through the INFO fields and update NS if found
for (j = 1; j <= nsplit; j++) {
split(info_array[j], kv, "=");
if (kv[1] == "NS") {
kv[2] -= missing;
info_array[j] = kv[1] "=" kv[2];
break;
}
}
# Reassemble the INFO field from the modified info_array
$8 = info_array[1];
for (j = 2; j <= nsplit; j++) {
$8 = $8 ";" info_array[j];
}
}
print $0;
}
/^#/ { print $0; }
' freebayes_midwest_lib1A-2B_snp_call_QUAL20_MINDP5.vcf > freebayes_midwest_lib1A-2B_snp_call_QUAL20_MINDP5_corrected_NS.vcf
然后我将调用次数 > 90% 的网站比例和调用次数 > 25% 的样本打印到终端。这会将任何出现
.:.:.:.:.:.:.
或任何以 ./
开头的调用视为缺失。
INPUT_VCF="freebayes_midwest_lib1A-2B_snp_call_QUAL20_MINDP5_corrected_NS.vcf"
awk -v smiss_threshold=0.90 -v imiss_threshold=0.25 '
BEGIN {
FS = OFS = "\t";
high_quality_sites = 0;
total_sites = 0;
total_samples_above_threshold = 0;
for (i = 1; i <= 96; i++) {
sample_nonmissing[i] = 0;
sample_sites[i] = 0;
}
}
/^#CHROM/ {
print;
next;
}
!/^#/ {
site_nonmissing = 0;
for (i = 10; i <= NF; i++) {
# Adjusted condition to check for both fully and partially missing genotypes
if ($i != "./././." && $i != ".:.:.:.:.:.:." && !match($i, /\./)) {
sample_nonmissing[i-9]++;
site_nonmissing++;
}
sample_sites[i-9]++;
}
if ((site_nonmissing / 96) >= smiss_threshold) {
high_quality_sites++;
}
total_sites++;
}
END {
for (i = 1; i <= 96; i++) {
sample_proportion = sample_nonmissing[i] / sample_sites[i];
if (sample_proportion >= imiss_threshold) {
total_samples_above_threshold++;
}
print "Sample " i ": non-missing proportion=" sample_proportion;
}
print "Total number of high-quality sites: " high_quality_sites "/" total_sites " (" high_quality_sites*100/total_sites "%)";
print "Total number of samples above the threshold: " total_samples_above_threshold "/96 (" total_samples_above_threshold*100/96 "%)";
}
' "$INPUT_VCF"
#Total number of high-quality sites: 1402/1694 (82.7627%)
#Total number of samples above the threshold: 96 out of 96 (100%)
然后我可以按站点缺失进行过滤,我通过执行以下脚本来运行该站点
./filter_ns.awk freebayes_midwest_lib1A-2B_snp_call_QUAL20_MINDP5_corrected_NS.vcf > freebayes_midwest_lib1A-2B_snp_call_QUAL20_MINDP5_corrected_NS_FMISS0.1.vcf
脚本:
#!/usr/bin/awk -f
BEGIN {
FS = "\t";
OFS = FS;
required_ns = 87; # At least 87 non-missing samples required (more than 90% of 96)
}
# Print header lines as they are
/^#/ {
print;
next;
}
# Process non-header lines
{
info_field = $8;
split(info_field, info_array, ";");
for (i in info_array) {
split(info_array[i], kv, "=");
if (kv[1] == "NS") {
ns_value = kv[2];
break;
}
}
if (ns_value >= required_ns) {
print;
}
}