使用 vcf 中的四倍体基因型调用计算每个位点和样本缺失

Question

我有一个包含四倍体基因型调用的 vcf 文件，并且想根据每个 snp 和样本的缺失基因型调用的比例来过滤 snps 和样本。我无法使用我尝试过的任何工具（vcftools、bcftools、plink）实现四倍体调用。 Bcftools 应该能够做到这一点，但无法正确计算丢失站点的比例。问题似乎是 vcf 文件中存在两种类型的丢失调用：一种没有信息 .:.:.:.:.:.:。一个被正确计数为缺失，另一个指示缺失基因型 ./././。其次是覆盖范围信息，该信息不计为缺失。我编写了一个自定义脚本来计算 ./././ 的出现次数。并更新每个 snp 的“NS”字段，以便可用于计算位点缺失。但是，我无法计算样本缺失情况。

我提供了一个我一直在测试故障排除的 snp 示例，因为它有很多缺失的基因型。

Scaffold_1      43629806        .       CTCTC   CTCTG   16620.9 .       AB=0.196078;ABP=43.9277;AC=25;AF=0.0932836;AN=268;AO=54;CIGAR=4M1X;DP=728;DPB=728;DPRA=0.996137;EPP=80.8616;EPPR=18.2106;GTI=8;LEN=1;MEANALT=1.55556;MQM=57.7593;MQMR=60;NS=67;NUMALT=2;ODDS=0.391415;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=1948;QR=259;RO=7;RPL=5;RPP=80.8616;RPPR=18.2106;RPR=49;RUN=1;SAF=54;SAP=120.27;SAR=0;SRF=7;SRP=18.2106;SRR=0;TYPE=snp;technology.illumina=1      GT:DP:AD:RO:QR:AO:QA ./././.:15:0,15,0:0:0:0:0       ./././.:9:0,9,0:0:0:0:0 ./././.:4:0,4,0:0:0:0:0 ./././.:5:0,5,0:0:0:0:0 .:.:.:.:.:.:.   ./././.:14:0,14,0:0:0:0:0       ./././.:7:0,7,0:0:0:0:0 ./././.:11:0,11,0:0:0:0:0       ./././1:23:0,19,3:0:0:3:111  ./././.:12:0,12,0:0:0:0:0       .:.:.:.:.:.:.   ./././.:23:0,23,0:0:0:0:0       .:.:.:.:.:.:.   ./././.:5:0,5,0:0:0:0:0 ./././.:10:0,10,0:0:0:0:0       ./././.:21:0,21,0:0:0:0:0       .:.:.:.:.:.:.   ./././.:10:0,10,0:0:0:0:0   .:.:.:.:.:.:.    ./././.:34:0,34,0:0:0:0:0       .:.:.:.:.:.:.   ./././.:17:0,17,0:0:0:0:0       .:.:.:.:.:.:.   ./././.:10:0,10,0:0:0:0:0       ./././.:22:0,22,0:0:0:0:0       ./././.:20:0,20,0:0:0:0:0       ./././.:1:0,1,0:0:0:0:0 ./././.:1:0,1,0:0:0:0:0      ./././.:16:0,15,0:0:0:0:0       .:.:.:.:.:.:.   .:.:.:.:.:.:.   .:.:.:.:.:.:.   .:.:.:.:.:.:.   ./././.:18:0,18,0:0:0:0:0       .:.:.:.:.:.:.   ./././1:4:0,3,1:0:0:1:37        ./././.:8:0,8,0:0:0:0:0 ./././.:1:0,1,0:0:0:0:0      ./././.:15:0,15,0:0:0:0:0       ./././.:11:0,11,0:0:0:0:0       .:.:.:.:.:.:.   .:.:.:.:.:.:.   .:.:.:.:.:.:.   ./././.:9:0,9,0:0:0:0:0 ./././.:7:0,7,0:0:0:0:0 ./././.:12:0,12,0:0:0:0:0       .:.:.:.:.:.:.   1/1/1/1:8:0,0,8:0:0:8:296    ./././.:2:0,2,0:0:0:0:0 ./././.:4:0,4,0:0:0:0:0 ././1/1:11:0,6,5:0:0:5:185      ./././.:1:0,1,0:0:0:0:0 .:.:.:.:.:.:.   ./././.:17:0,17,0:0:0:0:0       .:.:.:.:.:.:.   .:.:.:.:.:.:.   .:.:.:.:.:.:.   ./././.:13:0,13,0:0:0:0:0    ./././.:15:0,15,0:0:0:0:0       .:.:.:.:.:.:.   ./././.:14:0,14,0:0:0:0:0       .:.:.:.:.:.:.   ./././0:29:7,22,0:7:259:0:0     .:.:.:.:.:.:.   ./././.:8:0,8,0:0:0:0:0 ./././.:12:0,12,0:0:0:0:0       ./././.:1:0,1,0:0:0:0:0 ./././.:3:0,3,0:0:0:0:0      ./././.:15:0,15,0:0:0:0:0       ./././.:9:0,9,0:0:0:0:0 ./././.:8:0,8,0:0:0:0:0 .:.:.:.:.:.:.   ./././.:8:0,8,0:0:0:0:0 ./././.:9:0,9,0:0:0:0:0 ./././.:8:0,8,0:0:0:0:0 ./././.:6:0,6,0:0:0:0:0 1/1/1/1:7:0,0,7:0:0:7:259    ./././.:14:0,14,0:0:0:0:0       .:.:.:.:.:.:.   ./././.:1:0,1,0:0:0:0:0 ./././.:6:0,6,0:0:0:0:0 .:.:.:.:.:.:.   ./././.:2:0,2,0:0:0:0:0 ./././.:8:0,8,0:0:0:0:0 1/1/1/1:13:0,0,13:0:0:13:443    .:.:.:.:.:.:.   ./././1:13:0,12,1:0:0:1:37   ./././.:19:0,19,0:0:0:0:0       1/1/1/1:1:0,0,1:0:0:1:37        ./././.:9:0,9,0:0:0:0:0 .:.:.:.:.:.:.   .:.:.:.:.:.:.   1/1/1/1:15:0,0,15:0:0:15:543    ./././.:29:0,29,0:0:0:0:0       ./././.:4:0,4,0:0:0:0:0 ./././.:11:0,11,0:0:0:0:0

我已确认“NS”字段仅计算 .:.:.:.:.:.:.缺失，而不是 ./././ 的基因型调用。在 GPT 的帮助下，我编写了一个自定义脚本来根据 .:.:.:.:.:.:. 的数量更新“NS”字段。每个 snp 的出现次数（在此示例中，运行后 NS 为 10，脚本如下）。

awk -F'\t' 'BEGIN {OFS = FS}
    !/^#/ {  # Only process non-header lines
        missing = 0
        for (i = 10; i <= NF; i++) {
            # Check if the genotype column starts with "./././."
            if (substr($i, 1, 7) == "./././.") {
                missing++
            }
        }

        # Split INFO field into key-value pairs
        split($8, info_array, ";")
        for (j in info_array) {
            # Split each info field on "=" and check for NS key
            split(info_array[j], kv, "=")
            if (kv[1] == "NS") {
                # Adjust the NS value based on missing genotype count
                kv[2] -= missing
                info_array[j] = kv[1] "=" kv[2]  # Reconstruct NS key-value pair
                break
            }
        }

        # Reassemble INFO field from the modified info_array
        $8 = info_array[1]
        for (j=2; j <= length(info_array); j++) {
            $8 = $8 ";" info_array[j]
        }

        # Reconstruct and print the entire line with updated INFO
        print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10
        for (k = 11; k <= NF; k++) {
            printf("%s%s", OFS, $k)
        }
        printf("\n")  # Print newline at the end of the record
    }
    /^#/ {print}  # Print header lines as they are
' freebayes_midwest_lib1A-2B_snp_call_QUAL20_MINDP5.vcf > freebayes_midwest_lib1A-2B_snp_call_QUAL20_MINDP5_corrected_NS.vcf

我还无法使用类似的方法正确计算每个样本的缺失情况——有人对如何做到这一点有想法或建议吗？是否有我不知道的工具可以实现这一目标？预先感谢您的帮助！

Answer 1

首先，我根据尚未反映在 NS 字段中的任何部分或完整未接来电更新了 NS 字段，这些未接来电必须以

./

开头（

./././.

或

././0/0

或

./1/1/1

等）。

awk -F'\t' '
    BEGIN { OFS=FS; }
    !/^#/ {
        missing = 0;
        for (i = 10; i <= NF; i++) {
            if ($i ~ /^\.\//) {
                missing++;
            }
        }
        if (missing > 0) {
            # Split INFO field into key-value pairs
            nsplit = split($8, info_array, ";")
            # Iterate through the INFO fields and update NS if found
            for (j = 1; j <= nsplit; j++) {
                split(info_array[j], kv, "=");
                if (kv[1] == "NS") {
                    kv[2] -= missing;
                    info_array[j] = kv[1] "=" kv[2];
                    break;
                }
            }
            # Reassemble the INFO field from the modified info_array
            $8 = info_array[1];
            for (j = 2; j <= nsplit; j++) {
                $8 = $8 ";" info_array[j];
            }
        }
        print $0;
    }
    /^#/ { print $0; }
' freebayes_midwest_lib1A-2B_snp_call_QUAL20_MINDP5.vcf > freebayes_midwest_lib1A-2B_snp_call_QUAL20_MINDP5_corrected_NS.vcf

然后我将调用次数 > 90% 的网站比例和调用次数 > 25% 的样本打印到终端。这会将任何出现

.:.:.:.:.:.:.

或任何以

./

开头的调用视为缺失。

INPUT_VCF="freebayes_midwest_lib1A-2B_snp_call_QUAL20_MINDP5_corrected_NS.vcf"
    awk -v smiss_threshold=0.90 -v imiss_threshold=0.25 '
    BEGIN {
        FS = OFS = "\t";
        high_quality_sites = 0;
        total_sites = 0; 
        total_samples_above_threshold = 0;
    
        for (i = 1; i <= 96; i++) {
            sample_nonmissing[i] = 0;
            sample_sites[i] = 0;
        }
    }
    /^#CHROM/ {
        print;
        next;
    }
    !/^#/ {
        site_nonmissing = 0;
        for (i = 10; i <= NF; i++) {
            # Adjusted condition to check for both fully and partially missing genotypes
            if ($i != "./././." && $i != ".:.:.:.:.:.:." && !match($i, /\./)) {
                sample_nonmissing[i-9]++;
                site_nonmissing++;
            }
            sample_sites[i-9]++;
        }
        
        if ((site_nonmissing / 96) >= smiss_threshold) {
            high_quality_sites++;
        }
        total_sites++;
    }
    END {
        for (i = 1; i <= 96; i++) {
            sample_proportion = sample_nonmissing[i] / sample_sites[i];
            if (sample_proportion >= imiss_threshold) {
                total_samples_above_threshold++;
            }
            print "Sample " i ": non-missing proportion=" sample_proportion;
        }
        
        print "Total number of high-quality sites: " high_quality_sites "/" total_sites " (" high_quality_sites*100/total_sites "%)";
        print "Total number of samples above the threshold: " total_samples_above_threshold "/96 (" total_samples_above_threshold*100/96 "%)";
    }
    ' "$INPUT_VCF"
    #Total number of high-quality sites: 1402/1694 (82.7627%)
    #Total number of samples above the threshold:    96      out of  96      (100%)

然后我可以按站点缺失进行过滤，我通过执行以下脚本来运行该站点

./filter_ns.awk freebayes_midwest_lib1A-2B_snp_call_QUAL20_MINDP5_corrected_NS.vcf > freebayes_midwest_lib1A-2B_snp_call_QUAL20_MINDP5_corrected_NS_FMISS0.1.vcf

脚本：

#!/usr/bin/awk -f

BEGIN {
    FS = "\t";
    OFS = FS;
    required_ns = 87;  # At least 87 non-missing samples required (more than 90% of 96)
}

# Print header lines as they are
/^#/ {
    print;
    next;
}

# Process non-header lines
{
    info_field = $8;
    split(info_field, info_array, ";");
    for (i in info_array) {
        split(info_array[i], kv, "=");
        if (kv[1] == "NS") {
            ns_value = kv[2];
            break;
        }
    }

    if (ns_value >= required_ns) {
        print;
    }
}

使用 vcf 中的四倍体基因型调用计算每个位点和样本缺失

问题描述投票：0回答：1

1个回答

最新问题

使用 vcf 中的四倍体基因型调用计算每个位点和样本缺失

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1