gatk MakeVcfSampleNameMap
工具生成一个制表符分隔的文件,将样本名称映射到Snakemake工作流程中相应的VCF文件。这些vcfs在一个目录中input/
预期输出-> (sample_name_map.txt)
input/vcf1.vcf.gz control5-a
input/vcf2.vcf.gz control5-c
MakeVcfSampleNameMap
一次接受一个--INPUT
参数,所以我必须以以下形式向它提交多个VCF文件:
--INPUT {input.vcfs[0]}
--INPUT {input.vcfs[1]}
等当前的snakemake规则:
rule MakeVcfSampleNameMap:
input:
vcfs=["input/vcf1.vcf.gz", "input/vcf2.vcf.gz"],
output:
"sample_name_map.txt",
shell:
"gatk MakeVcfSampleNameMap \
--INPUT {input.vcfs[0]} \
--INPUT {input.vcfs[1]} \
--OUTPUT {output}"
如何在 Snakemake 规则中使用循环来访问所有 VCF 文件,因为这对于许多输入文件来说很麻烦?
这个 python 脚本可以解决问题,但我愿意接受有关如何改进它的建议:
#!/usr/bin/env python3
import os
import sys
# Get the directory path from the command line arguments
directory = sys.argv[1]
# Get a list of all the VCF files in the directory
vcf_files = [f for f in os.listdir(directory) if f.endswith('.vcf.gz')]
# Create a list of tuples with the file paths and names
file_list = []
for vcf_file in vcf_files:
file_name = os.path.splitext(vcf_file)[0]
file_path = os.path.join(directory, vcf_file)
file_list.append((file_path, file_name))
# Sort the list by the second element of each tuple (the file name)
file_list.sort(key=lambda x: x[1])
# Write the file names and paths to a TSV file
with open('sample_name_map.tsv', 'w') as tsv_file:
for file_path, file_name in file_list:
tsv_file.write(f'{file_path}\t{os.path.splitext(file_name)[0]}\n')
然后从像这样的 snakemake 规则中调用此脚本:
rule generate_sample_name_map:
input:
directory="input",
output:
"sample_name_map.txt",
shell:
"python MakeVcfSampleNameMap.py {input.directory} > {output}"
这个怎么样,没有测试但希望你明白了:
rule MakeVcfSampleNameMap:
input:
vcfs=["input/vcf1.vcf.gz", "input/vcf2.vcf.gz"],
output:
"sample_name_map.txt",
params:
gatk_input=lambda wc, input: ' '.join(['--INPUT %s' % x for x in input.vcfs]),
shell:
r"""
gatk MakeVcfSampleNameMap {params.gatk_input} \
--OUTPUT {output}
"""
如果您输入的是目录,则类似于:
rule MakeVcfSampleNameMap:
input:
vcfs='input_dir'
output:
"sample_name_map.txt",
params:
gatk_input=lambda wc, input: ' '.join(['--INPUT %s' % x for x in os.listdir(input.vcfs)]),
shell:
r"""
gatk MakeVcfSampleNameMap {params.gatk_input} \
--OUTPUT {output}
"""