我正在学习 Snakemake 来开发基因组学管道。由于输入/输出很快就会变得非常多样化,因此我想花一些额外的时间来了解构建 Snakemake 脚本的基础知识。我的目标是通过使用 python 对象来保持代码的明确性和可扩展性,但同时将其从 python 循环转换为 Snakemake 通配符,但我找不到破解它的方法。如何从Python对象派生snakemake通配符?
Python 类:
class Reference:
def __init__(self, name, species, source, genome_seq, genome_seq_url, transcript_seq, transcript_seq_url, annotation_gtf, annotation_gtf_url, annotation_gff, annotation_gff_url) -> None:
self.name = name
self.species = species
self.source = source
self.genome_seq = genome_seq
self.genome_seq_url = genome_seq_url
self.transcript_seq = transcript_seq
self.transcript_seq_url = transcript_seq_url
self.annotation_gtf = annotation_gtf
self.annotation_gtf_url = annotation_gtf_url
self.annotation_gff = annotation_gff
self.annotation_gff_url = annotation_gff_url
CSV 参考文件:
name,species,source,genome_seq,genome_seq_url,transcript_seq,transcript_seq_url,annotation_gtf,annotation_gtf_url,annotation_gff,annotation_gff_url
BDGP6_46,FruitFly,Ensembl,Drosophila_melanogaster.BDGP6.46.dna.toplevel.fa.gz,https://ftp.ensembl.org/pub/release-111/fasta/drosophila_melanogaster/dna/Drosophila_melanogaster.BDGP6.46.dna.toplevel.fa.gz,Drosophila_melanogaster.BDGP6.46.cdna.all.fa.gz,https://ftp.ensembl.org/pub/release-111/fasta/drosophila_melanogaster/cdna/Drosophila_melanogaster.BDGP6.46.cdna.all.fa.gz,Drosophila_melanogaster.BDGP6.46.111.gtf.gz,https://ftp.ensembl.org/pub/release-111/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.46.111.gtf.gz,Drosophila_melanogaster.BDGP6.46.111.gff3.gz,https://ftp.ensembl.org/pub/release-111/gff3/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.46.111.gff3.gz
Smake文件:
def get_references(references_path:str) -> dict:
refs_table = dict()
with open(references_path, 'r') as file:
reader = csv.DictReader(file)
for row in reader:
ref_data = Reference(
row['name'],
row['species'],
row['source'],
row['genome_seq'],
row['genome_seq_url'],
row['transcript_seq'],
row['transcript_seq_url'],
row['annotation_gtf'],
row['annotation_gtf_url'],
row['annotation_gff'],
row['annotation_gff_url']
)
refs_table[row['name']] = ref_data
return refs_table
references_table = get_references('references.csv')
rule all:
input:
genome_seq = expand("../resources/references/{ref_name}/{genome_seq}", zip,
genome_seq=[references_table[ref].genome_seq for ref in references_table.keys()],
ref_name=[references_table[ref].name for ref in references_table.keys()]),
transcript_seq = expand("../resources/references/{ref_name}/{transcript_seq}", zip,
transcript_seq=[references_table[ref].transcript_seq for ref in references_table],
ref_name=[references_table[ref].name for ref in references_table]),
annotation_gtf = expand("../resources/references/{ref_name}/{annotation_gtf}", zip,
annotation_gtf=[references_table[ref].annotation_gtf for ref in references_table],
ref_name=[references_table[ref].name for ref in references_table]),
annotation_gff = expand("../resources/references/{ref_name}/{annotation_gff}", zip,
annotation_gff=[references_table[ref].annotation_gff for ref in references_table.keys()],
ref_name=[references_table[ref].name for ref in references_table.keys()]),
当前使用动态规则的实现:
for ref_name, ref in references_table.items():
pathlib.Path(f"../resources/references/{ref_name}/").mkdir(parents=True, exist_ok=True)
pathlib.Path(f"../logs/download/refs/").mkdir(parents=True, exist_ok=True)
pathlib.Path(f"../times/download/refs/").mkdir(parents=True, exist_ok=True)
genome_seq = f"../resources/references/{ref_name}/{ref.genome_seq}"
transcript_seq = f"../resources/references/{ref_name}/{ref.transcript_seq}"
annotation_gtf = f"../resources/references/{ref_name}/{ref.annotation_gtf}"
annotation_gff = f"../resources/references/{ref_name}/{ref.annotation_gff}"
log_file = f"../logs/download/refs/{ref_name}.txt"
time_file = f"../times/download/refs/{ref_name}.txt"
genome_seq_url = ref.genome_seq_url
transcript_seq_url = ref.transcript_seq_url
annotation_gtf_url = ref.annotation_gtf_url
annotation_gff_url = ref.annotation_gff_url
rule_name = f"download_reference_{ref_name}"
rule:
name : rule_name
output:
genome_seq = genome_seq,
transcript_seq = transcript_seq,
annotation_gtf = annotation_gtf,
annotation_gff = annotation_gff
params:
genome_seq_url = genome_seq_url,
transcript_seq_url = transcript_seq_url,
annotation_gtf_url = annotation_gtf_url,
annotation_gff_url = annotation_gff_url,
log:
log_file = log_file
benchmark:
time_file
container:
"dockers/general_image"
threads:
1
message:
"Downloading {params.genome_seq_url} and {params.transcript_seq_url} and {params.annotation_gtf_url} and {params.annotation_gff_url}"
shell:
"""
wget {params.genome_seq_url} -O {output.genome_seq} &> {log.log_file}
wget {params.transcript_seq_url} -O {output.transcript_seq} &> {log.log_file}
wget {params.annotation_gtf_url} -O {output.annotation_gtf} &> {log.log_file}
wget {params.annotation_gff_url} -O {output.annotation_gff} &> {log.log_file}
"""
通过按照您的方式设置规则
all
,它已经期望表中所有 ref_name
的下载规则的输出,因此无需为每个 ref_name
生成新规则。
将下载规则分解为单独的规则也会更容易 - 这样它可以并行运行,并且您基本上可以将使用的相应字符串复制到
expand
语句中作为每个规则的预期输出。
最后,在大量规则属性(例如
input
和 params
)中,您可以使用函数或 lambda 表达式来生成要在规则本身中传递给 Snakemake 的内容。然后,您将 wildcards
对象传递给函数,并能够按名称访问您在规则中设置的通配符。
(对于输出、日志和基准测试,同时,由于我们要输出到一个新文件,Snakemake 希望文件名包含所有通配符,以绝对确保我们不会在只有一个通配符的规则执行过程中进行覆盖已更改。在日志和时间的情况下,这会导致一些稍微难看的文件名,因为您获得了所有扩展名,但应该可以帮助您开始。)
因此,您的规则将如下所示:
rule download_genome:
output:
genome_seq = "../resources/references/{ref_name}/{genome_seq}"
params:
genome_seq_url = lambda wildcards: references_table[wildcards.ref_name].genome_seq_url
log:
log_file = "../logs/download/refs/{ref_name}_{genome_seq}.txt"
benchmark:
"../times/download/refs/{ref_name}_{genome_seq}.txt"
container:
"dockers/general_image"
threads:
1
message:
"Downloading {params.genome_seq_url}"
shell:
"""
wget {params.genome_seq_url} -O {output.genome_seq} &> {log.log_file}
"""
rule download_transcript:
output:
transcript_seq = "../resources/references/{ref_name}/{transcript_seq}"
params:
transcript_seq_url = lambda wildcards: references_table[
wildcards.ref_name].transcript_seq_url
log:
log_file = "../logs/download/refs/{ref_name}_{transcript_seq}.txt"
benchmark:
"../times/download/refs/{ref_name}_{transcript_seq}.txt"
container:
"dockers/general_image"
threads:
1
message:
"Downloading {params.transcript_seq_url}"
shell:
"""
wget {params.transcript_seq_url} -O {output.transcript_seq} &> {log.log_file}
"""
rule download_annotation_gtf:
output:
annotation_gtf = "../resources/references/{ref_name}/{annotation_gtf}"
params:
annotation_gtf_url = lambda wildcards: references_table[
wildcards.ref_name].annotation_gtf_url
log:
log_file = "../logs/download/refs/{ref_name}_{annotation_gtf}.txt"
benchmark:
"../times/download/refs/{ref_name}_{annotation_gtf}.txt"
container:
"dockers/general_image"
threads:
1
message:
"Downloading {params.annotation_gtf_url}"
shell:
"""
wget {params.annotation_gtf_url} -O {output.annotation_gtf} &> {log.log_file}
"""