如何从Python对象派生snakemake通配符？

Question

我正在学习 Snakemake 来开发基因组学管道。由于输入/输出很快就会变得非常多样化，因此我想花一些额外的时间来了解构建 Snakemake 脚本的基础知识。我的目标是通过使用 python 对象来保持代码的明确性和可扩展性，但同时将其从 python 循环转换为 Snakemake 通配符，但我找不到破解它的方法。如何从Python对象派生snakemake通配符？

Python 类：

class Reference:
    def __init__(self, name, species, source, genome_seq, genome_seq_url, transcript_seq, transcript_seq_url, annotation_gtf, annotation_gtf_url, annotation_gff, annotation_gff_url) -> None:
        
        self.name = name
        self.species = species
        self.source = source
        self.genome_seq = genome_seq
        self.genome_seq_url = genome_seq_url
        self.transcript_seq = transcript_seq
        self.transcript_seq_url = transcript_seq_url
        self.annotation_gtf = annotation_gtf
        self.annotation_gtf_url = annotation_gtf_url
        self.annotation_gff = annotation_gff
        self.annotation_gff_url = annotation_gff_url

CSV 参考文件：

name,species,source,genome_seq,genome_seq_url,transcript_seq,transcript_seq_url,annotation_gtf,annotation_gtf_url,annotation_gff,annotation_gff_url
BDGP6_46,FruitFly,Ensembl,Drosophila_melanogaster.BDGP6.46.dna.toplevel.fa.gz,https://ftp.ensembl.org/pub/release-111/fasta/drosophila_melanogaster/dna/Drosophila_melanogaster.BDGP6.46.dna.toplevel.fa.gz,Drosophila_melanogaster.BDGP6.46.cdna.all.fa.gz,https://ftp.ensembl.org/pub/release-111/fasta/drosophila_melanogaster/cdna/Drosophila_melanogaster.BDGP6.46.cdna.all.fa.gz,Drosophila_melanogaster.BDGP6.46.111.gtf.gz,https://ftp.ensembl.org/pub/release-111/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.46.111.gtf.gz,Drosophila_melanogaster.BDGP6.46.111.gff3.gz,https://ftp.ensembl.org/pub/release-111/gff3/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.46.111.gff3.gz

Smake文件：

def get_references(references_path:str) -> dict:
    refs_table = dict()
    
    with open(references_path, 'r') as file:
        reader = csv.DictReader(file)
        for row in reader:
            ref_data = Reference(
                row['name'],
                row['species'],
                row['source'],
                row['genome_seq'],
                row['genome_seq_url'],
                row['transcript_seq'],
                row['transcript_seq_url'],
                row['annotation_gtf'],
                row['annotation_gtf_url'],
                row['annotation_gff'],
                row['annotation_gff_url']
            )

            refs_table[row['name']] = ref_data

        return refs_table
    
references_table = get_references('references.csv')

rule all:
    input:
        genome_seq     = expand("../resources/references/{ref_name}/{genome_seq}",         zip, 
                                genome_seq=[references_table[ref].genome_seq for ref in references_table.keys()], 
                                ref_name=[references_table[ref].name for ref in references_table.keys()]),

        transcript_seq = expand("../resources/references/{ref_name}/{transcript_seq}", zip, 
                                transcript_seq=[references_table[ref].transcript_seq for ref in references_table],
                                ref_name=[references_table[ref].name for ref in references_table]),

        annotation_gtf = expand("../resources/references/{ref_name}/{annotation_gtf}", zip, 
                                annotation_gtf=[references_table[ref].annotation_gtf for ref in references_table], 
                                ref_name=[references_table[ref].name for ref in references_table]),

        annotation_gff = expand("../resources/references/{ref_name}/{annotation_gff}", zip, 
                                annotation_gff=[references_table[ref].annotation_gff for ref in references_table.keys()], 
                                ref_name=[references_table[ref].name for ref in references_table.keys()]),

当前使用动态规则的实现：

for ref_name, ref in references_table.items(): 

    pathlib.Path(f"../resources/references/{ref_name}/").mkdir(parents=True, exist_ok=True) 
    pathlib.Path(f"../logs/download/refs/").mkdir(parents=True, exist_ok=True) 
    pathlib.Path(f"../times/download/refs/").mkdir(parents=True, exist_ok=True) 

    genome_seq     = f"../resources/references/{ref_name}/{ref.genome_seq}"
    transcript_seq = f"../resources/references/{ref_name}/{ref.transcript_seq}"
    annotation_gtf = f"../resources/references/{ref_name}/{ref.annotation_gtf}"
    annotation_gff = f"../resources/references/{ref_name}/{ref.annotation_gff}"

    log_file  = f"../logs/download/refs/{ref_name}.txt"
    time_file = f"../times/download/refs/{ref_name}.txt"

    genome_seq_url     = ref.genome_seq_url
    transcript_seq_url = ref.transcript_seq_url
    annotation_gtf_url = ref.annotation_gtf_url
    annotation_gff_url = ref.annotation_gff_url

    rule_name = f"download_reference_{ref_name}"

    rule:
        name : rule_name
        output: 
            genome_seq = genome_seq,
            transcript_seq = transcript_seq,
            annotation_gtf = annotation_gtf, 
            annotation_gff = annotation_gff
        params:
            genome_seq_url     = genome_seq_url,
            transcript_seq_url = transcript_seq_url,
            annotation_gtf_url = annotation_gtf_url,
            annotation_gff_url = annotation_gff_url,
        log:
            log_file = log_file
        benchmark:
            time_file
        container:
            "dockers/general_image"
        threads: 
            1
        message:
            "Downloading {params.genome_seq_url} and {params.transcript_seq_url} and {params.annotation_gtf_url} and {params.annotation_gff_url}"
        shell:
            """ 
            wget {params.genome_seq_url} -O {output.genome_seq}           &> {log.log_file}
            wget {params.transcript_seq_url} -O {output.transcript_seq}   &> {log.log_file}
            wget {params.annotation_gtf_url} -O {output.annotation_gtf}   &> {log.log_file}
            wget {params.annotation_gff_url} -O {output.annotation_gff}   &> {log.log_file}
            """

Answer 1

通过按照您的方式设置规则

all

，它已经期望表中所有

ref_name

的下载规则的输出，因此无需为每个

ref_name

生成新规则。

将下载规则分解为单独的规则也会更容易 - 这样它可以并行运行，并且您基本上可以将使用的相应字符串复制到

expand

语句中作为每个规则的预期输出。

最后，在大量规则属性（例如

input

和

params

）中，您可以使用函数或 lambda 表达式来生成要在规则本身中传递给 Snakemake 的内容。然后，您将

wildcards

对象传递给函数，并能够按名称访问您在规则中设置的通配符。

（对于输出、日志和基准测试，同时，由于我们要输出到一个新文件，Snakemake 希望文件名包含所有通配符，以绝对确保我们不会在只有一个通配符的规则执行过程中进行覆盖已更改。在日志和时间的情况下，这会导致一些稍微难看的文件名，因为您获得了所有扩展名，但应该可以帮助您开始。）

因此，您的规则将如下所示：


rule download_genome:
    output:
        genome_seq = "../resources/references/{ref_name}/{genome_seq}"
    params:
        genome_seq_url = lambda wildcards: references_table[wildcards.ref_name].genome_seq_url
    log:
        log_file = "../logs/download/refs/{ref_name}_{genome_seq}.txt"
    benchmark:
        "../times/download/refs/{ref_name}_{genome_seq}.txt"
    container:
        "dockers/general_image"
    threads:
        1
    message:
        "Downloading {params.genome_seq_url}"
    shell:
        """
        wget {params.genome_seq_url} -O {output.genome_seq}           &> {log.log_file}
        """

rule download_transcript:
    output:
        transcript_seq = "../resources/references/{ref_name}/{transcript_seq}"
    params:
        transcript_seq_url = lambda wildcards: references_table[
            wildcards.ref_name].transcript_seq_url
    log:
        log_file = "../logs/download/refs/{ref_name}_{transcript_seq}.txt"
    benchmark:
        "../times/download/refs/{ref_name}_{transcript_seq}.txt"
    container:
        "dockers/general_image"
    threads:
        1
    message:
        "Downloading {params.transcript_seq_url}"
    shell:
        """
        wget {params.transcript_seq_url} -O {output.transcript_seq}           &> {log.log_file}
        """

rule download_annotation_gtf:
    output:
        annotation_gtf = "../resources/references/{ref_name}/{annotation_gtf}"
    params:
        annotation_gtf_url = lambda wildcards: references_table[
            wildcards.ref_name].annotation_gtf_url
    log:
        log_file = "../logs/download/refs/{ref_name}_{annotation_gtf}.txt"
    benchmark:
        "../times/download/refs/{ref_name}_{annotation_gtf}.txt"
    container:
        "dockers/general_image"
    threads:
        1
    message:
        "Downloading {params.annotation_gtf_url}"
    shell:
        """
        wget {params.annotation_gtf_url} -O {output.annotation_gtf}           &> {log.log_file}
        """

如何从Python对象派生snakemake通配符？

问题描述投票：0回答：1

1个回答

最新问题

如何从Python对象派生snakemake通配符？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1