Snakemake-参数文件视为通配符

问题描述 投票:0回答:1

我已经在Snakemake中编写了一个管道。这是一条ATAC-seq管道(用于分析来自特定实验的基因组数据的生物信息学管道)。基本上,在合并对齐步骤之前,我使用{sample_id}通配符,以后再切换到{sample}通配符(将两个或多个sample_id合并为一个样本)。

[working DAG here(为简单起见,仅显示一个示例;橙色和蓝色{sample_id}被合并为一个绿色{sample}] >>

所有规则如下所示:

configfile: "config.yaml"
SAMPLES_DICT = dict()

with open(config['SAMPLE_SHEET'], "r+") as fil:
    next(fil)
    for lin in fil.readlines():
        row = lin.strip("\n").split("\t")
        sample_id = row[0]
        sample_name = row[1]
        if sample_name in SAMPLES_DICT.keys():
            SAMPLES_DICT[sample_name].append(sample_id)
        else:
            SAMPLES_DICT[sample_name] = [sample_id]

SAMPLES = list(SAMPLES_DICT.keys())
SAMPLE_IDS = [sample_id for sample in SAMPLES_DICT.values() for sample_id in sample]
rule all:
    input:
        # FASTQC output for RAW reads
        expand(os.path.join(config['FASTQC'], '{sample_id}_R{read}_fastqc.zip'),
               sample_id = SAMPLE_IDS,
               read = ['1', '2']),

        # Trimming
        expand(os.path.join(config['TRIMMED'],
                            '{sample_id}_R{read}_val_{read}.fq.gz'),
               sample_id = SAMPLE_IDS,
               read = ['1', '2']),

        # Alignment
        expand(os.path.join(config['ALIGNMENT'], '{sample_id}_sorted.bam'),
               sample_id = SAMPLE_IDS),

        # Merging
        expand(os.path.join(config['ALIGNMENT'], '{sample}_sorted_merged.bam'),
               sample = SAMPLES),

        # Marking Duplicates
        expand(os.path.join(config['ALIGNMENT'], '{sample}_sorted_md.bam'),
               sample = SAMPLES),

        # Filtering
        expand(os.path.join(config['FILTERED'],
                            '{sample}.bam'),
               sample = SAMPLES),
        expand(os.path.join(config['FILTERED'],
                            '{sample}.bam.bai'),
               sample = SAMPLES),

        # multiqc report
        "multiqc_report.html"

    message:
        '\n#################### ATAC-seq pipeline #####################\n'
        'Running all necessary rules to produce complete output.\n'
        '############################################################'

我知道它太乱了,我应该只保留必要的位,但是在这里我对蛇制作的理解失败了,因为我不知道什么我必须保留

以及应该删除的内容。

据我所知,这完全可以正常工作。

但是,我添加了一条规则:

rule hmmratac:
    input:
        bam = os.path.join(config['FILTERED'], '{sample}.bam'),
        index = os.path.join(config['FILTERED'], '{sample}.bam.bai')
    output:
        model = os.path.join(config['HMMRATAC'], '{sample}.model'),
        gappedPeak = os.path.join(config['HMMRATAC'], '{sample}_peaks.gappedPeak'),
        summits = os.path.join(config['HMMRATAC'], '{sample}_summits.bed'),
        states = os.path.join(config['HMMRATAC'], '{sample}.bedgraph'),
        logs = os.path.join(config['HMMRATAC'], '{sample}.log'),
        sample_name = '{sample}'
    log:
        os.path.join(config['LOGS'], 'hmmratac', '{sample}.log')
    params:
        genomes = config['GENOMES'],
        blacklisted = config['BLACKLIST']
    resources:
        mem_mb = 32000
    message:
        '\n######################### Peak calling ########################\n'
        'Peak calling for {output.sample_name}\n.'
        '############################################################'
    shell:
        'HMMRATAC -Xms2g -Xmx{resources.mem_mb}m '
        '--bam {input.bam} --index {input.index} '
        '--genome {params.genome} --blacklist {params.blacklisted} '
        '--output {output.sample_name} --bedgraph true &> {log}'

并加入规则all,经过过滤,在multiqc之前,我添加了:

    # Peak calling
    expand(os.path.join(config['HMMRATAC'], '{sample}.model'),
           sample = SAMPLES),

相关的config.yaml片段:

# Path to blacklisted regions
BLACKLIST: "/mnt/data/.../hg38.blacklist.bed"

# Path to chromosome sizes
GENOMES: "/mnt/data/.../hg38_sizes.genome"

# Path to filtered alignment
FILTERED: "alignment/filtered"

# Path to peaks
HMMRATAC: "peaks/hmmratac"

这是我得到的错误*(对于规则的每个输入和输出都会继续发生)。 *从技术上讲这是一个警告,但它会停止执行snakemake,因此我将其称为错误。

File path alignment/filtered//mnt/data/.../hg38.blacklist.bed.bam contains double '/'. This is likely unintended. It can also lead to inconsistent results of the file-matching approach used by Snakemake.
WARNING:snakemake.logging:File path alignment/filtered//mnt/data/.../hg38.blacklist.bed.bam contains double '/'. This is likely unintended. It can also lead to inconsistent results of the file-matching approach used by Snakemake.

实际上不是...-我只是在这里提供绝对路径感到不安全。

几天来,我一直在努力解决这个错误。浏览了文档,听了介绍。我知道上面的描述还远远不够完美(不列颠哥伦比亚省,我什至不知道如何降低它的难度以提供最小的可重复示例...),但我很绝望,希望您对我耐心等待。

关于如何使用Google搜索,在哪里查找错误的任何建议,将不胜感激。

我已经在Snakemake中编写了一个管道。这是一条ATAC-seq管道(用于分析来自特定实验的基因组数据的生物信息学管道)。基本上,在合并对齐步骤之前,我使用{...] >>

从技术上讲,这是一个警告,但它会停止执行snakemake,因此我将其称为错误。

从snakemake发布日志以查看snakemake是否因错误而终止,如果发生错误则很有用。

但是,除了Eric C.建议使用wildcards.sample代替{sample}作为文件名,我认为这很可疑:

alignment/filtered//mnt/data/.../hg38.blacklist.bed.bam

/mnt/通常位于文件系统的根目录,并且您在它之前添加了相对路径(alignment/filtered)。您确定它正确吗?

bioinformatics snakemake
1个回答
0
投票

从技术上讲,这是一个警告,但它会停止执行snakemake,因此我将其称为错误。

© www.soinside.com 2019 - 2024. All rights reserved.