Snakemake-参数文件视为通配符

问题描述投票：0回答：1

我已经在Snakemake中编写了一个管道。这是一条ATAC-seq管道（用于分析来自特定实验的基因组数据的生物信息学管道）。基本上，在合并对齐步骤之前，我使用{sample_id}通配符，以后再切换到{sample}通配符（将两个或多个sample_id合并为一个样本）。

[working DAG here（为简单起见，仅显示一个示例；橙色和蓝色{sample_id}被合并为一个绿色{sample}] >>

所有规则如下所示：

configfile: "config.yaml"
SAMPLES_DICT = dict()

with open(config['SAMPLE_SHEET'], "r+") as fil:
    next(fil)
    for lin in fil.readlines():
        row = lin.strip("\n").split("\t")
        sample_id = row[0]
        sample_name = row[1]
        if sample_name in SAMPLES_DICT.keys():
            SAMPLES_DICT[sample_name].append(sample_id)
        else:
            SAMPLES_DICT[sample_name] = [sample_id]

SAMPLES = list(SAMPLES_DICT.keys())
SAMPLE_IDS = [sample_id for sample in SAMPLES_DICT.values() for sample_id in sample]
rule all:
    input:
        # FASTQC output for RAW reads
        expand(os.path.join(config['FASTQC'], '{sample_id}_R{read}_fastqc.zip'),
               sample_id = SAMPLE_IDS,
               read = ['1', '2']),

        # Trimming
        expand(os.path.join(config['TRIMMED'],
                            '{sample_id}_R{read}_val_{read}.fq.gz'),
               sample_id = SAMPLE_IDS,
               read = ['1', '2']),

        # Alignment
        expand(os.path.join(config['ALIGNMENT'], '{sample_id}_sorted.bam'),
               sample_id = SAMPLE_IDS),

        # Merging
        expand(os.path.join(config['ALIGNMENT'], '{sample}_sorted_merged.bam'),
               sample = SAMPLES),

        # Marking Duplicates
        expand(os.path.join(config['ALIGNMENT'], '{sample}_sorted_md.bam'),
               sample = SAMPLES),

        # Filtering
        expand(os.path.join(config['FILTERED'],
                            '{sample}.bam'),
               sample = SAMPLES),
        expand(os.path.join(config['FILTERED'],
                            '{sample}.bam.bai'),
               sample = SAMPLES),

        # multiqc report
        "multiqc_report.html"

    message:
        '\n#################### ATAC-seq pipeline #####################\n'
        'Running all necessary rules to produce complete output.\n'
        '############################################################'
我知道它太乱了，我应该只保留必要的位，但是在这里我对蛇制作的理解失败了，因为我不知道什么我必须保留

以及应该删除的内容。

据我所知，这完全可以正常工作。

但是，我添加了一条规则：

rule hmmratac:
    input:
        bam = os.path.join(config['FILTERED'], '{sample}.bam'),
        index = os.path.join(config['FILTERED'], '{sample}.bam.bai')
    output:
        model = os.path.join(config['HMMRATAC'], '{sample}.model'),
        gappedPeak = os.path.join(config['HMMRATAC'], '{sample}_peaks.gappedPeak'),
        summits = os.path.join(config['HMMRATAC'], '{sample}_summits.bed'),
        states = os.path.join(config['HMMRATAC'], '{sample}.bedgraph'),
        logs = os.path.join(config['HMMRATAC'], '{sample}.log'),
        sample_name = '{sample}'
    log:
        os.path.join(config['LOGS'], 'hmmratac', '{sample}.log')
    params:
        genomes = config['GENOMES'],
        blacklisted = config['BLACKLIST']
    resources:
        mem_mb = 32000
    message:
        '\n######################### Peak calling ########################\n'
        'Peak calling for {output.sample_name}\n.'
        '############################################################'
    shell:
        'HMMRATAC -Xms2g -Xmx{resources.mem_mb}m '
        '--bam {input.bam} --index {input.index} '
        '--genome {params.genome} --blacklist {params.blacklisted} '
        '--output {output.sample_name} --bedgraph true &> {log}'
并加入规则all，经过过滤，在multiqc之前，我添加了：

    # Peak calling
    expand(os.path.join(config['HMMRATAC'], '{sample}.model'),
           sample = SAMPLES),
相关的config.yaml片段：

# Path to blacklisted regions BLACKLIST: "/mnt/data/.../hg38.blacklist.bed" # Path to chromosome sizes GENOMES: "/mnt/data/.../hg38_sizes.genome" # Path to filtered alignment FILTERED: "alignment/filtered" # Path to peaks HMMRATAC: "peaks/hmmratac"

这是我得到的错误*（对于规则的每个输入和输出都会继续发生）。 *从技术上讲这是一个警告，但它会停止执行snakemake，因此我将其称为错误。

File path alignment/filtered//mnt/data/.../hg38.blacklist.bed.bam contains double '/'. This is likely unintended. It can also lead to inconsistent results of the file-matching approach used by Snakemake. WARNING:snakemake.logging:File path alignment/filtered//mnt/data/.../hg38.blacklist.bed.bam contains double '/'. This is likely unintended. It can also lead to inconsistent results of the file-matching approach used by Snakemake.

实际上不是...-我只是在这里提供绝对路径感到不安全。

几天来，我一直在努力解决这个错误。浏览了文档，听了介绍。我知道上面的描述还远远不够完美（不列颠哥伦比亚省，我什至不知道如何降低它的难度以提供最小的可重复示例...），但我很绝望，希望您对我耐心等待。

关于如何使用Google搜索，在哪里查找错误的任何建议，将不胜感激。

我已经在Snakemake中编写了一个管道。这是一条ATAC-seq管道（用于分析来自特定实验的基因组数据的生物信息学管道）。基本上，在合并对齐步骤之前，我使用{...] >>

从技术上讲，这是一个警告，但它会停止执行snakemake，因此我将其称为错误。

从snakemake发布日志以查看snakemake是否因错误而终止，如果发生错误则很有用。

但是，除了Eric C.建议使用wildcards.sample代替{sample}作为文件名，我认为这很可疑：

alignment/filtered//mnt/data/.../hg38.blacklist.bed.bam

/mnt/通常位于文件系统的根目录，并且您在它之前添加了相对路径（alignment/filtered）。您确定它正确吗？

bioinformatics snakemake

1个回答

0
投票

从技术上讲，这是一个警告，但它会停止执行snakemake，因此我将其称为错误。

最新问题

© www.soinside.com 2019 - 2024. All rights reserved.