Snakemake 问题：尝试强制在另一个规则之后运行规则时出现通配符问题

Question

我正在尝试创建一个蛇文件，让我可以在我想要的任何数据集上运行工作流程。我会给你一个我正在尝试做的事情的简短版本：

我用来运行snakefile的命令如下：

snakemake -j 20 -s /path/to/snakefile --config workdir=/path/to/workdir data_dir=/path/to/data/dir

我已经包含了蛇文件中给我带来麻烦的部分以及有关如何创建一些通配符的必要上下文（如果您发现了一些（^：）：

# Import functions
import os
from pathlib import Path

import subprocess
# Global variables
WORKDIR = Path(config['workdir']) 
DATA_DIR = Path(config['data_dir'])
SCRIPTS = "/path/to/scripts

# Import read files
SAMPLES = f'{DATA_DIR}/{{sample_nmbr}}.{{extension}}'

SAMPLE_NMBR = glob_wildcards(SAMPLES)

# Create unique entries for SAMPLE_NMBR
SAMPLE_NMBR = tuple(set(SAMPLE_NMBR))



# Define output for every rule
rule all:
    input:
        # get output for tool e
        expand("{workdir}/{sample_nmbr}/e_run/output_e.txt", workdir = WORKDIR, sample_nmbr = SAMPLE_NMBR),
        # get output for tool c
        expand("{workdir}/{sample_nmbr}/c_run/output_c.txt", workdir = WORKDIR, sample_nmbr = SAMPLE_NMBR),
        # copying e samples to the same directory
        expand("{workdir}/e_samples/{sample_nmbr}.e", workdir = WORKDIR, sample_nmbr = SAMPLE_NMBR),
        # copying c samples to the same directory
        expand("{workdir}/c_samples/{sample_nmbr}.c", workdir = WORKDIR, sample_nmbr = SAMPLE_NMBR),
        # Process the results
        expand("{workdir}/last_tool_output.txt", workdir = WORKDIR)

INPUT = f'{DATA_DIR}/{{sample_nmbr}}.txt'

# Run the 2 tools on the input data
rule run_tool:
    input:
        input_file = INPUT 
    output:
        tool_c_output = "{workdir}/{sample_nmbr}/c_run/output_c.txt",
        tool_e_output = "{workdir}/{sample_nmbr}/e_run/output_e.txt"
    message:
        "Performing tool e and c on {wildcards.sample_nmbr}"
    shell:
        """
        tool_c {input.input_file} {tool_c_output}
        tool_e {input.input_file} {tool_e_output}
        """

# copy output of the 2 tools to the same repective directory as preparation of the final rule
rule copy_output:
    input:
        c_output = "{workdir}/{sample_nmbr}/c_run/output_c.txt",
        e_output = "{workdir}/{sample_nmbr}/e_run/output_e.txt",
    output:
        c_copied = "{workdir}/c_together/{sample_nmbr}.c",
        e_copied = "{workdir}/e_together/{sample_nmbr}.e",
        checkpoint_copy_output: touch("{workdir}/copying_done.txt")
    message:
        "Copying the output data"
    shell:
        """
        cp {input.c_output } {output.c_copied}
        cp {input.e_output } {output.e_copied}
        """

# Get final file that I need, which is an output of the final custom script
rule clean_data:
    input:
        checkpoint_copy_output: rules.copy_output.checkpoint_copy_output
    output:
        output_that_I_need = "{workdir}/last_tool_output.txt"
    params:
        scripts = SCRIPTS,
        workdir = WORKDIR,
    shell:
        """
        # Clean up data
        {params.scripts}/custom_script_c.py {params.workdir}/c_together {params.scripts}
        {params.scripts}/custom_script_e.py {params.workdir}/e_together {params.scripts}
        {params.scripts}/custom_script_final.py {output.output_that_I_need}
        """

因此，作为额外的解释，第一个规则是规则全部，它自然地定义了我想要的输出。然后，规则 run_tool 运行 2 个非描述性工具，为每个样本提供输出。规则 copy_output 使用 run_tool 规则的输出，并将每个输出文件复制到具有特定工具的其他输出的目录（因此您将获得 1 个包含所有 output.c 文件的目录，另一个包含 output.e 文件的目录）。然后，最后，执行最终规则，但在通配符方面与之前的规则没有任何共同点，除了工作目录之外。

这就是为什么我在 copy_output 规则中包含 checkpoint_copy_output 行，以强制 clean_data 规则仅在 copy_output 完成时执行。如果我排除这个，clean_data 规则将在其他任何事情之前运行，并且 Snakefile 将报告错误。

但是当我包含它时，snakemake 会在规则 copy_output 中抛出 shell 行的错误：

not all output log and benchmark files of rule copy_output contain the same wildcards

。

包含检查点文件作为参数也不起作用：

    params:
        checkpoint_extract = "{workdir}/extract_done.txt"
    shell:
        """
        cp {input.c_output } {output.c_copied}
        cp {input.e_output } {output.e_copied}
        """

# Get final file that I need, which is an output of the final custom script
rule clean_data:
    input:
        checkpoint_copy_output: "{workdir}/copying_done.txt"

从中我得到错误：

Missing input files for rule clean_data: /path/to/workdir/copying_done.txt

我完全陷入了这个问题，并且还没有在网上找到其他任何地方如何可能解决这个问题。我知道当您不使用一些复杂的代码来绕过此问题时，通配符需要相同，但无法重现它。如果有人可以告诉我如何更改我的代码或蛇文件设置以使其正常工作，我将不胜感激。

提前致谢，

马蒂斯

Answer 1

我必须修复几行才能使其执行，所以我很可能更改了对您原来的问题很重要的一些内容。主要有两点：

如果您只使用 .txt，为什么要使用 glob_wildcard 来请求扩展名（我猜这仅适用于您的示例？）
不知道为什么需要检查点。在您的示例中，您有 clean_data 脚本所需的输出文件。如果是这样，为什么不使用 (copy_output) 的输出作为 clean_data 输入。
clean_data 无法按照您编写的方式工作，因为您有两个通配符作为输入，而只有一个作为输出。因此，要么每个 {sample_nmbr} 有一个输出文件（就像我在下面所做的那样），或者您希望所有文件作为其输入，而不是需要创建一个列表作为其输入，以告诉 Snakemake 该规则只需运行一次，所有先前的文件作为输入和一个输出。

请参阅下面的似乎有效的版本（再次，也许我错过了重点）

# Import functions
import os
from pathlib import Path

import subprocess
# Global variables
WORKDIR = Path(config['workdir']) 
DATA_DIR = Path(config['data_dir'])
SCRIPTS = "/path/to/scripts"

# Import read files
SAMPLES = f'{DATA_DIR}/{{sample_nmbr}}.txt'

SAMPLE_NMBR = glob_wildcards(SAMPLES).sample_nmbr
print(SAMPLE_NMBR)
# Create unique entries for SAMPLE_NMBR
SAMPLE_NMBR = tuple(set(SAMPLE_NMBR))


# Define output for every rule
rule all:
    input:
        expand("{workdir}/{sample_nmbr}_last_tool_output.txt", workdir = WORKDIR, sample_nmbr = SAMPLE_NMBR)

INPUT = f'{DATA_DIR}/{{sample_nmbr}}.txt'

# Run the 2 tools on the input data
rule run_tool:
    input:
        input_file = INPUT 
    output:
        tool_c_output = "{workdir}/{sample_nmbr}/c_run/output_c.txt",
        tool_e_output = "{workdir}/{sample_nmbr}/e_run/output_e.txt"
    message:
        "Performing tool e and c on {wildcards.sample_nmbr}"
    shell:
        """
        tool_c {input.input_file} {output.tool_c_output}
        tool_e {input.input_file} {output.tool_e_output}
        """

# copy output of the 2 tools to the same repective directory as preparation of the final rule
rule copy_output:
    input:
        c_output = "{workdir}/{sample_nmbr}/c_run/output_c.txt",
        e_output = "{workdir}/{sample_nmbr}/e_run/output_e.txt",
    output:
        c_copied = "{workdir}/c_together/{sample_nmbr}.c",
        e_copied = "{workdir}/e_together/{sample_nmbr}.e"
    message:
        "Copying the output data"
    shell:
        """
        cp {input.c_output} {output.c_copied}
        cp {input.e_output} {output.e_copied}
        """


# Get final file that I need, which is an output of the final custom script
rule clean_data:
    input:
        c_copied = "{workdir}/c_together/{sample_nmbr}.c",
        e_copied = "{workdir}/e_together/{sample_nmbr}.e"
    output:
        output_that_I_need = "{workdir}/{sample_nmbr}_last_tool_output.txt"
    params:
        scripts = SCRIPTS,
        workdir = WORKDIR,
    shell:
        """
        # Clean up data
        {params.scripts}/custom_script_c.py {params.workdir}/c_together {params.scripts}
        {params.scripts}/custom_script_e.py {params.workdir}/e_together {params.scripts}
        {params.scripts}/custom_script_final.py {output.output_that_I_need}
        """

Snakemake 问题：尝试强制在另一个规则之后运行规则时出现通配符问题

问题描述投票：0回答：1

1个回答

最新问题

Snakemake 问题：尝试强制在另一个规则之后运行规则时出现通配符问题

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1