我正在尝试创建一个蛇文件,让我可以在我想要的任何数据集上运行工作流程。我会给你一个我正在尝试做的事情的简短版本:
我用来运行snakefile的命令如下:
snakemake -j 20 -s /path/to/snakefile --config workdir=/path/to/workdir data_dir=/path/to/data/dir
我已经包含了蛇文件中给我带来麻烦的部分以及有关如何创建一些通配符的必要上下文(如果您发现了一些(^:):
# Import functions
import os
from pathlib import Path
import subprocess
# Global variables
WORKDIR = Path(config['workdir'])
DATA_DIR = Path(config['data_dir'])
SCRIPTS = "/path/to/scripts
# Import read files
SAMPLES = f'{DATA_DIR}/{{sample_nmbr}}.{{extension}}'
SAMPLE_NMBR = glob_wildcards(SAMPLES)
# Create unique entries for SAMPLE_NMBR
SAMPLE_NMBR = tuple(set(SAMPLE_NMBR))
# Define output for every rule
rule all:
input:
# get output for tool e
expand("{workdir}/{sample_nmbr}/e_run/output_e.txt", workdir = WORKDIR, sample_nmbr = SAMPLE_NMBR),
# get output for tool c
expand("{workdir}/{sample_nmbr}/c_run/output_c.txt", workdir = WORKDIR, sample_nmbr = SAMPLE_NMBR),
# copying e samples to the same directory
expand("{workdir}/e_samples/{sample_nmbr}.e", workdir = WORKDIR, sample_nmbr = SAMPLE_NMBR),
# copying c samples to the same directory
expand("{workdir}/c_samples/{sample_nmbr}.c", workdir = WORKDIR, sample_nmbr = SAMPLE_NMBR),
# Process the results
expand("{workdir}/last_tool_output.txt", workdir = WORKDIR)
INPUT = f'{DATA_DIR}/{{sample_nmbr}}.txt'
# Run the 2 tools on the input data
rule run_tool:
input:
input_file = INPUT
output:
tool_c_output = "{workdir}/{sample_nmbr}/c_run/output_c.txt",
tool_e_output = "{workdir}/{sample_nmbr}/e_run/output_e.txt"
message:
"Performing tool e and c on {wildcards.sample_nmbr}"
shell:
"""
tool_c {input.input_file} {tool_c_output}
tool_e {input.input_file} {tool_e_output}
"""
# copy output of the 2 tools to the same repective directory as preparation of the final rule
rule copy_output:
input:
c_output = "{workdir}/{sample_nmbr}/c_run/output_c.txt",
e_output = "{workdir}/{sample_nmbr}/e_run/output_e.txt",
output:
c_copied = "{workdir}/c_together/{sample_nmbr}.c",
e_copied = "{workdir}/e_together/{sample_nmbr}.e",
checkpoint_copy_output: touch("{workdir}/copying_done.txt")
message:
"Copying the output data"
shell:
"""
cp {input.c_output } {output.c_copied}
cp {input.e_output } {output.e_copied}
"""
# Get final file that I need, which is an output of the final custom script
rule clean_data:
input:
checkpoint_copy_output: rules.copy_output.checkpoint_copy_output
output:
output_that_I_need = "{workdir}/last_tool_output.txt"
params:
scripts = SCRIPTS,
workdir = WORKDIR,
shell:
"""
# Clean up data
{params.scripts}/custom_script_c.py {params.workdir}/c_together {params.scripts}
{params.scripts}/custom_script_e.py {params.workdir}/e_together {params.scripts}
{params.scripts}/custom_script_final.py {output.output_that_I_need}
"""
因此,作为额外的解释,第一个规则是规则全部,它自然地定义了我想要的输出。然后,规则 run_tool 运行 2 个非描述性工具,为每个样本提供输出。规则 copy_output 使用 run_tool 规则的输出,并将每个输出文件复制到具有特定工具的其他输出的目录(因此您将获得 1 个包含所有 output.c 文件的目录,另一个包含 output.e 文件的目录) 。然后,最后,执行最终规则,但在通配符方面与之前的规则没有任何共同点,除了工作目录之外。
这就是为什么我在 copy_output 规则中包含 checkpoint_copy_output 行,以强制 clean_data 规则仅在 copy_output 完成时执行。如果我排除这个,clean_data 规则将在其他任何事情之前运行,并且 Snakefile 将报告错误。
但是当我包含它时,snakemake 会在规则 copy_output 中抛出 shell 行的错误:
not all output log and benchmark files of rule copy_output contain the same wildcards
。
包含检查点文件作为参数也不起作用:
params:
checkpoint_extract = "{workdir}/extract_done.txt"
shell:
"""
cp {input.c_output } {output.c_copied}
cp {input.e_output } {output.e_copied}
"""
# Get final file that I need, which is an output of the final custom script
rule clean_data:
input:
checkpoint_copy_output: "{workdir}/copying_done.txt"
从中我得到错误:
Missing input files for rule clean_data: /path/to/workdir/copying_done.txt
我完全陷入了这个问题,并且还没有在网上找到其他任何地方如何可能解决这个问题。我知道当您不使用一些复杂的代码来绕过此问题时,通配符需要相同,但无法重现它。如果有人可以告诉我如何更改我的代码或蛇文件设置以使其正常工作,我将不胜感激。
提前致谢,
马蒂斯
我必须修复几行才能使其执行,所以我很可能更改了对您原来的问题很重要的一些内容。主要有两点:
请参阅下面的似乎有效的版本(再次,也许我错过了重点)
# Import functions
import os
from pathlib import Path
import subprocess
# Global variables
WORKDIR = Path(config['workdir'])
DATA_DIR = Path(config['data_dir'])
SCRIPTS = "/path/to/scripts"
# Import read files
SAMPLES = f'{DATA_DIR}/{{sample_nmbr}}.txt'
SAMPLE_NMBR = glob_wildcards(SAMPLES).sample_nmbr
print(SAMPLE_NMBR)
# Create unique entries for SAMPLE_NMBR
SAMPLE_NMBR = tuple(set(SAMPLE_NMBR))
# Define output for every rule
rule all:
input:
expand("{workdir}/{sample_nmbr}_last_tool_output.txt", workdir = WORKDIR, sample_nmbr = SAMPLE_NMBR)
INPUT = f'{DATA_DIR}/{{sample_nmbr}}.txt'
# Run the 2 tools on the input data
rule run_tool:
input:
input_file = INPUT
output:
tool_c_output = "{workdir}/{sample_nmbr}/c_run/output_c.txt",
tool_e_output = "{workdir}/{sample_nmbr}/e_run/output_e.txt"
message:
"Performing tool e and c on {wildcards.sample_nmbr}"
shell:
"""
tool_c {input.input_file} {output.tool_c_output}
tool_e {input.input_file} {output.tool_e_output}
"""
# copy output of the 2 tools to the same repective directory as preparation of the final rule
rule copy_output:
input:
c_output = "{workdir}/{sample_nmbr}/c_run/output_c.txt",
e_output = "{workdir}/{sample_nmbr}/e_run/output_e.txt",
output:
c_copied = "{workdir}/c_together/{sample_nmbr}.c",
e_copied = "{workdir}/e_together/{sample_nmbr}.e"
message:
"Copying the output data"
shell:
"""
cp {input.c_output} {output.c_copied}
cp {input.e_output} {output.e_copied}
"""
# Get final file that I need, which is an output of the final custom script
rule clean_data:
input:
c_copied = "{workdir}/c_together/{sample_nmbr}.c",
e_copied = "{workdir}/e_together/{sample_nmbr}.e"
output:
output_that_I_need = "{workdir}/{sample_nmbr}_last_tool_output.txt"
params:
scripts = SCRIPTS,
workdir = WORKDIR,
shell:
"""
# Clean up data
{params.scripts}/custom_script_c.py {params.workdir}/c_together {params.scripts}
{params.scripts}/custom_script_e.py {params.workdir}/e_together {params.scripts}
{params.scripts}/custom_script_final.py {output.output_that_I_need}
"""