如何使用--slurm在snakemake工作流程中设置足够的资源用于管道输出?

问题描述 投票:0回答:0

我成功地使用

--profile
--slurm
一起在我们的 slurm 集群上运行了相当大的 Snakemake 工作流程,但是如果我将其中任何一个排除在命令之外,我都会收到错误/失败的作业。

我正在努力消除对

--profile
选项的需求,并仅使用
--slurm
运行。我对此很陌生。我解决了遇到的第一个问题,但现在我正在处理 1 个作业的“资源不足”错误,并且有点卡住了。

这是我当前使用的snakemake命令:

snakemake --use-conda --notemp --printshellcmds --directory .tests/test_1 --verbose --slurm --jobs unlimited --latency-wait 300

这是失败的规则:

rule summed_bedgraph_to_bigwig:
    input:
        bdg=rules.sort_summed_bedgraph.output,
        # None of the reference wildcards are used in the output, so they must all be expanded
        lengths=expand(
            rules.create_chromosome_lengths_file.output,
            refdir=config["reference_directory"],
            reference=config["reference_name"],
        ),
    output:
        temp("results/bigwigs/all_summed.bw"),
    log:
        std="results/bigwigs/logs/summed_bedgraph_to_bigwig.stdout",
        err="results/bigwigs/logs/summed_bedgraph_to_bigwig.stderr",
    conda:
        "../envs/bigwig_tools.yml"
    shell:
        "bedGraphToBigWig {input.bdg:q} {input.lengths:q} {output:q} 1> {log.std:q} 2> {log.err:q}"

这是我在控制台中看到的失败:

Error in rule summed_bedgraph_to_bigwig:
    message: SLURM-job '1152867' failed, SLURM status is: 'FAILED'
    jobid: 64
    input: results/bigwigs/all_sorted.bedGraph, input/reference/mm10.chr19.60m-end.chr_lengths.tsv
    output: results/bigwigs/all_summed.bw
    log: results/bigwigs/logs/summed_bedgraph_to_bigwig.stdout, results/bigwigs/logs/summed_bedgraph_to_bigwig.stderr, .snakemake/slurm_logs/rule_summed_bedgraph_to_bigwig/1152867.log (check log file(s) for error details)
    conda-env: /Genomics/argo/users/rleach/ATACCompendium/.tests/test_1/.snakemake/conda/f9952137484b0d8eeee5d0959433abf7_
    shell:
        bedGraphToBigWig results/bigwigs/all_sorted.bedGraph input/reference/mm10.chr19.60m-end.chr_lengths.tsv results/bigwigs/all_summed.bw 1> results/bigwigs/logs/summed_bedgraph_to_bigwig.stdout 2> results/bigwigs/logs/summed_bedgraph_to_bigwig.stderr
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

这是在控制台中提交该作业的报告(包括来自

--verbose
标志的 sbatch 命令):

rule summed_bedgraph_to_bigwig:
    input: results/bigwigs/all_sorted.bedGraph, input/reference/mm10.chr19.60m-end.chr_lengths.tsv
    output: results/bigwigs/all_summed.bw
    log: results/bigwigs/logs/summed_bedgraph_to_bigwig.stdout, results/bigwigs/logs/summed_bedgraph_to_bigwig.stderr
    jobid: 64
    reason: Missing output files: results/bigwigs/all_summed.bw; Input files updated by another job: results/bigwigs/all_sorted.bedGraph
    resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=<TBD>

bedGraphToBigWig results/bigwigs/all_sorted.bedGraph input/reference/mm10.chr19.60m-end.chr_lengths.tsv results/bigwigs/all_summed.bw 1> results/bigwigs/logs/summed_bedgraph_to_bigwig.stdout 2> results/bigwigs/logs/summed_bedgraph_to_bigwig.stderr
No wall time information given. This might or might not work on your cluster. If not, specify the resource runtime in your rule or as a reasonable default via --default-resources.
sbatch call: sbatch --job-name a93cfa76-5e46-4389-bf49-7860ef4442df -o .snakemake/slurm_logs/rule_summed_bedgraph_to_bigwig/%j.log --export=ALL -A main -p main --mem 1000 --cpus-per-task=1 -D /Genomics/argo/users/rleach/ATACCompendium/.tests/test_1 --wrap='/Genomics/argo/users/rleach/local/miniconda3/envs/ATACCD/bin/python3.11 -m snakemake --snakefile '"'"'/Genomics/argo/users/rleach/ATACCompendium/workflow/Snakefile'"'"' --target-jobs '"'"'summed_bedgraph_to_bigwig:'"'"' --allowed-rules '"'"'summed_bedgraph_to_bigwig'"'"' --cores '"'"'all'"'"' --attempt 1 --force-use-threads  --resources '"'"'mem_mb=1000'"'"' '"'"'mem_mib=954'"'"' '"'"'disk_mb=1000'"'"' '"'"'disk_mib=954'"'"' --wait-for-files '"'"'/Genomics/argo/users/rleach/ATACCompendium/.tests/test_1/.snakemake/tmp.x4tqnm8d'"'"' '"'"'results/bigwigs/all_sorted.bedGraph'"'"' '"'"'input/reference/mm10.chr19.60m-end.chr_lengths.tsv'"'"' '"'"'/Genomics/argo/users/rleach/ATACCompendium/.tests/test_1/.snakemake/conda/f9952137484b0d8eeee5d0959433abf7_'"'"' --force --keep-target-files --keep-remote --max-inventory-time 0 --nocolor --notemp --no-hooks --nolock --ignore-incomplete --rerun-triggers '"'"'params'"'"' '"'"'software-env'"'"' '"'"'code'"'"' '"'"'mtime'"'"' '"'"'input'"'"' --skip-script-cleanup  --use-conda  --conda-frontend '"'"'mamba'"'"' --conda-base-path '"'"'/Genomics/argo/users/rleach/local/miniconda3/envs/ATACCD'"'"' --wrapper-prefix '"'"'https://github.com/snakemake/snakemake-wrappers/raw/'"'"' --printshellcmds  --latency-wait 300 --scheduler '"'"'ilp'"'"' --scheduler-solver-path '"'"'/Genomics/argo/users/rleach/local/miniconda3/envs/ATACCD/bin'"'"' --default-resources '"'"'mem_mb=max(2*input.size_mb, 1000)'"'"' '"'"'disk_mb=max(2*input.size_mb, 1000)'"'"' '"'"'tmpdir=system_tmpdir'"'"' --directory '"'"'/Genomics/argo/users/rleach/ATACCompendium/.tests/test_1'"'"'  --slurm-jobstep --jobs 1 --mode 2'
Job 64 has been submitted with SLURM jobid 1152867 (log: .snakemake/slurm_logs/rule_summed_bedgraph_to_bigwig/1152867.log).

stdout 和 stderr 文件是空的,但这是 slurm 日志:

$ cat .tests/test_1/.snakemake/slurm_logs/rule_summed_bedgraph_to_bigwig/1152867.log
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954
Select jobs to execute...
WorkflowError:
Error grouping resources in group '508bacf5-7b11-4068-96f3-6f6d733f5e32': Not enough resources were provided. This error is typically caused by a Pipe group requiring too many resources. Note that resources are summed across every member of the pipe group, except for ['runtime'], which is calculated via max(). Excess Resources:
    mem_mib: 1908/954
    mem_mb: 2000/1000
    _cores: 2/1

我对工作流程错误中的提示感到困惑。 “资源不足”是由一个团体“需要太多资源”造成的?我无法理解这在我脑海中的意义。事实上,产生该规则使用的输出的规则是由其上方规则中定义的管道自动创建的组的一部分。对于上下文,这些是出现错误的规则之前的规则:

rule bigwigs_to_summed_bedgraph:
    input:
        expand(
            "results/bigwigs/{dataset}_peak_coverage.bw",
            dataset=DATASETS,
        ),
    output:
        pipe("results/bigwigs/all_summed.bedGraph"),
    params:
        nargs=len(DATASETS),
    log:
        std="results/bigwigs/logs/bigwigs_to_summed_bedgraph.stdout",
        err="results/bigwigs/logs/bigwigs_to_summed_bedgraph.stderr",
    conda:
        "../envs/bigwig_tools.yml"
    shell:
        # bigwigmerge requires a minimum of 2 files, but the original R script
        # supports 1, so to support 1 file here, we use bigwigtobedgraph
        """
        if [ {params.nargs:q} -eq 1 ]; then \
            bigWigToBedGraph {input:q} {output:q} 1> {log.std:q} 2> {log.err:q}; \
        else \
            bigWigMerge {input:q} {output:q} 1> {log.std:q} 2> {log.err:q}; \
        fi
        """


rule sort_summed_bedgraph:
    input:
        rules.bigwigs_to_summed_bedgraph.output,
    output:
        temp("results/bigwigs/all_sorted.bedGraph"),
    params:
        buffsize=lambda w, resources: resources.mem_mb - 1000,
    log:
        "results/bigwigs/logs/sort_summed_bedgraph.stderr",
    conda:
        "../envs/bigwig_tools.yml"
    resources:
        mem_mb=48000,
    shell:
        "sort -k1,1 -k2,2n --buffer-size={params.buffsize}M {input:q} 2> {log:q} 1> {output:q}"

我尝试过设置

--cores 12
--local-cores 12
,但仍然遇到同样的错误。

我必须在命令行上设置什么才能解决此错误?

更新:我已将问题范围缩小到最近一个看似微不足道的更改,我刚刚在出现错误的规则(

sort_summed_bedgraph
)之前对规则(
summed_bedgraph_to_bigwig
)进行了更改。我添加了一个参数,以便能够计算我想要提供给
--buffer-size
的值。我刚刚恢复了该更改,并且规则适用于我们的 slurm 集群。这一改变使其发挥作用:

rule sort_summed_bedgraph:
    input:
        rules.bigwigs_to_summed_bedgraph.output,
    output:
        temp("results/bigwigs/all_sorted.bedGraph"),
    # params:
    #     buffsize=lambda w, resources: resources.mem_mb - 1000,
    log:
        "results/bigwigs/logs/sort_summed_bedgraph.stderr",
    conda:
        "../envs/bigwig_tools.yml"
    resources:
        mem_mb=48000,
    shell:
        # "sort -k1,1 -k2,2n --buffer-size={params.buffsize}M {input:q} 2> {log:q} 1> {output:q}"
        "sort -k1,1 -k2,2n --buffer-size=47000M {input:q} 2> {log:q} 1> {output:q}"

但是为什么添加一个使用 lambda 来利用资源计算其值的参数会导致错误?这是一个错误还是有某种意义?

pipe resources slurm snakemake
© www.soinside.com 2019 - 2024. All rights reserved.