Nextflow - I/O 和索引问题

Question

我是 nextflow 的新手，正在尝试构建一个管道。我在将输出传递到新进程、正确捕获输出以及使用 BWA-MEM2 索引指定方面遇到问题。

这是我的脚本：

#!/usr/bin/env nextflow
// Declare syntax version
nextflow.enable.dsl = 2

// Set parameters
params.fastq = "$projectDir/fastq/*_{1,2}.f*"
params.refgenome = "$projectDir/RefGenome/hg38.fa"

workflow {

    read_pairs_ch = Channel.fromFilePairs(params.fastq, checkIfExists: true) 

    trimAndQC(read_pairs_ch)
    readMapping(params.refgenome, trimAndQC.out.trimmedreads)
}

process trimAndQC {
    debug true
    tag "$sample_id"

    publishDir 'trimmed/', mode: 'copy', overwrite: false

    input:
    tuple val(sample_id), path(fastq)

    output:
    tuple val('sample_id'), path('*_val_1.fq.gz'), path('*_val_2.fq.gz'), emit: trimmedreads

    shell:
    """
    trim_galore --paired --fastqc ${fastq.get(0)} ${fastq.get(1)}
    """
}

// BWA read mapping
process readMapping {
    debug true
    tag "$sample_id"

    publishDir 'bwa_aligned/', mode: 'copy', overwrite: false

    input:
    path(refgenome)
    tuple val(sample_id), val(trimmedreads)

    output:
    path('${sample_id}_sorted.bam'), emit: sorted_bam

    script:
    def indexbase = refgenome.baseName
    
    """
    bwa-mem2 mem -t 2 '${indexbase}' '${trimmedreads}' | samtools sort -@2 -o '${sample_id}_sorted.bam' -
    """
}

首先，我收到此警告：

WARN: Input tuple does not match input set cardinality declared by process `readMapping` -- offending value: [sample_id, /Users/me/Desktop/fastq/work/27/448a89259a9435ff86d52009afa05d/SRR925811_1_val_1.fq.gz, /Users/me/Desktop/fastq/work/27/448a89259a9435ff86d52009afa05d/SRR925811_2_val_2.fq.gz]

这令人困惑，因为我的输出是一个元组，因此应该与

trimAndQC

的输出基数相匹配。顺便说一句，我注意到大多数脚本将输入元组（在本例中为

path(fastq)

）作为单个参数传递，但为了使命令正常工作，我必须使用

.get()

访问方法来“解压”元组，所以我不确定我在那里做错了什么。

其次，我收到关于

readMapping

进程输出的错误：

ERROR ~ Error executing process > 'readMapping (sample_id)'

Caused by:
  Missing output file(s) `${sample_id}_sorted.bam` expected by process `readMapping (sample_id)`

Command executed:

  bwa-mem2 mem -t 2 'hg38' '/Users/me/Desktop/fastq/work/27/448a89259a9435ff86d52009afa05d/SRR925811_1_val_1.fq.gz' | samtools sort -@2 -o sample_id_sorted.bam -

Command exit status:
  0

Command output:
  (empty)

我认为这可能是由于上面的基数警告而导致

sample_id

为空。

最后，我似乎无法为 BWA-MEM2 命令正确指定索引。错误：

Command error:
  Looking to launch executable "/Users/me/opt/anaconda3/envs/nextflow/bin/bwa-mem2.sse42", simd = .sse42
  Launching executable "/Users/me/opt/anaconda3/envs/nextflow/bin/bwa-mem2.sse42"
  -----------------------------
  Executing in SSE4.2 mode!!
  -----------------------------
  * SA compression enabled with xfactor: 8
  * Ref file: hg38
  * Entering FMI_search
  ERROR! Unable to open the file: hg38.bwt.2bit.64

Work dir:
  /Users/me/Desktop/fastq/work/23/f8b3b6eace6f11f8619ce435b16725

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

 -- Check '.nextflow.log' file for details

在我的 RefGenome 目录中，我在 AWS EC2 上索引了 hg38.fa 并下载到本地以构建管道原型。这些文件看起来具有以下扩展名：

hg38.fa.{0123,amb,ann,pac,bwt.2bit.64}

，如果我只是在终端中运行

bwa-mem2 mem

命令，似乎工作正常。

我尝试适应多个SO答案（例如，如何将Channel.fromFilePairs从一个进程链接到另一个进程，nextflow：如何传递之前进程中创建的文件的目录路径），我已经完成了EMBL培训（Nextflow），浏览了 nf-core 和 github 上的管道，并在提出这个问题之前花时间阅读了 nextflow 文档。我似乎误解了 nextflow 的一些固有原理，因此任何关于关注重点的建议将不胜感激。

最后，为了清晰性和可重复性，我使用 NCBI SRA (SRX317818) 的以下 fastq 进行原型设计。这些是我通过 Anconda 环境使用的 nextflow、trim-galore 和 bwa-mem2 的版本：

bwa-mem2                  2.2.1                hdf58011_5    bioconda
nextflow                  23.10.1              hdfd78af_0    bioconda
trim-galore               0.6.10               hdfd78af_0    bioconda

我已经在 Biostars 上交叉发布了这个问题（Nextflow - I/O 和索引问题）。根据这篇文章，交叉发布似乎在一定程度上是可以容忍的（交叉发布是一个问题吗......）。如果有必要我可以删除帖子。

Answer 1

感谢 Biostars 的 @dthorbur 帮助解决了这个问题。

我终于完全解决了我的所有问题，并想在这里记录下来，以防其他人将来遇到此类问题。

第一个问题是基数问题。正如@dthorbur 指出的，我的问题不是在进程之间保持相同的元组结构。上游过程的输出：

output:
tuple val('sample_id'), path('*_val_1.fq.gz'), path('*_val_2.fq.gz')

而下游流程的输入：

input:
tuple val(sample_id), val(trimmedreads)

结果很容易修复：

input:
tuple val(sample_id), path(read1), path(read2)

我遇到的下一个问题是将 BWA 的索引提取到我的工作环境中。我发现最直接、最简单的方法就是分别指定索引目录和fasta文件，如下所示：

// Set paths for index directory and the reference fasta
params.index_dir = "$projectDir/RefGenome/BWAIndex"
params.ref = "hg38.fa"
------------------------
...
// Designate the index within the processes like this
script:
"""
bwa mem -t 2 ${index_dir}/${ref} ${read1} ${read2} | samtools sort -@2 -o "${sample_id}_sorted.bam" -
"""
------------------------

// Define the channels and pass them to the process
index_ch = Channel.fromPath(params.index_dir)
ref_ch = Channel.of(params.ref)
...
bwa_align(index_ch, ref_ch, trim_QC.out.trimmed_reads)

我的最后一个问题是让 bwa_align 进程正常运行。我最初的错误与没有输出文件有关，我意识到这是输出文件周围和脚本块内的单引号的问题：

output:
path('${sample_id}_sorted.bam')

Nextflow 没有正确解释。一旦我切换到双引号，一切都很好。因此，我在编写 Nextflow 脚本时倾向于严格使用双引号。上面记录的各种错误消息证明了我的最后一个问题是通过创建新的 Anaconda 环境并重新安装所有依赖项来解决的。一个非常不满意的答案，没有触及问题的根源，但它仍然帮助了我。

这是我最终起作用的最终脚本：

#!/usr/bin/env nextflow
nextflow.enable.dsl = 2

// Define parameters
params.fastq = "$projectDir/fastq/*_{1,2}*"
params.index_dir = "$projectDir/RefGenome/BWAIndex"
params.ref = "hg38.fa"

workflow {

    fastq_ch = Channel.fromFilePairs(params.fastq)
    index_ch = Channel.fromPath(params.index_dir)
    ref_ch = Channel.of(params.ref)

    trim_QC(fastq_ch)
    bwa_align(index_ch, ref_ch, trim_QC.out.trimmed_reads)
}

process trim_QC {
    debug true
    tag "$sample_id"

    publishDir 'trimmed/', mode: 'copy', overwrite: false

    input:
    tuple val(sample_id), path(fastq)

    output:
    tuple val("${sample_id}"), path("*_val_1.fq.gz"), path("*_val_2.fq.gz"), emit: trimmed_reads

    script:
    def (read1, read2) = fastq
    """
    trim_galore --paired --fastqc ${read1} ${read2}
    """
}

process bwa_align {
    debug true
    tag "$sample_id"

    publishDir 'bwa_aligned/', mode: 'copy', overwrite: false

    input:
    path index_dir
    val ref
    tuple val(sample_id), path(read1), path(read2)

    output:
    path("${sample_id}_sorted.bam"), emit: sorted_bam

    script:
    """
    bwa mem -t 2 ${index_dir}/${ref} ${read1} ${read2} | samtools sort -@2 o "${sample_id}_sorted.bam" -
    """
}

Nextflow - I/O 和索引问题

问题描述投票：0回答：1

1个回答

最新问题

Nextflow - I/O 和索引问题

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1