Snakemake:处理目录中的配对读取

问题描述 投票:0回答:1

存在一个包含配对读取的目录,配对的名称和数量未知,可能的文件扩展名是.fastq或.fastq.gz。

我需要有关如何最好地在目录中进行配对读取以便在 Snakemake 管道中进行处理的建议。

bioinformatics snakemake
1个回答
0
投票

有多种方法可以解决这个问题。我的首选方法是使用 CSV 来指定读取路径。这样做的另一个好处是,您可以将元数据添加到每组读取中,然后您可以在工作流程中访问这些元数据。

以下是三个样本的映射读取示例:

A, B, C

目录结构:

.
├── Snakefile
├── data
│   ├── genome
│   │   └── ref.fa
│   └── reads
│       ├── a_R1.fastq.gz
│       ├── a_R2.fastq.gz
│       ├── b_R1.fastq
│       ├── b_R2.fastq
│       ├── c_R1.fq.gz
│       └── c_R2.fq.gz
└── sample_reads.csv

sample_reads.csv

sample_id,read1,read2
a,data/reads/a_R1.fastq.gz,data/reads/a_R2.fastq.gz
b,data/reads/b_R1.fastq,data/reads/b_R2.fastq
c,data/reads/c_R1.fq.gz,data/reads/c_R2.fq.gz

Snakefile

import pandas as pd

sample_reads = pd.read_csv("./sample_reads.csv")


rule all:
    """
    Get all sample_ids from dataframe and use in expand to make sam 
    files for each sample.
    """
    input:
        expand(
            "results/alignments/{sample}.sam",
            sample=sample_reads["sample_id"].tolist(),
        ),


rule align_reads:
    """
    Use sample wildcard to lookup read paths in pandas dataframe.
    Need to use input function to access sample wildcard value.
    """
    input:
        read_1=lambda wc: sample_reads[sample_reads["sample_id"] == wc.sample][
            "read1"
        ].item(),
        read_2=lambda wc: sample_reads[sample_reads["sample_id"] == wc.sample][
            "read2"
        ].item(),
        ref="data/genome/ref.fa",
    output:
        sam="results/alignments/{sample}.sam",
    shell:
        "bwa mem {input.ref} {input.read_1} {input.read_2} > {output.sam}"
© www.soinside.com 2019 - 2024. All rights reserved.