我正在使用 snakemake 来分析从基因组测序中获得的数百个文件。根据每个样本的测序质量,平台必须对其中一些样本进行第二次测序,从而产生四到八个文件。
长这样:
multiple_subdirectories/BM1P149_A2/111353_BY/
├── 111353_BY_run757_AATGGTAG_S238_L001_R1_001.fastq.gz
├── 111353_BY_run757_AATGGTAG_S238_L001_R2_001.fastq.gz
├── 111353_BY_run757_AATGGTAG_S238_L002_R1_001.fastq.gz
├── 111353_BY_run757_AATGGTAG_S238_L002_R2_001.fastq.gz
├── 111353_BY_run758_AATGGTAG_S85_L001_R1_001.fastq.gz
├── 111353_BY_run758_AATGGTAG_S85_L001_R2_001.fastq.gz
├── 111353_BY_run758_AATGGTAG_S85_L002_R1_001.fastq.gz
└── 111353_BY_run758_AATGGTAG_S85_L002_R2_001.fastq.gz
在文件夹 BM1P149_A2 中,有 96 个样本,如 111353_BY,有 4 个或 8 个文件,以及 6 个文件夹,如 BM1P149_A2。
我正在使用 snakemake 通配符将我的所有数据提交给相同的分析。对于我分析的前两个步骤,修剪和映射,我没有遇到任何问题,因为所有文件都是单独操作的,管道看起来像修剪的那样:
(PLAQUES,SAMPLES,IDS,PAIRED,) = glob_wildcards(config["dataDirInput"]+"{plaque}/{sample}/{id}_R{paire}_001.fastq.gz")
rule all:
input:
expand(config["dataDirOutput"]+"{plaque}/{sample}/{id}_R{paire}_001_unpaired.fastq.gz", zip, plaque = PLAQUES, sample = SAMPLES, id = IDS, paire = PAIRED)
rule trimmomatic:
input:
read1=config["dataDirInput"]+"{plaque}/{sample}/{id}_R1_001.fastq.gz",
read2=config["dataDirInput"]+"{plaque}/{sample}/{id}_R2_001.fastq.gz"
output:
read1_paired=config["dataDirOutput"]+"{plaque}/{sample}/{id}_R1_001_paired.fastq.gz",
read1_unpaired=config["dataDirOutput"]+"{plaque}/{sample}/{id}_R1_001_unpaired.fastq.gz",
read2_paired=config["dataDirOutput"]+"{plaque}/{sample}/{id}_R2_001_paired.fastq.gz",
read2_unpaired=config["dataDirOutput"]+"{plaque}/{sample}/{id}_R2_001_unpaired.fastq.gz"
threads: 20
resources:
mem_mb = 20,
time_min = 300
params:
env=config["trimmomatic_env"] # Here I'm using conda to call software trimmomatic for the trimming, and it's my way of associating
shell: "{params.env} PE {input.read1} {input.read2} {output.read1_paired} {output.read1_unpaired} {output.read2_paired} {output.read2_unpaired} ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:True
LEADING:3 TRAILING:3 MINLEN:36"
# in the config file, that I specify when I run the snakefile above
trimmomatic_env:
"/directory_of_the_conda_environment".
dataDirInput:
"/multiple_subdirectories/Sample_GData"
dataDirOutput:
"/multiple_subdirectories/Trimmed"
对于接下来的步骤,在映射之后,我需要结合来自两条车道和两次运行的映射数据。我遇到的问题是,在开始时使用通配符分析每个文件,我相信对于其余的分析,通配符被设置为一个值(两次运行之一),所以我不能使用通配符合并它们。
为了克服这个问题,我试图做的是在规则的 shell 块中使用 if 语句来计算文件数量并合并两个或四个文件,无论这个样本有两次运行还是只有一次,但我真的不知道我想我了解如何在 snakefile 中使用 shell 块。另外,我认为这不是很优雅,特别是因为每次运行都对一批 96 个样本进行测序,因此,必须为每批 96 个样本更改管道。这是我设计的管道:
(SAMPLES,IDS,RUNS,TAGS,SSS,LANES,PAIRED,) = glob_wildcards(config["dataDirInput"]+"{samp}/{id}_run[run]_[tag]_[ss]_L[lane]_R{paire}_001.fastq.gz")
rule all:
input:
expand(config["dataDirOutput"]+"{samp}/{id}_merged.bam", zip, samp = SAMPLES, id = IDS)
rule trimmomatic:
input:
read1=expand(config["dataDirInput"]+"{samp}/{id}_run[run]_[tag]_[ss]_L[lane]_R1_001.fastq.gz", zip, samp = SAMPLES, id = IDS, run = RUNS, tag = TAGS, ss = SSS, lane = LANES),
read2=expand(config["dataDirInput"]+"{samp}/{id}_run[run]_[tag]_[ss]_L[lane]_R2_001.fastq.gz", zip, samp = SAMPLES, id = IDS, run = RUNS, tag = TAGS, ss = SSS, lane = LANES)
output:
read1_paired=config["dataDirOutput"]+"{samp}/{id}_run[run]_L[lane]_R1_001_paired.fastq.gz",
read1_unpaired=config["dataDirOutput"]+"{samp}/{id}_run[run]_L[lane]_R1_001_unpaired.fastq.gz",
read2_paired=config["dataDirOutput"]+"{samp}/{id}_run[run]_L[lane]_R2_001_paired.fastq.gz",
read2_unpaired=config["dataDirOutput"]+"{samp}/{id}_run[run]_L[lane]_R2_001_unpaired.fastq.gz"
threads: 20
resources:
mem_mb = 20,
time_min = 300
params:
env=config["trimmomatic_env"]
shell: "{params.env} PE {input.read1} {input.read2} {output.read1_paired} {output.read1_unpaired} {output.read2_paired} {output.read2_unpaired} ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:True
LEADING:3 TRAILING:3 MINLEN:36"
rule mapping:
input:
read1_paired=config["dataDirOutput"]+"{samp}/{id}_run[run]_L[lane]_R1_001_paired.fastq.gz",
read2_paired=config["dataDirOutput"]+"{samp}/{id}_run[run]_L[lane]_R2_001_paired.fastq.gz"
output:
mapped=config["dataDirOutput"]+"{sample}/{id}_run[run]_L[lane]_mapped.bam"
threads: 20
resources:
mem_mb = 20,
time_min = 300
params:
env=config["bwa_env"]
shell:"{params.env} mem -t 8 /store/EQUIPES/MRP/Partage/Arabidopsis_commun/Genome_sequences/REF_genomes/TAIR10_chr_all.fasta {input.read1_paired} {input.read2_paired} > {output.mapped}"
rule merging:
input:
mapped_L1_run1=config["dataDirOutput"]+"{sample}/{id}_run757_L001_mapped.bam", # here, run757 and run758 won't be present in every samples, sometimes only one
mapped_L2_run1=config["dataDirOutput"]+"{sample}/{id}_run757_L002_mapped.bam", # therefore, I think snakemake will miss two inputs when there's a single run
mapped_L1_run2=config["dataDirOutput"]+"{sample}/{id}_run758_L001_mapped.bam", # and therefore, won't be able to continue the job
mapped_L2_run2=config["dataDirOutput"]+"{sample}/{id}_run758_L002_mapped.bam"
output:
merged=config["dataDirOutput"]+"{sample}/{id}_merged.bam"
threads: 20
resources:
mem_mb = 20,
time_min = 300
params:
env=config["samtools_env"]
shell:"
shopt -nullglob
for file in config["dataDirOutput"]+"{samp}"
do
i=$((i+1))
done
n = 6
if [ $i = $n ]
then
{params.env} merge {output.merged} {input.mapped_L1_run1} {input.mapped_L2_run1}
else
{params.env} merge {output.merged} {input.mapped_L1_run1} {input.mapped_L2_run1} {input.mapped_L1_run2} {input.mapped_L2_run2}
fi
"
总的来说,我认为我对通配符的使用理解很差,我使用它的方式很低效,我想知道你是否有任何提示来克服我的问题并改进它,谢谢。