snakemake中如何使用通配符合并不同通配符的文件?

问题描述 投票:0回答:0

我正在使用 snakemake 来分析从基因组测序中获得的数百个文件。根据每个样本的测序质量,平台必须对其中一些样本进行第二次测序,从而产生四到八个文件。

长这样:

multiple_subdirectories/BM1P149_A2/111353_BY/
├── 111353_BY_run757_AATGGTAG_S238_L001_R1_001.fastq.gz
├── 111353_BY_run757_AATGGTAG_S238_L001_R2_001.fastq.gz
├── 111353_BY_run757_AATGGTAG_S238_L002_R1_001.fastq.gz
├── 111353_BY_run757_AATGGTAG_S238_L002_R2_001.fastq.gz
├── 111353_BY_run758_AATGGTAG_S85_L001_R1_001.fastq.gz
├── 111353_BY_run758_AATGGTAG_S85_L001_R2_001.fastq.gz
├── 111353_BY_run758_AATGGTAG_S85_L002_R1_001.fastq.gz
└── 111353_BY_run758_AATGGTAG_S85_L002_R2_001.fastq.gz

在文件夹 BM1P149_A2 中,有 96 个样本,如 111353_BY,有 4 个或 8 个文件,以及 6 个文件夹,如 BM1P149_A2。

我正在使用 snakemake 通配符将我的所有数据提交给相同的分析。对于我分析的前两个步骤,修剪和映射,我没有遇到任何问题,因为所有文件都是单独操作的,管道看起来像修剪的那样:

(PLAQUES,SAMPLES,IDS,PAIRED,) = glob_wildcards(config["dataDirInput"]+"{plaque}/{sample}/{id}_R{paire}_001.fastq.gz")

rule all:
  input: 
    expand(config["dataDirOutput"]+"{plaque}/{sample}/{id}_R{paire}_001_unpaired.fastq.gz", zip, plaque = PLAQUES, sample = SAMPLES, id = IDS, paire = PAIRED)

rule trimmomatic:
  input: 
    read1=config["dataDirInput"]+"{plaque}/{sample}/{id}_R1_001.fastq.gz",
    read2=config["dataDirInput"]+"{plaque}/{sample}/{id}_R2_001.fastq.gz"
  output: 
    read1_paired=config["dataDirOutput"]+"{plaque}/{sample}/{id}_R1_001_paired.fastq.gz",
    read1_unpaired=config["dataDirOutput"]+"{plaque}/{sample}/{id}_R1_001_unpaired.fastq.gz",
    read2_paired=config["dataDirOutput"]+"{plaque}/{sample}/{id}_R2_001_paired.fastq.gz",
    read2_unpaired=config["dataDirOutput"]+"{plaque}/{sample}/{id}_R2_001_unpaired.fastq.gz"
  threads: 20  
  resources:
    mem_mb = 20,
    time_min = 300 
  params: 
    env=config["trimmomatic_env"] # Here I'm using conda to call software trimmomatic for the trimming, and it's my way of associating  
  shell: "{params.env} PE {input.read1} {input.read2} {output.read1_paired} {output.read1_unpaired} {output.read2_paired} {output.read2_unpaired} ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:True
 LEADING:3 TRAILING:3 MINLEN:36"
 
# in the config file, that I specify when I run the snakefile above

trimmomatic_env:
   "/directory_of_the_conda_environment". 

dataDirInput:                                                                                                                                                                              
   "/multiple_subdirectories/Sample_GData"  
                                                                                                                                                                                           
dataDirOutput:                                                                                                                                                                             
   "/multiple_subdirectories/Trimmed"    

对于接下来的步骤,在映射之后,我需要结合来自两条车道和两次运行的映射数据。我遇到的问题是,在开始时使用通配符分析每个文件,我相信对于其余的分析,通配符被设置为一个值(两次运行之一),所以我不能使用通配符合并它们。

为了克服这个问题,我试图做的是在规则的 shell 块中使用 if 语句来计算文件数量并合并两个或四个文件,无论这个样本有两次运行还是只有一次,但我真的不知道我想我了解如何在 snakefile 中使用 shell 块。另外,我认为这不是很优雅,特别是因为每次运行都对一批 96 个样本进行测序,因此,必须为每批 96 个样本更改管道。这是我设计的管道:

(SAMPLES,IDS,RUNS,TAGS,SSS,LANES,PAIRED,) = glob_wildcards(config["dataDirInput"]+"{samp}/{id}_run[run]_[tag]_[ss]_L[lane]_R{paire}_001.fastq.gz")

rule all:
  input: 
    expand(config["dataDirOutput"]+"{samp}/{id}_merged.bam", zip, samp = SAMPLES, id = IDS) 

rule trimmomatic:
  input: 
    read1=expand(config["dataDirInput"]+"{samp}/{id}_run[run]_[tag]_[ss]_L[lane]_R1_001.fastq.gz", zip, samp = SAMPLES, id = IDS, run = RUNS, tag = TAGS, ss = SSS, lane = LANES),
    read2=expand(config["dataDirInput"]+"{samp}/{id}_run[run]_[tag]_[ss]_L[lane]_R2_001.fastq.gz", zip, samp = SAMPLES, id = IDS, run = RUNS, tag = TAGS, ss = SSS, lane = LANES)
  output: 
    read1_paired=config["dataDirOutput"]+"{samp}/{id}_run[run]_L[lane]_R1_001_paired.fastq.gz",
    read1_unpaired=config["dataDirOutput"]+"{samp}/{id}_run[run]_L[lane]_R1_001_unpaired.fastq.gz",
    read2_paired=config["dataDirOutput"]+"{samp}/{id}_run[run]_L[lane]_R2_001_paired.fastq.gz",
    read2_unpaired=config["dataDirOutput"]+"{samp}/{id}_run[run]_L[lane]_R2_001_unpaired.fastq.gz"
  threads: 20
  resources:
    mem_mb = 20,
    time_min = 300
  params: 
    env=config["trimmomatic_env"]
  shell: "{params.env} PE {input.read1} {input.read2} {output.read1_paired} {output.read1_unpaired} {output.read2_paired} {output.read2_unpaired} ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:True
 LEADING:3 TRAILING:3 MINLEN:36" 
 
 
rule mapping:
  input:  
    read1_paired=config["dataDirOutput"]+"{samp}/{id}_run[run]_L[lane]_R1_001_paired.fastq.gz", 
    read2_paired=config["dataDirOutput"]+"{samp}/{id}_run[run]_L[lane]_R2_001_paired.fastq.gz"
  output: 
    mapped=config["dataDirOutput"]+"{sample}/{id}_run[run]_L[lane]_mapped.bam"
  threads: 20
  resources:
    mem_mb = 20,
    time_min = 300
  params: 
    env=config["bwa_env"]
  shell:"{params.env} mem -t 8 /store/EQUIPES/MRP/Partage/Arabidopsis_commun/Genome_sequences/REF_genomes/TAIR10_chr_all.fasta {input.read1_paired} {input.read2_paired} > {output.mapped}"


rule merging: 
   input:  
     mapped_L1_run1=config["dataDirOutput"]+"{sample}/{id}_run757_L001_mapped.bam", # here, run757 and run758 won't be present in every samples, sometimes only one
     mapped_L2_run1=config["dataDirOutput"]+"{sample}/{id}_run757_L002_mapped.bam", # therefore, I think snakemake will miss two inputs when there's a single run
     mapped_L1_run2=config["dataDirOutput"]+"{sample}/{id}_run758_L001_mapped.bam", # and therefore, won't be able to continue the job
     mapped_L2_run2=config["dataDirOutput"]+"{sample}/{id}_run758_L002_mapped.bam"
   output:  
     merged=config["dataDirOutput"]+"{sample}/{id}_merged.bam"
   threads: 20
   resources:
     mem_mb = 20,
     time_min = 300
   params: 
     env=config["samtools_env"]
   shell:"
   shopt -nullglob  
   for file in config["dataDirOutput"]+"{samp}" 
   do
        i=$((i+1))
   done
   n = 6    
   if [ $i = $n ]
   then 
    {params.env} merge {output.merged} {input.mapped_L1_run1} {input.mapped_L2_run1}
  
   else
    {params.env} merge {output.merged} {input.mapped_L1_run1} {input.mapped_L2_run1} {input.mapped_L1_run2} {input.mapped_L2_run2}
  
   fi
  
   "

总的来说,我认为我对通配符的使用理解很差,我使用它的方式很低效,我想知道你是否有任何提示来克服我的问题并改进它,谢谢。

python bash workflow wildcard snakemake
© www.soinside.com 2019 - 2024. All rights reserved.