如何只在TSV文件选择一个行的值？

Question

编辑：我要补充一点，这个问题基本上是正是我问：How to use list in Snakemake Tabular configuration, for describing of sequencing units for bioinformatic pipeline 问题是，当我尝试它适应我Snakefile我得到一个错误信息说'function' object has no attribute 'unique'指的是以下行：samples= list(units_table.SampleSM.unique()) 而在同一个线程约翰内斯链接到他的GATK管道也使用TSV文件用于此目的，但坦白说，我不明白这是怎么回事所以很遗憾没有帮助。如果我能得到我的Snakefile一些输入下面我会很感激。

我需要通过基于这样的TSV文件填充变量开始我的工作流程：

flowcell    sample  library lane    R1  R2
FlowCellX   SAMPLE1 libZ    L001    fastq/FCX/Sample1_L001.R1.fastq.gz  fastq/FCX/Sample1_L001.R2.fastq.gz
FlowCellX   SAMPLE1 libZ    L002    fastq/FCX/Sample1_L002.R1.fastq.gz  fastq/FCX/Sample1_L002.R2.fastq.gz
FlowCellX   SAMPLE2 libZ    L001    fastq/FCX/Sample2_L001.R1.fastq.gz  fastq/FCX/Sample2_L001.R2.fastq.gz
FlowCellX   SAMPLE2 libZ    L002    fastq/FCX/Sample2_L002.R1.fastq.gz  fastq/FCX/Sample2_L002.R2.fastq.gz
FlowCellY   SAMPLE1 libX    L001    fastq/FCY/Sample1_L001.R1.fastq.gz  fastq/FCY/Sample1_L001.R2.fastq.gz

我的问题是，我不知道如何从每对文件中的一行，使其只能选择值，其原因是，我需要通过选择不同的R1和R2的所有字段的值来构建一个读组。

我能让它目前做的唯一一件事就是每个条目从每一行和列在流动室，样品库，并结合车道，但我只是想它来选择每对输入文件的一行条目。

这里是我到目前为止的代码：

import pandas as pd
from snakemake.utils import validate, min_version

samples = pd.read_table("samples.tsv", sep='\t', dtype=str).set_index(["flowcell", "sample"], drop=False)

samples.index = samples.index.set_levels([i.astype(str) for i in samples.index.levels])  # enforce str in index


def get_fastq1(wildcards):
    return samples.loc[(wildcards.flowcell, wildcards.sample), ["R1"]].dropna()
def get_fastq2(wildcards):
    return samples.loc[(wildcards.flowcell, wildcards.sample), ["R2"]].dropna()

rule all:
    input:
        expand("Outputs/BwaMem/{sample}_{lane}_{flowcell}.mapped.bam", sample=samples['sample'], lane=samples['lane'], flowcell=samples['flowcell']),

rule BwaMem:
    input:
        fasta = "/references/Homo_sapiens_assembly38.fasta",
        fastq1 = get_fastq1,
        fastq2 = get_fastq2,
    params:
        rgs = repr('@RG:\tID:{flowcell}.{lane}\tSM:{sample}\tLB:PlaceHolder\tPU:{flowcell}.{lane}.{sample}\tPL:Illumina')
    output:
        "Outputs/BwaMem/{sample}_{lane}_{flowcell}.mapped.bam",
    shell:
        "bwa mem -M -t 12 {input.fasta} \
        -R {params.rgs} \
        {input.fastq1} \
        {input.fastq2} | \
        samtools view -Sb - > {output}"

这个作品，如果我只有一个行中的TSV文件，但是当我添加第二行，所以我有这样的TSV文件：

flowcell    sample  library lane    R1  R2
FlowCellX   SAMPLE1 libZ    L001    fastq/Sample1_L001.R1.fastq.gz  fastq/Sample1_L001.R2.fastq.gz
FlowCellY   SAMPLE2 libZ    L002    fastq/Sample2_L002.R1.fastq.gz  fastq/Sample2_L00.R2.fastq.gz

它给了我此错误消息：

InputFunctionException in line 18 of /home/oskar/01-workspace/00-temp/GVP/Snakefile:
KeyError: 9
Wildcards:
sample=SAMPLE1
lane=L001
flowcell=FlowCellY

它试图SAMPLE1和L001与FlowCellY相结合，这不是我想要的，我希望它只是选择了FlowCellX和SAMPLE1输入文件L001，fastq/FCX/Sample1_L001.R1.fastq.gz和fastq/FCX/Sample1_L001.R2.fastq.gz。生成的输出文件和读取组是这样的：输出文件名：

SAMPLE1_FlowCellX_L001.mapped.bam

阅读组：

@RG: ID:FlowCellX.L001 SM:SAMPLE1 LB:PlaceHolder PU:FlowCellX.L001.SAMPLE1 PL:Illumina

我在想什么？

和第二个问题是，我不知道如何将{library}变量添加到读取群里的..LB:PlaceHolder\t是。如果我尝试把这个变量在expand("Outputs/BwaMem/{sample}_{lane}_{flowcell}.mapped.bam", sample=samples['sample'], lane=samples['lane'], flowcell=samples['flowcell']),线，预计{library}是在文件名，这不是我想要的。我想这个问题可以解决以前的答案来解决一起？

Answer 1

一个问题是，在expand你rule all功能将创建样品，车道和流通池导致无效查询样纸（也可能不是你想要的）的所有组合。如果你把expand(...)在一个变量并打印你会看到结果列表。下面的固定采用zip功能的加入样品，车道和流通池以并行的方式。

紧靠{library}问题，你可以有类似get_RG功能get_fastq函数从给定的通配符样纸返回appropropriate值。下面看到的，有可能是更好的解决方案，但这个希望让你在正确的方向...

import pandas as pd

samples = pd.read_csv("samples.tsv", sep='\t', dtype=str).set_index(["flowcell", "sample", "lane"], drop=False)

samples.index = samples.index.set_levels([i.astype(str) for i in samples.index.levels])  # enforce str in index

def get_fastq1(wildcards):
    return samples.loc[(wildcards.flowcell, wildcards.sample, wildcards.lane), ["R1"]].dropna()

def get_fastq2(wildcards):
    return samples.loc[(wildcards.flowcell, wildcards.sample, wildcards.lane), ["R2"]].dropna()

def get_RG(wc):
    library= samples.loc[(wc.flowcell, wc.sample, wc.lane), ["library"]].unique()[0]
    rgs = r'@RG:\tID:%(flowcell)s.%(lane)s\tSM:%(sample)s\tLB:%(library)s\tPU:%(flowcell)s.%(lane)s.%(sample)s\tPL:Illumina' \
        % {'flowcell': wc.flowcell, 
           'lane': wc.lane, 
           'sample': wc.sample, 
           'library': library}
    return rgs    

rule all:
    input:
        expand("Outputs/BwaMem/{sample}_{lane}_{flowcell}.mapped.bam", zip,
            sample=samples['sample'], lane=samples['lane'],
            flowcell=samples['flowcell']),

rule BwaMem:
    input:
        fasta = "Homo_sapiens_assembly38.fasta",
        fastq1 = get_fastq1,
        fastq2 = get_fastq2,
    params:
        rgs= get_RG,
    output:
        "Outputs/BwaMem/{sample}_{lane}_{flowcell}.mapped.bam",
    shell:
        r"bwa mem -M -t 12 {input.fasta} \
            -R '{params.rgs}' \
            {input.fastq1} \
            {input.fastq2} \
        | samtools view -Sb - > {output}"

如何只在TSV文件选择一个行的值？

问题描述投票：1回答：1

1个回答

最新问题

如何只在TSV文件选择一个行的值？

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1