.mcool 文件包含多种分辨率的矩阵。
一个 .mcool 文件的冷却器:
cooler ls ./../input/A001C007.hg38.nodups.pairs.mcool
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/200
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/500
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/1000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/2000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/5000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/10000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/20000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/50000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/100000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/250000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/500000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/1000000
(EagleC)
对于 ./input/*.mcool 中的每个文件,如果 Cooler ls "$mcool_file" 在最后一个 / 之后以 5000、10000 或 50000 结尾,我想从 EagleC 运行 PredictSV。
如存储库中所示,单个 .mcool 文件
predictSV --hic-5k SKNAS-MboI-allReps-filtered.mcool::/resolutions/5000 \
--hic-10k SKNAS-MboI-allReps-filtered.mcool::/resolutions/10000 \
--hic-50k SKNAS-MboI-allReps-filtered.mcool::/resolutions/50000 \
-O SK-N-AS -g hg38 --balance-type CNV --output-format full \
--prob-cutoff-5k 0.8 --prob-cutoff-10k 0.8 --prob-cutoff-50k 0.99999
但是,对于文件列表,我需要编写一个for循环来迭代运行
predictSV
。
我的尝试:
#!/usr/bin/env bash
for mcool_file in input/*.mcool; do
# iterate over ids emitted from cooler ls for this file
hic5k_num=; hic10k_num=; hic50k_num=
while IFS= read -r id; do
id_suffix=${id##*/}
case $id_suffix in
5000) hic5k_num=$id_suffix;;
10000) hic10k_num=$id_suffix;;
50000) hic50k_num=$id_suffix;;
esac
done < <(cooler ls "$mcool_file")
echo predictSV \
${hic5k_num:+ --hic-5k "$hic5k_num"} \
${hic10k_num:+ --hic-10k "$hic10k_num"} \
${hic50k_num:+ --hic-50k "$hic50k_num"} \
-g hg38 \
-O "${mcool_file%%.*}" \
--balance-type CNV \
--output-format full \
--prob-cutoff-5k 0.8 \
--prob-cutoff-10k 0.8 \
--prob-cutoff-50k 0.99999
done
但是我的输出是:
predictSV --hic-5k 5000 --hic-10k 10000 --hic-50k 50000 -g hg38 -O input/A001C007 --balance-type CNV --output-format full --prob-cutoff-5k 0.8 --prob-cutoff-10k 0.8 --prob-cutoff-50k 0.99999
predictSV --hic-5k 5000 --hic-10k 10000 --hic-50k 50000 -g hg38 -O input/A001C008 --balance-type CNV --output-format full --prob-cutoff-5k 0.8 --prob-cutoff-10k 0.8 --prob-cutoff-50k 0.99999
预期输出:
predictSV --hic-5k ./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/500 --hic-10k ./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/10000 --hic-50k ./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/50000 -g hg38 -O input/A001C007 --balance-type CNV --output-format full --prob-cutoff-5k 0.8 --prob-cutoff-10k 0.8 --prob-cutoff-50k 0.99999
predictSV --hic-5k ./../input/A001C008.hg38.nodups.pairs.mcool::/resolutions/500 --hic-10k ./../input/A001C008.hg38.nodups.pairs.mcool::/resolutions/10000 --hic-50k ./../input/A001C008.hg38.nodups.pairs.mcool::/resolutions/50000 -g hg38 -O input/A001C008 --balance-type CNV --output-format full --prob-cutoff-5k 0.8 --prob-cutoff-10k 0.8 --prob-cutoff-50k 0.99999
编辑:
此外,我想将
-O
部分从 "${mcool_file%%.*}"
更改为以下内容:
EagleC_output/
+ /
文件中 mcool_file
括号后面的子字符串。例如,EagleC_output/A001C007
。
您可以使用 GNU Parallel 来简化您的 bash 脚本。
parallel
修改 bash 脚本:-O
解析 grep
输入文件名的动态
.mcool
#!/usr/bin/env bash
for mcool_file in input/*.mcool; do
OUTPUT=$(echo "$mcool_file" | grep -oE "[A-Z0-9]*\.hg38" | cut -d '.' -f 1)
cooler ls "$mcool_file" | grep -E '\/[1][0]{4}$|\/[5][0]{3,4}$' \
| tr '\n' '\t' \
| parallel --dry-run --colsep '\t' --link \
predictSV --hic-5k {1} --hic-10k {2} --hic-50k {3} \
-O 'EagleC_output/'$OUTPUT -g hg38 --balance-type CNV --output-format full \
--prob-cutoff-5k 0.8 --prob-cutoff-10k 0.8 --prob-cutoff-50k 0.99999 \
| tr -d "'"
done
提供:
predictSV --hic-5k ./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/5000 --hic-10k ./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/10000 --hic-50k ./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/50000 -O EagleC_output/A001C007 -g hg38 --balance-type CNV --output-format full --prob-cutoff-5k 0.8 --prob-cutoff-10k 0.8 --prob-cutoff-50k 0.99999
predictSV --hic-5k ./../input/A001C008.hg38.nodups.pairs.mcool::/resolutions/5000 --hic-10k ./../input/A001C008.hg38.nodups.pairs.mcool::/resolutions/10000 --hic-50k ./../input/A001C008.hg38.nodups.pairs.mcool::/resolutions/50000 -O EagleC_output/A001C008 -g hg38 --balance-type CNV --output-format full --prob-cutoff-5k 0.8 --prob-cutoff-10k 0.8 --prob-cutoff-50k 0.99999
假设有两个文件名为:
A001C007.hg38.nodups.pairs.mcool
A001C008.hg38.nodups.pairs.mcool
其中包含
cooler ls
输出,显示您在问题中提供的每行分辨率。
(注意:只需从同一 bash 脚本中删除
--dry-run
选项,然后运行它,以便在第一次完成“空运行”后实际执行命令列表,以确保所有命令都以正确的方式执行语法。还有一个非常有用的--joblog
选项。)
我制作了一个名为“test.mcool”的测试文件,其中包含:
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/200
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/500
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/1000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/2000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/5000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/10000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/20000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/50000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/100000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/250000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/500000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/1000000
模仿
cooler ls
的输出。
使用
bash grep
给出:
$ grep -E '\/[1][0]{4}$|\/[5][0]{3,4}$' test.mcool
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/5000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/10000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/50000
然后可以通过管道传输到
parallel
,如下所示:
grep -E '\/[1][0]{4}$|\/[5][0]{3,4}$' test.mcool | tr '\n' '\t' | parallel --dry-run --colsep '\t' --link predictSV --hic-5k {1} --hic-10k {2} --hic-50k {3} [...]
请参阅上面提供的 bash 脚本输出,其中还考虑了
predictSV
的输出选项,并且该命令设置为可以轻松使用您提供的 bash 脚本。
parallel
选项说明--dry-run
:仅打印出将要执行的命令(删除此选项将实际运行命令(并行))--link
:告诉 parallel
分开、定界(由 --colsep
选项指定;我使用 tr
将换行符转换为制表符),并按指示进行分区(使用大括号中包含的整数) )将管道输入输入到要并行执行的命令模式模板中。(额外说明:进一步的改进是额外循环遍历所有
./*.mcool
文件也并行(即运行嵌套的parallel
单个命令)。但这只有在您感兴趣的情况下才会感兴趣关注通过更完全实现 predictSV
命令的并行化来提高分析的整体计算时间。)
鉴于您想要执行的操作,变量名称中的
_num
后缀具有误导性。你不想存储数字;你想存储 ID。因此,您应该存储 hic5k_num=$id_suffix
,而不是设置 hic5k_num=$id
——或者,要使用误导性较小的变量名称,只需 hic5k=$id
:
#!/usr/bin/env bash
for mcool_file in input/*.mcool; do
# iterate over ids emitted from cooler ls for this file
hic5k_num=; hic10k_num=; hic50k_num=
while IFS= read -r id; do # id=./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/200
id_suffix=${id##*/} # id_suffix=200
id_part1=${id%%::*} # id_part1=./../input/A001C007.hg38.nodups.pairs.mcool
id_part2=${id##*::} # id_part2=/resolutions/200
# id_part1_basename=A001C007.hg38.nodups.pairs.mcool
id_part1_basename=${id_part1##*/}
# id_reassembled=A001C007.hg3.nodeps.pairs.mcool::/resolutions/200
id_reassembled=$id_part1_basename::$id_part2
case $id_suffix in
5000) hic5k=$id_reassembled;;
10000) hic10k=$id_reassembled;;
50000) hic50k=$id_reassembled;;
esac
done < <(cooler ls "$mcool_file")
mcool_basename=${mcool_file##*/}
printf '%q ' predictSV \
${hic5k:+ --hic-5k "$hic5k"} \
${hic10k:+ --hic-10k "$hic10k"} \
${hic50k:+ --hic-50k "$hic50k"} \
-g hg38 \
-O "EagleC_output/${mcool_basename%%.*}" \
--balance-type CNV \
--output-format full \
--prob-cutoff-5k 0.8 \
--prob-cutoff-10k 0.8 \
--prob-cutoff-50k 0.99999
printf '\n'
done