日常瞎掰
工欲善其事必先利其器,好用的工具对做事来说很重要,可以让你做起事来收到事半功倍的效果。前段时间分析数据时,需要给intervals
做基因注释,需要操作gtf文件,最后发现一款非常好用的python包 — pyranges
。该包可以很轻松地将bed
、gtf
、gff3
等格式文件读取为PyRanges
对象,该对象有点类似潘大师的Dataframe
,熟悉潘大师的同学应该能体会到Dataframe
操作数据的快乐。同样,皮软杰斯内置操作intervals
的方法也绝对能给你带来无语伦比的畅快。废话不多说,下面来感受一下使用皮软杰斯做注释基因有多方便。
读取文件
read_bed
方法可以用来读取bed
格式的文件,read_gtf
、read_gff3
分别用来读取gtf
、gff3
文件:
import pyranges as pr
bed = pr.read_bed('data.bed')
bed
+--------------+-----------+-----------+------------+-----------+--------------+
| Chromosome | Start | End | Name | Score | Strand |
| (category) | (int32) | (int32) | (object) | (int64) | (category) |
|--------------+-----------+-----------+------------+-----------+--------------|
| chr1 | 212609534 | 212609559 | U0 | 0 | + |
| chr1 | 169887529 | 169887554 | U0 | 0 | + |
| chr1 | 216711011 | 216711036 | U0 | 0 | + |
| chr1 | 144227079 | 144227104 | U0 | 0 | + |
| ... | ... | ... | ... | ... | ... |
| chrY | 15224235 | 15224260 | U0 | 0 | - |
| chrY | 13517892 | 13517917 | U0 | 0 | - |
| chrY | 8010951 | 8010976 | U0 | 0 | - |
| chrY | 7405376 | 7405401 | U0 | 0 | - |
+--------------+-----------+-----------+------------+-----------+--------------+
Stranded PyRanges object has 10,000 rows and 6 columns from 24 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
gtf = pr.read_gtf('gencode.vM24.annotation.gtf')
gtf
+--------------+------------+-------------+-----------+-----------+------------+--------------+------------+----------------------+----------------+-------+
| Chromosome | Source | Feature | Start | End | Score | Strand | frame | gene_id | gene_type | +15 |
| (category) | (object) | (object) | (int32) | (int32) | (object) | (category) | (object) | (object) | (object) | ... |
|--------------+------------+-------------+-----------+-----------+------------+--------------+------------+----------------------+----------------+-------|
| chr1 | HAVANA | gene | 3073252 | 3074322 | . | + | . | ENSMUSG00000102693.1 | TEC | ... |
| chr1 | HAVANA | transcript | 3073252 | 3074322 | . | + | . | ENSMUSG00000102693.1 | TEC | ... |
| chr1 | HAVANA | exon | 3073252 | 3074322 | . | + | . | ENSMUSG00000102693.1 | TEC | ... |
| chr1 | ENSEMBL | gene | 3102015 | 3102125 | . | + | . | ENSMUSG00000064842.1 | snRNA | ... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| chrY | ENSEMBL | CDS | 90838871 | 90839177 | . | - | 0 | ENSMUSG00000096850.1 | protein_coding | ... |
| chrY | ENSEMBL | start_codon | 90839174 | 90839177 | . | - | 0 | ENSMUSG00000096850.1 | protein_coding | ... |
| chrY | ENSEMBL | stop_codon | 90838868 | 90838871 | . | - | 0 | ENSMUSG00000096850.1 | protein_coding | ... |
| chrY | ENSEMBL | UTR | 90838868 | 90838871 | . | - | . | ENSMUSG00000096850.1 | protein_coding | ... |
+--------------+------------+-------------+-----------+-----------+------------+--------------+------------+----------------------+----------------+-------+
Stranded PyRanges object has 1,870,769 rows and 25 columns from 22 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
15 hidden columns: gene_name, level, mgi_id, havana_gene, transcript_id, transcript_type, transcript_name, transcript_support_level, tag, havana_transcript, exon_number, ... (+ 4 more.)
上面说过PyRanges
对象类似潘大师的Dataframe
,所以可以用类似的方法来过滤数据,如下面只保留gtf
里面Feature
属性为gene的行:
gene = gtf[gtf.Feature == 'gene']
gene
+--------------+------------+------------+-----------+-----------+------------+--------------+------------+----------------------+----------------------+-------+
| Chromosome | Source | Feature | Start | End | Score | Strand | frame | gene_id | gene_type | +15 |
| (category) | (object) | (object) | (int32) | (int32) | (object) | (category) | (object) | (object) | (object) | ... |
|--------------+------------+------------+-----------+-----------+------------+--------------+------------+----------------------+----------------------+-------|
| chr1 | HAVANA | gene | 3073252 | 3074322 | . | + | . | ENSMUSG00000102693.1 | TEC | ... |
| chr1 | ENSEMBL | gene | 3102015 | 3102125 | . | + | . | ENSMUSG00000064842.1 | snRNA | ... |
| chr1 | HAVANA | gene | 3252756 | 3253236 | . | + | . | ENSMUSG00000102851.1 | processed_pseudogene | ... |
| chr1 | HAVANA | gene | 3466586 | 3513553 | . | + | . | ENSMUSG00000089699.1 | lncRNA | ... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| chrY | HAVANA | gene | 90603500 | 90605864 | . | - | . | ENSMUSG00000099619.6 | lncRNA | ... |
| chrY | HAVANA | gene | 90665345 | 90667625 | . | - | . | ENSMUSG00000099399.6 | lncRNA | ... |
| chrY | HAVANA | gene | 90752426 | 90755467 | . | - | . | ENSMUSG00000095366.2 | lncRNA | ... |
| chrY | ENSEMBL | gene | 90838868 | 90839177 | . | - | . | ENSMUSG00000096850.1 | protein_coding | ... |
+--------------+------------+------------+-----------+-----------+------------+--------------+------------+----------------------+----------------------+-------+
Stranded PyRanges object has 55,385 rows and 25 columns from 22 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
15 hidden columns: gene_name, level, mgi_id, havana_gene, transcript_id, transcript_type, transcript_name, transcript_support_level, tag, havana_transcript, exon_number, ... (+ 4 more.)
基因注释
皮软杰斯里面有nearest
、k_nearest
两种方法可以用。通过函数名我们就可以猜到这两个函数都可以用来查找interval1
集合里每一个区间在interval2
集合里面距离最近的区间,默认是返回有交集的区间,而k_nearest
可以指定返回的区间个数。通过代码看一下区别:
bed_test = pr.PyRanges(chromosomes="chr1", strands="+", starts=[3074300], ends=[3074322])
bed
+--------------+-----------+-----------+--------------+
| Chromosome | Start | End | Strand |
| (category) | (int32) | (int32) | (category) |
|--------------+-----------+-----------+--------------|
| chr1 | 3074300 | 3074322 | + |
+--------------+-----------+-----------+--------------+
bed_test.nearest(gtf,suffix="_gtf")
+--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------+
| Chromosome | Start | End | Strand | Source | Feature | Start_gtf | End_gtf | Score | Strand_gtf | frame | +18 |
| (category) | (int32) | (int32) | (category) | (object) | (object) | (int32) | (int32) | (object) | (category) | (object) | ... |
|--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------|
| chr1 | 3074300 | 3074322 | + | HAVANA | gene | 3073252 | 3074322 | . | + | . | ... |
+--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------+
bed_test.nearest(gene,overlap=False,suffix="_gtf")
+--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------+
| Chromosome | Start | End | Strand | Source | Feature | Start_gtf | End_gtf | Score | Strand_gtf | frame | +18 |
| (category) | (int32) | (int32) | (category) | (object) | (object) | (int32) | (int32) | (object) | (category) | (object) | ... |
|--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------|
| chr1 | 3074300 | 3074322 | + | ENSEMBL | gene | 3102015 | 3102125 | . | + | . | ... |
+--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------+
bed_test.k_nearest(gene,k=5,suffix="_gtf")
+--------------+-----------+-----------+--------------+------------+------------+-----------+-----------+------------+--------------+------------+-------+
| Chromosome | Start | End | Strand | Source | Feature | Start_b | End_b | Score | Strand_b | frame | +18 |
| (category) | (int32) | (int32) | (category) | (object) | (object) | (int32) | (int32) | (object) | (category) | (object) | ... |
|--------------+-----------+-----------+--------------+------------+------------+-----------+-----------+------------+--------------+------------+-------|
| chr1 | 3074300 | 3074322 | + | HAVANA | gene | 3073252 | 3074322 | . | + | . | ... |
| chr1 | 3074300 | 3074322 | + | ENSEMBL | gene | 3102015 | 3102125 | . | + | . | ... |
| chr1 | 3074300 | 3074322 | + | HAVANA | gene | 3205900 | 3671498 | . | - | . | ... |
| chr1 | 3074300 | 3074322 | + | HAVANA | gene | 3252756 | 3253236 | . | + | . | ... |
| chr1 | 3074300 | 3074322 | + | HAVANA | gene | 3365730 | 3368549 | . | - | . | ... |
+--------------+-----------+-----------+--------------+------------+------------+-----------+-----------+------------+--------------+------------+-------+
从上面的代码可知有无overlap=False
参数的区别,以及nearest
和k_nearest
函数的区别。当然,还有一些比较实用的参数如strandedness
:"same",指定链是否相同;how
:"upstream"、 "downstream"分别指定选取最近区间的方位,nb_cpu
:指定使用的cpu数。除此之外,k_nearest
还有一个参数ties
可以指定有距离相同的区间时如何返回,可选值为"first"、"last"、"different"。
suffix
参数指定当两个集合有相同列名时,后一个集合列名会添加的字段,默认是'_b'。不过,好像k_nearest
函数有BUG这个参数无效,但并不影响结果可以忽略。
从上面可以看出k_nearest
函数功能上略胜一筹,注释基因时用此函数相对更自由一些:
anno = bed.k_nearest(gene, ties='different')
anno = anno[['Chromosome', 'Start', 'End, 'Strand', 'gene_id']]
anno
+--------------+-----------+-----------+--------------+----------------------+
| Chromosome | Start | End | Strand | gene_id |
| (category) | (int32) | (int32) | (category) | (object) |
|--------------+-----------+-----------+--------------+----------------------|
| chr1 | 212609534 | 212609559 | + | ENSMUSG00000102307.1 |
| chr1 | 169887529 | 169887554 | + | ENSMUSG00000103110.1 |
| chr1 | 216711011 | 216711036 | + | ENSMUSG00000102307.1 |
| chr1 | 144227079 | 144227104 | + | ENSMUSG00000100389.1 |
| ... | ... | ... | ... | ... |
| chrY | 15224235 | 15224260 | - | ENSMUSG00000102835.5 |
| chrY | 13517892 | 13517917 | - | ENSMUSG00000102174.1 |
| chrY | 8010951 | 8010976 | - | ENSMUSG00000099861.1 |
| chrY | 7405376 | 7405401 | - | ENSMUSG00000118447.1 |
+--------------+-----------+-----------+--------------+----------------------+
anno.to_csv("data_anno.txt", sep="\t", header=True)
结束语
pyranges
包的功能还是挺多的,不仅包含区间的交集、差集,还可以进行一些统计运算如Jaccard
、Fisher
等。虽然,目前有不少现成的软件如homer
、chipseeker
可以做基因注释,很多时候我们可以直接使用这些软件即可,但pyranges
还是值得学习收藏一下,也许做个性化数据处理的时候使用它会来得更为方便些。
往期回顾
scanpy踩坑实录
差异基因密度分布
R绘图配色总结
saddleplot | A/B compartments
双曲线火山图一键拿捏