热点新闻
bed基因注释
2023-07-23 05:30  浏览:1661  搜索引擎搜索“手机易展网”
温馨提示:信息一旦丢失不一定找得到,请务必收藏信息以备急用!本站所有信息均是注册会员发布如遇到侵权请联系文章中的联系方式或客服删除!
联系我时,请说明是在手机易展网看到的信息,谢谢。
展会发布 展会网站大全 报名观展合作 软文发布

日常瞎掰

  工欲善其事必先利其器,好用的工具对做事来说很重要,可以让你做起事来收到事半功倍的效果。前段时间分析数据时,需要给intervals做基因注释,需要操作gtf文件,最后发现一款非常好用的python包 — pyranges。该包可以很轻松地将bedgtfgff3等格式文件读取为PyRanges对象,该对象有点类似潘大师的Dataframe,熟悉潘大师的同学应该能体会到Dataframe操作数据的快乐。同样,皮软杰斯内置操作intervals的方法也绝对能给你带来无语伦比的畅快。废话不多说,下面来感受一下使用皮软杰斯做注释基因有多方便。






读取文件

  read_bed方法可以用来读取bed格式的文件,read_gtfread_gff3分别用来读取gtfgff3文件:

import pyranges as pr bed = pr.read_bed('data.bed') bed +--------------+-----------+-----------+------------+-----------+--------------+ | Chromosome | Start | End | Name | Score | Strand | | (category) | (int32) | (int32) | (object) | (int64) | (category) | |--------------+-----------+-----------+------------+-----------+--------------| | chr1 | 212609534 | 212609559 | U0 | 0 | + | | chr1 | 169887529 | 169887554 | U0 | 0 | + | | chr1 | 216711011 | 216711036 | U0 | 0 | + | | chr1 | 144227079 | 144227104 | U0 | 0 | + | | ... | ... | ... | ... | ... | ... | | chrY | 15224235 | 15224260 | U0 | 0 | - | | chrY | 13517892 | 13517917 | U0 | 0 | - | | chrY | 8010951 | 8010976 | U0 | 0 | - | | chrY | 7405376 | 7405401 | U0 | 0 | - | +--------------+-----------+-----------+------------+-----------+--------------+ Stranded PyRanges object has 10,000 rows and 6 columns from 24 chromosomes. For printing, the PyRanges was sorted on Chromosome and Strand. gtf = pr.read_gtf('gencode.vM24.annotation.gtf') gtf +--------------+------------+-------------+-----------+-----------+------------+--------------+------------+----------------------+----------------+-------+ | Chromosome | Source | Feature | Start | End | Score | Strand | frame | gene_id | gene_type | +15 | | (category) | (object) | (object) | (int32) | (int32) | (object) | (category) | (object) | (object) | (object) | ... | |--------------+------------+-------------+-----------+-----------+------------+--------------+------------+----------------------+----------------+-------| | chr1 | HAVANA | gene | 3073252 | 3074322 | . | + | . | ENSMUSG00000102693.1 | TEC | ... | | chr1 | HAVANA | transcript | 3073252 | 3074322 | . | + | . | ENSMUSG00000102693.1 | TEC | ... | | chr1 | HAVANA | exon | 3073252 | 3074322 | . | + | . | ENSMUSG00000102693.1 | TEC | ... | | chr1 | ENSEMBL | gene | 3102015 | 3102125 | . | + | . | ENSMUSG00000064842.1 | snRNA | ... | | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | | chrY | ENSEMBL | CDS | 90838871 | 90839177 | . | - | 0 | ENSMUSG00000096850.1 | protein_coding | ... | | chrY | ENSEMBL | start_codon | 90839174 | 90839177 | . | - | 0 | ENSMUSG00000096850.1 | protein_coding | ... | | chrY | ENSEMBL | stop_codon | 90838868 | 90838871 | . | - | 0 | ENSMUSG00000096850.1 | protein_coding | ... | | chrY | ENSEMBL | UTR | 90838868 | 90838871 | . | - | . | ENSMUSG00000096850.1 | protein_coding | ... | +--------------+------------+-------------+-----------+-----------+------------+--------------+------------+----------------------+----------------+-------+ Stranded PyRanges object has 1,870,769 rows and 25 columns from 22 chromosomes. For printing, the PyRanges was sorted on Chromosome and Strand. 15 hidden columns: gene_name, level, mgi_id, havana_gene, transcript_id, transcript_type, transcript_name, transcript_support_level, tag, havana_transcript, exon_number, ... (+ 4 more.)

  上面说过PyRanges对象类似潘大师的Dataframe,所以可以用类似的方法来过滤数据,如下面只保留gtf里面Feature属性为gene的行:

gene = gtf[gtf.Feature == 'gene'] gene +--------------+------------+------------+-----------+-----------+------------+--------------+------------+----------------------+----------------------+-------+ | Chromosome | Source | Feature | Start | End | Score | Strand | frame | gene_id | gene_type | +15 | | (category) | (object) | (object) | (int32) | (int32) | (object) | (category) | (object) | (object) | (object) | ... | |--------------+------------+------------+-----------+-----------+------------+--------------+------------+----------------------+----------------------+-------| | chr1 | HAVANA | gene | 3073252 | 3074322 | . | + | . | ENSMUSG00000102693.1 | TEC | ... | | chr1 | ENSEMBL | gene | 3102015 | 3102125 | . | + | . | ENSMUSG00000064842.1 | snRNA | ... | | chr1 | HAVANA | gene | 3252756 | 3253236 | . | + | . | ENSMUSG00000102851.1 | processed_pseudogene | ... | | chr1 | HAVANA | gene | 3466586 | 3513553 | . | + | . | ENSMUSG00000089699.1 | lncRNA | ... | | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | | chrY | HAVANA | gene | 90603500 | 90605864 | . | - | . | ENSMUSG00000099619.6 | lncRNA | ... | | chrY | HAVANA | gene | 90665345 | 90667625 | . | - | . | ENSMUSG00000099399.6 | lncRNA | ... | | chrY | HAVANA | gene | 90752426 | 90755467 | . | - | . | ENSMUSG00000095366.2 | lncRNA | ... | | chrY | ENSEMBL | gene | 90838868 | 90839177 | . | - | . | ENSMUSG00000096850.1 | protein_coding | ... | +--------------+------------+------------+-----------+-----------+------------+--------------+------------+----------------------+----------------------+-------+ Stranded PyRanges object has 55,385 rows and 25 columns from 22 chromosomes. For printing, the PyRanges was sorted on Chromosome and Strand. 15 hidden columns: gene_name, level, mgi_id, havana_gene, transcript_id, transcript_type, transcript_name, transcript_support_level, tag, havana_transcript, exon_number, ... (+ 4 more.)

基因注释

  皮软杰斯里面有nearestk_nearest两种方法可以用。通过函数名我们就可以猜到这两个函数都可以用来查找interval1集合里每一个区间在interval2集合里面距离最近的区间,默认是返回有交集的区间,而k_nearest可以指定返回的区间个数。通过代码看一下区别:

bed_test = pr.PyRanges(chromosomes="chr1", strands="+", starts=[3074300], ends=[3074322]) bed +--------------+-----------+-----------+--------------+ | Chromosome | Start | End | Strand | | (category) | (int32) | (int32) | (category) | |--------------+-----------+-----------+--------------| | chr1 | 3074300 | 3074322 | + | +--------------+-----------+-----------+--------------+ bed_test.nearest(gtf,suffix="_gtf") +--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------+ | Chromosome | Start | End | Strand | Source | Feature | Start_gtf | End_gtf | Score | Strand_gtf | frame | +18 | | (category) | (int32) | (int32) | (category) | (object) | (object) | (int32) | (int32) | (object) | (category) | (object) | ... | |--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------| | chr1 | 3074300 | 3074322 | + | HAVANA | gene | 3073252 | 3074322 | . | + | . | ... | +--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------+ bed_test.nearest(gene,overlap=False,suffix="_gtf") +--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------+ | Chromosome | Start | End | Strand | Source | Feature | Start_gtf | End_gtf | Score | Strand_gtf | frame | +18 | | (category) | (int32) | (int32) | (category) | (object) | (object) | (int32) | (int32) | (object) | (category) | (object) | ... | |--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------| | chr1 | 3074300 | 3074322 | + | ENSEMBL | gene | 3102015 | 3102125 | . | + | . | ... | +--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------+ bed_test.k_nearest(gene,k=5,suffix="_gtf") +--------------+-----------+-----------+--------------+------------+------------+-----------+-----------+------------+--------------+------------+-------+ | Chromosome | Start | End | Strand | Source | Feature | Start_b | End_b | Score | Strand_b | frame | +18 | | (category) | (int32) | (int32) | (category) | (object) | (object) | (int32) | (int32) | (object) | (category) | (object) | ... | |--------------+-----------+-----------+--------------+------------+------------+-----------+-----------+------------+--------------+------------+-------| | chr1 | 3074300 | 3074322 | + | HAVANA | gene | 3073252 | 3074322 | . | + | . | ... | | chr1 | 3074300 | 3074322 | + | ENSEMBL | gene | 3102015 | 3102125 | . | + | . | ... | | chr1 | 3074300 | 3074322 | + | HAVANA | gene | 3205900 | 3671498 | . | - | . | ... | | chr1 | 3074300 | 3074322 | + | HAVANA | gene | 3252756 | 3253236 | . | + | . | ... | | chr1 | 3074300 | 3074322 | + | HAVANA | gene | 3365730 | 3368549 | . | - | . | ... | +--------------+-----------+-----------+--------------+------------+------------+-----------+-----------+------------+--------------+------------+-------+

  从上面的代码可知有无overlap=False参数的区别,以及nearestk_nearest函数的区别。当然,还有一些比较实用的参数如strandedness:"same",指定链是否相同;how:"upstream"、 "downstream"分别指定选取最近区间的方位,nb_cpu:指定使用的cpu数。除此之外,k_nearest还有一个参数ties可以指定有距离相同的区间时如何返回,可选值为"first"、"last"、"different"。
  suffix参数指定当两个集合有相同列名时,后一个集合列名会添加的字段,默认是'_b'。不过,好像k_nearest函数有BUG这个参数无效,但并不影响结果可以忽略。
  从上面可以看出k_nearest函数功能上略胜一筹,注释基因时用此函数相对更自由一些:

anno = bed.k_nearest(gene, ties='different') anno = anno[['Chromosome', 'Start', 'End, 'Strand', 'gene_id']] anno +--------------+-----------+-----------+--------------+----------------------+ | Chromosome | Start | End | Strand | gene_id | | (category) | (int32) | (int32) | (category) | (object) | |--------------+-----------+-----------+--------------+----------------------| | chr1 | 212609534 | 212609559 | + | ENSMUSG00000102307.1 | | chr1 | 169887529 | 169887554 | + | ENSMUSG00000103110.1 | | chr1 | 216711011 | 216711036 | + | ENSMUSG00000102307.1 | | chr1 | 144227079 | 144227104 | + | ENSMUSG00000100389.1 | | ... | ... | ... | ... | ... | | chrY | 15224235 | 15224260 | - | ENSMUSG00000102835.5 | | chrY | 13517892 | 13517917 | - | ENSMUSG00000102174.1 | | chrY | 8010951 | 8010976 | - | ENSMUSG00000099861.1 | | chrY | 7405376 | 7405401 | - | ENSMUSG00000118447.1 | +--------------+-----------+-----------+--------------+----------------------+ anno.to_csv("data_anno.txt", sep="\t", header=True)

结束语

  pyranges包的功能还是挺多的,不仅包含区间的交集、差集,还可以进行一些统计运算如JaccardFisher等。虽然,目前有不少现成的软件如homerchipseeker可以做基因注释,很多时候我们可以直接使用这些软件即可,但pyranges还是值得学习收藏一下,也许做个性化数据处理的时候使用它会来得更为方便些。


往期回顾

scanpy踩坑实录
差异基因密度分布
R绘图配色总结
saddleplot | A/B compartments
双曲线火山图一键拿捏

发布人:df89****    IP:120.230.72.***     举报/删稿
展会推荐
  • 隐居
  • 2023-07-22浏览:2200
让朕来说2句
评论
收藏
点赞
转发