设为首页收藏本站
开启辅助访问
切换到窄版

 找回密码
 注册

QQ登录

只需一步,快速开始

搜索
查看: 4400|回复: 6

如何用perl批量删除一组序列中指定区段的序列

[复制链接]
发表于 2012-1-2 11:18:26 | 显示全部楼层 |阅读模式
15金钱
我想去除一组序列(大约1000条)中指定区段序列,例如
file1中包含原始待处理序列,格式和内容如下:
>AT2G01210.1
MLASLIIFVALLCNVTVISGLN
>AT2G01820.1
MSNSHLGTLCFIISLLGLANFSLS
>AT2G01950.1
MTTSPIRVRIRTRIQISFIFLLTHL
file2中包含指定区段信息,格式和内容如下:
AT2G01210.1 1 12
AT2G01820.1 2 5,10 14,17 20
AT2G01950.1 3 7,12 15
要求得到file3,内容如下:
>AT2G01210.1
CNVTVISGLN
>AT2G01820.1
MLGTLLLFSLS
>AT2G01950.1
MTVRIRISFIFLLTHL
上面是个简单的例子,实际情况就是序列长一些,序列数多一些。
终端命令形式希望是cutseq.pl file1 file2 file3
哪位高手能够帮助小弟一下,写下perl命令,感激不尽,新的一年必将身体健康工作顺利!!!

 楼主| 发表于 2012-1-3 00:22:03 | 显示全部楼层
依然无人问津
回复

使用道具 举报

发表于 2012-1-5 10:05:51 | 显示全部楼层
身为版主,我来帮你解决。稍等。。。
回复

使用道具 举报

发表于 2012-1-5 11:07:31 | 显示全部楼层
写好了,测试通过。
  1. #! /usr/bin/perl -w

  2. use strict;

  3. #Author:liyinghong520@gmail.com
  4. #Date:2012-01-05

  5. open IN1,"<$ARGV[0]" or die;
  6. open IN2,"<$ARGV[1]" or die;
  7. open OUT,">$ARGV[2]" or die;
  8. my %proFa;
  9. my $id1;
  10. while(<IN1>){
  11.         chomp;
  12.         if(/^>(.+)$/){
  13.                 $id1=$1;
  14.         }else{
  15.                 $proFa{$id1}=$_;
  16.                 }
  17.         }
  18. my %proCut;
  19. while(<IN2>){
  20.         chomp;
  21.         my $temp=$_;
  22.         my $id2;
  23.         if($temp=~s/(.+?)\s+(.+)$/$2/){
  24.                 $id2=$1;
  25.                 }
  26.         my @array=split/\,/,$temp;
  27.         foreach(@array){
  28.                 my @arr=split/\s+/,$_;
  29.                 if($proCut{$id2}){
  30.                         my $refA=$proCut{$id2};
  31.                         my @A=@$refA;
  32.                         push @A,\@arr;
  33.                         $proCut{$id2}=\@A;
  34.                 }else{
  35.                         $proCut{$id2}=[\@arr];
  36.                         }
  37.                 }
  38.         }
  39. foreach my $idFa(keys %proFa){
  40.         my $seq=$proFa{$idFa};
  41.         my @array=split//,$seq;
  42.         my $refLoc=$proCut{$idFa};
  43.         my @Loc=@$refLoc;
  44.         foreach(@Loc){
  45.                 my $refa=$_;
  46.                 my @ar=@$refa;
  47.                 for(my $i=$ar[0]-1;$i<=$ar[1]-1;$i++){
  48.                         $array[$i]="";
  49.                         }
  50.                 }
  51.                 print OUT ">$idFa\n";
  52.                 print OUT @array;
  53.                 print OUT "\n";       
  54.         }
复制代码
回复

使用道具 举报

 楼主| 发表于 2012-1-9 02:59:36 | 显示全部楼层
多谢版主的鼎力相助,感激不尽,刚去外地开会回来,未能及时道谢,请不要见怪!
版主帮写的命令,对于那个例子是可以正常运行的,后来我运行正式数据,发现仍然报错,请斑竹运行程序处理如下内容,
#file1中的全场序列,内容如下,
>AT4G25390.1
MPSRSISAPVPVLAPAPIVSSLVPAAPSGHQNKTTRIFPPFVVAGAGAGFSLFITLSVCFCKFSRKRSSPPAENASSSPR
RPSPREFSYSSLRRATGSFSQANRLGQGGFGVVFRGTISGGENVAVKVMDSGSLQGEGEFQNELFFAAKLDSPHVVPVIG
FSHDRKRRRLLLVYKLMDNGNLQDALLHRRCPELMDWNRRFLVAVNIADGIKHLHSLEPPVIHGDIKPSNVLLDSLFSAK
IADFGLARLKAEQVEISVAPERDGDGSMVEEVESVVTTVTGYEDFNFGLVDQSPESVAKVPGSVSASPEATTVVSVSPEM
GEKTDEDGGSVVVMKKGKESESKDWWWKQESNVERGRVKEYVMQWIGSEVKKERPSRSDWIEAAALSSSSSKKLEKKTSK
RLDWWLSLEEEDENKKKKKRRMVREWWKDEYRRELAKKRKKKKKMTLEAEFCSDDGSSSVSQWRRGSGSGSSIDWWLDGL
SGERWLRARGNSHDSVSGEIAKSCGISSTPSMRGTVCYAAPEYCNLDNNVSEKCDVYSYGVLLLVLISGRRPLEMTGSAS
EIQRANLMSWARKLARRGKLVDLVDQKLQNLDQEQAVLCIKVALLCLQRLPISRPSMKEVLGMLKGEVNLPELPSEFSPS
PPLKTTRKQRR
>AT5G51770.1
MPSRTLPSPAPVNSPFSSSVTPHHETTTTRIFPPLAAVGFSLFITLSICFCKFNRKRRSPAAVASSSTPPQKQPLHEFSY
SSLRKATASFSPENRLGQGGFGSVFRGTLSPSSGGGNVAVKVMDSGSLQGEREFQNELFFAGKLDSPHVVSVIGFSRRRR
SRLILVYELMDIGNLQDALLHRRSPELMIWNRRFLVAIDIAKGIEHLHSLNPCVIHGDLKPSNVLLDRFFSAKISDFGLA
RLKSEHVEVKVVSESDVVEDYGSVVEEVESVVTNTTGCDESNFGFTDQSPVPLSSPEMVEQVPMTSPETVVSVSPEMGEK
GSVLEVGNVVRSKDWWWKQEGNVGRGKGKEYVMQWIGSEVKEERQSSDWIAETAEGGKKVEKKKSSKRLEWWLSLDEEKE
KGKKKKRRMVREWWKDEYRKELAKRMKKKKKKKTLESEFYSDDVSGSVDQRRHGDGEVYRKKRRGVSSNSIGSSIDWWLD
GLSGEQWRARRRNSQDSVKSCGVSSTPSMRGTMCYVAPECCGNNIDDVSEKSDVYSYGVLLLVLVSGRRPLEVTGPASEI
MLRANLMSWARKLARRGRLGDLVDEKLQLLDQEQAVLCIKVALQCLQKSPVSRPSMKDVLEMLTGAISPPDLPTEFSPSP
QTRFPFKARRKQNR
>AT4G25160.1
MSRSPDKLALPPPPPPPPSRTVVVALSGSSKSKYVVTWAIEKFATEGNVGFKLLHIHPMITSVPTPMGNAIPISEVRDDV
VTAYRQEILWQSEEMLKPYTKLFVRRKVAVEVLVIESDNVAAAIAEEVTRDSIDRIVIGGSSRSFFSRKADICSVISALM
PNFCTVYVVSKGKLSCVRPSDSDGNATIREDGSERTNSSSGSSGPTSDSSDVMSSAHDSQSRPLSLPVRRMQHFPAIAGQ
ASVPMETSSVGSDETRCMSLDAEEARDVSSINRSSTDTTSRWTPRRRDYEERKEAMSSSSSNREYGNFGTRFSWSGMGVD
TTHSRASQQASNMSDALSEQSYTDNQVNLNFEVEKLRAELRHVQEMYAVAQTETFDASRKLGELNQRRLEEAIKLEELKL
KEYEARELAEKEKQNFEKARRDAESMRERAEREIAQRREAERKSARDTKEKEKLEGTLGSPQLQYQHFAWEEIMAATSSF
SEELKIGMGAYGAVYKCNLHHTTAVVKVLQSAENQLSKQFQQELEILSKIRHPHLVLLLGACPEQGALVYEYMENGSLED
RLFQVNNSPPLPWFERFRIAWEVAAALVFLHKSKPKPIIHRDLKPANILLDHNFVSKVGDVGLSTMVQVDPLSTKFTIYK
QTSPVGTLCYIDPEYQRTGRISSKSDIYSFGMILLQLLTAKPAIALTHFVESAMDSNDEFLKILDQKAGNWPIEETRELA
ALALCCTELRGKDRPDLKDQILPALENLKKVAEKARNSFSGVSTQPPTHFICPLLKDVMNEPCVAADGYTYDRHAIEEWL
KEHNTSPMTDSPLHSKNLLPNYTLYTAIMEWRSTR
#ile2中包含去除序列的起始及终止信息,内容如下,
AT4G25160.1 480 744
AT4G25390.1 99 266,498 623
AT5G51770.1 90 261,496 623
#file3中应得到的结果如下(红色为已经去除区段,为方便版主分析结果,我这里仍然显示)
>AT4G25160.1
MSRSPDKLALPPPPPPPPSRTVVVALSGSSKSKYVVTWAIEKFATEGNVGFKLLHIHPMITSVPTPMGNAIPISEVRDDVVTAYRQEILWQSEEMLKPYTKLFVRRKVAVEVLVIESDNVAAAIAEEVTRDSIDRIVIGGSSRSFFSRKADICSVISALMPNFCTVYVVSKGKLSCVRPSDSDGNATIREDGSERTNSSSGSSGPTSDSSDVMSSAHDSQSRPLSLPVRRMQHFPAIAGQASVPMETSSVGSDETRCMSLDAEEARDVSSINRSSTDTTSRWTPRRRDYEERKEAMSSSSSNREYGNFGTRFSWSGMGVDTTHSRASQQASNMSDALSEQSYTDNQVNLNFEVEKLRAELRHVQEMYAVAQTETFDASRKLGELNQRRLEEAIKLEELKLKEYEARELAEKEKQNFEKARRDAESMRERAEREIAQRREAERKSARDTKEKEKLEGTLGSPQLQYQHFAWEEIMAATSS
FSEELKIGMGAYGAVYKCNLHHTTAVVKVLQSAENQLSKQFQQELEILSKIRHPHLVLLLGACPEQGALVYEYMENGSLEDRLFQVNNSPPLPWFERFRIAWEVAAALVFLHKSKPKPIIHRDLKPANILLDHNFVSKVGDVGLSTMVQVDPLSTKFTIYKQTSPVGTLCYIDPEYQRTGRISSKSDIYSFGMILLQLLTAKPAIALTHFVESAMDSNDEFLKILDQKAGNWPIEETRELAALALCCTELRGKDRPDLKDQILPALENLKKVAEKARNSFSGVSTQPPTHFICPLLKDVMNEPCVAADGYTYDRHAIEEWLKEHNTSPMTDSPLHSKNLLPNYTLYTAIMEWRSTR

>AT4G25390.1 MPSRSISAPVPVLAPAPIVSSLVPAAPSGHQNKTTRIFPPFVVAGAGAGFSLFITLSVCFCKFSRKRSSPPAENASSSPRRPSPREFSYSSLRRATGSFSQANRLGQGGFGVVFRGTISGGENVAVKVMDSGSLQGEGEFQNELFFAAKLDSPHVVPVIGFSHDRKRRRLLLVYKLMDNGNLQDALLHRRCPELMDWNRRFLVAVNIADGIKHLHSLEPPVIHGDIKPSNVLLDSLFSAKIADFGLARLKAEQVEISVAPERDGDGSMVEEVESVVTTVTGYEDFNFGLVDQSPESVAKVPGSVSASPEATTVVSVSPEMGEKTDEDGGSVVVMKKGKESESKDWWWKQESNVERGRVKEYVMQWIGSEVKKERPSRSDWIEAAALSSSSSKKLEKKTSKRLDWWLSLEEEDENKKKKKRRMVREWWKDEYRRELAKKRKKKKKMTLEAEFCSDDGSSSVSQWRRGSGSGSSIDWWLDGLSGERWLRARGNSHDSVSGEIAKSCGISSTPSMRGTVCYAAPEYCNLDNNVSEKCDVYSYGVLLLVLISGRRPLEMTGSASEIQRANLMSWARKLARRGKLVDLVDQKLQNLDQEQAVLCIKVALLCLQRLPISRPSMKEVLGMLKGEVNLPELPSEFSPSPPLKTTRKQRR
>AT5G51770.1 MPSRTLPSPAPVNSPFSSSVTPHHETTTTRIFPPLAAVGFSLFITLSICFCKFNRKRRSPAAVASSSTPPQKQPLHEFSYSSLRKATASFSPENRLGQGGFGSVFRGTLSPSSGGGNVAVKVMDSGSLQGEREFQNELFFAGKLDSPHVVSVIGFSRRRRSRLILVYELMDIGNLQDALLHRRSPELMIWNRRFLVAIDIAKGIEHLHSLNPCVIHGDLKPSNVLLDRFFSAKISDFGLARLKSEHVEVKVVSESDVVEDYGSVVEEVESVVTNTTGCDESNFGFTDQSPVPLSSPEMVEQVPMTSPETVVSVSPEMGEKGSVLEVGNVVRSKDWWWKQEGNVGRGKGKEYVMQWIGSEVKEERQSSDWIAETAEGGKKVEKKKSSKRLEWWLSLDEEKEKGKKKKRRMVREWWKDEYRKELAKRMKKKKKKKTLESEFYSDDVSGSVDQRRHGDGEVYRKKRRGVSSNSIGSSIDWWLDGLSGEQWRARRRNSQDSVKSCGVSSTPSMRGTMCYVAPECCGNNIDDVSEKSDVYSYGVLLLVLVSGRRPLEVTGPASEIMLRANLMSWARKLARRGRLGDLVDEKLQLLDQEQAVLCIKVALQCLQKSPVSRPSMKDVLEMLTGAISPPDLPTEFSPSPQTRFPFKARRKQNR
烦请楼主调试一下程序,再次多谢!!!
回复

使用道具 举报

发表于 2012-1-13 12:33:21 | 显示全部楼层
本帖最后由 haminy 于 2012-1-13 12:34 编辑

不能确定你的数据具体是什么样,所以我是这样的,可能要稍微改下程序:)不过应该不难!
file1的数据:
>AT2G01210.1    MLASLIIFVALLCNVTVISGLN
>AT2G01820.1    MSNSHLGTLCFIISLLGLANFSLS
>AT2G01950.1    MTTSPIRVRIRTRIQISFIFLLTHL

file2的数据:
AT2G01210.1 1 12
AT2G01820.1 2 5 10 14 17 20
AT2G01950.1 3 7 12 15

#!/usr/bin/perl
use strict;
use warnings;

my @data;

open (IN,"file1.txt");
open (INI,"file2.txt");
open (OUTPUT, ">output.txt")|| die "Cannot open the newfile: $!\n";

while(my $line=<INI>){
    my @cut=split(/\s/,$line);
    my @cut_new=shift(@cut);
    my $num=@cut;
   
print "$num\n@cut\n";
    my $line1=<IN>;
    my @cut1=split(/\t/,$line1);
    print "$cut1[1]$cut1[0]\n";
    foreach(@cut) {
        my $data=substr($cut1[1],"$_",1);
        
        push(@data,$data);
        }
    print OUTPUT "$cut1[0] \> @data\n";
   
   
}
close(IN);
close(INI);
close(OUTPUT);
回复

使用道具 举报

发表于 2012-1-18 12:47:25 | 显示全部楼层
原因很简单,你给的测试数据和实际数据不一样,file1中>这行最后面多一个空格。把程序第15行改成:
  1. if(/^>(.+)\s+$/){
复制代码
即可。
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 注册

本版积分规则

QQ|申请友链|小黑屋|手机版|Archiver|生物信息学论坛 ( 蜀ICP备09031721号  

GMT+8, 2017-2-23 15:22 , Processed in 0.103120 second(s), 20 queries .

Powered by Discuz! X3

© 2001-2013 Comsenz Inc.

快速回复 返回顶部 返回列表