设为首页收藏本站
开启辅助访问
切换到窄版

 找回密码
 注册

QQ登录

只需一步,快速开始

搜索
查看: 1478|回复: 0

Parsing BLAST Output by Perl

[复制链接]
发表于 2011-10-6 11:20:24 | 显示全部楼层 |阅读模式
Once you have run blast, its output is in a file.  You need to open this file, then read it line-by-line with a while statement.
Because you are reading line-by-line, most of the information you gather needs to be stored in global variables: variables declared with “my” outside the while loop.
Each new line needs to be examined for the type of information it contains using regular expressions.  It is important to be able to identify the start and end points of various sections, and respond appropriately.
Most information will be gathered using regular expression memory parentheses.
Variables need to be re-initialized at the end of each section (or the beginning of the next one if the end is not easily discerned).
We will store the information in a hash whose keys are the subject identifiers.  All hits for that subject will be stored in an array.
Assuming that you want the information in the detailed analysis of each hit, note that there are 3 important sections:
1. each new subject sequence starts with a “>” in the first column, followed by the rest of its FASTA comment line.  This can spill over multiple lines in the output.  Immediately following this line(s) is the “Length = “ line.
2. Within each subject sequence there are one or more hits.  Each hit starts with a “Score =“ line, followed by an “Identities = “ line, followed (for blastn but not blastp) with a “Strand = “ line.
3. Within each hit there are groups of 3 lines representing the aligned sequence, including the coordinates of the start and end position of each line.  There can be several such groups, for long hits.

1. Parsing New Subject Sequence Information
Each new subject sequence starts with a line having the entire FASTA comment line on it (i.e. a “>” in the first column), followed by zero or more additional comment lines, and ends with a “Length = “ line.
The useful pieces of information here are the comment line itself, the unique identifier for this subject sequence (part of the comment line), and the length.
Because these sections have easily recognized start and end points, you can deal with them as a unit, inputting new lines separately from the main while loop.



  1. my ($subject_ident, $comment_line, $subject_length);
  2. my %hits;  # to store all data from this file;
  3.     while (<INFILE>) {   
  4.             # code for processing other sections
  5.         if (/^>/) {   # finds the “>” in the first column of the line
  6.             $subject_length = 0;  # reset to 0
  7.             $comment_line = $_;
  8.             while (1) {  # endless loop to get all of comment line
  9.                 my $next_line = <INFILE>;
  10.                 if ($next_line =~ /Length = (\d+)/ ) {
  11.                     $subject_length = $1;
  12.                     last;  # break out of loop
  13.                 }
  14.                 else {  # continuation of comment line
  15.                     chomp $next_line;
  16.                     $next_line =~ s/^\s+/ /; # remove leading
  17.                     $next_line =~ s/\s+$/ /; # and trailing multiple spaces
  18.                     $comment_line .= $next_line;
  19.                 }  # end else
  20.             }  # end while(1) loop
  21.             $comment_line =~ />(\S+)\s/;
  22.             $subject_ident = $1;
  23.             $hits{$subject_ident}{subject_length} = $subject_length;
  24.             $hits{$subject_ident}{comment_line} = $comment_line;
  25.         } # end subject sequence info section
  26.       # additional processing of BLAST output file
  27. }   # end while <INFILE> loop
复制代码



2. Parsing Hit Information
Each hit starts with the same 3 lines (2 for blastp), Score, Identities, and Strand.  If you detect the Score line, you can get the next two lines without using the overall while loop:



  1. my ($e_value, $identities, $gaps, $aligned_length, $orientation);
  2. while (<INFILE>) {
  3.     if (/Score = /) {
  4.          my $score_line = $_;
  5.          my $identities_line = <INFILE>;
  6.          my $strand_line = <INFILE>;
  7.               # now process the information in these 3 lines
  8.          $score_line =~ /Score.*Expect = ([-.e\d]+)/;  # note the regex here!
  9.          $e_value = $1;
  10.          $identities_line =~ /Identities = (\d+)\/(\d+)/;  # note the escaped slash   
  11.          ($identities, $aligned_length) = ($1, $2);
  12.          $gaps = 0;  # set a default
  13.          if ($identities_line =~ /Gaps = (\d+)/ ) {  # Gaps are not always present
  14.              $gaps = $1;
  15.          }
  16.           $orientation = 'Plus';
  17.          if ($strand_line =~ /Plus \/ (Plus|Minus)/) {  # only for blastn
  18.             $orientation = $1;
  19.          }
  20.          push @{ $hits{$subject_ident}{hit_data} }, [$e_value, $identities, $gaps, $aligned_length, $orientation];
  21.      }  # end if /Score/
  22. }  # end while <INFILE>
复制代码



3. Parsing the Aligned Sequences
This section is the hardest to deal with, because the end point is not clearly delimited.
We solve this problem by noting that there are 3 ways the aligned sequence section can end: it can be followed by another hit (a Score = line), by a new subject (a line with > in the first column) or by the statistical information at the end of the file (identified by “Lambda” in the first column).
So, near the beginning of the overall while loop we will insert a section that recognizes any of these patterns, and writes the aligned sequence information to the array of information from the previous hit.
However, note that this should only be done if there is a previous hit, that is, not for the first hit in the first subject sequence.  We can detect this because the $subject_ident variable will be undef before the first subject line is processed, and there will be no $hits{$subject_ident}{hit_data}.



  1. my ($query_start, $query_end, $subject_start, $subject_end);
  2. while (<INFILE>) {
  3.      if ($subject_ident & exists $hits{$subject_ident}{hit_data} && (/^>/ || /Score =/ || /^Lambda/ ) ) {  # but not for the first hit of the first subject
  4.         push @{ $hits{$subject_ident}[-1] }, $query_start, $query_end, $subject_start, $subject_end;   
  5.         last if (/^Lambda/);  # end of hit processing
  6.       $query_start = $query_end =  $subject_start = $subject_end = 0;  # reset
  7.     }
  8.     # process the file
  9. }  # end while <INFILE>
复制代码



4.Printing Out the Information
After the end of the while <INFILE> loop.
Very simple format.
This whole program could be a subroutine, with a reference to the %hits hash returned.



  1. foreach my $subject_ident (sort keys %hits) {
  2.     print "$subject_ident:\n";
  3.     print "     $hits{$subject_ident}{comment_line}\n";
  4.     print "     Length  = $hits{$subject_ident}{subject_length}\n";
  5.     foreach my $hit ( @{$hits{$subject_ident}{hit_data}} ) {
  6.         print "@$hit\n";
  7.     }
  8.     print "\n";
  9. }
复制代码
您需要登录后才可以回帖 登录 | 注册

本版积分规则

QQ|申请友链|小黑屋|手机版|Archiver|生物信息学论坛 ( 蜀ICP备09031721号  

GMT+8, 2017-6-23 00:26 , Processed in 0.104490 second(s), 20 queries .

Powered by Discuz! X3

© 2001-2013 Comsenz Inc.

快速回复 返回顶部 返回列表