设为首页收藏本站
开启辅助访问
切换到窄版

 找回密码
 注册

QQ登录

只需一步,快速开始

搜索
查看: 3674|回复: 0

Formatdb与如何构建本地Blast与批量Blast

[复制链接]
发表于 2011-6-7 23:36:54 | 显示全部楼层 |阅读模式
本文详细出处参考:http://liucheng.name/478/
如何批量blast
     本文详细出处参考:http://liucheng.name/1221/
blast formatdb 介绍


formatdb :
在对核苷酸或蛋白质序列数据库进行Blast搜索之前,必须要对所使用的序列数据库进行formatdb, 即对序列数 据库进行格式化,这是所有使用BLAST所必须的一步。
格式化序列数据库— —formatdb

formatdb简单介绍

formatdb处理的都是格式为 ASN.1和 FASTA,而且不论是核苷酸序列数据库,还是蛋白质序列数据库;不论是使用Blastall ,还是Blastpgp,Mega Blast应用程序,这一步都是不可少的。

formatdb命令行参数

formatdb - 得到formatdb 所有的参数显示(见附录二)和介绍,
它可以根据我们的想法把源数据库格式化

主要参数的说明

-i 输入需要格式化的源数据库名称 Optional-p 文件类型,是核苷酸序列数据库,还是蛋白质序列数据库 T – protein F - nucleotide [T/F] Optionaldefault = T-a 输入数据库的格式是ASN.1(否 则是FASTA) T - True, F - False. [T/F] Optionaldefault = F-o 解析选项T - True: 解析序列标识并且建立目录 F - False: 与上相反[T/F] Optional default = F
命令示例

formatdb -i ecoli.nt -p F -o T 此时,blastall可以直接使用。

附详细说明:

Command Line Options
A list of the command line options and the current version for formatdb may be obtained by executing formatdb without options, as in:
formatdb -The formatdb options are summarized below:
formatdb 2.2.5 arguments:
-t Title for database file [String]
Optional
-i Input file(s) for formatting (this parameter must be set)
[File In]
-l Logfile name: [File Out]
Optional
default = formatdb.log
-p Type of file
T - protein
F - nucleotide [T/F] Optional
default = T
-o Parse options
T - True: Parse SeqId and create indexes.
F - False: Do not parse SeqId. Do not create indexes.
[T/F] Optional default = FIf the "-o" option is TRUE (and the source database is in FASTA format), then the database identifiers in the FASTA definition line must follow the convention of the FASTA Defline Format. Please see section "F Note on creating custom databases" below.
-a Input file is database in ASN.1 format (otherwise FASTA is expected)
T - True,
F - False.
[T/F] Optional default = F
-b ASN.1 database in binary mode
T - binary,
F - text mode.
[T/F] Optional default = FA source ASN.1 database may be represented in two formats - ascii text and binary. The "-b" option, if TRUE, specifies that input ASN.1 database is in binary format. The option is ignored in case of FASTA input database.
-e Input is a Seq-entry [T/F]
Optional
default = FA source ASN.1 database (either text ascii or binary) may contain a Bioseq-set or just one Bioseq. In the latter case the "-e" switch should be set to TRUE.
-n Base name for BLAST files [String]
OptionalThis options allows one to produce BLAST databases with a different name than that of the original FASTA file. For instance, one could have a file named 'ecoli.nuc.txt' and and format it as 'ecoli':
formatdb -i ecoli.nuc.txt -p F -o T -n ecoli
uncompress -c nr.z | formatdb -i stdin -o T -n nrThis can be used in situations where the original FASTA file is not required other than by formatdb. This can help in a situation where disk-space is tight.
-v Database volume size in millions of letters [Integer] Optional
default = 0
range from 0 to This option breaks up large FASTA files into 'volumes' (each with a maximum size of 2 billion letters). As part of the creation of a volume formatdb writes a new type of BLAST database file, called an alias file, with the extension 'nal' or 'pal'.
-s Create indexes limited only to accessions - sparse [T/F]
Optional
default = FThis option limits the indices for the string identifiers (used by formatdb) to accessions (i.e., no locus names). This is especially useful for sequences sets like the EST's where the accession and locus names are identical. Formatdb runs faster and produces smaller temporary files if this option is used. It is strongly recommended for EST's, STS's, GSS's, and HTGS's.
-L Create an alias file with this name
use the gifile arg (below) if set to calculate db size
use the BLAST db specified with -i (above) [File Out] OptionalThis option produces a BLAST database alias file using a specified database, but limiting the sequences searched to those in the GI list given by the -F argument. See the section "Note on creating an alias file for a GI list" for more information.
-F Gifile (file containing list of gi's) [File In] OptionalThis option can be used to specify the GI list for the alias file construction (-L option above) or to produce a binary GI list if the -B option (below) is set.
-B Binary Gifile produced from the Gifile specified above [File Out]
OptionalThis option specifies the name of a binary GI list file. This option should be used with the -F option. A text GI list may be specified with the -F option and the -B option will produce that GI list in binary format. The binary file is smaller and BLAST does not need to convert it, so it can be read faster.
Notes/Troubleshooting:
A.) Note on -o option:
It is always advantageous to use the '-o' option if the database identifiers are in the format specified at ftp://ftp.NCBI.nih.gov/blast/db/README. If the database identifiers are in this parseable format, formatdb produces additional indices allowing retrieval from the databases by identifier. The databases on the NCBI FTP site contain parseable identifiers. It is sufficient if the first word on the FASTA definition line is a unique identifier (e.g., ">3091 Alcoho de..."). It is necessary to use parseable identifiers for the following cases:
1.) ASN.1 is to be produced from blastall or blastpgp, then "-o" must be TRUE.
2.) query-anchored alignments are desired (i.e., the '-m' option with a non-zero value is used).
3.) The gi's are desired as part of the output (i.e., '-I' is used).
4.) fastacmd will be used to fetch sequences from the database by accession or gi.
See Appendix 1: The Files Produced by Formatdb for more information in the -o T option.
B.) Note on "SORTFiles failed" message:
Formatdb will use the 'standard' temporary directory to sort the string indices on disk. Under UNIX this directory is often /var/tmp and if there is not enough space there, then the error message: "ERROR: [000.000] SORTFiles failed" will be issued. This can be avoided by setting the TMPDIR environment variable to a partition with more free space. This message may also often be avoided by using the sparse option (-s) for formatdb described above.
C.) Note on formatting large (4 Gig and larger) FASTA files:
A single BLAST database can contain up to 4 billion letters. If one wishes to formatdb a FASTA file containing more letters than this, several databases, each of a maximum size of 4 billion bases, will be produced. This will be done automatically if the -v argument is not set. One may also specify a smaller size for the volume databases by using the -v option:
formatdb -i hugefasta -p F -v 2000000000This command line will format the "hugefasta" FASTA file as a number of database "volumes," each containing a maximum of two billion base pairs, as specified by the "-v" option. Two billion is the current limitation on the NCBI toolkit command-line parser. The volumes will have names consisting of the root database name, "hugefasta" followed by a two-digit volume extension, followed by the usual BLAST database extensions. These smaller databases can be searched as if they were a single entity using:
blastall -i infile -d hugefasta -p blastn -o outIn this case, BLAST recognizes that the database "hugefasta" has been partitioned into several volumes because it detects a file with the name of the root database followed by an extension of "nal" (for protein databases, the extension is "pal"). This file specifies a database list to be searched when the root database name is specified to BLAST. BLAST sequentially searches each database listed in this "nal" file and generates output that is indistinguishable from that of a single database search. A sample "nal" file, resulting from formatting the datafile "hugefasta" into three volumes, is given below. The "DBLIST" line can also be edited to specify additional databases to be searched.
#
# Alias file created Tue Jan 18 13:12:24 2000
#
#
TITLE hugefasta
#
DBLIST hugefasta.00 hugefasta.01 hugefasta.02
#
#GILIST
#
#OIDLIST
#The "nal" and "pal" files can also be used to simplify searches of multiple databases created separately. For instance, a file called "multi.nal" containing the following lines could be created from scratch using a text editor.
#
# Alias file created Tue Jan 18 13:12:24 2000
#
#
TITLE multi
#
DBLIST part1 part2 part3
#
#GILIST
#
#OIDLIST
#The "multi.nal" file would allow the three databases, "part1", "part2", and "part3", to be searched by specifying a single database name, "multi", on the blastall command line as follows:
blastall -i infile -d multi -p blastn -o outThe reason for using database volumes, as opposed to simply making the indices in the BLAST databases large enough to handle all conceivable databases with an eight-byte 'integer', is that this would have doubled the size of the indices for all searches no matter how small the database. Hence very large FASTA files are broken down into a couple of databases.
Formatdb must be able to open files larger than 2 Gig in order to work on very large files. This is not a problem on a 64-bit OS and on certain 32-bit OS that allows binaries to be made large-file aware. The 32-bit Solaris formatdb binary on the NCBI FTP site is now compiled large file aware.
D.) Note on running formatdb on a database without uncompressing it:
Under UNIX it is possible to uncompress a database on the fly and pipe it to formatdb. This can reduce the disk-space needed for running formatdb on a large database. In addition, some operating systems cannot write files larger than 2 Gig to disk. To circumvent this on Unix or Linux systems, use a "pipe" system such as:
uncompress -c nt.Z|formatdb -i stdin -o T -p T -n "nt" -v 100000000In this case, no file is written which is larger than 1 Gig and an arbitrarily large database is formatted as a set of 1 Gig volumes. Note the use of the '-n' option that specifies the name of the resulting BLAST database. Note also that 'stdin' specifies that input will be coming from 'standard input'. The nt FASTA file is not needed for running BLAST searches and nt.Z may be deleted after formatdb has been run.
E) Note on creating custom databases:
With Standalone BLAST it is possible to take any custom file of FASTA sequences and use these as a database source file for searching. All BLAST database source files must be in FASTA format. In order to use the formatdb option -o T, especially for use with NCBI tool kit retrieval tools the FASTA defline must follow a specific format.
F) Note on creating an alias file for a GI list:
Formatdb can now produce a BLAST database alias file that specifies a (real) BLAST database to search as well as a GI list to limit the search. This is useful if one often searches a subset of a database (e.g., based on organism or a curated list). The alias file makes the search appear as if one were searching a real database rather than the subset of one. The procedure to produce an alias file for searching (protein) nr limiting it to a list of zebrafish GI's would be:
1.) obtain the list of zebrafish GI's from Entrez or some other source and keep it in a file called "zebrafish.gi.in".
2.) invoke formatdb to convert the text GI list to the more efficient binary format:
formatdb -F zebrafish.gi.in -B zebrafish.gi 3.) invoke formatdb with the following options:
formatdb -i nr -p T -L zebrafish -F zebrafish.gi -t "My zebrafish database"This will produce the alias file zebrafish.pal listing the database title, the real database to be searched, the GI file, and some statistics:
#
# Alias file created Thu Jul 5 15:04:29 2001
#
#
TITLE My zebrafish database
#
DBLIST nr
#
GILIST zebrafish.gi
#
#OIDLIST
#
NSEQ 1836
LENGTH 640724One can search this by invoking (for example):
blastall -p blastp -d zebrafish -i MYQUERY -o MYOUTPUTNOTE: One may wish to prepare the alias file in one directory, but move it to a different production directory that does not contain the real database. In that case you may use the '-n' option to specify a path to the real database in the production environment. In the example below the -n option is used to specify that the nr database can really be found at a relative path of ../../newest_blast/blast
formatdb -i nr -n ../../newest_blast/blast/nr -p T -L zebrafish -F
zebrafish.gi -t "My zebrafish database"and the alias file will be:
#
# Alias file created Wed Nov 28 13:55:41 2001
#
#
TITLE My zebrafish database
#
DBLIST ../../newest_blast/blast/nr
#
GILIST zebrafish.gi
#
#OIDLIST
#
NSEQ 1836
LENGTH 640724


构建本地Blast
Blast一种是在线的(http://blast.NCBI.nlm.nih.gov/Blast.cgi),一种是本地的(Local BLAST,单机版BLAST,Standalone BLAST)。
对于少量的序列,如果要在跟整个(或者某一物种)核酸或者蛋白库比对,一般采用在线BLAST,简单易用,直观,容易被人接受其;对于大量序列,或者是目前不方便公开的数据库,或者是自定义的数据库,一般采用本地BLAST,其效率高,能满足用户更多的要求,但操作相对有点繁琐;
在线blast的说明看这里:NCBI在线Blast的图文说明
blast软件下载看这里:   最新版本地blast下载
构建本地Blast首先要准备什么:
下载blast软件
构建本地的blast数据库
学习blast的常用参数与命令
如何构建本地的blast数据库:
假设有一序列数据(sequence.fa,多序列,fasta格式),欲自己做成Blast数据库,典型的命令如下:
核酸序列:
$ ./formatdb –i sequence.fa –p F –o T/F
蛋白序列:
$ ./formatdb –i sequence.fa –p T –o T/F
执行blast:
获得了单机版的Blast程序,解压开以后,如果有了相应的数据库(db),那么就可以开始执行Blast分析了。
单机版的Blast程序包,把基本的blast分析,包括blastn,blastp,blastx等都整合到了blastall一个程序里面。
以下是一个典型的blastn分析命令:
(待分析序列seq.fa,数据库nt_db)
$./blastall –p blastn –i seq.fa  -d nt_db –w 7 –e 10 –o seq.blastn.out
(该命令的意思是,对seq.fa文件中的核酸序列对nt_db数据库执行blastn搜索,窗口大小是7,e值限制是10,输出的结果保存到文件seq.blastn.out 中)。
Blastall的常用参数:
-p 程序名应该是blastn,blastp,blastx,tblastn,tblastx中的一个
-d 数据库名称,默认nr
-i 查询序列文件,默认stdin
-e E值限制,默认10
-o 结果输出文件,默认stdout
-F 过滤选项,默认T
-a 选择进行运算的CPU个数

本地Blast的详细使用方法

blastall -p blastn -i myRNA.fasta -d humanRNA.fasta -o myresult.blastout -a 2 -F F -T T -e 1e-10
解释如下:

blastall: 这是本地化/命令行执行blast时的程序名字!(Tips:blastall直接回车就会给出你所有的参数帮助,但是英文的)
-p: p 是program的简写,program在计算机领域中是程序的意思。此参数是指定要使用何种子程序,所谓子程序,就是针对不同的需要,如核酸序列和核酸序列进行比对、蛋白质序列和蛋白质序列进行比对、假设翻译后核酸序列于蛋白质序列进行比对,选择相应的子程序: blastn 是用于核酸对核酸 blastp 是蛋白质对蛋白质序列 等等,一共5个自程序。
-i: i 是input的简写,意思是输入文件,就是你自己的要进行比对的序列文件(fasta格式)
-d: d是database的简写,意思是要比对的目标数据库,在例子中就是humanRNA.fasta (别忘了要formatdb)
-o: o是output的简写,意思是结果文件名字,这个根据你自己的习惯起名字,可以带路径,(上边两个参数-i -d 也都可以带路径)
*注意以上4个参数是必须的,缺一不可,下面的参数是为了得到更好的结果自己可调的参数,如果你不加也没有关系,blastall程序本身会给一个默认值!
-a: 是指计算时要用的CPU个数,我的机器有两个CPU,所以用-a 2,这样可以并行化进行计算,提高速度,当然你的计算机就一个CPU,可以不用这个参数,系统默认值为1,就是一个CPU
-F: 是filter的简写,blastall程序中有对简单的重复序列和低复杂度的一些repeats过滤调,默认是T (注意以后的有几种参数就两个选项,T/F T就是ture,真,你可以理解为打开该功能; F就是false,假,理解为关闭该功能)
-T: 是HTML的简写,是指blast结果文件是否用HTML格式,默认是F!如果你想用IE看,我建议用-T T
-e: 是Expectation value,期望值,默认是10,我用的10-10!
BLASTALL 用法
a.格式化序列数据库
格式化序列数据库— —formatdb
formatdb简单介绍:
formatdb处理的都是格式为 ASN.1和 FASTA,而且不论是核苷酸序列数据库,还是蛋白质序列数据库;不论是使用Blastall ,还是Blastpgp,Mega Blast应用程序,这一步都是不可少的。
formatdb命令行参数:
formatdb -    得到formatdb 所有的参数显示(见附录二)和介绍,
主要参数的说明:
-i 输入需要格式化的源数据库名称 Optional
-p 文件类型,是核苷酸序列数据库,还是蛋白质序列数据库
    T – protein   F - nucleotide [T/F] Optional default = T
-a 输入数据库的格式是ASN.1(否 则是FASTA)
    T - True,     F - False.    [T/F] Optional default = F
-o 解析选项
    T - True: 解析序列标识并且建立目录
    F - False: 与上相反
   [T/F] Optional default = F命令示例:
formatdb -i ecoli.nt -p F -o T运行此命令就会在当前目录下产生用于BLAST搜索的7个文件,一旦如上的formatdb命令执行完毕,就不 再需要ecoli.nt,可以移除。此时,blastall可以直接使用。
b.Blastall常用参数简析
-p Program Name [String]
所用程序名称[String],用 户可以根据需要从blastn,blastp,blastx,tblastn,tblastx中任选一程序。
-d Database [String] default = nr
所用序列数据库的名称 [String],默认为:nr
-i Query File [File In] default = stdin
所用查询序列文件[File In], 默认为:stdin,本文例为 test.txt
-e Expectation value (E) [Real]   default = 10.0
期望值[Real]   默认为10.0 描述搜索某一特定数据 库时,随机出现的匹配序列数目。
-m alignment view options: 比对显 示选项,其具体的说明可以用以下的比对实例说明
0 = pairwise,显示具体匹配信息(缺省)
1 = query-anchored showing identities,查询-比上区域,显示一致性
2 = query-anchored no identities,查询-比上区域,不显示一致性
3 = flat query-anchored, show identities,查询-比上区域的屏文形式,显示一致性
4 = flat query-anchored, no identities,查询-比上区域的屏文形式,不显示一致性
5 = query-anchored no identities and blunt ends,查询-比上区域,不显示一致性,无突然的结束
6 = flat query-anchored, no identities and blunt ends,查询-比上区域的屏文形式,不显示一致性
7 = XML Blast output,XML格式的输出
8 = tabular,TAB格式的输出
9 =tabular with comment lines,带注释行的TAB格式的输出
10 =ASN, text,文本方式的ASN格式输出
11 =ASN, binary [Integer] default = 0,二进制方式的ASN格式输出
-m 8 用法举例说明如下:
A_query    B_Sbjct    97.61    585    3    3    309    886    94498    95078    0.0    1017
A_query    B_Sbjct    100.00    303    0    0    913    1215    95092    95394    2e-172    601
A_query    B_Sbjct    100.00    209    0    0    1    209    94196    94404    3e-116    414
A_query    B_Sbjct    100.00    123    0    0    1234    1356    95413    95535    6e-65    244
A_query    B_Sbjct    100.00    41    0    0    210    250    94096    94136    5e-16    81.8
A_query    B_Sbjct    100.00    35    0    0    251    285    94440    94474    2e-12    69.9
A_query    B_Sbjct    100.00    29    0    0    885    913    95747    95775    7e-09    58.0
A_query    A_query    97.61    585    3    3    309    886    403    983    0.0    1017
A_query    A_query    100.00    303    0    0    913    1215    997    1299    2e-172    601
A_query    A_query    100.00    209    0    0    1    209    101    309    3e-116    414
A_query    A_query    100.00    123    0    0    1234    1356    1318    1440    6e-65    244
A_query    A_query    100.00    41    0    0    210    250    1    41    5e-16    81.8
A_query    A_query    100.00    35    0    0    251    285    345    379    2e-12    69.9
A_query    A_query    100.00    29    0    0    885    913    1652    1680    7e-09    58.0

结果12列
Query id,Subject id,% identity,alignment length,mismatches,gap openings,q. start,q. end,s. start,s. end,e-value,bit score
-------------------------------------------------------------------------------------------------------------
-o BLAST report Output File [File Out] Optional default = stdout,BLAST报告的输出文件[File Out] 默认为:stdout
-F Filter query sequence (DUST with blastn, SEG with others) [String] default = T
查询序列过滤,将那些 给出影响比对结果的低复杂度区域过滤掉。用blastn进行查询的序列用DUST程序过滤,其他的用SEG过滤 。对DUST和SEG的详细情况,用户可以自己查询资料。
-G Cost to open a gap (zero invokes default behavior) [Integer]   default = 0
空位开放罚分[Integer] (设为0则调用默认行为)   默认为0分
-E Cost to extend a gap (zero invokes default behavior) [Integer] default = 0
空位扩展罚分[Integer] (设为0则调用默认行为) 默认为0分
-T Produce HTML output [T/F] default = F
以网页形式打印
-X X dropoff value for gapped alignment (in bits) (zero invokes default behavior)
blastn 30, megablast 20, tblastx 0, all others 15 [Integer],default = 0
-I Show GI's in deflines [T/F]   default = F
提示行显示GI number 默认不显示
-q Penalty for a nucleotide mismatch (blastn only) [Integer] default = -3
核酸序列基对不匹配所罚分数(blastn only) [Integer] 默认罚3分
-r Reward for a nucleotide match (blastn only) [Integer] default = 1
核苷酸序列基对匹配所加分数(blastn only) [Integer] 默认加1分
-g Perfom gapped alignment (not available with tblastx) [T/F] default = T
是否执行带缺口的比对(not available with tblastx) 默认为是    
-a Number of processors to use [Integer] default = 1
使用处理器的数目[Integer] 默认为单机
-B Number of concatenated queries, for blastn and tblastn [Integer] Optional default = 0
需要联配查询的序列数目 for blastn and tblastn [Integer] 默认为单序列
-M Matrix [String],default = BLOSUM62 打分矩阵,默认BLOSUM62
-W Word size, default if zero (blastn 11, megablast 28, all others 3) [Integer] default = 0
所开窗口
-w Frame shift penalty (OOF algorithm for blastx) [Integer] default = 0
窗口罚分

批量Blast

批量做Blast的问题还是蛮多网友问起的。既然这样,在这里就稍微讲讲吧。其它的Blast基础知识就略过了,不懂的就去Google本地Blast用法。
这篇文章默认为你已有了Blast的基础。
批量Blast
就是指多个序列的Blast。其实我也不明白为什么会有这么多人提这个问题,批量Blast就跟单个Blast一样的。
我们都知道,默认参数下的blastn是这样的:
blastall -p blastn -d BlastDB -i in_file.fasta  >blast_output当in_file.fasta里面只有一个序列时,就是单个Blast啊。in_file.fasta也可以放多个Fasta格式的序列,这样子就是批量Blast了。
当然了,麻烦的是批量Blast之后的结果,一个的话我们可以看得了,当批量上千个时,我们不可能一个个看到的。这种小事情Blast早就想到了。这就引进了-m8参数。-b5参数是指显示匹配的前5个结果(默认好像是500个,忘记了)。
blastall -p blastn -d BlastDB -i in_file.fasta  -m8 -b5 >blast_output-m8参数的输出结果有12列,每一列的解释如下(用Tab键隔开的):
例子
Query_id,Subject_id,%identity,alignment_length,mismatches,gap_openings,q.start,q.end,s.start,s.end,e-value,bit_score
A_query    B_Sbjct    97.61    585    3    3    309    886    94498    95078    0.0    1017
A_query    B_Sbjct    100.00    303    0    0    913    1215    95092    95394    2e-172    601
A_query    B_Sbjct    100.00    209    0    0    1    209    94196    94404    3e-116    414
A_query    B_Sbjct    100.00    123    0    0    1234    1356    95413    95535    6e-65    244这样子的结果就方便后面的分析工作了。
推荐的命令行如下:
blastall -p blastn -d BlastDB -i in_file.fasta  -m8 -b5 -b1 -a2 -FF >blast_output-a2参数是用二个CPU,加速。-FF是不过滤简单的重复序列和低复杂度的序列(默认是过滤的)。
其它更详细的参数,直接敲打blastall命令就能看到了。
您需要登录后才可以回帖 登录 | 注册

本版积分规则

QQ|申请友链|小黑屋|手机版|Archiver|生物信息学论坛 ( 蜀ICP备09031721号  

GMT+8, 2017-3-23 12:23 , Processed in 0.107566 second(s), 20 queries .

Powered by Discuz! X3

© 2001-2013 Comsenz Inc.

快速回复 返回顶部 返回列表