Biskit :: Mod :: SequenceSearcher :: SequenceSearcher :: Class SequenceSearcher
[hide private]
[frames] | no frames]

Class SequenceSearcher

source code

Take a sequence and return a list of nonredundant homolog sequences as fasta file. The search can be performed using three different methods:
  • localBlast Uses Bio.Blast.NCBIStandalone.blastall (Biopython) to perform the search.
  • localPSIBlast Uses Bio.Blast.NCBIStandalone.blastpgp (Biopython)
  • remoteBlast Uses Bio.Blast.NCBIWWW.qblast (Biopython) which performs a BLAST search using the QBLAST server at NCBI.



  • Note: the blast sequence database has to be built with -o potion e.g. cat db1.dat db2.dat | formatdb -i stdin -o T -n indexed_db

    To Do: copy blast output

    Instance Methods [hide private]
      __init__(self, outFolder='.', clusterLimit=50, verbose=0, log=None)
      prepareFolders(self)
    Create needed output folders if not there.
    [Bio.Fasta.Record] getRecords(self)
    Get all homologues.
    [Bio.Fasta.Record] getClusteredRecords(self)
    Get representative of each cluster.
      __blast2dict(self, parsed_blast, db)
    Convert parsed blast result into dictionary of FastaRecords indexed by sequence ID.
      remoteBlast(self, seqFile, db, method, e=0.01, **kw)
    Perform a remote BLAST search using the QBLAST server at NCBI.
      localBlast(self, seqFile, db, method='blastp', resultOut=None, e=0.01, **kw)
    Performa a local blast search (requires that the blast binaries and databases are installed localy).
      localPSIBlast(self, seqFile, db, resultOut=None, e=0.01, **kw)
    Performa a local psi-blast search (requires that the blast binaries and databases are installed localy).
    [str] getSequenceIDs(self, blast_records)
    Extract sequence ids from BlastParser result.
    Bio.Fasta.Record fastaRecordFromId(self, db, id)
    Use:
    {str: Bio.Fasta.Record} fastaFromIds(self, db, id_lst, fastaOut=None)
    Use:
      copyClusterOut(self, raw=None)
    Write clustering results to file.
      reportClustering(self, raw=None)
    Report the clustering result.
      clusterFasta(self, fastaIn=None, simCut=1.75, lenCut=0.9, ncpu=1)
    Cluster sequences.
      clusterFastaIterative(self, fastaIn=None, simCut=1.75, lenCut=0.9, ncpu=1)
    Run cluterFasta iteratively, with tighter clustering settings, until the number of clusters are less than self.clusterLimit.
    str selectFasta(self, ids_from_cluster)
    Select one member of cluster of sequences.
      writeFasta(self, frecords, fastaOut)
    Create fasta file for given set of records.
      writeFastaAll(self, fastaOut=None)
    Write all found template sequences to fasta file.
      writeFastaClustered(self, fastaOut=None)
    Write non-redundant set of template sequences to fasta file.
      __writeBlastResult(self, parsed_blast, outFile)
    Write the result from the blast search to file (similar to the output produced by a regular blast run).
      writeClusteredBlastResult(self, allFile, clustFile, selection)
    Reads the blast.out file and keeps only centers.

    Class Variables [hide private]
      F_RESULT_FOLDER = '/sequences'
      F_FASTA_ALL = F_RESULT_FOLDER+ '/all.fasta'
      F_FASTA_NR = F_RESULT_FOLDER+ '/nr.fasta'
      F_CLUSTER_RAW = F_RESULT_FOLDER+ '/cluster_raw.out'
      F_CLUSTER_LOG = F_RESULT_FOLDER+ '/cluster_result.out'
      F_BLAST_OUT = F_RESULT_FOLDER+ '/blast.out'
      F_CLUSTER_BLAST_OUT = F_RESULT_FOLDER+ '/cluster_blast.out'
      F_FASTA_TARGET = '/target.fasta'

    Method Details [hide private]

    __init__(self, outFolder='.', clusterLimit=50, verbose=0, log=None)
    (Constructor)

    source code 
    Parameters:
    • outFolder (str) - project folder (results are put into subfolder) ['.']
    • clusterLimit (int) - maximal number of returned sequence clusters (default: 50)
    • verbose (1|0) - keep temporary files (default: 0)
    • log (LogFile) - log file instance, if None, STDOUT is used (default: None)

    prepareFolders(self)

    source code 

    Create needed output folders if not there.

    getRecords(self)

    source code 

    Get all homologues.
    Returns: [Bio.Fasta.Record]
    list of Bio.Fasta.Record with all found protein homologues
    Raises:

    getClusteredRecords(self)

    source code 

    Get representative of each cluster.
    Returns: [Bio.Fasta.Record]
    list with Bio.Fasta.Record with best record of each cluster
    Raises:

    __blast2dict(self, parsed_blast, db)

    source code 

    Convert parsed blast result into dictionary of FastaRecords indexed by sequence ID. Writes all the sequences in fasta format to F_FASTA_ALL and the raw blast result to F_BLAST_OUT.
    Parameters:
    • parsed_blast (Bio.Blast.Record.Blast) - parsed blast result
    • db (str) - database

    remoteBlast(self, seqFile, db, method, e=0.01, **kw)

    source code 

    Perform a remote BLAST search using the QBLAST server at NCBI. Uses Bio.Blast.NCBIWWW.qblast (Biopython) for the search
    Parameters:
    • seqFile (str) - file name with search sequence as FASTA
    • db ([str]) - database(s) to search in, e.g. ['swissprot', 'pdb']
    • method (str) - search method, e.g. 'blastp', 'fasta'
    • e (float) - expectation value cutoff
    • kw (any) - optional keywords:
            program        BLASTP, BLASTN, BLASTX, TBLASTN, or TBLASTX.
            database       Which database to search against.
            sequence       The sequence to search.
            ncbi_gi        TRUE/FALSE whether to give 'gi' identifier.
                           (default: FALSE)
            descriptions   Number of descriptions to show.  Def 500.
            alignments     Number of alignments to show.  Def 500.
            expect         An expect value cutoff.  Def 10.0.
            matrix         Specify an alt. matrix
                           (PAM30, PAM70, BLOSUM80, BLOSUM45).
            filter         'none' turns off filtering. Default uses 'seg'
                           or 'dust'.
            format_type    'HTML', 'Text', 'ASN.1', or 'XML'.  Def. 'HTML
      

    Note: Using the remoteBlast is asking for trouble, as every change in the output file might kill the parser. If you still want to use remoteBlast we strongly recomend that you install BioPython from CVS. Information on how to do this you will find on the BioPython homepage.

    To Do: Remote Blasting is running as expected but sequences are still retrieved from a local database. Implement remote collection of fasta seuqences from NCBI (there should be something like in Biopython). Otherwise something like this will also work:

     ## collect the entry with gi 87047648
     url = 'http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=text&db=protein&dopt=FASTA&dispmax=20&uid=87047648'
     import urllib
     handle = urllib.urlopen(url)
     lines = handle.readlines()
     seq = ''
     for i in range(len(lines)):
         if re.match( "[<]pre[>][<]div class='recordbody'[>]{2}gi", l[i] ):
             i+= 1
             while not re.match( "^[<]/pre[>][<]/div[>]", l[i]):
                 seq += l[i][:-1]
                 i+= 1
     print seq       
    

    localBlast(self, seqFile, db, method='blastp', resultOut=None, e=0.01, **kw)

    source code 

    Performa a local blast search (requires that the blast binaries and databases are installed localy). Uses Bio.Blast.NCBIStandalone.blastall (Biopython) for the search.
    Parameters:
    • seqFile (str) - file name with search sequence in FASTA format
    • db ([str]) - database(s) to search, e.g. ['swissprot', 'pdb']
    • method (str) - search program to use, e.g. 'blastp', 'fasta' (default: blastp)
    • e (float) - expectation value cutoff
    • resultOut (str) - save blast output to this new file
    • kw (any) - optional keywords:
             --- Scoring ---
             matrix         Matrix to use (default BLOSUM62).
             gap_open       Gap open penalty (default 0).
             gap_extend     Gap extension penalty (default 0).
      
             --- Algorithm ---
             gapped         Whether to do a gapped alignment. T/F 
                             (default T)
             wordsize       Word size (blastp default 11).
             keep_hits      Number of best hits from a region to keep
                             (default off).
             xdrop          Dropoff value (bits) for gapped alignments
                             (blastp default 25).
             hit_extend     Threshold for extending hits (blastp default 11)
      
             --- Processing ---
             filter         Filter query sequence? (T/F, default F)
             restrict_gi    Restrict search to these GI's.
             believe_query  Believe the query defline? (T/F, default F)
             nprocessors    Number of processors to use (default 1).
      
             --- Formatting ---
             alignments     Number of alignments. (default 250)
      
    Raises:

    localPSIBlast(self, seqFile, db, resultOut=None, e=0.01, **kw)

    source code 

    Performa a local psi-blast search (requires that the blast binaries and databases are installed localy). Uses Bio.Blast.NCBIStandalone.blastpgp (Biopython) for the search
    Parameters:
    • seqFile (str) - file name with search sequence in FASTA format
    • db ([str]) - database(s) to search e.g. ['swissprot', 'pdb']
    • e (float) - expectation value cutoff (default: 0.001)
    • resultOut (str) - save blast output to this new file
    • kw (any) - optional keywords:
         --- Scoring --- 
         matrix           Matrix to use (default BLOSUM62).
         gap_open         Gap open penalty (default 11).
         gap_extend       Gap extension penalty (default 1).
         window_size      Multiple hits window size (default 40).
         npasses          Number of passes (default 1).
         passes           Hits/passes (Integer 0-2, default 1).
      
         --- Algorithm --- 
         gapped           Whether to do a gapped alignment (T/F, default T).
         wordsize         Word size (default 3).
         keep_hits        Number of beset hits from a region to keep (def 0)
         xdrop            Dropoff value (bits) for gapped alignments
                          (def 15)
         hit_extend       Threshold for extending hits (default 11).
         nbits_gapping    Number of bits to trigger gapping (default 22).
         pseudocounts     Pseudocounts constants for multiple passes
                          (def 9).
         xdrop_final      X dropoff for final gapped alignment (default 25).
         xdrop_extension  Dropoff for blast extensions (default 7).
         model_threshold  E-value threshold to include in multipass model
                          (default 0.005).
         required_start   Start of required region in query (default 1).
         required_end     End of required region in query (default -1).
      
         --- Processing --- 
         filter           Filter query sequence with SEG? (T/F, default F)
         believe_query    Believe the query defline? (T/F, default F)
         nprocessors      Number of processors to use (default 1).
      
         --- Formatting --- 
         alignments       Number of alignments (default 250).
      
    Raises:

    getSequenceIDs(self, blast_records)

    source code 

    Extract sequence ids from BlastParser result.

    Use:
      getSequenceIDs( Bio.Blast.Record.Blast ) -> [ str ]
    
    Parameters:
    • blast_records (Bio.Blast.Record.Blast) - blast search result
    Returns: [str]
    list of sequence IDs
    Raises:

    fastaRecordFromId(self, db, id)

    source code 

    Use:
      fastaRecordFromId( db, id ) -> Bio.Fasta.Record
    
    Parameters:
    • db (str) - database
    • id (str) - sequence database ID
    Returns: Bio.Fasta.Record
    fasta record
    Raises:
    • BlastError - if can't fetch fasta record from database

    fastaFromIds(self, db, id_lst, fastaOut=None)

    source code 

    Use:
      fastaFromIds( id_lst, fastaOut ) -> { str: Bio.Fasta.Record }
    
    Parameters:
    • db (str) - database
    • id_lst ([str]) - sequence database IDs
    Returns: {str: Bio.Fasta.Record}
    dictionary mapping IDs to Bio.Fasta.Records
    Raises:

    copyClusterOut(self, raw=None)

    source code 

    Write clustering results to file.
    Parameters:
    • raw (1|0) - write raw clustering result to disk (default: None)

    reportClustering(self, raw=None)

    source code 

    Report the clustering result.

    Writes:
    Parameters:
    • raw (1|0) - write raw clustering result to disk (default: None)

    clusterFasta(self, fastaIn=None, simCut=1.75, lenCut=0.9, ncpu=1)

    source code 

    Cluster sequences. The input fasta titles must be the IDs. fastaClust( fastaIn [, simCut, lenCut, ncpu] )
    Parameters:
    • fastaIn (str) - name of input fasta file
    • simCut (double) - similarity threshold (score < 3 or %identity) (default: 1.75)
    • lenCut (double) - length threshold (default: 0.9)
    • ncpu (int) - number of CPUs
    Raises:

    clusterFastaIterative(self, fastaIn=None, simCut=1.75, lenCut=0.9, ncpu=1)

    source code 

    Run cluterFasta iteratively, with tighter clustering settings, until the number of clusters are less than self.clusterLimit.
    Parameters:
    • fastaIn (str) - name of input fasta file
    • simCut (double) - similarity threshold (score < 3 or %identity) (default: 1.75)
    • lenCut (double) - length threshold (default: 0.9)
    • ncpu (int) - number of CPUs

    Note: Older versions of T-Coffee can't handle more that approx. 50 sequences (out of memory). This problem is taken care of in versions > 3.2 of T-Coffee.

    selectFasta(self, ids_from_cluster)

    source code 

    Select one member of cluster of sequences.
    Parameters:
    • ids_from_cluster ([str]) - list of sequence ids defining the cluster
    Returns: str
    id of best in cluster

    writeFasta(self, frecords, fastaOut)

    source code 

    Create fasta file for given set of records.
    Parameters:
    • frecords ([Bio.Blast.Record]) - list of Bio.Blast.Records
    • fastaOut (str) - file name

    writeFastaAll(self, fastaOut=None)

    source code 

    Write all found template sequences to fasta file.
    Parameters:
    • fastaOut (str OR None) - write all fasta records to file (default: F_FASTA_ALL)

    writeFastaClustered(self, fastaOut=None)

    source code 

    Write non-redundant set of template sequences to fasta file.
    Parameters:
    • fastaOut (str) - write non-redundant fasta records to file (default: F_FASTA_NR)

    __writeBlastResult(self, parsed_blast, outFile)

    source code 

    Write the result from the blast search to file (similar to the output produced by a regular blast run).

    writeBlastResult( parsed_blast, outFile )
    Parameters:
    • parsed_blast (Bio.Blast.Record.Blast) - Bio.Blast.Record.Blast
    • outFile (str) - file to write the blast result to

    writeClusteredBlastResult(self, allFile, clustFile, selection)

    source code 

    Reads the blast.out file and keeps only centers.
    Parameters:
    • allFile (file) - all blast results
    • clustFile (str) - output file name
    • selection (list) - write only sequences in list

    Class Variable Details [hide private]

    F_RESULT_FOLDER

    Value:
    '/sequences'                                                           
          

    F_FASTA_ALL

    Value:
    F_RESULT_FOLDER+ '/all.fasta'                                          
          

    F_FASTA_NR

    Value:
    F_RESULT_FOLDER+ '/nr.fasta'                                           
          

    F_CLUSTER_RAW

    Value:
    F_RESULT_FOLDER+ '/cluster_raw.out'                                    
          

    F_CLUSTER_LOG

    Value:
    F_RESULT_FOLDER+ '/cluster_result.out'                                 
          

    F_BLAST_OUT

    Value:
    F_RESULT_FOLDER+ '/blast.out'                                          
          

    F_CLUSTER_BLAST_OUT

    Value:
    F_RESULT_FOLDER+ '/cluster_blast.out'                                  
          

    F_FASTA_TARGET

    Value:
    '/target.fasta'