Getting Started

The ppalign package constists of following components:

  • ppalign

  • ppblast

  • The ppalign library

  • ppalign

    ppalign can be used to analyse the posterior an user supplied alignment. Alternatively the user may provide a pair of sequences (option --optimize [algo]). In this case ppalign firstly determines the optimal alignment and then the positionwise posterior probabilities for this alignment.

    Three algorithms are available:

    • Global alignment (option --algo global)
    • Global alignment only on the aligned part of the alignment. In this case the padding gaps at the begin and the end of the alignment are ignored (option --algo global_bound). This option can be used together for example with the option --optimize local
    • Probabilities for start and end points of local alignments (option --algo local_start). In a second step, the user may then realign the sequences on a range which has been choosen according to this distributions (options --restrict_end [i2 j2] and --restrict_start [i1 j1]).

    Example useage

    Use ppALIGN to compute the posterior probabilities for a given alignment. It can be used interactively, for instance use the following command to compute the posterior probabilities for protein alignment using the blosum62 matrix.

    ppalign -s blosum62 -a aa -f text

    Then paste or type the alignment in fasta format. When the second sequence has been finished just press ctrl+D to indicate the end of the alignment. Alternatively you may provide an alignment from a file via -i filename.fasta

    >Query Sequence 
    G-YATTIIPRIYTYYVSTALFAIFGIRML----REGLKMSPDEGQEELEEVQAEIKKKDEELQRSKLANGAADVEAG
    >Subject Sequence 
    GRIVPNLISRKHTNSAATVLYAFFGLRLLYIAWRSDSKVSQKKEMEEVEE----------------------KLESG
    >
        

    If you do not know the alignment you may optimize it before the actual computation. Just use

    ppalign -s blosum62 -a aa -f text --optimize global

    and provide a pair of non aligned sequences. You will get the following result:

            PpAlign_Program: ppalign
             PpAlign_Version: 1.0
                    Alphabet: protein
                 Scorematrix: blosum62
                    Gap_Open: 11
               Gap_Extension: 1
    -------------
                    QueryDef: Query Sequence 
                  SubjectDef: Subject Sequence 
                   Align-len: 77
                    Identity: 20
                        Gaps: 27
                       Score: 38
               AvgPosterProb: 0.516278
              ---------------------------------------- 
              #     #######################            
              #    ########################            
              #    ########################            
              #   #########################            
              #   #########################            
              #  ##########################            
              #  ##########################            
              #############################            
              #############################            
              ######################################## 
           1  G-YATTIIPRIYTYYVSTALFAIFGIRML----REGLKMS 
           1  GRIVPNLISRKHTNSAATVLYAFFGLRLLYIAWRSDSKVS 
    
              ------------------------------------- 
                                              ##### 
                                              ##### 
                        #                     ##### 
                        ###                   ##### 
                        #######               ##### 
                        #########             ##### 
                        #############         ##### 
                        ###################   ##### 
                        ########################### 
              ##################################### 
          36  PDEGQEELEEVQAEIKKKDEELQRSKLANGAADVEAG 
          41  QKKEMEEVEE----------------------KLESG 
      

    In this output, the confidence (posterior probability) in the alignment is indicated by vertical bars of "#" symbols divided in 10% bins. ppAlign reports the typical alignment charcteristics like length, number of gaps, number of matches, the score, and, additionally the average posterior probability.

    • With the options -f xml you obtain a structured machine readable XML document.
    • To produce a human readable HTML page just supply -f html -o example.html. In this example, we also used the options --sampling 10 --marg_decode to determine alternative alignments.
    • You may also try our ppALIGN webserver to test some features of ppalign, including local alignment.

    ppblast

    Let us start with a simple example of protein sequence similarity search using nblast available on the NCBI web-server.

    • First we search a DNA sequence against a DNA database. In our example we have searched the human beta globin (gi|455025) against the mouse genom database.
    • If available, you may use the command-line version of blast blastp with the option -fmt 7 for XML output. Redirect the XML output into a file (e.g. nblast.xml).
    • If you have used the web-server, download the result in the XML format (save it for example as blast_result.xml). This option can be found on the BLAST web-server (result page, and then download -> XML).
    • Run ppblast on the BLAST output
      ppblast -i nblast.xml -o ppblast.xml
      You have created the extended BLAST output ppblast.xml with the posterior probabilities. Note: If this step failed and produced the error message
      error in constructing pair hmm:
      1-2 * nu < 0
      choose larger gap costs!
           
      you probably used the default gap costs (0 for open and 0 for extension), which is located deeply in the so called linear regime. We recomment to overwrite this value by the command line arguments --open and --ext, for example like
      ppblast -i nblast.xml --open 2 --ext 1 -o ppblast.xml
    • To produce a human readable HTML page and use more options, you may also try
      ppblast -i nblast.xml --open 2 --ext 1 -f html \
      --sampling 10 --expected --marg_decode -o ppblast.html
           
      In addition to the posterior probabilties we also have sampled alternative alignments from the posterior distribution (--sampling 10), computed the expected score (--expected) with respect to the pair HMM, and, we have obtained the maximal averaged marginalized posterior alignments (--marg_decode)

    The ppalign library


    To use the library we recomment to look at the examples in the source distribution and the API documentation. If you are using the GNU compiler collection, you may link your own programs against the ppALIGN library. as follows

    g++ -L/path/to/libppalign -I/path/to/include/ppalign mysrc.cpp -lppalign 
      

    You may try one of the examples.