|MamPol Home Page||Search||Analysis||Help Page||Statistics||Links||Contact us|
|(1) Sequence comparison||(2) Nucleotide Diversity|
You can choose whether to do a full alignment (SLOW), or rather to use an stringent algorithm in order to create the philogenetic tree or a fast algorithm to create the alignment (FAST).
You can let the program detect the type of sequence leaving the predetermined option (AUTOMATIC), or rather, in cases of complex sequences, select specifically the type as protein (PROTEIN) or DNA (DNA). The sequence will be considered as DNA when at least 85% of the letters are A, C, G, T or U.
You can choose in which format you want to obtain the sequences alignment. The options are:
Clustal o ALN: This is a self explanatory alignment. The alignment is written out in blocks. Identities are highlighted and (if you use a PAM 250 matrix) positions in the alignment where all of the residues are "similar" to each other (PAM 250 score of 8 or more) are indicated.
GCG o MSF: In version 7 of the Wisconsin GCG package, a new multiple sequence format was introduced.This is the MSF (Multiple Sequence Format) format. It can be used as input to the GCG sequence editor or any of the GCG programs that make use of multiple alignments.THIS FORMAT IS ONLY SUPPORTED IN VERSION 7 OF THE GCG PACKAGE OR LATER.
Phylip: This format can be used by the Phylip package of Joe Felsenstein (see the references/algorithms section for details of how to get it). Phylip allows you to do a huge range of phylogenetic analyses (we just offer one method in this program) and is probably the most widely used set of programs for drawing trees. It also works on just about every computer you can think of, providing you have a decent Pascal compiler.
PIR: This is the usual NBRF/PIR format with gaps indicated by hyphens ("-"). AS we have stressed before, this format is EXACTLY compatible with the sequence input format.Therefore you can read in these alignments again for profile alignments or for calculating phylogenetic trees.
You can decide in which order you want the sequences in the alignment appear. The options are:
ALIGNED: depending on the punctuation in the alignment: from more to less far away.
INPUT: the order is the same used by the user to introduce the sequences.
FAST PAIR WISE ALIGNMENT
Can be 1 or 2 for proteins; 1 to 4 for DNA. Increase this to increase speed; decrease to improve sensitivity.
The number of diagonals around each "top" diagonal that are considered. Decrease for speed; increase for greater sensitivity.
The similarity scores may be expressed as raw scores (number of identical residues minus a "gap penalty" for each gap) or as percentage scores. If the sequences are of very different lengths, percentage scores make more sense.
The number of best diagonals in the imaginary dot-matrix plot that are considered. Decrease (must be greater than zero) to increase speed; increase to improve sensitivity.
The number of matching residues that must be found in order to introduce a gap. This should be larger than K-Tuple Size. This has little effect on speed or sensitivity.
For protein comparisons, a weight matrix is used to differentially weight different pairs of aligned amino acids. The default is the well known Dayhoff PAM 250 matrix. We also offer a PAM 100 matrix, an identity matrix (all weights are the same for exact matches) or allow you to give the name of a file with your own matrix. What's more, you can choose also these other series:
Henikoff BLOSUM. These seem to be the best in order to do similarity studies in databases (homologue searches).
GONNET. These matrixs come from the Dayhoff matrixs, but they are more actualized and they are based in larger information groups, so they seem to be more sensitive.
IDENTITY MATRIX (ID). The punctuation is 10 for two identical amino acids, or 0 in the other cases.
Reduce this to encourage gaps of all sizes; increase it to discourage them. Terminal gaps are penalized same as all others except for END GAPS not being selected. BEWARE of making this too small (approx 5 or so); if the penalty is too small, the program may prefer to align each sequence opposite one long gap.
Reduce this to encourage longer gaps; increase it to shorten them. Terminal gaps are penalized same as all others. BEWARE of making this too small (approx 5 or so); if the penalty is too small, the program may prefer to align each sequence opposite one long gap.
You will need to introduce an alignment to use this option. The format of this alignment must be one of the followings:
NBRF / PIR
EMBL / SwissProt
GCG / MSF
The method used is NJ (Neighbor-Joining) of Saitou and Nei. First, it calculates the distances (percentage of divergence) between all the pares of sequences in the multiple alignment; then, the distances matrix is calculated.
You can choose one of the following tree formats with this option:
You will need a program capable of showing the information, such as Tree-View, in order to see these trees.
As sequences diverge, substitutions accumulate. It becomes increasingly likely that more than one substitution (as a result of a mutation) will have happened at a site where you observe just one difference now. This option allows you to use formulae developed by Motoo Kimura to correct for this effect.It has the effect of stretching long branches in trees while leaving short ones relatively untouched. The desired effect is to try and make distances proportional to time since divergence.
This option allows you to ignore all alignment positions (columns) where there is a gap in any sequence. This guarantees that "like" is compared with "like" in all distances i.e. the same positions are used to calculate all distances. It also means that the distances will be "metric". The disadvantage of using this option is that you throw away much of the data if there are many gaps. If the total number of gaps is small, it has little effect.
UPLOAD: you can include an archive with the sequences you want to align from your computer. All the sequences must be in the same archive, and in one of these formats: NBRF/PIR, EMBL/SwissProt o FASTA (Pearson y Lipman, 1988). The sequences can be introduced in capital letters or in small letters. The symbols recognized for proteins are: A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W y Y, and for DNA/RNA: A, C, G, T y U. All the other letters of the alphabet will be considered as X for proteins, or as N in DNA/RNA. The other symbols (spaces, numbers, ...) will be ignored except the hyphen "-", which can be used to specify a gap. This can be specially useful for two reasons: 1) you can fix the position of some gaps before doing the alignment; 2) the resulting alignment can be in NBRF format using hyphens for the gaps. So these alignments can be used as input to make phylogenetic trees.
FASTA (PEARSON y LIPMAN, 1988) FORMAT: The sequences are delimited by an angle bracket ">" in column 1. The text immediately after the ">" is used as a title. Everything on the following line until the next ">" or the end of the file is one sequence.
NBRF/PIR FORMAT: is similar to FASTA format but immediately after the ">", you find the characters "P1;" if the sequences are protein or "DL;" if they are nucleic acid. Clustalv looks for the ";" character as the third character after the ">". If it finds one it assumes that the format is NBRF if not, FASTA format is assumed. The text after the ";" is treated as a sequence name while the entire next line is treated as a title. The sequence is terminated by a star "*" and the next sequence can then begin (with a >P1; etc). This is just the basic format description (there are other variations and rules).
EMBL/SWISSPROT FORMAT: Do not try to create files with this format unless you have utilities to help. If you are just using an editor, use one of the above formats. If you do use this format, the program will ignore everything between the ID line (line beginning with the characters "ID") and the SQ line. The sequence is then read from between the SQ line and the "//" characters.