MamPol Home Page Search Analysis Help Page Statistics Links Contact us


Contents


(1) THE SEARCH INTERFACE

  1. GENERAL SEARCH

Search Options

Select organisms/genes
Filter for diversity values
Filter for degree of confidence on the polymorphic set
Advanced options

Results pages format
 

  1. COMPARATIVE SEARCH

Search Options

Select organisms
Select diversity parameters
Filter for degree of confidence on the polymorphic set
Advanced options

Results pages format

  1. GRAPHICAL SEARCH

Search Options

Select organisms/genes
Select distribution
Filter for degree of confidence on the polymorphic set
Advanced options

Results pages format

  1. SEARCH BY MAMPOL OR GENBANK ACCESSION NUMBERS

 

(2) THE ANALYSIS SECTION

(a) SEQUENCE COMPARISON:

CLUSTALW: Multiple Sequences Alignment
JALVIEW: Multiple Sequence Alignment Viewer and Editor

(b) NUCLEOTIDE DIVERSITY ANALYSIS:

SNPs - Graphic
PDA Server

 

(3) THE STATISTICS SECTION

 

(4) THE LINKS SECTION

 

(5) HOW TO DOWNLOAD THE DATABASE?

 

(6) HOW TO ASSESS THE CONFIDENCE ON AN ANALYSIS UNIT

 

(7) AMNIS: ALGORITHM FOR THE MAXIMIZATION OF THE NUMBER OF INFORMATIVE SITES IN THE ALIGNMENT

 

(8) MamPol DATA MODEL

 


(1) THE SEARCH INTERFACE

The MamPol Search Tool is the web interface which allows you to retrieve both secondary information (diversity measures) and related primary information (sequences, references, genes and aberrations) from the MamPol database. The search options are explained below for General, Comparative and Graphical searches, as well as performing quick searches by MamPol or GenBank accession numbers.

 

a. GENERAL SEARCH

Search Options

Select organisms/genes

Here you can select from the lists the set of organisms and/or genes you want to include in your results. Click the "Sp" button to select species or the "Tax" button to select taxonomic groups, the later list is expandable. Click the "List" button or the links to "Gene alias lists" to select genes. In any list click on "Add selected organisms/genes" at the end of the page. In the case you leave the boxes empty, all organisms/genes will be included.

You can select to view nuclear, mitochondrial or both type of genes.
 

Filter for diversity values

Here you can define threshold values for the different parameters of nucleotide diversity, linkage disequilibrium and codon bias. The parameters are distributed into four categories:

  • Nucleotide polymorphism

  • Synonymous and non-synonymous polymorphisms

  • Linkage disequilibrium

  • Codon bias

Each category can accept two different values related by a boolean operator (and, or, not). You can use both or only one, or leave them empty or 0 as defaults. Be aware that only D (linkage disequilibrium) and Tajima's D (polymorphism) can accept negative values.

Note that synonymous and non-synonymous polymorphisms and codon bias estimates are calculated on CDS (coding regions) only, and linkage disequilibrium only on exons and introns.

Please refer to http://pda.uab.es/pda/pda_help.asp#param for a more extensive explanation of the parameters and their references.
 

Filter for degree of confidence on the polymorphic set

We assess the confidence of each polymorphic set taking into account the quality of each alignment and the sequences source.

  • Quality of the alignments: To assess the quality of an alignment we used three criteria: the number of sequences included in the alignment, the percentage of gaps or ambiguous bases within the alignment and the percentage of difference between the shortest and the longest sequences. For each criterion three qualitative categories were defined: low quality, medium quality and high quality:

Number of sequences

Low (2-5 =  ! )

Medium (6-10 =    K )

High (>10 =   J)

   
Percentage of gaps / ambiguous bases within the alignment

High (≥30% =  ! , low quality)

Medium (≥10%-<30% =    K , medium quality)

Low (<10% =   J , high quality)

   
Percentage of difference in length between the shortest and the longest sequences

High (≥30% =  ! , low quality)

Medium (≥10%-<30% =    K , medium quality)

Low (<10% =   J , high quality)

 

  • Data source confidence: The following four criteria were used to determine if the study had a polymorphism goal:

  1. One or more sequences from the alignment can be found in the PopSet database in NCBI.

  2. All the sequences from the alignment have consecutive GenBank accession numbers (for example, AF254110-AF254111-AF254112-AF254113-AF254114-AF254115-AF254116-AF254117-AF254118-etc.)

  3. All the sequences from the alignment share one or more references (in which they were published)

  4. At least one of their references (shared or not) are from these journals that typically publish polymorphism studies: Genetics, Mol. Biol. Evol., J. Mol. Evol. or Mol. Phylogenet. Evol.

  5. Two values are assigned to each criterion: true (complies the requirement) or false (does not comply the requirement)


Advanced options

In this part of the form you can define other advanced options for your search:

  • Regions to be included: please select one or more regions from which you want to retrieve the diversity values.

  • Order polymorphic sets: retrieve the results ordered by Organism, Gene or Setcode.

  • Polymorphic sets per page: number of polymorphic sets to be displayed on each results page.

 
 

Results pages format

The different polymorphic sets are displayed on a table showing the basic parameters of each analysis. From this table you can (see the figure below):

  1. Retrieve the complete results of a selected analysis. They will be displayed in a new page, from which you can also edit and/or reanalyze the specific alignment, as well as access all the primary related information of sequences (MamPol, GenBank and EMBL) and references (MamPol, PubMed and Medline).

  2. Retrieve the history of the corresponding polymorphic set: previous analyses on that polymorphic set that have now been updated.

  3. Reanalyze a polymorphic set using the PDA software. All the sequences belonging to the selected polymorphic sets will be included on the PDA form as a list of Accession numbers from the MamPol database. Then you can define your preferred parameters to reanalyze the sequences again.

  4. Get the Sequences in the FASTA format. All the sequences from the selected analysis units will be shown in the FASTA format.


The complete results of a selected analysis (1) contain information of the sequences used and the estimations, as well as the alignments in different formats:

 

 
 

b. COMPARATIVE SEARCH

Search Options

Select organisms

Here you can select from the lists the set of organisms you want to include in your results. Click the "Sp" button to select species or the "Tax" button to select taxonomic groups, the later list is expandable. In any list click on "Add selected organisms" at the end of the page. You must select at least one organism or taxonomic group from which you want all available estimates to be averaged.

You can select to view nuclear, mitochondrial or both type of genes.
 

Select the diversity parameters

Here you can select the diversity parameters you want to include into the comparison. The parameters are distributed into three categories:

  • Polymorphism

  • Synonymous and non-synonymous polymorphisms

  • Codon bias

Note that synonymous and non-synonymous polymorphisms and codon bias estimates are calculated on CDS (coding regions) only.
 

Filter for degree of confidence on the polymorphic set

Please refer to the same section in the General Search help.


Advanced options

In this part of the form you can define other advanced options for your search:

  • Regions to be included: please select one or more regions from which you want to retrieve the diversity values. Note that different regions will be analyzed separatedly.

 
 

Results pages format

The results are represented on a table as shown on the figure:


The table contains the number of polymorphic sets and the number of analysis units included in the estimates shown. Estimates on this table are averages that are computed first on polymorphic sets (e.g. all exons of the same gene are averaged to obtain a single estimate for exons on that gene) and then resulting estimates for each gene are averaged again to obtain a single final estimate (shown in the table above). Thus, every gene weights the same in the shown average. Each of these averages has a link to the Graphical Search to view the distribution of the values included in computed average.

Tajima's D is a special case on this table. The number of Tajima's tests shown in the table are those which gave significant values at the 95% confidence interval (e.g. a row showing 3 - 2 means that 3 tests gave a significantly negative Tajima's D, and 2 gave a significantly positive Tajima's D). 

 
 

c. GRAPHICAL SEARCH

Search Options

Select organisms/genes

Please refer to the same section in the General Search help.
 

Select a distribution

Select one parameter from one list. The distribution of this parameter will be displayed in the results.
 

Filter for degree of confidence on the polymorphic set

Please refer to the same section in the General Search help.
 

Advanced options

In this part of the form you can define other advanced options for your search:

  • Regions to be included: please select one or more regions from which you want to retrieve the diversity values.

  • Type of representation: Histogram (items ordered by value of the parameter) or Frequency (items ordered by frequency of the categories).

  • Number of categories: number of categories in which the values will be distributed in the graph.

 
 

Results pages format

The results are represented on a histogram or frequency representation as shown on the figure. You can retrieve all analysis units for each class by clicking in the frequency range at the left, the histogram bar, or the count number at the right:

 

 

d. SEARCH BY MAMPOL OR GENBANK ACCESSION NUMBERS

At the top of the General Search (section "Search by Id"), enter any MamPol accession (e.g. SET000033 for polymorphic sets, MAMpol000025 for analysis units, MAMseq001739 for sequences) or GenBank accession (AF175215 for sequence accession numbers, AF175215.1 for sequence versions, 6002968 for sequence GIs) and click the button 'Go'. You will retrieve all related analysis units from MamPol.

 


(2) ANALYSIS SECTION

This section provides you a collection of programs for sequence analysis.

(a) SEQUENCE COMPARISON:

CLUSTALW: Multiple Sequences Alignment

The ClustalW software with default parameters optimized for alignment of Drosophila polymorphic sequences (as manually checked) is available. ClustalW is a Multiple Sequences Alignment program. It aligns different sequences avoiding gaps as much as possible, depending on the parameters values chosen. It can also construct phylogenetic trees. See the Clustal help for more information.


JALVIEW: Multiple Sequence Alignment Viewer and Editor

Jalview is a multiple sequence alignment viewer and editor. Alignments can be divided into subfamilies using a tree or by hand. Conservation can then be calculated using physico-chemical properties within subfamilies or across the whole alignment. Principal component analysis can also be used as an alternative way of clustering the sequences. An SRS server can be used to fetch and display the sequence features and any PDB structures listed. See the Jalview help for more information.

 

(b) NUCLEOTIDE DIVERSITY ANALYSIS:

SNPs - Graphic: Analysis of nucleotide diversity in Sliding Windows

This is a web module that estimates several measures of DNA sequence polymorphism and allows performing these analyses by the sliding windows method, obtaining graphic representations. Aligned DNA sequences are introduced as input in FASTA format. The output is a web page, saved in the server for 24 hours, where results are displayed in text and graphs. See the SNPs-Graphic help for more information.


PDA Server: Pipeline Diversity Analysis

PDA, "Pipeline Diversity Analysis", is a collection of programs and modules mainly written in Perl that automatically can:

  1. search for polymorphic sequences in a large database, and

  2. estimate their genetic diversity.

PDA has a user-friendly, web-based interface where the user can select the sequences to be analyzed and the parameters to be used. Sequences can be retrieved from either GenBank or the MamPol database as a list of accession numbers or a set of organisms and/or genes. Low quality sequences coming from large-scale sequencing projects (i.e. working draft), where most missing data is, will be excluded from the analysis. Alternatively, sequences can be introduced manually in FASTA or GenBank formats. All sequences will be grouped by organism and gene, and groups will be aligned using the ClustalW algorithm. After, different analyses of polymorphism in synonymous and non-synonymous sites, linkage disequilibrium and codon bias will be performed. See the PDA help for more information.

 


(3) STATISTICS

The MamPol Statistics Section shows the contents of the database and includes tabular and graphic information on the secondary and primary database: number of polymorphic sets and analysis units available classified by functional regions, species, genes, quality of alignments, confidence of data source, total number of sequences and references,… All the information, tables and graphs are updated on a daily base, after the updating of the database itself.

There are different pages to show the statistics of the nuclear, the mitochondrial and both types of genes. The same division is made in the Rodentia order, the Primate order and the rest of the Mammalia class.

 


(4) LINKS

This section offers a selected collection of web addresses, specially related to the study of nucleotide polymorphism and bioinformatics. These are distributed in different categories:


 


(5) HOW TO DOWNLOAD THE DATABASE?

The database can be freely downloaded using our Download page. It contains a compressed gzip copy of each MySQL database (db_name.contents.gz).

Download the files and load them into a new database in you MySQL Server, as follows:

  1. Create a new database:  mysqladmin create db_name -u root -p 

  2. Decompress and load the database:  gzip -d < /PATH/db_name.contents.gz | mysql db_name -u root -p

Note that you must do it from an account with privileges to create a new database in your MySQL server.

 


(6) HOW TO ASSESS THE CONFIDENCE ON AN ANALYSIS UNIT

The results stored in the Mammalia Polymorphism Database are obtained by an automatic process of analysis using PDA (Casillas & Barbadilla 2004) (http://pda.uab.es). We highly recommend users of this database to follow the following steps in order to assess the confidence on any analysis unit.
 

1. Revise the parameters about the QUALITY OF THE ALIGNMENT (Figure 1a, Figure 2a):

To assess the quality of an alignment we used three criteria: the number of sequences included in the alignment, the percentage of gaps o ambiguous bases within the alignment and the percentage of difference between the shortest and the longest sequences. For each criterion three qualitative categories were defined: low quality, medium quality and high quality:

Number of sequences

Low (2-5 =  ! )

Medium (6-10 =    K )

High (>10 =   J)

   
Percentage of gaps / ambiguous bases within the alignment

High (≥30% =  ! , low quality)

Medium (≥10%-<30% =    K , medium quality)

Low (<10% =   J , high quality)

   
Percentage of difference in length between the shortest and the longest sequences

High (≥30% =  ! , low quality)

Medium (≥10%-<30% =    K , medium quality)

Low (<10% =   J , high quality)

 

2. Check the ALIGNMENT and the DND TREE FILE (Figure 2b):

The ALIGNMENT (generated with MUSCLE) is given in CLUSTAL, FASTA and JALVIEW formats. JALVIEW is recommended, because it allows you to view the alignment in colors, do manual edition, output the alignment in different formats, etc. However, if you just want to download the alignment in order to use it in another program, we recommend to download the FASTA file.

You can open the DND Tree File as text, but if TREEVIEW is installed on your computer, you will be able to see it graphically.

 

3. Revise the parameters about the QUALITY OF THE DATA SOURCE (Figure 1b):

The following four criteria were used to determine if the study had a polymorphism goal:

  1. One or more sequences from the alignment can be found in the PopSet database in NCBI.

  2. All the sequences from the alignment have consecutive GenBank accession numbers (for example, AF254110-AF254111-AF254112-AF254113-AF254114-AF254115-AF254116-AF254117-AF254118-etc.)

  3. All the sequences from the alignment share one or more references (in which they were published)

  4. At least one of their references (shared or not) are from these journals that typically publish polymorphism studies: Genetics, Mol. Biol. Evol., J. Mol. Evol. or Mol. Phylogenet. Evol.

Two values are assigned to each criterion: true (complies the requirement) or false (does not comply the requirement).

 

4. Revise the ORIGIN OF THE SEQUENCES (Figure 2c):

In the main results page, three parameters are given when available in the GenBank annotations: the country, strain and population variant of each sequence. For a complete description of the sequences, you can follow the links to the MamPol, GenBank and EMBL databases.

 

5. Check the RESULTS OF THE ANALYSES (Figure 2d):

Check the results of polymorphism, linkage disequilibrium and codon bias, especially when they show extreme values. In those cases, the program may have grouped together sequences from different origins, or maybe the alignment is poor.

 

6. REANALYZE THE DATA if needed:

Two programs are available to reanalyze your data from the results:

  • PDA (Figure 2e): Any analysis unit can be interactively reanalyzed using PDA, when the user can freely set the sequences and parameter values. On using this option, the set of sequences is taken as input in the PDA submission page. Any subset of sequences can then be included or excluded from the analysis or the default parameters be modified.

  • SNPs-Graphic (Figure 2f): This is a web module that estimates several measures of DNA sequence polymorphism and allows performing these analyses by the sliding windows method, obtaining graphic representations. Aligned DNA sequences are introduced as input in FASTA format. The output is a web page, saved in the server for 24 hours, where results are displayed in text and graphs. See the SNPs-Graphic help for more information.

 

Figure 1  

Figure 2  

 

 


(7) AMNIS: ALGORITHM FOR THE MAXIMIZATION OF THE NUMBER OF INFORMATIVE SITES IN THE ALIGNMENT

After the grouping and alignment of sequences, a further step is taken before estimating the polymorphism parameters. It is referred here as the AMNIS (Algorithm for the Maximization of the Number of Informative Sites):

  1. First, sequences are grouped again by their length, so that sequences in the same group must not differ in more than the 20% of their length.

  2. The amount of informative sites in each accumulative group of sequences is calculated (e.j. group 1 (the longest sequences), groups 1 + 2, groups 1 + 2 + 3, etc.).

  3. Finally, the program will use the set of sequences with the greatest number of informative sites (in some cases discarding the shortest sequences).

Example:

>LDseq000001
AGCATCGATCATCATCTACGTACGTACGATCAGCCGATGCGCGGGGTTTT   50
>LDseq000002
AGCATCGATCATCGTGTACGTACGTACGATCAGCCGATGCGCGGGGTTTT   50
>LDseq000003
AGCATCGATCATCATCTACGTACGTACGATCAGCCGATGCGCGGGG----   46
>LDseq000004
AGCATCGATCATCATCTACGTACGTACGATCAGCCGATGCGC--------   42
>LDseq000005
AGCATCG-------------------------------------------    7
>LDseq000006
AGCATCG-------------------------------------------    7
>LDseq000007
AGCATCG-------------------------------------------    7
>LDseq000008
AGCATCG-------------------------------------------    7

In this example, the first four sequences would be assigned to group 1, and the last two sequences to group 2. The number of informative sites (without gaps) using the four first sequences (group 1) is:

Informative sites group 1 = 42 non-gapped positions * 4 sequences = 168

Using the accumulative set of sequences of group 1 + 2, we have more sequences, but less non-gapped positions:

 Informative sites group 1+2 = 7 non-gapped positions * 8 sequences = 56

Therefore, we will have more informative sites by using the four long sequences only and discarding the short ones, rather than using the complete set of eight sequences. MamPol would show the alignment with all the sequences, but would use the four long sequences only to calculate the polymorphism estimates (n = 4 in the results).

To distinguish which sequences were used in the analyses from those which were discarded, MamPol uses a color code:

   for sequences that were included in the estimates, and
   for sequences that were NOT included in the estimates

You can find this information in the REPORT for each analysis unit.

 


(8) MamPol DATA MODEL

No standard data model exists for the storage and representation of haplotypic data with associated diversity estimates, so that we have defined a new data model for the secondary database, which is based on two basic units: the POLYMORPHIC SETS (each group of sequences belonging to the same gene and species) and the ANALYSIS UNITS (or ALIGNMENTS) (different subgroups from the corresponding polymorphic sets, according to the functional region (gene, CDSs, exons, etc.) and the percentage of homology between sequences pairs). All subsequent diversity data is estimated and annotated into different joined tables in a MySQL database, related by index tables. The storage of diversity estimates in databases makes them permanently available and allows the re-analysis of all or part of the sequences.

The database content is daily updated, and records are assigned unique and permanent MamPol identification numbers to facilitate cross-database referencing. Each new item is assigned a unique and increasingly MamPol identifier: a six-digit number is preceded by the string SET for polymorphic sets, by MAMpol for analysis units, by MAMseq for individual sequences, and by MAMref for references.

The database can be freely downloaded via our Download page (see the corresponding section in this help) as a compressed gzip file, with the following structure of related tables:


Figure legend:

  • Blue tables: primary database (information is retrieved from external databases: GenBank, NCBI PopSet)

  • Green tables: secondary database (information is newly generated with PDA)

  • Purple table = polsets: main table with the first basic storing unit: POLYMORPHIC SET (for a given gene and species)

  • Red table = index_analysis: main table with the second basic storing unit: ANALYSIS UNIT (or ALIGNMENT) (for a given polymorphic set)

The database contains two copies of each green table. The second copy is labeled with _old and contains all the older information of the analyses that have been reanalyzed (the newest information of which are stored in the first copy of the table). This allows to trace the history of a polymorphic set or analysis unit, including all the previous results with the corresponding date when they were analyzed.

 

 




DGM UAB