Contents
(1)
THE SEARCH INTERFACE
-
GENERAL SEARCH
Search
Options
Select organisms/genes
Filter for diversity
values
Filter for degree of
confidence on the polymorphic set
Advanced options
Results pages format
-
COMPARATIVE SEARCH
Search Options
Select organisms
Select diversity
parameters
Filter for degree of
confidence on the polymorphic set
Advanced options
Results pages format
-
GRAPHICAL SEARCH
Search Options
Select organisms/genes
Select distribution
Filter for degree of
confidence on the polymorphic set
Advanced options
Results pages format
-
SEARCH BY MAMPOL OR GENBANK ACCESSION NUMBERS
(2)
THE ANALYSIS
SECTION
(a)
SEQUENCE COMPARISON:
CLUSTALW:
Multiple Sequences Alignment
JALVIEW: Multiple
Sequence Alignment Viewer and Editor
(b)
NUCLEOTIDE DIVERSITY ANALYSIS:
SNPs - Graphic
PDA Server
(3)
THE STATISTICS SECTION
(4)
THE LINKS SECTION
(5)
HOW TO DOWNLOAD THE
DATABASE?
(6)
HOW TO ASSESS THE
CONFIDENCE ON AN ANALYSIS UNIT
(7)
AMNIS: ALGORITHM FOR THE
MAXIMIZATION OF THE NUMBER OF INFORMATIVE
SITES IN THE ALIGNMENT
(8) MamPol DATA MODEL
(1) THE SEARCH INTERFACE
The
MamPol Search Tool is the web interface which
allows you to retrieve both secondary
information (diversity measures)
and related primary information (sequences,
references, genes and aberrations) from the MamPol database. The
search options are explained below for
General, Comparative and
Graphical searches, as well as
performing quick searches by MamPol
or GenBank accession numbers.
a.
GENERAL
SEARCH
Search Options
Select organisms/genes
Here you can
select from the lists the set of organisms
and/or genes you want to include in your
results. Click the "Sp" button to
select species or the "Tax" button to
select taxonomic groups, the later list is
expandable. Click the "List" button
or the links to "Gene alias lists" to
select genes. In any list click on "Add
selected organisms/genes" at the end of
the page. In the case you leave the boxes
empty, all organisms/genes will be included.
You can select
to view nuclear, mitochondrial or both type
of genes.
Filter for diversity
values
Here you can
define threshold values for the different
parameters of nucleotide diversity, linkage
disequilibrium and codon bias. The
parameters are distributed into four
categories:
Each category
can accept two different values related by a
boolean operator (and, or, not).
You can use both or only one, or leave them
empty or 0 as defaults. Be aware that only
D (linkage disequilibrium) and
Tajima's D (polymorphism) can
accept negative values.
Note that
synonymous and non-synonymous polymorphisms
and codon bias estimates are
calculated on CDS (coding regions)
only, and linkage disequilibrium only
on exons and introns.
Please refer
to
http://pda.uab.es/pda/pda_help.asp#param
for a more extensive explanation of the
parameters and their references.
Filter for degree of
confidence on the polymorphic set
We assess the confidence of each polymorphic set taking into account the quality of each alignment and the sequences source.
-
Quality of
the alignments: To assess the quality of an alignment we used three criteria: the number of sequences included in the alignment, the percentage of gaps or ambiguous bases within the alignment and the percentage of difference between the shortest and the longest sequences. For each criterion three qualitative categories were defined: low quality,
medium quality and high quality:
Number
of sequences |
Low (2-5
=
!
) |
Medium (6-10
=
K
) |
High (>10
=
J) |
|
|
Percentage of gaps / ambiguous bases
within the alignment |
High (≥30%
=
!
, low quality) |
Medium (≥10%-<30%
=
K
, medium quality) |
Low (<10%
=
J
, high quality) |
|
|
Percentage of difference in length
between the shortest and the longest
sequences |
High (≥30%
=
!
, low quality) |
Medium (≥10%-<30%
=
K
, medium quality) |
Low (<10%
=
J
, high quality) |
-
One or more sequences from the alignment
can be found in the PopSet database in NCBI.
-
All the sequences from the alignment have
consecutive GenBank accession numbers
(for example,
AF254110-AF254111-AF254112-AF254113-AF254114-AF254115-AF254116-AF254117-AF254118-etc.)
-
All the sequences from the alignment
share one or more references (in which
they were published)
-
At least one of their references (shared
or not) are from these journals that
typically publish polymorphism studies:
Genetics, Mol. Biol. Evol.,
J. Mol. Evol. or Mol. Phylogenet.
Evol.
Two values are assigned to each criterion: true (complies
the requirement) or false (does not
comply the requirement)
Advanced options
In this part
of the form you can define other advanced
options for your search:
-
Regions to
be included: please select one or more
regions from which you want to retrieve the
diversity values.
-
Order
polymorphic sets: retrieve the results
ordered by Organism, Gene or
Setcode.
-
Polymorphic
sets per page: number of polymorphic
sets to be displayed on each results page.

Results pages format
The different
polymorphic sets are displayed on a table
showing the basic parameters of each
analysis. From this table you can (see
the figure below):
-
Retrieve the complete results of a
selected analysis. They will be
displayed in a new page, from which you can
also edit and/or reanalyze the specific
alignment, as well as access all the primary
related information of sequences (MamPol,
GenBank and EMBL) and references (MamPol,
PubMed and Medline).
-
Retrieve the history of the corresponding
polymorphic set: previous analyses on
that polymorphic set that have now been
updated.
-
Reanalyze a polymorphic set using the PDA
software. All the sequences belonging to
the selected polymorphic sets will be
included on the PDA form as a list of
Accession numbers from the MamPol database.
Then you can define your preferred
parameters to reanalyze the sequences again.
-
Get the Sequences in the FASTA format.
All the sequences from the selected analysis
units will be shown in the FASTA format.

The complete results of a selected analysis
(1) contain information of the sequences
used and the estimations, as well as the
alignments in different formats:


b. COMPARATIVE SEARCH
Search Options
Select organisms
Here you can
select from the lists the set of organisms you want to include in your
results. Click the "Sp" button to
select species or the "Tax" button to
select taxonomic groups, the later list is
expandable. In any list click on "Add
selected organisms" at the end of the
page. You must select at least one organism
or taxonomic
group from which you want all available
estimates to be averaged.
You can select
to view nuclear, mitochondrial or both type
of genes.
Select the diversity
parameters
Here you can
select the diversity parameters you want to
include into the comparison. The
parameters are distributed into three
categories:
Note that
synonymous and non-synonymous polymorphisms
and codon bias estimates are
calculated on CDS (coding regions)
only.
Filter for degree of
confidence on the polymorphic set
Please refer
to the same section in the
General Search help.
Advanced options
In this part
of the form you can define other advanced
options for your search:

Results pages format
The results
are represented on a table as shown on the figure:

The table contains the number of polymorphic
sets and the number of analysis units
included in the estimates shown. Estimates
on this table are averages that are computed
first on polymorphic sets (e.g. all exons of
the same gene are averaged to obtain a
single estimate for exons on that gene) and
then resulting estimates for each gene are
averaged again to obtain a single final
estimate (shown in the table above). Thus,
every gene weights the same in the shown
average. Each of these averages has a link
to the Graphical Search
to view the distribution of the values
included in computed average.
Tajima's D is
a special case on this table. The number of
Tajima's tests shown in the table are those
which gave significant values at the 95%
confidence interval (e.g. a row showing 3 -
2 means that 3 tests gave a significantly
negative Tajima's D, and 2 gave a
significantly positive Tajima's D).

c. GRAPHICAL SEARCH
Search Options
Select organisms/genes
Please refer
to the same section in the
General Search help.
Select a distribution
Select one
parameter from one list. The
distribution of this parameter will be
displayed in the results.
Filter for degree of
confidence on the polymorphic set
Please refer
to the same section in the
General Search help.
Advanced options
In this part
of the form you can define other advanced
options for your search:
-
Regions to
be included: please select one or more
regions from which you want to retrieve the
diversity values.
-
Type of
representation: Histogram (items
ordered by value of the parameter) or
Frequency (items ordered by frequency of
the categories).
-
Number of
categories:
number of categories in which the values
will be distributed in the graph.

Results pages format
The results
are represented on a histogram or frequency
representation as shown on the figure. You
can retrieve all analysis units for each
class by clicking in the frequency range
at the left, the histogram bar, or the count
number at the right:


d. SEARCH BY MAMPOL OR
GENBANK ACCESSION NUMBERS
At the top of
the General Search (section "Search by Id"),
enter any MamPol accession (e.g.
SET000033 for polymorphic sets, MAMpol000025
for analysis units, MAMseq001739 for sequences) or GenBank accession
(AF175215 for sequence accession numbers,
AF175215.1 for sequence versions, 6002968
for sequence GIs) and click the button 'Go'.
You will retrieve all related analysis units
from MamPol.

(2) ANALYSIS
SECTION
This
section provides you a collection of programs for sequence
analysis.
(a) SEQUENCE COMPARISON:
CLUSTALW:
Multiple Sequences Alignment
The ClustalW software with default parameters optimized for alignment of Drosophila polymorphic sequences (as manually checked) is available. ClustalW is a Multiple Sequences Alignment program. It aligns
different sequences avoiding gaps as much as possible,
depending on the parameters values chosen. It can also construct phylogenetic trees. See the Clustal
help for more information.

JALVIEW:
Multiple Sequence Alignment Viewer and Editor
Jalview
is a multiple sequence alignment viewer and editor.
Alignments can be divided into subfamilies using a tree or
by hand. Conservation can then be calculated using physico-chemical
properties within subfamilies or across the whole alignment.
Principal component analysis can also be used as an
alternative way of clustering the sequences. An SRS server
can be used to fetch and display the sequence features and
any PDB structures listed. See the
Jalview
help for more information.

(b) NUCLEOTIDE
DIVERSITY ANALYSIS:
SNPs - Graphic:
Analysis of nucleotide diversity in Sliding Windows
This
is a web module that estimates several measures
of DNA sequence polymorphism and allows performing these analyses by the sliding windows
method, obtaining
graphic representations. Aligned DNA sequences are
introduced as input in FASTA format. The output is a web
page, saved in the server for 24 hours, where results are
displayed in text and graphs. See the
SNPs-Graphic help for more information.

PDA Server:
Pipeline Diversity
Analysis
PDA, "Pipeline Diversity
Analysis", is a collection of programs and modules
mainly written in Perl that automatically can:
-
search for polymorphic
sequences in a large database, and
-
estimate their genetic
diversity.
PDA
has a user-friendly, web-based
interface where the user can select the sequences to be
analyzed and the parameters to be used. Sequences can be
retrieved from either GenBank or the MamPol database as a list
of accession numbers or a set of organisms and/or genes. Low
quality sequences coming from large-scale sequencing
projects (i.e. working draft), where most missing
data is, will be excluded from the analysis. Alternatively,
sequences can be introduced manually in FASTA or GenBank
formats. All sequences will be grouped by organism and gene,
and groups will be aligned using the ClustalW algorithm.
After, different analyses of polymorphism in synonymous and
non-synonymous sites, linkage disequilibrium and codon bias
will be performed.
See the
PDA help for more information.

(3) STATISTICS
The
MamPol Statistics Section shows the contents of the database and includes tabular and graphic information on the secondary and primary database: number of polymorphic sets and analysis units available classified by functional regions, species, genes, quality of alignments, confidence of data source, total number of sequences and references,… All
the information, tables and graphs are updated on a daily base,
after the updating of the database itself.
There are different pages to show
the statistics of the nuclear, the mitochondrial and both types
of genes. The same division is made in the Rodentia order, the
Primate order and the rest of the Mammalia class.

(4) LINKS
This
section offers a selected collection of web addresses, specially related to the study of nucleotide polymorphism
and bioinformatics. These are distributed in different categories:

(5) HOW TO DOWNLOAD
THE DATABASE?
The database can be freely downloaded
using our
Download page.
It contains a compressed gzip copy of
each MySQL database (db_name.contents.gz).
Download
the files and load them into a new database in
you MySQL Server, as follows:
-
Create a
new database:
mysqladmin create db_name
-u root -p
-
Decompress and load the database:
gzip -d < /PATH/db_name.contents.gz | mysql db_name -u root -p
Note that
you must do it from an account with privileges
to create a new database in your MySQL server.

(6) HOW TO
ASSESS THE CONFIDENCE ON AN ANALYSIS UNIT
The
results stored in the Mammalia
Polymorphism Database are obtained by an automatic process of analysis
using PDA
(Casillas
& Barbadilla 2004)
(http://pda.uab.es).
We highly recommend users of this database
to follow the following steps in order to
assess the confidence on any analysis unit.
1.
Revise the parameters about the QUALITY
OF THE ALIGNMENT (Figure
1a, Figure 2a):
To assess the quality of an alignment we used three criteria: the number of sequences included in the alignment, the percentage of gaps o ambiguous bases within the alignment and the percentage of difference between the shortest and the longest sequences. For each criterion three qualitative categories were defined:
low quality,
medium quality and high quality:
Number
of sequences |
Low (2-5
=
!
) |
Medium (6-10
=
K
) |
High (>10
=
J) |
|
|
Percentage of gaps / ambiguous bases
within the alignment |
High (≥30%
=
!
, low quality) |
Medium (≥10%-<30%
=
K
, medium quality) |
Low (<10%
=
J
, high quality) |
|
|
Percentage of difference in length
between the shortest and the longest
sequences |
High (≥30%
=
!
, low quality) |
Medium (≥10%-<30%
=
K
, medium quality) |
Low (<10%
=
J
, high quality) |
2. Check the ALIGNMENT
and the DND TREE FILE
(Figure 2b):
The
ALIGNMENT (generated with MUSCLE) is given in CLUSTAL,
FASTA and JALVIEW formats.
JALVIEW is recommended, because it
allows you to view the alignment in
colors, do manual edition, output the
alignment in different formats, etc.
However, if you just want to download
the alignment in order to use it in
another program, we recommend to
download the FASTA file.
You can
open the DND Tree File as text, but if
TREEVIEW is installed on your computer,
you will be able to see it graphically.
3. Revise the parameters
about the QUALITY OF THE DATA SOURCE
(Figure 1b):
The following four criteria were used to
determine if the study had a polymorphism
goal:
-
One or more sequences from the alignment
can be found in the PopSet database in NCBI.
-
All the sequences from the alignment have
consecutive GenBank accession numbers (for example,
AF254110-AF254111-AF254112-AF254113-AF254114-AF254115-AF254116-AF254117-AF254118-etc.)
-
All the sequences from the alignment
share one or more references (in which
they were published)
-
At least one of their references (shared
or not) are from these journals that
typically publish polymorphism studies:
Genetics, Mol. Biol. Evol.,
J. Mol. Evol. or Mol. Phylogenet.
Evol.
Two values are assigned to each criterion:
true (complies
the requirement) or false (does not
comply the requirement).
4. Revise the ORIGIN OF
THE SEQUENCES
(Figure 2c):
In the
main results page, three parameters are
given when available in the GenBank
annotations: the country,
strain and population variant
of each sequence. For a complete
description of the sequences, you can
follow the links to the MamPol,
GenBank and EMBL
databases.
5.
Check the RESULTS OF THE ANALYSES
(Figure 2d):
Check the
results of polymorphism, linkage
disequilibrium and codon bias,
especially when they show extreme values.
In those cases, the program may have
grouped together sequences from
different origins, or maybe the
alignment is poor.
6. REANALYZE THE DATA if needed:
Two
programs are available to reanalyze your
data from the results:
-
PDA
(Figure 2e):
Any
analysis unit can be interactively
reanalyzed using PDA, when the user
can freely set the sequences and
parameter values. On using this
option, the set of sequences is
taken as input in the PDA submission
page. Any subset of sequences can
then be included or excluded from
the analysis or the default
parameters be modified.
-
SNPs-Graphic
(Figure 2f):
This
is a web module that estimates several measures
of DNA sequence polymorphism and allows performing these analyses by the sliding windows
method, obtaining
graphic representations. Aligned DNA sequences are
introduced as input in FASTA format. The output is a web
page, saved in the server for 24 hours, where results are
displayed in text and graphs. See the
SNPs-Graphic help for more information.
Figure 1

Figure 2


(7) AMNIS:
ALGORITHM FOR THE MAXIMIZATION OF THE NUMBER
OF INFORMATIVE SITES IN THE ALIGNMENT
After the
grouping and alignment of sequences, a
further step is taken before estimating the
polymorphism parameters. It is referred here
as the AMNIS (Algorithm for the
Maximization of the Number of Informative
Sites):
-
First,
sequences are grouped again by their
length, so that sequences in the same
group must not differ in more than the
20% of their length.
-
The
amount of informative sites in each
accumulative group of sequences is
calculated (e.j. group 1 (the longest
sequences), groups 1 + 2, groups 1 + 2 +
3, etc.).
-
Finally, the program will use the set of
sequences with the greatest number of
informative sites (in some cases
discarding the shortest sequences).
Example:
>LDseq000001
AGCATCGATCATCATCTACGTACGTACGATCAGCCGATGCGCGGGGTTTT 50
>LDseq000002
AGCATCGATCATCGTGTACGTACGTACGATCAGCCGATGCGCGGGGTTTT 50
>LDseq000003
AGCATCGATCATCATCTACGTACGTACGATCAGCCGATGCGCGGGG---- 46
>LDseq000004
AGCATCGATCATCATCTACGTACGTACGATCAGCCGATGCGC-------- 42
>LDseq000005
AGCATCG-------------------------------------------
7
>LDseq000006
AGCATCG-------------------------------------------
7
>LDseq000007
AGCATCG-------------------------------------------
7
>LDseq000008
AGCATCG-------------------------------------------
7
In this
example, the first four sequences would be
assigned to group 1, and the last two
sequences to group 2. The number of
informative sites (without gaps) using the
four first sequences (group 1) is:
Informative
sites group 1 = 42 non-gapped positions * 4 sequences = 168
Using the
accumulative set of sequences of group 1 +
2, we have more sequences, but less non-gapped
positions:
Informative
sites group 1+2 = 7 non-gapped positions * 8 sequences = 56
Therefore,
we will have more informative sites by using
the four long sequences only and discarding
the short ones, rather than using the
complete set of eight sequences. MamPol would
show the alignment with all the sequences,
but would use the four long sequences only
to calculate the polymorphism estimates (n = 4
in the results).
To
distinguish which sequences were used in the
analyses from those which were discarded,
MamPol uses a color code:
●
for sequences that were included in the
estimates, and
●
for sequences that were NOT included in
the estimates
You can
find this information in the REPORT for each
analysis unit.

(8)
MamPol DATA MODEL
No standard data
model exists for the storage and representation of haplotypic data
with associated diversity estimates, so that we have defined a new
data model for the secondary database, which is based on two basic
units: the POLYMORPHIC SETS (each group of sequences belonging
to the same gene and species) and the ANALYSIS UNITS (or
ALIGNMENTS) (different subgroups from the corresponding polymorphic sets,
according to the functional region (gene, CDSs, exons, etc.) and the
percentage of homology between sequences pairs). All subsequent
diversity data is estimated and annotated into different joined
tables in a MySQL database, related by index tables. The
storage of diversity estimates in databases makes them permanently
available and allows the re-analysis of all or part of the
sequences.
The database
content is daily updated, and records are assigned unique and
permanent MamPol identification numbers to facilitate cross-database
referencing. Each new item is assigned a unique and increasingly
MamPol identifier: a six-digit number is preceded by the string SET
for polymorphic sets, by MAMpol for analysis units, by
MAMseq for individual sequences, and by MAMref for
references.
The database can
be freely downloaded via our
Download page
(see the corresponding section in this help) as a
compressed gzip file, with the following structure of related
tables:

Figure legend:
-
Blue tables:
primary database (information is retrieved
from external databases: GenBank,
NCBI PopSet)
-
Green tables:
secondary database (information is newly
generated with PDA)
-
Purple table = polsets:
main table with the first basic
storing unit: POLYMORPHIC SET (for a
given gene and species)
-
Red table =
index_analysis:
main table with the
second basic storing unit: ANALYSIS UNIT
(or ALIGNMENT) (for a given polymorphic
set)
The
database contains two copies of each green
table. The second copy is labeled with _old
and contains all the older information of the
analyses that have been reanalyzed (the newest
information of which are stored in the first
copy of the table). This allows to trace the
history of a polymorphic set or analysis unit,
including all the previous results with the
corresponding date when they were analyzed.

|