"Biologists should realize that before long we shall have a subject which might be called "protein taxonomy" -- the study of amino acid sequences of the proteins of an organism and the comparison of them between species. It can be argued that these sequences are the most delicate expression possible of the phenotype of an organism and that vast amounts of evolutionary information may be hidden away within them". Francis Crick, 1957


Phytome Documentation

Contents


Overview

Phytome contains publicly available protein-coding sequence data from 136 plant species (119 angiosperms and 17 non-angiosperm land plants). The sequence data are derived from a combination of publicly available sequence databases and include gene predictions from genomic DNA, full-length cDNA sequences, and Expressed Sequence Tags (ESTs).

Phytome supports phylogenetic and functional analyses of predicted protein sequences. Multiple sequence alignments and phylogenies have been inferred for related families of proteins. Motif & domain structure have been also been inferred for a representative of each subfamily and used to generate automated Gene Ontology classifications. Organellar Unipeptides, transposable elements and other repetitive elements have been identified.

The results of these analyses can be explored in a variety of ways and explored via this website. We hope to make Phytome a useful engine for discovery in plant comparative genomics. If you see a feature that does not work the way you would like, or have suggestions for additional features, please let us know.

An overview of the Phytome analysis pipeline is shown in the image below.


Phytome analysis pipeline. The analysis pipeline is indicated by black arrows among the data components. Future improvements include the addition of sequence-based comparative mapping data and tools.



Unigenes

Source of data

Unigene assemblies have been obtained from the following external sources:

For many species, different unigene assemblies are available. To populate Phytome, one source was chosen to be primary for each species based on having an assembly containing the largest number of component sequences (see table below). A small number of unigenes are excluded due to the absence of consensus sequence or assembly information. For the special cases of Arabidopsis and rice, the primary sources are the set of predicted protein sequences from TAIR (Arabidopsis genome release, version 5.0) and TIGR (rice genome release, version 3.0), respectively. Mimulus unigenes were obtained from EST build 6 at mimulusevolution.org.



Searching Unigenes

Unigenes, unlike Unipeptides, cannot be downloaded from Phytome. However, Phytome does store the IDs (e.g. Genbank accession numbers) of the component sequences used to assemble each unigene. One can retrieve all primary and secondary source unigenes containing particular component sequence. This enables the user to see the correspondence between unigenes from different sources.


Unipeptides

Source of data

Unipeptide sequences have been derived from translated Unigenes. Phytome currently uses homology-based gene-prediction to derive Unipeptides for species other than Arabidopsis. The procedure was as follows. First, Swissprot (release 47.6)/Trembl (release 30.6) was searched with each Unigene sequence using BLASTX (Altschul et al. 1997). The top three matches with E-values less than 1e-5 were used as templates for translation by In most cases (Arabidopsis and rice excepted), Unipeptides are then inferred from the Unigene sequences. A multi-stage homology search is done against several protein sequence databases using BLAST. First, Uniprot/Swissprot plus TrEMBL plant proteins are searched. If a nearly perfect match is found to a protein from the same species, this protein (or a consensus of all such proteins) is used as the Unipeptide for that Unigene. Failing that, the top three homologs are input to a homology-guided translation step using ESTWise (Birney et al. 2004) using the following three datasets in descending order of priority: (i) all Uniprot/Swissport plus TrEMBL plant records, (ii) non-plant records in Uniprot/Swissprot, or (iii) non-plant records in Uniprot/TrEMBL. Some Unigenes do not produce a corresponding Unipeptide in Phytome for any number of (non-mutually exclusive) reasons: they may lack a coding sequence (by consisting entirely of the 5' or 3'-untranslated region, or of an RNA gene), they possess a coding sequence that is too short, homologs can not be found, or the homology-based translation fails.

In Phytome version 2, 1,070,0355 Unigenes were used for ESTwise translations, derived from 5,017,744 ESTs. This resulted in 793,706 Unipeptides. Attrition was either due to BLAST failure (no or too few hits) or to ESTwise failure.

Searching Unipeptides

Unipeptides can be searched based on

  • Unipeptide ID This is useful if you already know the Unipeptide ID from a previous visit to Phytome. It is not recommended to randomly search for Unipeptide IDs because some IDs do not exist in the current database build.
  • Component Sequences This will return a Unipeptide obtained from a unigene assembled using the given DNA record from Genbank or another source (e.g. Arabidopsis AT number).
  • Gene/marker alias Phytome stores some alternate names for Unipeptides and for unigenes in non-primary source databases that share at least one Component Sequence (which we refer to as "secondary sources").
  • InterPro ID or term The results of Interproscan are available for one exemplar from each subfamily.
  • Gene Ontology ID or term The results of Interpro2Go mappings are available for one exemplar of each subfamily.

Keyword searches for Interpro and GO terms may be qualified with the terms "contains", "starts with", "ends with" or "is". In addition, MySQL-style regular expressions can be input as queries.

The "Unipeptide Page" contains

  • The family and subfamily ID
  • Interpro and Gene Ontology results if the Unipeptide is the exemplar of its subfamily
  • The species of origin
  • A link to the primary source for this unigene sequence
  • A list of constituent ESTs for this Unipeptide
  • A list of secondary sources (i.e. related Unigenes from all sources thatshare at least one component sequence).
  • Hits to organellar and repeat sequence databases
  • Unipeptide sequence (available for download in FASTA format)


BLAST Search

You may search the Unipeptides in the full Phytome dataset, or one species at a time, using BLAST (Altschul et al. 1997). Use BLASTX for nucleotide queries and BLASTP for protein queries. The BLAST output has been customized to organize the hits by Phytome family.

If you are a registered user, you may also perform a batch BLAST search by uploading a multiFASTA sequence. This functionality can be accessed via the "Advanced Features" page.


Interpro, Gene Ontology and other Functional Annotations

Source of data

Sequences for which InterPro and Gene Ontology results are available are marked with a red arrow and the text "InterPro" next to their PhytomeID. InterproScan (Zdobnov and Apweiler 2001) was used to predict domains and functional motifs only for the longest sequence within each Subfamily. Gene Ontology (Ashburner et al. 2000) terms were assigned to Unipeptides using InterPro2GO.

The following databases were searched

Phobius (Kall et al. 2004) was used to predict transmembrane domains and signal peptides. The results are shown together with those for InterproScan.

Where an organellar (plastid or mitochondrial) genome was available for a given species, the Unipeptides from that species were searched against the organellar proteome using BLAST. Positive and negative hits are indicated in the Unipeptide and Family pages by appropriate symbols.

Plastid genomes were available from the following species:

  • Amborella trichopoda (NC_005086), Arabidopsis thaliana (NC_000932), Chlamydomonas reinhardtii (NC_005353), Cucumis sativus (NC_007144), Glycine max (NC_007942), Gossypium hirsutum (NC_007944), Helianthus annuus (NC_007977), Lactuca sativa (NC_007578), Lotus corniculatus var. japonicus (NC_002694), Lycopersicon esculentum (NC_007898), Marchantia polymorpha (NC_001319), Nicotiana sylvestris (NC_007500), Nicotiana tabacum (NC_001879), Oryza sativa (japonica cultivar-group) (NC_001320), Panax ginseng (NC_006290), Saccharum officinarum (NC_006084), Scenedesmus obliques (NC_008101), Spinacia oleracea (NC_002202), Triticum aestivum (NC_002762), Vitis vinifera (NC_007957), Zea mays (NC_001666).

Mitochondrial genomes were available from the following species:

  • Arabidopsis thaliana (NC_001284), Beta vulgaris subsp. vulgaris (NC_002511), Chlamydomonas reinhardtii (NC_001638), Marchantia polymorpha (NC_001660), Nicotiana tabacum (NC_006581), Oryza sativa (indica cultivar-group) (NC_007886), Physcomitrella patens (NC_007945), Triticum aestivum (NC_007579), Zea mays subsp. mays (NC_007982).

All Unipeptides were searched against the TIGR repeat database and positive hits were classified as "transposon", "retrotransposon", "MITE, and "other". Positive hits are indicated by the appropriate symbols on the Unipeptide and Family pages.

The following repeats databases were used

  • Arabidopsis_GSS_Repeats.v2, Arabidopsis_Repeats.v2, Brassica_GSS_Repeats.v2, Brassica_Repeats.v2, Brassicaceae_Repeats.v2, Fabaceae_Repeats.v2, Glycine_GSS_Repeats.v2, Glycine_Repeats.v2, Gramineae_Repeats.v3.1, Hordeum_Repeats.v3.0, Lotus_GSS_Repeats.v2, Lotus_Repeats.v2, Lycopersicon_GSS_Repeats.v2, Lycopersicon_Repeats.v2,, Medicago_GSS_Repeats.v2, Medicago_Repeats.v2, Oryza_GSS_Repeats.v2, Oryza_Repeats.v3.1, Solanaceae_Repeats.v2, Solanum_Repeats.v2, Sorghum_GSS_Repeats.v2, Sorghum_Repeats.v3.0, Triticum_GSS_Repeats.v2, Triticum_Repeats.v3.0, Zea_GSS_Repeats.v2, Zea_Repeats.v3.0 .

Searching Interpro and GO Assignments

InterPro and GO IDs and descriptions may be searched from the "Unipeptide Search Tab" and "Family Search Tab". Both Interpro and GO results are shown by default on the "Unipeptide Page" and Interpro results are summarized for each subfamily on the "Family Page". A mouseover of the the Interpro graphic will show the start and stop coordinates and the E-value for the match in the status bar of your browser. Only high-confidence matches are reported.


Family Classification

Source of data

In version 1 of Phytome, an all-by-all BLASTP of Unipeptides from all species was used as input to Tribe-MCL (Enright et al. 2002), which outputs non-overlapping clusters of sequences that can be considered approximate Unipeptide Families. Tribe-MCL heuristically takes into account both the strength of the pairwise matches and the interconnectivity among the members of a cluster. The inflation value, which is a tunable parameter affecting the stringency of the clustering, was initially set to three. Clusters with >400 members were iteratively broken into smaller clusters using inflation values of four or five.

For Phytome version 2, the process of family assignment builds upon existing families as much as possible. Phytome version 2 Unipeptides that have not changed in sequence retain their earlier family membership. New and updated Unipeptides are first searched against profile hidden Markov models (HMMs) generated using HMMer (Eddy 1998) for the large Unipeptide Families from version 1. Those that clearly fall into existing families need not be clustered. Those that do not match an existing profile HMM are searched against version 1 families that were too small to have HMMs, and against each other, using BLAST. They are assigned to one of the small families if they are closer in sequence to one member than current family members are to each other. If not, they are clustered into new Unipeptide Families as in version 1.

Clustering was initiated with 793,706 Unipeptides and yielded 25,763 Families of size two or greater.

Searching families

Families may be searched based on

  • Family ID This is useful if you know a particular family ID based on a previous search of Phytome.
  • InterPro ID or term InterPro information exists for one member from each subfamily.
  • Gene Ontology ID or term Gene ontology information exists for one member from each subfamily.

Keyword searches for Interpro and GO terms may be qualified with the terms "contains", "starts with", "ends with" or "is". In addition, MySQL-style regular expressions can be input as queries.

The "Family Page" includes

  • Related families (identified using BLASTP and HMMer)
  • A list of Subfamilies.
  • A list of family members excluded from the reduced alignment.
  • A list of those species represented within the Family.
  • Organellar and repeat annotations.
The tabs below allow one to view
  • A list of Unipeptides which can be sorted either by Subfamily or by species depending on which tab is selected. The user may select those Unipeptides to include in a multiple alignment and/or phylogeny.
  • InterPro and GO assignments for an examplar of each subfamily.

By selecting multiple Unipeptides and proceeding to the "Alignment Page", the user can download a single file containing all the predicted peptide sequences (in FASTA format) as well as additional information such as the unigene source names and those of their component sequences.


Searching by Species

The user may search for families that do or do not contain members from particular species or clade. The user selects species/clades to include or exclude using radio buttons to the right of each taxon name. If the default "maybe" is selected, Phytome will return a family regardless of whether there are members from that taxon. Note that taxa with small numbers of Unipeptides will necessarily lack members in most families, so it is recommended to use "maybe" instead of "yes" for most searches. Clades at all levels can be expanded or collapsed.


Alignments

Source of data

A multiple sequence alignment (MSA) is produced for every Unipeptide Family. Owing to the well-known difficulties of automated de novo MSA on large protein families, especially when there are many incomplete sequences, different software programs and parameters need to be applied depending on the context. The vast majority of de novo MSAs for Phytome version 1 were produced using MAFFT, which is both extremely rapid and has been shown to produce alignments comparable with the most successful general-purpose programs available (Katoh et al. 2005). For each family, MAFFT was run once with two iterations and once with three; the better alignment, as determined by the average sum of pairs (SP) score with the PAM250 substitution matrix (Dayhoff 1978), was retained. Alignments of families with 20–700 members generated by MAFFT were subsequently refined using RASCAL (Thompson et al 2003), but since we encountered cases in which applying RASCAL yielded a lower quality MSA than was input, the refined alignment was used only when it had a higher SP score. For a small number of families, the MAFFT alignment was found to be inadequate, and T-COFFEE (Notredame et al 2000) or DIALIGN (Morgenstern 1999) were used to generate alignments for these.

Profile HMMs (Eddy 1998) were trained for Unipeptide Families of Phytome version 1 and used to identify new family members, using full alignments. Profile HMMs were also used to guide full alignments of version 2 Unipeptide Families using HMMer, rather than recalculating each MSA from scratch. Families for which no prior profile HMM is available are aligned as in version 1. Profiles are calculated for new families and for old ones that change substantially in membership between versions.

Generating reduced alignments To ensure positional homology of columns used for reconstructing phylogenies, we developed a program named REAP (Hartmann, Phillips and Vision, unpublished) that pruned the full MSA by

  • excluding extremely "gappy" and divergent columns
  • excluding sequences that had little overlap with the rest of the sequences in the family or were obviously misaligned.

Specifically, within a sliding window of size 6, REAP retained a column when the sum-of-pairs score within the column was greater than 0.5 (using a PAM250 matrix) and the percentage of gaps did not exceed 75%. REAP then removed sequences that had more than 80% gaps among the retained columns and removed misaligned sequences that had a low average sum-of-pairs scores relative to all other sequences.

Searching Alignments

Reduced alignments are displayed after the user selects which Unipeptides to include from the "Family Page". In the default display, each set of contiguous columns excluded from the full alignment is shown by an integer flanked by the neighboring columns of aligned residues. Both full and reduced alignments (in FASTA format) are available for download as text files. Alignments can also be viewed interactively using the JalView Java applet (Clamp et al. 2004). This requires the J2SE Java Runtime Environment (JRE), which allows end-users to run Java applications, to be installed.

In some cases, no alignment columns are retained by REAP, in which case the reduced alignment and phylogeny is unavailable.


Phylogenies

Source of Data

The programs Protdist, Neighbor, and Retree of the PHYLIP package v3.62 (Felsenstein 2004) were used to calculate amino acid substitution distance matrices from the reduced alignments, to construct unrooted neighbor-joining phylogenies, and to obtain midpoint roots, respectively.

Viewing the Phylogenies

Phylogenies can be viewed and manipulated using the ATV Java applet (Zmasek and Eddy 2002) This requires the J2SE Java Runtime Environment (JRE), which allows end-users to run Java applications, to be installed. Phylogenies can also be downloaded as a text file in Newick format. The Unipeptides included in the phylogenetic tree, and in the Newick file available for download, are those that were selected from the "Family Information Page".


Subfamily classification

Subfamilies were identified to facilitate analysis within large families. They are solely defined by the phylogenetic structure of the family and no functional coherence within a subfamily should be assumed. To obtain subfamilies, each midpoint-rooted tree was traversed by a reverse breadth-first search (i.e. from the leaves to the roots). During this traversal, monophyletic clades containing up to 50 sequences were selected and defined as subfamilies. Sequences excluded from the multiple sequence alignment by REAP were placed into a separate Subfamily (numbered 0). It is possible that these sequences can be aligned correctly with manual intervention.


Batch download

Registered users can download batch files that provide information on Phytome Unipeptides, Families, and unigene assemblies. Files available for download are

  • contig2contig.dat.gz : shows the correspondence among unigenes from Plant Genome Network, Plant Genome Database, NCBI, TIGR, SPUTNIK as determined by shared Component Sequences.
  • family2unipeptide.dat.gz: lists the Unipeptides in each Family and the primary unigene source for each Unipeptide.
  • unipeptide2gbaccession.dat.gz: lists the Component Sequences for each Unipeptide.

In addition, we provide information about connection strings that allow to connect to Phytome Unipeptides and families.


Hyperlinks to Unipeptides and Families

To link to a Phytome entity from an external webpage, use the following strings. Substitute actual IDs for "Unipeptide_ID" and "Family_ID".

  • Unipeptide http://www.phytome.org/search-unipeptide.php?keyword="Unipeptide_ID"&unipep_search_type=unipeptideid
  • Family http://www.phytome.org/search-family.php?keyword="Family_ID"&family_search_type=familyid


References

  • Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389-402
  • Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25:25-9
  • Birney E, Clamp M, Durbin R (2004) GeneWise and Genomewise. Genome Res.14:988-995
  • Clamp M, Cuff J, Searle SM, Barton GJ (2004) The Jalview Java alignment editor. Bioinformatics 20:426-7
  • Dayhoff MO (1978) Survey of new data and computer methods of analysis. In MO Dayhoff, ed, Atlas of Protein Sequence and Structure.
  • Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14, 755-776.
  • Enright AJ, Van Dongen S, Ouzounis CA (2002) An efficient algorithm for large-scale detection of protein families. Nucl. Acids. Res. 30:1575-1584
  • Felsenstein J (2004) PHYLIP (Phylogeny Inference Package). Distributed by the author. Department of Genome Sciences, University of Washington, Seattle
  • Kall K, Krogh A, Sonnhammer ELL (2004) A Combined Transmembrane Topology and Signal Peptide Prediction Method. Journal of Molecular Biology. 338, 1027-1036.
  • Katoh K, Kuma K, Toh, H., Miyata, T. (2005) MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 33, 511–518
  • Lassmann T, Sonnhammer ELL (2002) Quality assessment of multiple alignment programs. Febs Letters 529:126-130
  • Morgenstern B (1999) DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics 15:211-218
  • Nicholas HB, Jr., Ropelewski AJ, Deerfield DW, 2nd (2002) Strategies for multiple sequence alignment. Biotechniques 32:572-590
  • Notredame C (2002) Recent progress in multiple sequence alignment: a survey. Pharmacogenomics 3:131-144
  • Notredame C, Higgins DG, Heringa J (2000) T-Coffee: A novel method for fast and accurate multiple sequence alignment. Journal of Molecular Biology 302:205-217
  • Schmidt HA, Strimmer K, Vingron M, von Haeseler A (2002) TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics 18:502-504
  • Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22:4673-80
  • Thompson JD, Plewniak F, Poch O (1999) A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Research 27:2682-2690
  • Thompson JD, Thierry JC, Poch O (2003) RASCAL: rapid scanning and correction of multiple sequence alignments. Bioinformatics 19:1155-1161
  • Zdobnov EM, Apweiler R (2001) InterProScan--an integration platform for the signature-recognition methods in InterPro. Bioinformatics 17:847-8
  • Zmasek CM, Eddy SR (2002) RIO: analyzing proteomes by automated phylogenomics using resampled inference of orthologs. BMC Bioinformatics 3:14
 
Department of Biology | University of North Carolina at Chapel Hill
Comments / Questions