首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Large-scale DNA sequencing is creating a sequence infrastructure of great benefit to protein biochemistry. Concurrent with the application of large-scale DNA sequencing to whole genome analysis, mass spectrometry has attained the capability to rapidly, and with remarkable sensitivity, determine weights and amino acid sequences of peptides. Computer algorithms have been developed to use the two different types of data generated by mass spectrometers to search sequence databases. When a protein is digested with a site-specific protease, the molecular weights of the resulting collection of peptides, the mass map or fingerprint, can be determined using mass spectrometry. The molecular weights of the set of peptides derived from the digestion of a protein can then be used to identify the protein. Several different approaches have been developed. Protein identification using peptide mass mapping is an effective technique when studying organisms with completed genomes. A second method is based on the use of data created by tandem mass spectrometers. Tandem mass spectra contain highly specific information in the fragmentation pattern as well as sequence information. This information has been used to search databases of translated protein sequences as well as nucleotide databases such as expressed sequence tag (EST) sequences. The ability to search nucleotide databases is an advantage when analyzing data obtained from organisms whose genomes are not yet completed, but a large amount of expressed gene sequence is available (e.g., human and mouse). Furthermore, a strength of using tandem mass spectra to search databases is the ability to identify proteins present in fairly complex mixtures.  相似文献   

2.
The GenBank (Registered Trademark symbol) sequence database incorporates DNA sequences from all available public sources, primarily through the direct submission of sequence data from individual laboratories and from large-scale sequencing projects. Most submitters use the BankIt (Web) or Sequin programs to format and send sequence data. Data exchange with the EMBL Data Library and the DNA Data Bank of Japan helps ensure comprehensive worldwide coverage. GenBank data is accessible through NCBI's integrated retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome and protein structure information. MEDLINE (Registered Trademark symbol) s from published articles describing the sequences are included as an additional source of biological annotation through the PubMed search system. Sequence similarity searching is offered through the BLAST series of database search programs. In addition to FTP, Email, and server/client versions of Entrez and BLAST, NCBI offers a wide range of World Wide Web retrieval and analysis services based on GenBank data. The GenBank database and related resources are freely accessible via the URL: http://www.ncbi.nlm.nih.gov  相似文献   

3.
The determination of complete genome sequences provides us with an opportunity to describe and analyze evolution at the comprehensive level of genomes. Here we compare nine genomes with respect to their protein coding genes at two levels: (i) we compare genomes as "bags of genes" and measure the fraction of orthologs shared between genomes and (ii) we quantify correlations between genes with respect to their relative positions in genomes. Distances between the genomes are related to their divergence times, measured as the number of amino acid substitutions per site in a set of 34 orthologous genes that are shared among all the genomes compared. We establish a hierarchy of rates at which genomes have changed during evolution. Protein sequence identity is the most conserved, followed by the complement of genes within the genome. Next is the degree of conservation of the order of genes, whereas gene regulation appears to evolve at the highest rate. Finally, we show that some genomes are more highly organized than others: they show a higher degree of the clustering of genes that have orthologs in other genomes.  相似文献   

4.
We present a method for discovering conserved sequence motifs from families of aligned protein sequences. The method has been implemented as a computer program called EMOTIF (http://motif. stanford.edu/emotif). Given an aligned set of protein sequences, EMOTIF generates a set of motifs with a wide range of specificities and sensitivities. EMOTIF also can generate motifs that describe possible subfamilies of a protein superfamily. A disjunction of such motifs often can represent the entire superfamily with high specificity and sensitivity. We have used EMOTIF to generate sets of motifs from all 7,000 protein alignments in the BLOCKS and PRINTS databases. The resulting database, called IDENTIFY (http://motif. stanford.edu/identify), contains more than 50,000 motifs. For each alignment, the database contains several motifs having a probability of matching a false positive that range from 10(-10) to 10(-5). Highly specific motifs are well suited for searching entire proteomes, while generating very few false predictions. IDENTIFY assigns biological functions to 25-30% of all proteins encoded by the Saccharomyces cerevisiae genome and by several bacterial genomes. In particular, IDENTIFY assigned functions to 172 of proteins of unknown function in the yeast genome.  相似文献   

5.
6.
We describe a computer program, named DNA-Protein Search (DPS), for comparing a megabase DNA sequence with a protein sequence database. The DPS program addresses the problems of frameshifts and introns in the DNA sequence. The DPS program was used to compare each of the following sequences with the Swiss-Prot database: the 1.8-megabase sequence of the Haemophilus influenzae Rd genome, the 0.58-megabase sequence of the Mycoplasma genitalium genome, and the 0.56-megabase sequence of Saccharomyces cerevisiae chromosome VIII. The comparisons found new regions that are similar to protein sequences. The sensitivity of DPS was evaluated using as test data the known coding regions of the three DNA sequences. The results demonstrate that the DPS program is a useful tool for finding the coding regions of the DNA sequence. The DPS program uses an order of magnitude less computer memory and is several times faster than the BLASTX program.  相似文献   

7.
A method is described for searching protein sequence databases using tandem mass spectra of tryptic peptides. The approach uses a de novo sequencing algorithm to derive a short list of possible sequence candidates which serve as query sequences in a subsequent homology-based database search routine. The sequencing algorithm employs a graph theory approach similar to previously described sequencing programs. In addition, amino acid composition, peptide sequence tags and incomplete or ambiguous Edman sequence data can be used to aid in the sequence determinations. Although sequencing of peptides from tandem mass spectra is possible, one of the frequently encountered difficulties is that several alternative sequences can be deduced from one spectrum. Most of the alternative sequences, however, are sufficiently similar for a homology-based sequence database search to be possible. Unfortunately, the available protein sequence database search algorithms (e.g. Blast or FASTA) require a single unambiguous sequence as input. Here we describe how the publicly available FASTA computer program was modified in order to search protein databases more effectively in spite of the ambiguities intrinsic in de novo peptide sequencing algorithms.  相似文献   

8.
Raw sequence data representing the majority of a bacterial genome can be obtained at a tiny fraction of the cost of a completed sequence. To demonstrate the utility of such a resource, 870 single-stranded M13 clones were sequenced from a shotgun library of the Salmonella typhi Ty2 genome. The sequence reads averaged over 400 bases and sampled the genome with an average spacing of once every 5,000 bases. A total of 339,243 bases of unique sequence was generated (approximately 7% representation). The sample of 870 sequences was compared to the complete Escherichia coli K-12 genome and to the rest of the GenBank database, which can also be considered a collection of sampled sequences. Despite the incomplete S. typhi data set, interesting categories could easily be discerned. Sixteen percent of the sequences determined from S. typhi had close homologs among known Salmonella sequences (P < 1e-40 in BlastX or BlastN), reflecting the proportion of these genomes that have been sequenced previously; 277 sequences (32%) had no apparent orthologs in the complete E. coli K-12 genome (P > 1e-20), of which 155 sequences (18%) had no close similarities to any sequence in the database (P > 1e-5). Eight of the 277 sequences had similarities to genes in other strains of E. coli or plasmids, and six sequences showed evidence of novel phage lysogens or sequence remnants of phage integrations, including a member of the lambda family (P < 1e-15). Twenty-three sample sequences had a significantly closer similarity a sequence in the database from organisms other than the E. coli/Salmonella clade (which includes Shigella and Citrobacter). These sequences are new candidate lateral transfer events to the S. typhi lineage or deletions on the E. coli K-12 lineage. Eleven putative junctions of insertion/deletion events greater than 100 bp were observed in the sample, indicating that well over 150 such events may distinguish S. typhi from E. coli K-12. The need for automatic methods to more effectively exploit sample sequences is discussed.  相似文献   

9.
IMGT, the international ImMunoGeneTics database, is an integrated database specialising in Immunoglobulins (Ig), T cell Receptors (TcR) and Major Histocompatibility Complex (MHC) of all vertebrate species, created by Marie-Paule Lefranc, CNRS, Montpellier II University, Montpellier, France (lefranc@ligm.crbm.cnrs-mop.fr). IMGT includes three databases: LIGM-DB (for Ig and TcR), MHC/HLA-DB and PRIMER-DB (the last two in development). IMGT comprises expertly annotated sequences and alignment tables. LIGM-DB contains more than 23 000 Immunoglobulin and T cell Receptor sequences from 78 species. MHC/HLA-DB contains Class I and Class II Human Leucocyte Antigen alignment tables. An IMGT tool, DNAPLOT, developed for Ig, TcR and MHC sequence alignments, is also available. IMGT works in close collaboration with the EMBL database. IMGT goals are to establish a common data access to all immunogenetics data, including nucleotide and protein sequences, oligonucleotide primers, gene maps and other genetic data of Ig, TcR and MHC molecules, and to provide a graphical user friendly data access. IMGT has important implications in medical research (repertoire in autoimmune diseases, AIDS, leukemias, lymphomas), therapeutical approaches (antibody engineering), genome diversity and genome evolution studies. IMGT is freely available at http://imgt.cnusc.fr:8104  相似文献   

10.
HSSP is a derived database merging structural three dimensional (3-D) and sequence one dimensional(1-D) information. For each protein of known 3-D structure from the Protein Data Bank (PDB), the database has a multiple sequence alignment of all available homologues and a sequence profile characteristic of the family. The list of homologues is the result of a database search in Swissprot using a position-weighted dynamic programming method for sequence profile alignment (MaxHom). The database is updated frequently. The listed homologues are very likely to have the same 3-D structure as the PDB protein to which they have been aligned. As a result, the database is not only a database of aligned sequence families, but also a database of implied secondary and tertiary structures covering 27% of all Swissprot-stored sequences.  相似文献   

11.
The goal of the fungal mitochondrial genome project (FMGP) is to sequence complete mitochondrial genomes for a representative sample of the major fungal lineages; to analyze the genome structure, gene content, and conserved sequence elements of these sequences; and to study the evolution of gene expression in fungal mitochondria. By using our new sequence data for evolutionary studies, we were able to construct phylogenetic trees that provide further solid evidence that animals and fungi share a common ancestor to the exclusion of chlorophytes and protists. With a database comprising multiple mitochondrial gene sequences, the level of support for our mitochondrial phylogenies is unprecedented, in comparison to trees inferred with nuclear ribosomal RNA sequences. We also found several new molecular features in the mitochondrial genomes of lower fungi, including: (1) tRNA editing, which is the same type as that found in the mitochondria of the amoeboid protozoan Acanthamoeba castellanii; (2) two novel types of putative mobile DNA elements, one encoding a site-specific endonuclease that confers mobility on the element, and the other constituting a class of highly compact, structured elements; and (3) a large number of introns, which provide insights into intron origins and evolution. Here, we present an overview of these results, and discuss examples of the diversity of structures found in the fungal mitochondrial genome.  相似文献   

12.
It is well established that sequence templates such as those in the PROSITE and PRINTS databases are powerful tools for predicting the biological function and tertiary structure for newly derived protein sequences. The number of X-ray and NMR protein structures is increasing rapidly and it is apparent that a 3D equivalent of the sequence templates is needed. Here, we describe an algorithm called TESS that automatically derives 3D templates from structures deposited in the Brookhaven Protein Data Bank. While a new sequence can be searched for sequence patterns, a new structure can be scanned against these 3D templates to identify functional sites. As examples, 3D templates are derived for enzymes with an O-His-O "catalytic triad" and for the ribonucleases and lysozymes. When these 3D templates are applied to a large data set of nonidentical proteins, several interesting hits are located. This suggests that the development of a 3D template database may help to identify the function of new protein structures, if unknown, as well as to design proteins with specific functions.  相似文献   

13.
14.
PRINTS is a database of protein family 'fingerprints' offering a diagnostic resource for newly-determined sequences. By contrast with PROSITE, which uses single consensus expressions to characterise particular families, PRINTS exploits groups of motifs to build characteristic signatures. These signatures offer improved diagnostic reliability by virtue of the mutual context provided by motif neighbours. To date, 800 fingerprints have been constructed and stored in PRINTS. The current version, 17.0, encodes approximately 4500 motifs, covering a range of globular and membrane proteins, modular polypeptides, and so on. The database is accessible via the UCL Bioinformatics World Wide Web (WWW) Server at http://www. biochem.ucl.ac.uk/bsm/dbbrowser/ . We have recently enhanced the usefulness of PRINTS by making available new, intuitive search software. This allows both individual query sequence and bulk data submission, permitting easy analysis of single sequences or complete genomes. Preliminary results indicate that use of the PRINTS system is able to assign additional functions not found by other methods, and hence offers a useful adjunct to current genome analysis protocols.  相似文献   

15.
16.
Neisseria meningitidis isolates are conventionally classified by serosubtyping, which characterizes the reactivities of the PorA outer membrane protein variable-region (VR) epitopes with monoclonal antibodies (MAbs). A newer method (PorA VR typing) uses predicted amino acid sequences derived from DNA sequence analysis. The resulting classification schemes are not standardized, offering conflicting and sometimes irreconcilable data from the two methods. In this paper, we propose a standardization of the PorA VR typing nomenclature that incorporates serologic information from traditional PorA serosubtyping with molecular data from predicted VR sequences. We performed a comprehensive literature and database search, generating a collection of strains and DNA sequences that reflects the diversity within PorA that exists to date. We have arranged this information in a comprehensive logical model that includes both serosubtype and PorA VR type assignments. Our data demonstrate that the current panel of serosubtype-defining MAbs underestimates PorA VR variability by at least 50%. Our proposal for VR typing is informative because amino acid sequence and serologic information, when serosubtype-defining MAbs are available, can be deduced simultaneously from the PorA VR designation. This scheme will be useful in future classification and applied epidemiologic studies of N. meningitidis, being a systematic way of selecting PorA vaccine candidates and analyzing vaccine coverage and failure.  相似文献   

17.
The National Center for Biotechnology Information (NCBI), part of the National Library of Medicine, was established in 1988 to perform basic research in the field of computational molecular biology as well as build and distribute molecular biology databases. The basic research has led to new algorithms and analysis tools for interpreting genomic data and has been instrumental in the discovery of human disease genes for neurofibromatosis and Kallmann syndrome. The principal database responsibility is the National Institutes of Health (NIH) genetic sequence database, GenBank. NCBI, in collaboration with international partners, builds, distributes, and provides online and CD-ROM access to over 112,000 DNA sequences. Another major program is the integration of multiple sequences databases and related bibliographic information and the development of network-based retrieval systems for Internet access.  相似文献   

18.
Recent increases in the number of genome sequencing projects means that the amount of protein sequence in databases is increasing at an astonishing pace. In proteome studies, this is facilitating the identification of proteins from molecularly well-defined organisms. However, in studies of proteins from the majority of organisms, proteins must be identified by comparing analytical data to sequences in databases from other species. This process is known as cross-species protein identification. Here we present a new program, MultiIdent, which uses multiple protein parameters such as amino acid composition, peptide masses, sequence tags, estimated protein pI and mass, to achieve cross-species protein identification. The program is structured so that protein amino acid composition, which is highly conserved across species boundaries, first generates a set of candidate proteins. These proteins are then queried with other protein parameters such as sequence tags and peptide masses. A final list of database entries which considers all analytical parameters is presented, ranked by an integrated score. We illustrate the power of the approach with the identification of a set of standard proteins, and the identification of proteins from dog heart separated by two-dimensional gel electrophoresis. The MultiIdent program is available on the world-wide web at: http://www.expasy.ch/sprot/multiident.h tml.  相似文献   

19.
Comparative analysis of the complete sequences of seven bacterial and three archaeal genomes leads to the first generalizations of emerging genome-based microbiology. Protein sequences are, generally, highly conserved, with -70% of the gene products in bacteria and archaea containing ancient conserved regions. In contrast, there is little conservation of genome organization, except for a few essential operons. The most striking conclusions derived by comparison of multiple genomes from phylogenetically distant species are that the number of universally conserved gene families is very small and that multiple events of horizontal gene transfer and genome fusion are major forces in evolution.  相似文献   

20.
Searches of the major public protein databases with core and linker chicken and human histone sequences have resulted in the compilation of an annotated set of histone protein sequences. In addition, new database searches with two distinct motif search algorithms have identified several members of the histone fold family, including human DRAP1 and yeast CSE4. Database resources include information on conflicts between similar sequence entries in different source databases, multiple sequence alignments, links to the Entrez integrated information retrieval system, structures for histone and histone fold proteins, and the ability to visualize structural data through Cn3D. The database currently contains >1000 protein sequences, which are searchable by protein type, accession number, organism name, or any other free text appearing in the definition line of the entry. All sequences and alignments in this database are available through the World Wide Web at http://www.nhgri.nih. gov/DIR/GTB/HISTONES or http://www.ncbi.nlm.nih. gov/Baxevani/HISTONES  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号