首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The sequences of related proteins can diverge beyond the point where their relationship can be recognised by pairwise sequence comparisons. In attempts to overcome this limitation, methods have been developed that use as a query, not a single sequence, but sets of related sequences or a representation of the characteristics shared by related sequences. Here we describe an assessment of three of these methods: the SAM-T98 implementation of a hidden Markov model procedure; PSI-BLAST; and the intermediate sequence search (ISS) procedure. We determined the extent to which these procedures can detect evolutionary relationships between the members of the sequence database PDBD40-J. This database, derived from the structural classification of proteins (SCOP), contains the sequences of proteins of known structure whose sequence identities with each other are 40% or less. The evolutionary relationships that exist between those that have low sequence identities were found by the examination of their structural details and, in many cases, their functional features. For nine false positive predictions out of a possible 432,680, i.e. at a false positive rate of about 1/50,000, SAM-T98 found 35% of the true homologous relationships in PDBD40-J, whilst PSI-BLAST found 30% and ISS found 25%. Overall, this is about twice the number of PDBD40-J relations that can be detected by the pairwise comparison procedures FASTA (17%) and GAP-BLAST (15%). For distantly related sequences in PDBD40-J, those pairs whose sequence identity is less than 30%, SAM-T98 and PSI-BLAST detect three times the number of relationships found by the pairwise methods.  相似文献   

2.
The parasitic bacterium Mycoplasma genitalium has a small, reduced genome with close to a basic set of genes. As a first step toward determining the families of protein domains that form the products of these genes, we have used the multiple sequence programs PSI-BLAST and GEANFAMMER to match the sequences of the 467 gene products of M. genitalium to the sequences of the domains that form proteins of known structure [Protein Data Bank (PDB) sequences]. PDB sequences (274) match all of 106 M. genitalium sequences and some parts of another 85; thus, 41% of its total sequences are matched in all or part. The evolutionary relationships of the PDB domains that match M. genitalium are described in the structural classification of proteins (SCOP) database. Using this information, we show that the domains in the matched M. genitalium sequences come from 114 superfamilies and that 58% of them have arisen by gene duplication. This level of duplication is more than twice that found by using pairwise sequence comparisons. The PDB domain matches also describe the domain structure of the matched sequences: just over a quarter contain one domain and the rest have combinations of two or more domains.  相似文献   

3.
The set of proteins which are conserved across families of microbes contain important targets of new anti-microbial agents. We have developed a simple and efficient computational tool which determines concordances of putative gene products that show sets of proteins conserved across one set of user specified genomes and not present in another set of user specified genomes. The thresholds and the homology scoring criterion are selectable to allow the user to decide the stringency of the homologies. The system uses a relational database to store protein coding regions from different genomes, and to store the results of a complete comparison of all sequences against all sequences using the FASTA program. Using Web technology, the display of all the related proteins for a given sequence and calculation of multiple sequence alignments (using CLUSTALW) can be performed with the click of a button. The current database holds 97 365 sequences from 19 complete or partial genomes and 8798905 FASTA comparison results. A example concordance is presented which demonstrates that the target of the quinolone antibiotics could have been identified using this tool.  相似文献   

4.
A "single-base sequence" is a DNA sequence in which the identities and locations of bases of only one type have been determined. We present experimental procedures for single-base sequencing and describe the effective use of existing software (FASTA) in similarity comparisons of single-base sequences. We determined the theoretical and experimental minimum sequence lengths required for identification of a sequence within a large dataset and optimized the FASTA parameters for use in single-base similarity comparisons. Single-base sequences have been used to identify cDNAs occurring in a database. Single-base sequencing could be used to reduce the redundancy of "shot-gun sequencing."  相似文献   

5.
Here we address the following questions. How many structurally different entries are there in the Protein Data Bank (PDB)? How do the proteins populate the structural universe? To investigate these questions a structurally non-redundant set of representative entries was selected from the PDB. Construction of such a dataset is not trivial: (i) the considerable size of the PDB requires a large number of comparisons (there were more than 3250 structures of protein chains available in May 1994); (ii) the PDB is highly redundant, containing many structurally similar entries, not necessarily with significant sequence homology, and (iii) there is no clear-cut definition of structural similarity. The latter depend on the criteria and methods used. Here, we analyze structural similarity ignoring protein topology. To date, representative sets have been selected either by hand, by sequence comparison techniques which ignore the three-dimensional (3D) structures of the proteins or by using sequence comparisons followed by linear structural comparison (i.e. the topology, or the sequential order of the chains, is enforced in the structural comparison). Here we describe a 3D sequence-independent automated and efficient method to obtain a representative set of protein molecules from the PDB which contains all unique structures and which is structurally non-redundant. The method has two novel features. The first is the use of strictly structural criteria in the selection process without taking into account the sequence information. To this end we employ a fast structural comparison algorithm which requires on average approximately 2 s per pairwise comparison on a workstation. The second novel feature is the iterative application of a heuristic clustering algorithm that greatly reduces the number of comparisons required. We obtain a representative set of 220 chains with resolution better than 3.0 A, or 268 chains including lower resolution entries, NMR entries and models. The resulting set can serve as a basis for extensive structural classification and studies of 3D recurring motifs and of sequence-structure relationships. The clustering algorithm succeeds in classifying into the same structural family chains with no significant sequence homology, e.g. all the globins in one single group, all the trypsin-like serine proteases in another or all the immunoglobulin-like folds into a third. In addition, unexpected structural similarities of interest have been automatically detected between pairs of chains. A cluster analysis of the representative structures demonstrates the way the "structural universe' is populated.  相似文献   

6.
HSSP is a derived database merging structural three dimensional (3-D) and sequence one dimensional(1-D) information. For each protein of known 3-D structure from the Protein Data Bank (PDB), the database has a multiple sequence alignment of all available homologues and a sequence profile characteristic of the family. The list of homologues is the result of a database search in Swissprot using a position-weighted dynamic programming method for sequence profile alignment (MaxHom). The database is updated frequently. The listed homologues are very likely to have the same 3-D structure as the PDB protein to which they have been aligned. As a result, the database is not only a database of aligned sequence families, but also a database of implied secondary and tertiary structures covering 27% of all Swissprot-stored sequences.  相似文献   

7.
A method is described for searching protein sequence databases using tandem mass spectra of tryptic peptides. The approach uses a de novo sequencing algorithm to derive a short list of possible sequence candidates which serve as query sequences in a subsequent homology-based database search routine. The sequencing algorithm employs a graph theory approach similar to previously described sequencing programs. In addition, amino acid composition, peptide sequence tags and incomplete or ambiguous Edman sequence data can be used to aid in the sequence determinations. Although sequencing of peptides from tandem mass spectra is possible, one of the frequently encountered difficulties is that several alternative sequences can be deduced from one spectrum. Most of the alternative sequences, however, are sufficiently similar for a homology-based sequence database search to be possible. Unfortunately, the available protein sequence database search algorithms (e.g. Blast or FASTA) require a single unambiguous sequence as input. Here we describe how the publicly available FASTA computer program was modified in order to search protein databases more effectively in spite of the ambiguities intrinsic in de novo peptide sequencing algorithms.  相似文献   

8.
9.
We have designed and implemented a system to carry out cross-genome comparisons of open reading frames (ORFs) from multiple genomes. This implementation includes a genome profiling system that allows us to explore pairwise comparisons at different levels of match similarity and ask biologically motivated queries involving number and identity of ORFs, their function, functional category, distribution in genomes or in biological domains, and statistics on their matches and match families. This analysis required precise definition of new classification terms and concepts. We define the terms genomic signature, summary signature, biologic domain signature, domain class, match level, match family, and extended match family, then use these terms to define concepts, including genomically universal proteins and proteins characteristics of sets of genomes. We initiate an analysis based on automated FASTA (Pearson, 1996) comparison of 22,419 conceptually translated protein sequences from nine microbial genomes.  相似文献   

10.
We report the latest release (version 1.4) of the CATH protein domains database (http://www.biochem.ucl.ac.uk/bsm/cath). This is a hierarchical classification of 13 359 protein domain structures into evolutionary families and structural groupings. We currently identify 827 homologous families in which the proteins have both structual similarity and sequence and/or functional similarity. These can be further clustered into 593 fold groups and 32 distinct architectures. Using our structural classification and associated data on protein functions, stored in the database (EC identifiers, SWISS-PROT keywords and information from the Enzyme database and literature) we have been able to analyse the correlation between the 3D structure and function. More than 96% of folds in the PDB are associated with a single homologous family. However, within the superfolds, three or more different functions are observed. Considering enzyme functions, more than 95% of clearly homologous families exhibit either single or closely related functions, as demonstrated by the EC identifiers of their relatives. Our analysis supports the view that determining structures, for example as part of a 'structural genomics' initiative, will make a major contribution to interpreting genome data.  相似文献   

11.
The amino acid sequences of the amidinotransferases and the nucleotide sequences of their genes or cDNA from four Streptomyces species (seven genes) and from the kidneys of rat, pig, human and human pancreas were compared. The overall amino acid and nucleotide sequences of the prokaryotes and eukaryotes were very similar and further, three regions were identified that were highly identical. Evidence is presented that there is virtually zero chance that the overall and high identity regions of the amino acid sequence similarities and the overall nucleotide sequence similarities between Streptomyces and mammals represent random match. Both rat and lamprey amidinotransferases were able to use inosamine phosphate, the amidine group acceptor of Streptomyces. We have concluded that the structure and function of the amidinotransferases and their genes has been highly conserved through evolution from prokaryotes to eukaryotes. The evolution has occurred with: (1) a high degree of retention of nucleotide and amino acid sequences; (2) a high degree of retention of the primitive Streptomyces guanine + cytosine (G + C) third codon position composition in certain high identity regions of the eukaryote cDNA; (3) a decrease in the specificities for the amidine group acceptors; and (4) most of the mutations silent in the regions suggested to code for active sites in the enzymes.  相似文献   

12.
The heterochromatic Responder (Rsp) locus of Drosophila melanogaster is the target of the two distorter loci Sd and E(SD). Rsp is located in a specific heterochromatic region of the second chromosome and is made up of AT-rich satellite sequences whose abundance is related to its sensitivity to the distorter chromosomes. Here we report that a cluster of Rsp sequences is also located in the third chromosome. The third-chromosome cluster has the same flanking sequences as the clone originally used to identify the Rsp elements, and one of the flanking sequences is a rearranged 412 retrotransposon. The presence of a second, unlinked Rsp-sequence cluster makes re-interpretation necessary for some earlier experiments in which segregation of the third chromosome had not been followed and raises interesting possibilities for the origin of the Rsp locus.  相似文献   

13.
Analysis of the organization of nucleotide sequences in mouse genome is carried out on total DNA at different fragment size, reannealed to intermediate value of Cot, by Ag+--Cs2SO4 density gradient centrifugation.--According to nuclease S-1 resistance and kinetic renaturation curves mouse genome appears to be made up of non-repetitive DNA (76% of total DNA), middle repetitive DNA (average repetition frequency 2X10(4) copies, 15% of total DNA), highly repetitive DNA (8% of total DNA) and fold-back DNA (renatured density 1.701 g/ml, 1% of total DNA).--Non-repetitive sequences are intercalated with short middle repetitive sequences. One third of non-repetitive sequences is longer than 4500 nucleotides, another third is long between 1800 and 4500 nucleotides, and the remainder is shorter than 1800 nucleotides.--Middle repetitive sequences are transcribed in vivo. The majority of the transcribed repeated sequences appears to be not linked to the bulk of non-repeated sequences at a DNA size of 1800 nucleotides.--The organization of mouse genome analyzed by Ag+--Cs2SO4 density gradient of reannealed DNA appears to be substantially different than that previously observed in human genome using the same technique.  相似文献   

14.
Wolbachia are a group of intracellular inherited bacteria that infect a wide range of arthropods. They are associated with a number of different reproductive phenotypes in their hosts, such as cytoplasmic incompatibility, parthenogenesis and feminization. While it is known that the bacterial strains responsible for these different host phenotypes form a single clade within the alpha-Proteobacteria, until now it has not been possible to resolve the evolutionary relationships between different Wolbachia strains. To address this issue we have cloned and sequenced a gene encoding a surface protein of Wolbachia (wsp) from a representative sample of 28 Wolbachia strains. The sequences from this gene were highly variable and could be used to resolve the phylogenetic relationships of different Wolbachia strains. Based on the sequence of the wsp gene from different Wolbachia isolates we propose that the Wolbachia pipientis clade be initially divided into 12 groups. As more sequence information becomes available we expect the number of such groups to increase. In addition, we present a method of Wolbachia classification based on the use of group-specific wsp polymerase chain reaction (PGR) primers which will allow Wolbachia isolates to be typed without the need to clone and sequence individual Wolbachia genes. This system should facilitate future studies investigating the distribution and biology of Wolbachia strains from large samples of different host species.  相似文献   

15.
Using a maximum-likelihood formalism, we have developed a method with which to reconstruct the sequences of ancestral proteins. Our approach allows the calculation of not only the most probable ancestral sequence but also of the probability of any amino acid at any given node in the evolutionary tree. Because we consider evolution on the amino acid level, we are better able to include effects of evolutionary pressure and take advantage of structural information about the protein through the use of mutation matrices that depend on secondary structure and surface accessibility. The computational complexity of this method scales linearly with the number of homologous proteins used to reconstruct the ancestral sequence.  相似文献   

16.
The Munich Information Center for Protein Sequences (MIPS-GSF), Martinsried near Munich, Germany, develops and maintains genome oriented databases. It is commonplace that the amount of sequence data available increases rapidly, but not the capacity of qualified manual annotation at the sequence databases. Therefore, our strategy aims to cope with the data stream by the comprehensive application of analysis tools to sequences of complete genomes, the systematic classification of protein sequences and the active support of sequence analysis and functional genomics projects. This report describes the systematic and up-to-date analysis of genomes (PEDANT), a comprehensive database of the yeast genome (MYGD), a database reflecting the progress in sequencing the Arabidopsis thaliana genome (MATD), the database of assembled, annotated human EST clusters (MEST), and the collection of protein sequence data within the framework of the PIR-International Protein Sequence Database (described elsewhere in this volume). MIPS provides access through its WWW server (http://www.mips.biochem.mpg.de) to a spectrum of generic databases, including the above mentioned as well as a database of protein families (PROTFAM), the MITOP database, and the all-against-all FASTA database.  相似文献   

17.
In the first report in this series we presented dendrograms based on 152 individual proteins of the EF-hand family. In the second we used sequences from 228 proteins, containing 835 domains, and showed that eight of the 29 subfamilies are congruent and that the EF-hand domains of the remaining 21 subfamilies have diverse evolutionary histories. In this study we have computed dendrograms within and among the EF-hand subfamilies using the encoding DNA sequences. In most instances the dendrograms based on protein and on DNA sequences are very similar. Significant differences between protein and DNA trees for calmodulin remain unexplained. In our fourth report we evaluate the sequences and the distribution of introns within the EF-hand family and conclude that exon shuffling did not play a significant role in its evolution.  相似文献   

18.
In a similar manner to sequence database searching, it is also possible to compare three-dimensional protein structure. Such methods can be extremely useful because a structural similarity may represent a distant evolutionary relationship that is undetectable by sequence analysis. In this review, we summarise the most popular structure comparison methods, show how they can be used for database searching, and then describe some of the most advanced attempts to develop comprehensive protein structure classifications. With such data, it is possible to identify distant evolutionary relationships, provide libraries of unique folds for structure prediction, estimate the total number of folds that exist, and investigate the preference for certain types of structures over others.  相似文献   

19.
We compare the sequences for the mitochondrial cytochrome oxidase II gene of 13 species of the Drosophila obscura group. The survey includes six members of the D. affinis subgroup, four of the D. pseudoobscura subgroup, and three of the D. obscura subgroup. In all species, the gene is 688 nucleotides in length, encoding a protein of 229 amino acids plus the first position T of the stop codon. The sequences show the typical high-transition bias for closely related species, but that bias is essentially eliminated for species pairs of > 5% sequence divergence. The phylogenetic relationships in the species group are inferred using both neighbor-joining and maximum parsimony. The two procedures give comparable results, showing that the D. affinis and D. pseudoobscura subgroups are monophyletic groupings that appear to have closer affinities to one another than either has to the D. obscura subgroup. We use transversion distances to estimate times of divergence, on the basis of three different estimates of the time of separation of the D. obscura species group from the D. melanogaster group. If that event occurred 35 Mya, then we can estimate the origin of the nearctic forms at approximately 22 Mya and the separation of the D. affinis and D. pseudoobscura subgroups at approximately 17 Mya.  相似文献   

20.
Bacterial cDNA expression libraries are made to reproduce protein sequences present in the mRNA source tissue. However, there is no control over which frame of the cDNA is translated, because translation of the cDNA must be initiated on vector sequence. In a library of nondirectionally cloned cDNAs, only some 8% of the protein sequences produced are expected to be correct. Directional cloning can increase this by a factor of two, but it does not solve the frame problem. We have therefore developed and tested a library construction methodology using a novel vector, pKE-1, with which translation in the correct reading frame confers kanamycin resistance on the host. Following kanamycin selection, the cDNA libraries contained 60-80% open, in-frame clones. These, compared with unselected libraries, showed a 10-fold increase in the number of matches between the cDNA-encoded proteins made by the bacteria and database protein sequences. cDNA sequencing programs will benefit from the enrichment for correct coding sequences, and screening methods requiring protein expression will benefit from the enrichment for authentic translation products.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号