| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
LION Bioscience AG, 69123 Heidelberg, Germany
Address all correspondence and requests for reprints to: Manfred Koegl, Phenex Pharmaceuticals AG, Im Neueneheimer Feld 515, 69120 Heidelberg, Germany. E-mail: manfred.koegl{at}phenex-pharma.com.
| ABSTRACT |
|---|
|
|
|---|
In total, 4360 abstracts were retrieved containing data on protein interactions for nuclear receptors. The resulting database contains all reported protein interactions involving nuclear receptors from 1966 to September 2001. Remarkably, the annual increase in number of reported interactors for nuclear receptors has been following an exponential growth curve in the years 1991 to 2001.
Apparent in the data set is the high complexity of protein interactions for nuclear receptors. The number of interactions correlates with the number of published papers for a given receptor, suggesting that the number of reported interactors is a reflection of the intensity of research dedicated to a given receptor. Indeed, comparison of the retrieved data to a systematic yeast two-hybrid-based interaction analysis suggests that most NRs are similar with respect to the number of interacting proteins. The data set obtained serves as a source for information on NR interactions, as well as a reference data set for the improvement of advanced text-mining methods.
| INTRODUCTION |
|---|
|
|
|---|
The responsiveness of NRs to small molecule ligands makes them excellent drug targets, as exemplified by the many successful drugs that target nuclear receptors (4). Interestingly, different ligands for the same NR can have diverse biological effects. For example, the natural ligand estradiol activates the estrogen receptor (NR3A/ER) in the breast epithelium as well as in bone, whereas the synthetic ligand raloxifen represses ER function in the breast, but has agonistic effects in bone (5, 6, 7). The binding of the different ligands seems to induce different conformations of the receptor, which then result in an altered preference of the receptor for the available cofactors (8, 9, 10, 11, 12). These ligand-dependent cofactor preferences are part of a possible explanation for differences in the biological effects of different ligands on the same NR (13). Therefore, protein-protein interactions involving NRs and cofactors are of particular interest in drug discovery.
Their outstanding biological and pharmaceutical importance has made NRs, and particularly steroid hormone receptors, a very well researched group of proteins. Although there are a number of excellent NR-specific databases (14, 15, 16), there is no publicly available resource dealing with the cofactor specificity of each NR. Primary scientific literature is the best source of information in this case. A widely used literature resource is MEDLINE (>11 million abstracts accessible via PubMed, at http://www.ncbi.nlm.nih.gov/Entrez), which represents a vast corpus of medical and molecular biology literature available electronically. However, the sheer size of the data poses problems. For example, a mere query in MEDLINE using the term "estrogen receptor" yields more than 12,000 citations, and queries for other NRs as well return several thousand abstracts. The returned abstracts have to be subselected for the retrieval of papers covering only certain aspects of NR biology. However, there is no easy way, at present, to extract all published cofactors for a given NR, or to retrieve only the abstracts concerning NR-protein interactions, for example. This shortcoming is reflected in recent interest in the development of text-mining technologies (17, 18, 19, 20, 21). Successful applications of text mining for protein interactions have been reported (22, 23).
In an attempt to overcome the above mentioned difficulties, we applied automated text-mining methods for the retrieval of all abstracts reporting NR-cofactor interactions. The automatically retrieved data were quality controlled and completed by biologists. In the course of the project, a dictionary was created containing NRs, cofactors, and other NR-binding proteins (for simplicity, all proteins interacting with NRs including cofactors are referred to as "NRBPs" in this paper) and their synonyms, as well as expressions describing protein-protein interactions. The project yielded a database resource containing all NR-NRBP interactions for the protein names contained in the dictionary published in MEDLINE abstracts between 1966 and September 2001.
This paper describes our general approach, as well as the extraction and curation processes, and discusses the resulting protein interaction data.
| RESULTS |
|---|
|
|
|---|
NR Dictionary
The trioccurrence approach requires the compilation of a list of terms to be searched for in the sentences. Better than a mere list of terms, it was decided to build a dictionary with a hierarchical structure allowing as much quality as possible concerning the description of the entities. We exclusively focused on the three most studied mammals for NRs, i.e. human, mouse, and rat. In case the species is not specified in the selected sentence, but from the context it is clear that it is a mammalian species, the generic term "mammal" is applied.
At the end of the project, the dictionary contained 563 terms for 49 NRs orthologs (plus 11,928 synonyms) and 570 NRBPs (plus 4,415 synonyms). For details on the dictionary, see Materials and Methods.
Extraction Process
The extraction process was run on all MEDLINE abstracts available at the beginning of the project, i.e. those found in the literature database from 1966 until September 10, 2001. The abstracts and sentences of interest were retrieved by applying the following sequence of operations (see Materials and Methods):
A flow chart describing the complete process is presented in Fig. 1
.
|
Curation Process
The curation by domain experts is essential to select relevant interactions because the mere presence of three entities of interest in the same sentence does not guarantee that these entities describe a protein-protein interaction. The process leads to the production of a high-precision data set of protein-protein interactions in the NR domain that also represents a training set for further text-mining investigations. The curation additionally permits the evaluation of the automatic extraction technique.
Two biologists curated the whole set of automatically extracted abstracts independently of each other with the help of a customized graphic user interface optimized for their needs. A third biologist resolved the conflicts between the two curators and handled the dictionarys improvements and modifications in collaboration with them.
Curators also added novel interactions that had not been found by the automatic process but were present within the selected abstracts. This was, in some cases, due to the lack of a protein name in the dictionary, which was added at that time. In other cases, the three terms necessary to express a protein-protein interaction were found in separate sentences, which explains why the automatic extraction program could not pick up these interactions. To get a consistent data set, curators also removed interactions referring to nonmammal proteins. Thus, there are no data on insect NRs, for example, in our curated data set.
Primary Results of the Extraction
A total number of 4,360 abstracts was retrieved and processed, containing 15,608 automatically extracted trioccurrences, which represents an average of 3.6 trioccurrences per abstract. After curation, 3308 trioccurrences were classified as showing a positive interaction (A binds to B) and 143 as denying a relation (A does not bind to B), corresponding to an overall precision of 22%. The curators were furthermore adding 3556 trioccurrences expressing a positive interaction and 163 expressing a negative interaction. The complete data set of validated interactions obtained after curation, i.e. 7170 interactions, is available as supplemental data, which are published on The Endocrine Societys Journals Online web site at http://mend.endojournals.org. Interactions that are denied in a paper are included as a supplemental table in this file.
The most frequently used terms to describe an interaction were "dimerize" and "interact." Together, they accounted for 57% of all trioccurrences. The most reliable term for the automated retrieval of interactions was "dimerize," whereas "link," "couple," and "affinity," led to a low percentage (<5%) of good extractions.
NR-NR and NRBP-NRBP Interactions
Figure 2
and supplemental table entitled "NR-NR-interactions" (see supplemental data) show the interactions of NRs with other NRs extracted from the abstracts. We have ordered NRs according to the official NR nomenclature in all Pivot tables (e.g. in Fig. 2
). Because the official NR nomenclature reflects the phylogenetic relationship of NRs, these tables represent the protein interactions in an order based on phylogeny. For example, the tendency of the NR3 and NR1A families to form homodimers is clearly visible in the diagonal of the grid, as well as the heterodimerization of members of the NR1 family with NR2B/retinoid X receptor. Smaller groups of interaction are the heterodimers involving the NR1A/thyroid hormone receptor (TR), NR3A/ER, and NR2F/chicken ovalbumin upstream promoter transcription factor families. A similar plot is available for the interested reader in the online supplemental data for interactions within NRBPs. We believe that the digest of the literature presented here and below should be of interest as a reference especially for novices in this well-researched area of biology, complementing the excellent reviews that are available (e.g. Ref. 2).
|
, NR2A2/hepatocyte nuclear factor (HNF)4
, and NR2F6/EAR2 we did not find any reported interactions in the literature. The number of NR families published to interact with a given NRBP is plotted in Fig. 3B
|
|
, NR1I1/vitamin D receptor, NR1A/TRs, NR1C/PPARs, NR1B/RARs], or have a high constitutive NRBP-binding activity (NR5A2/liver receptor homolog, NR2A/HNF, NR3B/ERRs, NR1F/RORs). Details and results of the screens are summarized in Table 1
|
|
|
| DISCUSSION |
|---|
|
|
|---|
Limits of the Text-Mining Method
The extraction method applied ensures a high recall, at the expense of the precision, which reached a level of 22% before curation. Although promising, for many applications of the data a precision of 22% is too low. More sophisticated methods are presently becoming available that should allow the precision to improve. However, to test such methods, a reference set of data is needed to determine the precision and recall of the method in order to allow assessment of potential improvements in the procedure. With the generation of the quality-controlled set of data presented here, we provide such a reference, which is, to our knowledge, the first such resource.
The presence of a high number of trioccurrences picked up that do not denote protein-protein interactions in the sentence is mainly due to the fact that the applied method does not analyze the sentence structure, but rather extracts all the possible triplet combinations. One of the frequently found problems is the coordination problem, exemplified by the following sentence: "A binds to B and C." The extraction program will provide the curators with two correct trioccurrences, "A, B, bind" and "A, C, bind", but also with "B, C, bind," which is obviously not correct. Another problem is the presence of more than three entities from the dictionary in the same sentence, but spread over several phrases, leading to the extraction of a high number of false trioccurrences.
Technologies able to extract information from text more sophisticated than the one presented in this paper are becoming available at present. Information Extraction is one of them (18, 19). Information Extraction takes advantage of Natural Language Processing techniques to produce a structured representation of pieces of free text. The input text is syntactically and semantically analyzed to locate the entities of interest. This approach is expected to result in a higher precision, but most likely at the expense of the recall (Kirsch, H., and S. Albert, unpublished data). In another scenario, automatic clustering of documents allows the user to have a good overview on a large collection of documents and to make use of content words and content similarities between documents. In any case, the imprecise use of the scientific language in abstracts will put an implicit limit to even the most advanced methods of text mining. Reference to gene/protein families instead of precise identification of genes/proteins including variants, lack of species information, and unresolved or unused nomenclatures will not permit the extraction of precise data from abstracts.
The Complexity of Protein Interaction Networks
From our analysis the complexity of protein interactions of NRs is evident, and it appears that the number of reported interactions is likely to increase. This is, to a great extent, a reflection of the coming of age of high-throughput methods to detect protein-protein interactions, mostly the Y2H system, first published by Fields and Song in 1989 (27), but also recent improvements in mass spectrometry-based methods (28, 29). Thus, this surge in reported protein interaction data driven by proteomic methods parallels the increase in DNA sequence data generated by advanced DNA-sequencing technology. In contrast to DNA sequence information, however, systematic and comprehensive databases for protein interactions are only beginning to emerge (30, 31).
The Need for Systematic Databases
At present, automated methods can only deliver data with a limited precision. Some reliability can be gained for well researched interactions, e.g. by scoring the number of publications on a given interaction. For example, the automatic extraction of the interaction of NR1A/TR with NCoR is plausible, because it has been found in 60 different statements. Even though truth may not depend on the number of times a fact is stated, scientific consensus usually implicates reliability, exceptions notwithstanding. For less well researched data that are not mentioned several times in the scientific literature, methods as the one presented here can merely guide a scientist to the appropriate literature. We believe that this is of use, especially in combination with meaningful clustering methods of abstracts. However, to arrive at the creation of complete and reliable databases, e.g. on protein interactions, we believe that data will have to be entered manually. In a preferred setting, newly discovered data on protein-protein interactions will have to be deposited at a central resource at the time they are discovered and published, as has become good practice for newly discovered nucleotide sequences.
| MATERIALS AND METHODS |
|---|
|
|
|---|
as bait, we have used a yeast two-hybrid screen to clone a novel protein, termed PGC-2, containing a partial SCAN domain." Each term denoting a protein is assigned to one species. Unfortunately, the precise denotation of proteins is often not possible from MEDLINE abstracts. For example, the description of interactions involving "estrogen receptor" (ER) is ambiguous because "estrogen receptor" is a generic term grouping several entities: there are two genes that could be referred to, ER
and ERß. In addition, there are ERß1, ERß2, ... and ERß5, which are splice variants produced from the ERß locus. Because, in general, authors of scientific abstracts do not assign a high resolution to the terms they are using, we decided to link each biological entity in the dictionary to one of the three following classes (Fig. 6
and ERß), or gene variant level (ERß1 to ERß5), to be able to extract the correct information from the abstracts and to interpret and discuss the results in a suitable way. In addition, the following relations between the above cited entities were considered: 1) "kind-of" or "is-a" relation: one entity is an example of another entity ("human androgen receptor" is a "androgen receptor," "androgen receptor" is a "nuclear receptor"); 2) "part-of" relation: one entity is a part of another entity ("human androgen receptor" is a part of "human," "NF
B subunit p50" is a part of "NF
B"). This second relation is used to relate proteins to species in which they exist and to complexes.
|
Extraction of the Trioccurrences
MEDLINE abstracts were downloaded according to the rules described at http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html and indexed under SRS (Sequence Retrieval System, http://srs.ebi.ac.uk).
During the trioccurrence extraction, all the NRs and NRBPs (= protein 1) were searched for in the sentences in combination with all the proteins entered in the dictionary (= protein 2: NRs, NRBPs, NR/NR complexes, NR/NRBP complexes, NRBP/NRBP complexes), and the 13 families of verbs/nouns and method names described above were used to pinpoint an interaction.
Some word-derivation rules necessary for the extraction were included in the corpus extraction program (see Fig. 1
). They are of three types and allow the following:
In the last case, the terms without "s" as the last character were also matched if they were immediately followed by an "s" and only then by anything but az. No context was applied to verbs because they did not generate immediately obvious numbers of false positives.
Throughout the project, the extraction process was continuously refined and improved. The first important improvement concerning the trioccurrence extraction is the application of "stop lists" (see Fig. 1
). These lists are used as an exclusion criterion for the terms they contain, as described below.
Stop List I.
The trioccurrence extraction method has the side effect that search terms can be found in an unsuitable context: for example the term "binding" found in a string like "fatty-acid binding" or "binding to DNA" is of no interest for protein-protein interactions. The terms "X frequency" or "anti-X" or "binding of X to DNA," where X can be any of the dictionarys proteins, are also of no interest. To accelerate the curation process and improve its efficiency, a stop list containing nonuseful strings was built. At the end, this list was composed of approximately 1000 strings. If a word from the dictionary is included in one of these strings in the text, it should not be involved in any trioccurrence.
Stop List II.
When considering protein names, it is necessary to handle acronyms (ER for "estrogen receptor" or VDR for "vitamin D receptor"). An acronym is generally short and then can have several significations: ER is an acronym for "estrogen receptor" but also for "endoplasmic reticulum." Thus, acronyms in connection with unwanted significations were entered into stop list II. This stop list also contains term pairs, including one term from the dictionary, which, according to our experience, were increasing the ratio of wrong interactions (e.g. androgen receptor and carcinoma, estrogen receptor and metastasis, progesterone receptor and cancer). The program did not consider tri-occurrences when acronyms of interest are accompanied by unwanted significations or when terms from the dictionary are accompanied by unwanted terms at the sentence level.
Stop List III.
This stop list contains protein pairs that were, according to our experience, never found to be involved in an interaction. For example, the pairs "estrogen receptor-androgen receptor" or "glucocorticoid receptor-progesterone receptor" are used to describe coexpression results, gene regulation, ligand binding to NRs, association with diseases, etc. but never a protein-protein interaction. Trioccurrences containing one of these couples and a term expressing an interaction were not considered and hence reduced the curation effort, at the expense of potentially missing some of these NR-NR interactions.
Table 2
shows the reduction in the number of trioccurrences and number of abstracts to be curated after applying each of the stop lists and a combination of the three lists. A limited sample of the extracted corpus was analyzed, out of which approximately one third of the trioccurrences and abstracts were eliminated from the curation process. This reduction is directly linked to the increase of the precision of the automated extraction.
|
We defined the complete protein list of interest and queried MEDLINE for abstracts that contain at least one of the protein names. The resulting corpus of abstracts was then analyzed as follows. Abstracts were split into sentences, and in these sentences we marked, with an appropriate tag, words found in a list of terms (term list in Fig. 1
) derived from the dictionary. The term list was obtained by applying a series of word-derivation rules as described above to three lists coming from the dictionary: protein 1 list, protein 2 list, and interaction terms. The final term list was obtained, after the application of the derivation rules, by the exclusion of all the strings contained in Stop List I. The text resulting from the tagging step contained only sentences with at least one trioccurrence where the dictionary terms were clearly marked. With LION's FSA (Finite State Automata) technology it was possible to filter 2.5 million selected abstracts for terms of the dictionary within a few hours. We eliminated the unwanted trioccurrences by applying Stop Lists II and III and finally extracted and dumped the ones to be curated into a relational database.
Curation Process
The Curation Interface.
The curation interface allowed curators to select and curate the computer-extracted abstracts and trioccurences. In the curation process, a trioccurrence is selected, and the respective sentence is checked, as well as the abstract, when necessary. The curators check and correct the accuracy of the protein names in the dictionary. They can have access to the dictionary at any time. The curators can choose between different relation states for each trioccurrence. "Shows the relation" and "shows the negative relation" are chosen for a reported or denied interaction, respectively. The state "no relation shown" is used when there is no relation linking the three entities in the sentence. "Shows the relation, but not interesting" is used when an interaction involving proteins belonging to species other than mammal is stated. "No relation shown but true with respect to text mining" is chosen when the interaction between the entities is hypothetical (e.g. "our hypothesis was then A binds to B," "we investigated whether A binds to B," ... ) or when the interaction is not physical (functional interaction, synergistic interaction, interaction between signaling systems, interaction between genes, ... ). Note that the two latter stages are not of interest for building the nuclear receptors database, but to determine when the text-mining method applied has worked correctly, in a technical sense, when selecting these trioccurrences for the curation. The state "maybe" could be chosen by the curators, but it is changed to another state during the conflict resolution. "Duplication" is used to discard the trioccurrences added twice by mistake. The curators have the ability to create new trioccurrences in case the automatic program missed some (because proteins or interaction terms were not in the dictionary at the time of extraction) or, more frequently, when more than one sentence is needed for expressing an interaction. The string "implied interaction" can be used as an interaction term for manually added trioccurrences when a verb/noun is not precisely stated but the interaction between two proteins is obvious ("More recently, our lab has identified ARA267, a SET domain containing protein, and supervillin, an F-actin binding protein, as AR coregulators."). The Results chapter takes into consideration the fact that some of the available interactions could not be found by the automatic method but were added manually.
Resolution of Curation Conflicts.
The results provided by two independent curators were merged, and the conflicting trioccurrences appeared as "unchecked" in the curation interface. A third biologist curated them, taking into consideration the relation states that the curators were choosing when the decision was uncertain.
Y2H Screening
The ligand binding domains of NRs were cloned into pGBT9 and transformed into yeast CG1945 using standard methods. All Y2H libraries used for screening were bought as pretransformed libraries in the yeast strain Y187 from CLONTECH (Palo Alto, CA). Culture and transformation of yeast cells were according to the instructions provided by CLONTECH. For screening, diploid cells containing both the NR and the library clones were generated by mating of yeast cells in Erlenmeyer flasks and selected for clones containing interacting hybrid proteins on selective medium lacking leucine, tryptophan, and histidine, containing 4-methyl umbelliferyl-
-D-galactoside (50 µM) and various amounts of 3-aminotriazole (Sigma, St. Louis, MO) in 96-well microtiter plates as described previously (32). Where appropriate, the ligand for the respective NR was added as indicated in the legend of Table 1![]()
. Positive cells were identified by measuring fluorescence at 460 nm (excitation at 365 nm) and passaged to new wells twice. Cells were then transferred to agar plates lacking leucine, tryptophan, and histidine using a manual 96-pin replicator and regrown before isolating the library insert via PCR using generic primers as recommended by CLONTECH. PCR products were resolved by agarose gel electrophoresis, and all reactions were collected where a single clear band was apparent. Inserts were sequenced at GATC Biotech AG (Konstanz, Germany) and analyzed by sequence comparison to public databases using the BLAST algorithm (33). Clones that corresponded to untranslated regions or the noncoding strand were discarded. All compounds were from Sigma.
Nomenclature
All NR names refer to the official NR nomenclature (34).
Web Site References
Web site references are as follows: http://www.ncbi.nlm. nih.gov/Entrez, Entrez search and retrieval system homepage; http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html, Entrez progamming utilities; http://srs.ebi. ac.uk, SRS entry page at the European Bioinformatics Institute.
| ACKNOWLEDGMENTS |
|---|
| FOOTNOTES |
|---|
Current address for M.A.: Phenex Pharmaceuticals AG, Im Neueneheimer Feld 515, 69120 Heidelberg, Germany.
Current address for B.H.: Cellzome AG, Meyerhofstrasse 1, 69117 Heidelberg, Germany.
Current address for D.R.-S. and H.K.: EMBL-Outstation Hinxton, Hinxton, Cambridge CB10 1SD, United Kingdom.
The retrieved data are available online (published as supplemental data on The Endocrine Societys Journals Online web site at http://mend.endojournals.org.
Abbreviations: ER, Estrogen receptor; ERR, estrogen-related receptor; HNF, hepatocyte nuclear receptor; LXR, liver X receptor; MR, mineralocorticoid receptor; NCoR, nuclear receptor corepressor; NR, nuclear receptor; NRBP, NR-binding protein; PPAR, peroxisome proliferator-activated receptor; RAR, retinoic acid receptor; ROR, retinoid-related orphan receptor; SRC-1, steroid receptor coactivator 1; TR, thyroid hormone receptor; TRAP, TR-associated protein; Y2H, yeast two hybrid.
Received for publication December 17, 2002. Accepted for publication April 28, 2003.
| REFERENCES |
|---|
|
|
|---|
with transcriptional coactivators. J Biol Chem 275:3320133204
and estrogen receptor-ß: correlations with biological character and distinct differences among SRC coactivator family members. Endocrinology 141:35343545NURSA Molecule Pages Link:
This article has been cited by other articles:
![]() |
J. D. Wren, W. H. Hildebrand, S. Chandrasekaran, and U. Melcher Markov model recognition and classification of DNA/protein sequences within large text databases Bioinformatics, November 1, 2005; 21(21): 4046 - 4053. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Albers, H. Kranz, I. Kober, C. Kaiser, M. Klink, J. Suckow, R. Kern, and M. Koegl Automated Yeast Two-hybrid Screening for Nuclear Receptor-interacting Proteins Mol. Cell. Proteomics, February 1, 2005; 4(2): 205 - 213. [Abstract] [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||