Cancer Gene PromoterRelated Motif Search |
The cancer gene promoter related motif search (CAPRIS) database contains cancer gene classes that are grouped based on their 600bp promoter region (-500 to +100) and motifs extracted from these groups.
The CAPRIS database was constructed as a result of the analysis of 2036 cancer related gene promoter sequences using machine learning techniques. These sequences were labeled based on their relation to 23 different cancer types. Then these sequences were clustered on the SOM. Finally, a total of 168 clusters from neighboring SOM nodes were formed for 23 cancer types. Then we analyzed these cancer gene promoter groups with the MEME motif extraction tool in order to obtain common sequence motifs.
The extracted 168 clusters and their motifs were stored in a searchable and downloadable relational database. The database can be queried based on gene name, ID, cancer type and nucleotide sequence fragment with an internal BLAST. Users can perform NCBI-BLAST analysis against extracted motifs in addition to 2036 cancer related promoter gene motifs in the selected 600bp region.
Cancer types were categorized according to the Cancer by Body Location/System of National Cancer Institute (U.S. National Institutes of Health www.cancer.gov). On this basis, we composed 23 different classes of cancer types (Table 2.1). Then, cancer specific genes were extracted from NCBI (http://www.ncbi.nlm.nih.gov/). The Gene Bibliograhy Database in NCBI was searched to determine whether a gene has a relation with any of the types of cancer mentioned below. All gene references were scanned using regular expressions. A total of 5474 cancer gene relations were taken out from this database. Since different cancer types share the same gene(s), a total of 2080 genes are extracted.
Table 2.1 18 Cancer Types
| 1. ALL |
2.AML |
| 3. Bone |
4. Brain |
| 5. Breast |
6. CLL |
| 7. CML |
8. Digestive |
| 9. Endocrine |
10. Eye |
| 11. Genitourinary |
12. Germcell |
| 13. Gynecologic |
14. Head and Neck |
15. Hodgkins Lymphoma |
16.Leukemia |
17. Lung |
18. Lymphoma |
19. Musculoskeletal |
20.Neurologic |
21. Non-Hodgkins Lymphoma |
22. Respiratory |
23. Skin |
|
Table 2.2 Subcancer Types
1. ALL |
13. Gynecologic |
||
Acute Lymphoblastic Leukemia |
Cervical Cancer | Endometrial Cancer | |
| Gestational Trophoblastic Tumor | Ovarian Epithelial Cancer | ||
2.AML |
Ovarian Germ Cell Tumor | Uterine Sarcoma | |
| Acute Myeloid Leukemia | Vaginal Cancer | Vulvar Cancer | |
3. Bone |
14. Head and Neck |
||
| Ewing's Family of Tumors | Osteosarcoma | Hypopharyngeal Cancer | Laryngeal Cancer |
| Lip&Oral Cavity Cancer | Neck Cancer | ||
4. Brain |
Nasopharyngeal Cancer | Oropharyngeal Cancer | |
| Brain Tumor | Brain Stem Glioma | Parathyroid Cancer | Salivary Gland Cancer |
| Cerebral Astrocytoma | Malignant Glioma | Paranasal Sinus&Nasal Cavity Cancer | |
| Ependymoma | Medulloblastoma | ||
| Pineoblastoma | Hypothalamic Glioma | 15. Hodgkins Lymphoma |
|
Hodgkin's Lymphoma |
|||
5. Breast |
|
||
| Breast Cancer | 16. Leukemia |
||
| Acute Lymphoblastic Leukemia | Acute Myeloid Leukemia | ||
6. CLL |
Chronic Lymphocytic Leukemia | Hairy Cell Leukemia | |
| Chronic Lymphocytic Leukemia | Chronic Myelogenous Leukemia | ||
17.Lung |
|||
7. CML |
Lung Cancer | ||
| Chronic Myelogenous Leukemia | |||
18. Lymphoma |
|||
8. Digestive/Gastrointestinal |
T-Cell Lymphoma | Hodgkin's Lymphoma | |
| Anal Cancer | Bile Duct Cancer | Mycosis Fungoides | Non-Hodgkin's Lymphoma |
| Carcinoid Tumor | Colon Cancer | Sezary Syndrome | Nervous System Lymphoma |
| Esophageal Cancer | Gallbladder Cancer | Waldenstram's Macroglobulinemia | |
| Liver Cancer | Pancreatic Cancer | ||
| Rectal Cancer | Stomach Cancer | 19. Musculoskeletal |
|
| Ewing's Family of Tumors | Osteosarcoma | ||
9. Endocrine |
Rhabdomyosarcoma | Soft Tissue Sarcoma | |
| Adrenocortical Carcinoma | Carcinoid Tumor | Uterine Sarcoma | |
| Islet Cell Carcinoma | Parathyroid Cancer | ||
| Pheochromocytoma | Pituitary Tumor | 20. Neurologic |
|
| Thyroid Cancer | Brain Tumor | Brain Stem Glioma | |
| Cerebellar Astrocytoma | Ependymoma | ||
10. Eye |
Medulloblastoma | Pineoblastoma | |
| Melanoma | Retinoblastoma | Neuroblastoma | Pituitary Tumor |
| Visual Pathway&Hypothalamic Glioma | |||
11. Genitourinary |
|||
| Bladder Cancer | Kidney Cancer | 21. Non-Hodgkins Lymphoma |
|
| Penile Cancer | Prostate Cancer | Non-Hodgkin's Lymphoma | |
| Renal Pelvis & Ureter Cancer | Testicular Cancer | ||
| Urethral Cancer | 22. Respiratory/Thoracic |
||
| Lung Cancer | Malignant Mesothelioma | ||
12. Germcell |
Thymoma&Thymic Carcinoma | ||
| Germ Cell Tumor | Testicular Cancer | ||
23. Skin |
|||
| Cutaneous T-Cell Lymphoma | Kaposi's Sarcoma | ||
| Melanoma | Merkel Cell Carcinoma | ||
| Skin Cancer | |||
From NCBI database, only the cancer specific gene id (number) and symbol names could be extracted. Therefore, the next step of the extraction process involved finding the RefSeq identification numbers of the genes from these symbol names. This was done by using the HUGO Gene Symbol List table. Then the extracted RefSeq Id's were fed into Promoser, a large-scale mammalian promoter and transcription start side identification service, to obtain the sequences for the promoter regions of these cancer genes. The length of the extracted promotor sequences was set to 600 (500 downstream and 100 upstream).
Table 2.3 shows the number of extracted sequences among the genes in our database. All the genes that were extracted during the process described above can be accessed through their gene ID, name, cancer type(s) and promoter region on the Gene_Resources( http://www.i-cancer.org/gensor.htm ) page.
Table 2.3 Extracted Sequence numbers
| Cancer Type | # of cancer genes from NCBI | # of Promoter sequences extracted |
| ALL | 127 |
121 |
| AML | 191 |
187 |
| Bone | 76 |
76 |
| Brain | 263 |
262 |
| Breast | 579 |
571 |
| CLL | 113 |
107 |
| CML | 41 |
41 |
| Digestive | 634 |
619 |
| Endocrine | 77 |
77 |
| Eye | 334 |
332 |
| Genitourinary | 475 |
467 |
| Germcell | 25 |
25 |
| Gynecologic | 131 |
130 |
| Head and Neck | 38 |
38 |
| Hodgkins Lymphoma | 39 |
39 |
| Leukemia | 590 |
576 |
| Lung | 289 |
287 |
| Lymphoma | 270 |
265 |
| Musculoskeletal | 98 |
98 |
| Neurologic | 389 |
387 |
| Non-Hodgkins Lymphoma | 41 |
41 |
| Respiratory | 318 |
314 |
| Skin | 336 |
334 |
| total | 5474 | 5394 |
| total genes in CAPRIS | 2080 | 2036 |
Similar promoter region sequences from the same cancer types were clustered by using Self Organizing Map (SOM) (Figure 3.1). Each SOM node represents a group of genes sharing common promoter sequences which are specific to that particular cancer type. Finally, a total of 168 clusters from neighboring SOM nodes were constructed for 23 cancer types.
Figure 3.1
Then we analyzed these cancer gene promoter groups with the MEME motif extraction tool in order to obtain common sequence motifs (Figure 3.2). The cancer gene groups clustered according to their selected promoter region, and their extracted motifs were stored in a searchable and downloadable MySQL relational database.
Figure 3.2 MEME result of a cancer gene promoter groups
There are two different databases in CAPRIS webpage. Motif search contains the information for cancer gene groups which were clustered according to the gene's selected promoter region, and their extracted motifs. Gene Resources contains all the final extracted genes from CancerGene Database with its gene ID, name, cancer type(s) and promoter region.
For more detailed explanation about searching in CAPRIS click here
Figure 4.1 Use Case Diagram of CAPRIS
All clusters of each cancer types were integrated again. These genes were clustered by using KMEANS in MATLAB. Two different clustering of same genes were obtained.Lastly these two clustering types were compared according to Rand Index. (Table 5.1)
P and Q are two different clustering types.
N 11 the number of point pairs that are in the same cluster under both P
and Q
N 00 number of point pairs in different clusters under P and Q
N 10 number of point pairs in the same cluster under P but not under Q
N 01 number of point pairs in the same cluster under Q but not under P
Rand Index:
Table 5.1 Rand Index of Cancer Types
Cancer Type |
Rand Index |
Cancer Type |
Rand Index |
| ALL | 0.96 |
Gynecologic | 0.97 |
| AML | 0.95 | Head and Neck | 1 |
| Bone | 0.78 |
Hodgkins Lymphoma | 1 |
| Brain | 0.93 |
Leukemia | 0.95 |
| Breast | 0.93 |
Lung | 0.92 |
| CLL | 0.96 |
Lymphoma | 0.91 |
| CML | 1 |
Musculoskeletal | 0.98 |
| Digestive | 0.87 |
Neurologic | 0.94 |
| Endocrine | 1 |
Non-Hodgkins Lymphoma | 0.74 |
| Eye | 0.94 |
Respiratory | 0.90 |
| Genitourinary | 0.90 |
Skin | 0.92 |
| Germcell | 1 |
| Name | Institution | |
| Rengul Cetin Atalay | Bilkent University | |
| Volkan Atalay | Middle East Technical University | |
| Allan Dickerman | Virginia Bioinformatics Institute | |
| Steve Akman | Wake Forest University School of Medicine | |
| M.Erkut Erdem | Middle East Technical University | |
| I.Aykut Erdem | Middle East Technical University | |
| Murat Iskar | Bilkent University |