Multi-source data integration and biomarker identification provide critical data for Alzheimer’s disease research
Alzheimer’s disease (AD), a highly prevalent neurodegenerative disease, is widely recognized as a major, escalating, epidemic, worldwide challenge to global healthcare systems. AD is the most common type of dementia, accounting for 60-80 percent of age-related dementia cases. AD is the sixth leading cause of death in the U.S. and the fifth leading cause of death for those age 65 and older.
The direct cost of care for AD patients by family members or healthcare professionals is more than $100 billion per year; this figure is expected to rise dramatically as the population ages during the next several decades. To avert a healthcare crisis, AD researchers have recently intensified their efforts to delay, cure, or prevent the onset and progression of the disease. These efforts have generated a large amount of data, including brain neuroimages, that provides unprecedented opportunities to investigate AD-related questions with higher confidence and precision.
In AD patients, neurons and their connections are progressively destroyed, leading to loss of cognitive function and ultimately death. It is well accepted that the underlying disease pathology most probably precedes the onset of cognitive symptoms by many years.
Clinical and research studies commonly acquire complementary brain images, neuropsychological and genetic data of each participant for a more accurate and rigorous assessment of the disease status and likelihood of progression.
Jieping Ye at CIDSE is collaborating with researchers from Johnson & Johnson to develop sparse learning methods for multi-source data integration and biomarker identification.
“The integration of multiple heterogeneous sources will not only provide more accurate information on AD progression and pathology, but also effectively predict cognitive decline before the onset or in the earliest stages of disease,” states Ye.
In addition, an integrated genetics, biological biomarkers and multi-modal imaging analysis software system for brain function and disease studies has recently been developed.
Missing data presents a special challenge to current large-scale biomedical data integration. Incomplete data is ubiquitous in real-world biomedical applications. Missing data may be due to the high cost of certain measurements, poor data quality, dropout of the patients from the study, etc. A commonly adopted approach is to remove all samples with missing values, but this would throw away a vast amount of useful information and dramatically reduce the number of samples in the training set.
An alternative is to estimate missing entries based on the observed values, and many algorithms have been proposed for this. Most existing methods are less effective when a significant amount of data is missing. Ye and his group have developed a novel multi-task feature learning framework to integrate multiple incomplete data sources.