Whole genome studies generate a large number of data that shows millions of individual DNA sequences, as well as tell where and how many thousands of genes are expressed to a certain location in the genome. Comparing different conditions or across studies from different laboratories could be a lot challenging because of the amount and complexity of the data.

Phys.org reports that a team of researchers from Pennsylvania State University developed a new statistical model that allows for a more efficient way to uncover biologically meaningful changes in genomic data across multiple conditions, such as different cell types or tissues.

 CLIMB: Novel Statistical Model Provides More Efficient Way of Analyzing Large-Scale Genomic Data
(Photo : Pixabay/PhotoMIX-Company)
CLIMB: Novel Statistical Model Provides More Efficient Way of Analyzing Large-Scale Genomic Data

CLIMB Statistical Model Benefits

Qunhua Li, associate professor of statistics at Penn State, said that it is difficult to analyze the data together in multiple conditions in a way that is both statistically powerful and computationally efficient. Existing methods produce results that are difficult to interpret and are computationally expensive.

That is why they developed the Composite LIkelihood eMpirical Bayes (CLIMB) statistical model which is an improvement of existing methods as it is computationally efficient and produces biologically interpretable results. The team tested the statistical method to three types of genomic data collected from hematopoietic cells although it can also be used in analyses of other 'omic' data.

The CLIMB statistical method uses principles from two traditional techniques typically used in analyzing data across multiple conditions. One of these techniques is pairwise comparisons between conditions.

According to a paper via Science Direct, pairwise comparisons are a basic and simple strategy for entity resolution to compute for similarity score. However, this method becomes increasingly challenging to interpret as more conditions are added.

The second technique is combining the activity pattern of each subject across conditions into an "association vector," such as a gene being up-regulated, down-regulated, or no charge in various cell types. But the number of different combinations possible makes it extremely computationally intense.

Hillary Koch, a senior statistician at Moderna but a graduate student at Penn State at the time of the research, explained that CLIMB uses both approaches by using pairwise analyses to identify patterns and then analyze association vectors. Following this method helps researchers tel sets of genes that are collectively up-regulated in some cells but are down-regulated in other cells.

Li noted that the CLIMB method gives more specific results than the pairwise method when tested using RNA sequencing to measure the number of RNA made from all the genes. CLIMB produced a narrower list of 2,000 to 3,000 genes identified in both analyses compared to the 6,000 to 7,000 genes in the pairwise method.

READ ALSO: Why You Should Study Statistics If You Want to Become a Professional Data Analyst

Using CLIMB in Other Experimental Technologies

As Phys.org reported, the team also used the CLIMB statistical method in different experimental technologies. For example, in ChIP-seq, they explored how CTCF protein does or does not change across 17 cell populations, which are all derived from the same hematopoietic stem cell.

The CLIMB analysis was able to identify distinct categories of CTCF-bound sites in which some show roles for this transcription factor in all blood cells while others reveal roles in specific cell types.

Then, the team also tested the CLIMB statistical method in DNase-seq, to compare the accessibility of chromatin in 38 human cell types. Koch said that they used all three tests to see if their results had biological relevance and compared it to independent data. They found that their results correspond to the method.

They plan to improve the computational speed of the CLIMB statistical method and increase the number of conditions it can handle for the next step of their research.

The findings are discussed in full in their study titled "CLIMB: High-dimensional association detection in large scale genomic data," which was published in Nature Communications.

RELATED ARTICLE: Longevity Calculator: Statistical Model Suggests Mortality Prediction of Dementia-Diagnosed Patients [Study]

Check out more news and information on Physics and Math in Science Times.