Feature selection in RNA-seq and proteomics with MADVAR in R

Published on November 7, 2025 by Champions Oncology | Read in 4 minutes

8:06

High dimensional omics datasets often include many features that contribute little to downstream analysis. This can blur structure in unsupervised tasks, slow computation, and complicate model training. The MADVAR study introduces two simple, data driven procedures that set feature selection thresholds from the distribution of the data itself, rather than relying on fixed heuristics. The first procedure, madvar, computes a variance cutoff using the median plus a multiple of the median absolute deviation. The second, intersect Distributions, fits a two component Gaussian mixture to the variance or another continuous score, and uses the intersection point between components as the cutoff. Both methods are implemented in an R package and were evaluated across public datasets that include TCGA gene expression, GTEx proteomics, and CPTAC phosphoproteomics. The paper reports improvements in unsupervised clustering quality and competitive supervised performance with fewer features, while keeping runtime and memory use modest.

What the Paper Tested

The benchmarking examined unsupervised structure and supervised classification. For unsupervised analysis, the authors applied filtering and then assessed cluster quality with connectivity, the Dunn index, and the Biological Homogeneity Index. Across datasets, the variance based approaches produced tighter or more homogeneous clusters on these metrics. For supervised analysis, they trained random forest models with repeated runs. Both approaches produced low out of bag error rates. Retaining more features sometimes improved accuracy, but MADVAR often matched the mixture based approach while selecting fewer features. The paper also documents practical defaults, such as Ward.D for hierarchical clustering with Euclidean distance, and explains how to pass either a raw matrix or a precomputed variance vector into the functions. Source code and documentation are available on GitHub.

When These Methods are a Good Fit

These procedures are particularly well suited for rapid, large-scale preprocessing, when analyses require a quick, efficient, and transparent approach to feature selection prior to dimensionality reduction, clustering, or model fitting. They also integrate naturally into interpretability-oriented pipelines, since thresholds based on medians or mixture-model intersections are simple to explain and justify to collaborators. Because the logic operates on continuous feature scores, the same framework can be applied seamlessly to any quantitative data type that can be summarized by variability Things to keep in mind.

Variance is a proxy for informativeness, not a guarantee. Low variance does not always mean a feature is uninformative. Some biomarkers remain stable yet become predictive in combination with others. If domain knowledge indicates a feature should be preserved, the package allows must keep lists. The mixture based method assumes the variance distribution resembles a two-component mixture. If the fit is poor, the intersection may not be meaningful, so density plots are worth inspecting before adopting the cutoff. Downstream metric choice also matters. Gains in the Biological Homogeneity Index or Dunn index describe cluster characteristics, which may not translate to improvements on other endpoints such as survival modeling or dose response prediction. Finally, supervised performance can depend on class imbalance and sample size. If your data are skewed, tune the learner and validate with a scheme that reflects your use case.

Read the Full Paper

How to Apply, a Straightforward Workflow

A practical workflow starts by exploring the variance distribution. Plot the density (using the madvar flag `plot_density = TRUE`), confirm whether there is a near zero peak, and decide whether a MAD based threshold or a two-component mixture are appropriate. Set a conservative first pass using the default MAD multiplier (`mads = 2`) and adjust if another stringency level is preferred, or, if you prefer the mixture approach, verify the intersection visually before you commit to the cutoff. Preserve domain critical features by whitelisting known markers or controls that should not be dropped. Re run the planned clustering or model fitting, compare structure and error rates before and after filtering, and record any change in feature counts and compute time so the impact is transparent to collaborators.

Reproducibility and availability

The R implementation and documentation are available on GitHub, as referenced in the paper. The evaluations draw on TCGA gene expression, GTEx proteomics, and CPTAC phosphoproteomics, with figures that show density plots, clustering metrics, and classification results. The article appears as an Application Note in Bioinformatics Advances and is available for open access.

Summary

MADVAR provides two transparent, variance-based rules for feature selection, enabling the removal of near invariant features from large omics matrices. In the reported benchmarks, these procedures improved or maintained clustering quality and supervised accuracy while substantially reducing feature counts and computational load. The approach is easy to inspect, easy to explain, and simple to integrate into existing R workflows. As with any filter, final value depends on the analysis goal, so it is worth validating its effects on the endpoints that matter for your study.

Silberberg G. “MADVAR, a lightweight, data driven tool for automated feature selection in omics data.” Bioinformatics Advances 2025, vbaf211. doi, 10.1093/bioadv/vbaf211.

Explore our Data Ecosystem

Unveiling the Secrets of PBMC Isolation Tubes: Clinching Clinical Trial Success with the Right Tools

The quality of data received from clinical specialty testing labs is largely dependent on the integrity of the samples tested. One critical aspect of this is the proper isolation of PBMCs - peripheral blood mononuclear cells - often used in high-complexity flow cytometry research. To get the best quality results, researchers must use the most optimal tubes to isolate these cells. Three main types of tubes are commonly used: SepMate tubes, CPT tubes, and Manual Ficoll Gradient tubes. Each of these tubes has advantages and disadvantages that researchers must consider carefully. In this blog post, we will look at the pros and cons of each tube type and what you should consider before selecting one for your clinical trial. SepMate Tubes: SepMate tubes come in different sizes that enable researchers to isolate PBMCs from different numbers of tubes at once. These tubes work by using an inserted plastic separation technology that helps remove unwanted clumps of cells. One of the most significant advantages of these tubes is that they can be used for small to medium-sized samples with ease. The technology also separates cells significantly, making it easy to measure the concentration of cells after the separation procedure accurately. Furthermore, the tube also allows for rapid and consistent layering and easy collection of PBMCs, making it ideal for research. However, obtaining a high percentage of neutrophils is not possible, as they would be trapped below the barrier inside the tube. SepMate tubes are disposable, which makes them less financially sensible in the long run. CPT Tubes: CPT tubes are anticoagulant-based layered tubes that allow for the collection directly to cell separation without any additional processing. CPT tubes simplify the separation process and reduce the risk of contamination. They offer a more robust PBMC isolation process, and specific manufacturers produce four types to suit different research needs regarding testing human cells. The tubes generate high PBMC yields with high cell viability, making them an optimal option where quality is paramount. The primary disadvantage of CPT tubes is that they are quite expensive when compared to other tube types for PBMC isolation. Manual Ficoll Gradient Tubes: Manual Ficoll Gradient tubes require manual handling and processing, so they aren't automated. Due to this, the individual processing of Ficoll gradient tubes can cause variability in the manual layering steps, leading to variation in the final cell yield. The biggest issue with this type of isolation tube is that it takes longer to obtain the actual PBMCs than using SepMate or CPT tubes, which may create more significant delays in your research. Additionally, due to the manual nature of this type of PBMC isolation, final PBMC recovery rates are usually dependent on the experience of the researcher. In conclusion, the right tube type will depend on your primary requirements for your research. Regardless of your choice, there is still a need to make sure you select a tube type that meets the technical requirements of your studies and delivers high-quality results. Champions Oncology has expertise in GCLP-compliant PBMC isolation using SepMate, CPT, and manual Ficoll Gradient tube types and can advise the best choice for your upcoming clinical trial.

by Champions Oncology

Clinical Specialty Testing

Navigating Clinical Specialty Testing: Key Insights into Regulatory Compliance

Clinical specialty testing laboratories, like Champions Oncology, are expected to adhere to stringent standards to ensure accuracy and reliability of test results which can have life-altering implications for patients. Regulatory compliance is not a mere bureaucratic hoop but a foundational element that guarantees the integrity of laboratory operations. Navigating through the complex landscape of clinical specialty testing and its regulatory environment is crucial for the success of each clinical trial. Regulatory compliance within each clinical trial is vital to ensure data validity and also ensures each laboratory’s commitment to patients’ safety. In this blog post, we’ll explore the intricacies of adhering to Good Clinical Laboratory Practice (GCLP), Clinical Laboratory Improvement Amendments (CLIA), and College of American Pathologists (CAP) standards, compare regulatory frameworks in the United States (US) versus the European Union (EU) and underscore why meticulous regulatory compliance is a non-negotiable for each clinical trial. Clinical specialty testing laboratories, like Champions Oncology, are expected to adhere to stringent standards to ensure accuracy and reliability of test results which can have life-altering implications for patients. Regulatory compliance is not a mere bureaucratic hoop but a foundational element that guarantees the integrity of laboratory operations. Good Clinical Laboratory Practice (GCLP) GCLP is a quality system that ensures laboratories conducting clinical trial testing provide data of consistent quality. It bridges the gap between the guidelines provided by Good Laboratory Practice (GLP) and Good Clinical Practice (GCP), focusing on pre-analytical, analytical, and post-analytical processes. Clinical Laboratory Improvement Amendments (CLIA) In the United States, CLIA regulations pertain to laboratory testing and require labs to be certified by the federal government. They establish standards for test performance, personnel qualifications, quality control, and proficiency testing for each specific assay performed at the specialty testing laboratory. College of American Pathologists (CAP) The CAP accreditation is an internationally recognized program that provides a framework for clinical labs to achieve excellence in patient care and ensure compliance with statutory and regulatory requirements. CAP takes a peer-reviewed approach to help maintain the highest standard of care. While there may be considerable overlap in what these regulations and standards aim to achieve, there are nuanced differences in their requirements and scopes. GCLP is broader and more flexible in its application, potentially accommodating international guidelines. CLIA is prescriptive and specific to the United States, focusing significantly on the analytical phase of testing. CAP, albeit a US-based program, aligns with many international standards and offers a comprehensive accreditation process that envelopes all aspects of lab operations. Comparatively, the European Union (EU) takes a different approach to laboratory oversight. The EU mandates that each company ensures the quality and safety of its laboratories, but it does not impose a uniform set of standards. Instead of an EU-wide equivalent to CLIA, countries may have their own regulatory frameworks or adhere to international standards like those of the International Organization for Standardization (ISO). Regulatory standards are the pillars that support the validity of clinical trial data. They are key to ensuring that the specialty tests upon which clinical decisions are based are reliable and reproducible. Compliance ensures patient safety, the validity of data submitted to regulatory authorities, and ultimately the success of a clinical trial. Failures in compliance can lead to serious legal consequences and ethical breaches, undermining public trust. Every clinical scientist must understand that regulatory compliance is not simply about following rules; it's about upholding the scientific rigor and ethical duty inherent in clinical research. Each standard, whether it be GCLP, CLIA, or CAP, serves as a QA/QC mechanism to this end. By mastering these regulatory frameworks and recognizing their importance in every aspect of a clinical trial, we safeguard the integrity of clinical research, protect patient welfare, and contribute to the greater good of advancing scientific clinical research.

by Champions Oncology

IHC

Immunohistochemistry: a Powerful Tool in Cancer Research

Immunohistochemistry (IHC) is a technique that originates in the early twentieth century but continues to be a valuable method that forms the backbone of molecular pathology. IHC is used for histological examination of tissues and specifically detects the presence of a molecule, such as a tumor antigen. IHC uses antibody-based labeling in which the primary antibody detects the target of interest and the secondary antibody detects the primary antibody which is linked to a molecule for microscopic visualization. Many different secondary antibody labeling modalities exist, including fluorescence, enzyme-mediated reactions and colloidal gold, and different labels are suited to specific microscopy platforms. Consider these five aspects of IHC as you implement this technique in preclinical cancer research: Quantitative measurements. IHC can be used as a qualitative measurement, but unlike many other visualization techniques, IHC can also be used as a quantitative measurement because antibodies that label specific parts of tissues or cells can be counted by a pathologist or with a computer-aided system. Developing robust validated quantitative IHC staining and visualization methods allows researchers to rely on the accuracy of this data. Customizable. IHC methods can be adapted to detect any cellular marker, given that a monoclonal antibody exists or can be made that specifically detects this marker. Validation of a new primary or secondary antibody also includes determining any off-target staining caused by these antibodies as this can be a critical determinant in the utility of an antibody. The ability to customize IHC in this manner is crucial to preclinical research that seeks to identify new biomarkers associated with tumor progression or immunotherapy efficacy. Flexibility. IHC can be used on almost any tissue type so long as it is processed correctly. Tissue samples from model animals as well as clinical biopsies can be fixed and sectioned in advance and stained at later times. Tumor microarrays (TMAs) can also be created and stained for evaluation of novel tumor markers or screening efficacy of drug candidates. This flexibility in sample type and handling highlights the overall utility of IHC. Automation. Currently, several different systems exist for automated IHC staining, and advances in digital analysis of IHC samples have allowed larger batches of clinical samples to be processed and evaluated. Comparison of automated IHC methods with manual methods has determined that these new approaches are accurate, sensitive, and reproducible. The big picture. Several methods exist for staining different targets in an IHC sample, and this allows scientists and clinicians to gain critical insights into which cells and molecules are present in the tumor microenvironment (TME), including levels of immune checkpoint molecules or infiltration of critical anti-tumor cell types. Consider how IHC can complement other techniques that look at the tumor microenvironment, including flow cytometry and RNAseq. IHC will continue to be a powerful tool in preclinical and clinical cancer research. Consider revisiting this classic technique as it has matured with the technological advances of the 21st century. Champions Oncology’s histology and immunohistochemistry services are custom developed and fully optimized to meet your needs in preclinical research. With industry leading pathology expertise and innovative automated technology, Champions provides you with the highest quality endpoints for your in vivo and ex vivo studies.

by Champions Oncology

Feature selection in RNA-seq and proteomics with MADVAR in R

What the Paper Tested

When These Methods are a Good Fit

How to Apply, a Straightforward Workflow

Reproducibility and availability

Summary

Search our Blog

Categories

Feature selection in RNA-seq and proteomics with MADVAR in R

What the Paper Tested

When These Methods are a Good Fit

How to Apply, a Straightforward Workflow

Reproducibility and availability

Summary

Search our Blog

Categories

Related Posts

Unveiling the Secrets of PBMC Isolation Tubes: Clinching Clinical Trial Success with the Right Tools

Navigating Clinical Specialty Testing: Key Insights into Regulatory Compliance

Immunohistochemistry: a Powerful Tool in Cancer Research