Genomics and transcriptomics remain foundational for precision oncology, but they do not fully represent the functional state that determines how tumors respond to therapy. Proteins and phosphoproteins capture activity at the level where drugs actually engage, for example receptor density and localization, complex assembly, and pathway signaling. That distinction is not academic. In a 2024 pan-cancer analysis from the Clinical Proteomic Tumor Analysis Consortium (CPTAC), investigators integrated proteogenomic data from 1,043 patients across 10 tumor types, surveyed 2,863 druggable proteins, and quantified biological factors that weaken mRNA to protein correlation, making the case for models that learn directly from protein and phosphoprotein context rather than inferring from transcripts alone. Cell
The practical bottleneck has been access to harmonized, well-annotated cohorts that support training, testing, and independent validation. In August 2023, the National Cancer Institute announced a standardized pan-cancer proteogenomic dataset that aligns genomics, proteomics, imaging, and clinical data for more than 1,000 tumors across 10 cancer types, explicitly to enable reproducible discovery and model benchmarking. The Proteomic Data Commons (PDC) now serves these resources in a way that supports programmatic access and cross-study comparisons, a requirement if machine-learning outputs are going to generalize beyond a single study. National Cancer Institute+1
Two Cell papers from 2024 illustrate why adding protein-level information changes conclusions. An immune-landscape analysis derived distinct immune subtypes by integrating genomic, epigenomic, transcriptomic, and proteomic features and connected oncogenic drivers to downstream protein states that influence immune surveillance and evasion. A companion pan-cancer study expanded the landscape of therapeutic opportunities by evaluating thousands of druggable proteins across tissues and documenting where mRNA is a poor proxy for protein, especially in pathways relevant to therapy response. Together, they show that multi-omic modeling, including protein and phosphoprotein features, improves biological interpretability and exposes actionable biology that single-omic approaches overlook. Cell+1
The trend is not confined to CPTAC. The Pan-Cancer Proteome Atlas (TPCPA), published in Cancer Cell in 2025, quantified 9,670 proteins across 999 primary tumors representing 22 cancer types using DIA-MS. The atlas offers a tissue-based substrate for target nomination, biomarker discovery, and external validation, and has been highlighted in the trade press for its global availability and immediate relevance to oncology research. Such atlases are valuable because they capture proteomic variability directly in clinical material, not only in cell lines, providing realistic distributions for features that ML models attempt to learn. Cell+2PubMed+2
Integrating proteins and phosphoproteins adds information that is both mechanistic and measurable. First, pathway activity is reflected in phosphorylation states, which function as on–off or rheostat-like controls for signaling. Second, receptor exposure and complex formation at the protein level determine whether a therapy can bind or disrupt a process. Third, protein degradation and post-translational regulation often decouple mRNA abundance from target availability, which explains why transcript-only biomarkers can fail at the bedside. When these features are engineered into models, performance gains are not just numeric; they tend to be more interpretable, mapping to drug-actionable pathways and receptors that clinicians recognize.
The 2024 CPTAC studies provide concrete examples. Immune subtypes defined by proteogenomic features correlate with differences in antigen presentation, cytokine signaling, and interferon responses, features with obvious translational implications. The survey of druggable proteins shows wide variation in abundance and localization across tumors and details the contexts where transcript and protein diverge, arguing for protein-aware rules when nominating targets or stratifying patients. Cell+1
There is a growing consensus on practical guardrails. Independent validation across cohorts is essential to avoid overfitting, and the infrastructure now exists to support that step through the PDC and related CPTAC resources. Feature construction should prioritize pathway-level signals that aggregate individual phospho-sites into kinase or pathway activity because these are more stable across cohorts and easier to interpret for clinical decision making. Finally, clinically annotated samples, including treatment history and outcomes, are indispensable if models are expected to inform responder enrichment and mechanism-of-resistance hypotheses rather than only classify molecular subtypes. National Cancer Institute
When executed with these guardrails, proteogenomic ML offers tangible benefits. Programs can generate earlier responder and non-responder hypotheses and test them prospectively in preclinical systems before committing costly clinical designs. Resistance pathways inferred from phospho-proteomic features can motivate combination strategies, for example pairing an antibody-drug conjugate with a kinase inhibitor when signaling indicates a plausible escape route. Educational content from ASCO has emphasized the centrality of quantitative thresholds and validated assays for patient selection, particularly for ADCs where surface accessibility and abundance determine benefit. The lesson is consistent across modalities. Predictive features must be connected to assays that can be deployed consistently in trials, and thresholds should be defined in a way that anticipates real-world variability. PubMed
Several limitations deserve explicit mention. Proteomic and phospho-proteomic data remain technically variable across platforms and laboratories. Although CPTAC and PDC mitigate this through standardization, modelers should evaluate batch effects and apply normalization strategies suited to proteomic data. Coverage of kinase–substrate relationships and post-translational networks is incomplete, which constrains inference. Tumor heterogeneity adds another layer, particularly when bulk tissue averages mask subclonal or microenvironmental signals. These caveats do not negate the value of proteogenomic ML, but they do argue for conservative claims, orthogonal validation, and a bias toward features that can be measured reproducibly in clinical settings.
The immediate implication is a more disciplined approach to enrichment. If protein-level features identify a subgroup with plausible sensitivity, early designs can incorporate eligibility criteria and stratification based on validated assays rather than exploratory cutpoints. Conversely, if pathway-level features suggest multiple escape routes, it may be more efficient to prioritize combinations earlier instead of iterating single-agent studies. At a portfolio level, proteogenomic evidence can help prioritize programs with a mechanistic rationale supported by functional data, not only by mutation prevalence or gene expression.
Champions Oncology builds models on tumor-derived systems that preserve patient biology and heterogeneity. Our datasets link genomics and transcriptomics with proteomics, phospho-proteomics, and cell surface proteomics, and they are annotated with pharmacologic phenotypes. This combination supports models that tie features to functional biology and drug accessibility, making it possible to move from correlation to causal relation and from causal relation to druggable targets.