Genomics and artificial intelligence to empower cancer diagnostics
The lab of Anna Obenauf just published their findings: the human papillomavirus HPV42 causes Digital Papillary Adenocarcinoma. Within the frame of their study, the scientists used large genomics datasets and machine learning algorithms to identify the genes whose expression patterns distinguish cancers caused by viral infections from other types of cancer. Lukas Leiendecker, PhD student and first author, and Tobias Neumann, bioinformatician, explain their research.
How did you find out the link between HPV42 and Digital Papillary Adenocarcinoma (DPA)?
Lukas Leiendecker: We know that certain viruses can cause cancer – for example, Human Papillomaviruses or HPV are known to cause cervical and head and neck cancers. The skin is constantly exposed to viral pathogens, but whether those viruses are involved in skin cancers is unknown, so we decided to screen rare skin cancers to find out if their genetic basis was viral. We analysed the genomes of about 200 skin tumours and found that all DPA tumour samples contained DNA sequences originating from HPV42. This virus wasn’t known to cause tumours at all.
Are all HPVs oncogenic?
Lukas Leiendecker: So far, about 15 HPVs are considered to be oncogenic, but there are many more that don’t cause cancer, or whose effects on human health are unknown. So far HPV42 had been classified as an HPV type that can’t cause cancer, but our data clearly indicates that HPV42 can be oncogenic.
How did you go about detecting viral sequences in cancer genomes?
Tobias Neumann: We started by sequencing the DNA of these tumours. Once we had these sequences and had mapped all the bits we could recognise, we were left with sequences that didn’t belong to the human genome, so we looked for their source in genome databases. The problem is that many viruses are not well-covered in standard databases, which meant our mysterious sequences couldn’t be placed anywhere. We decided to create a database with more than 6,000 known viruses that can infect vertebrate hosts, including HPVs. This meant collating existing viral databases into one, umbrella database. That’s when we found a match for HPV42.
Did you look for HPV42 in other tumours, besides DPA?
Tobias Neumann: Most common cancer types haven’t been systematically screened for viral sequences, let alone HPV42. We decided to do just that, using a dataset of about 11,000 cancer patient samples and sifting through it to see if any of the leftover sequences matched a virus in our new database. Bioinformatics-wise, this was a big chunk of work: 11,000 samples represent about 300 terabytes of data. If you process that amount of data on a regular computer, it will run for several years until you get your results. We had to run our analyses using cloud computing that conforms with data protection policies and reduced the final runtime to a bit more than a month.
What did the analysis yield?
Tobias Neumann: We confirmed the presence of viruses in samples that we knew were virus-driven cancers, such as HPV16 and 18 in cervical cancers, but we didn’t detect HPV42 sequences in any of the other cancers. This highlights that HPV42 is indeed a DPA-specific driver.
Artificial intelligence was key for the success of your study. When did it come in handy?
Lukas Leiendecker: A big issue in the diagnosis and therapy of virus-linked cancers is our limited ability to distinguish HPV-driven cancers from mutation-driven cancers. The therapeutic options for these two groups are different: therapies for mutation-driven cancers are often more aggressive. Ideally, patients would only be given the aggressive therapy when it’s really needed. So far, medical professionals have looked at the expression of a single gene, called CDKN2A/p16, to sort cancer into one or the other category. This gene is highly expressed in many, but not all HPV-driven cancers, so it isn’t reliable enough. We examined gene expression data across thousands of genes in HPV-linked cancers, including DPA, and looked for a signature of HPV infection.
Tobias Neumann: To start the machine learning process, we fed our algorithm with two sets of gene expression data from our 11,000 samples: one from HPV-driven cancers, and one from mutation-driven cancers. The algorithm then uses those data to learn to distinguish these two groups.
Could the algorithm tell you which genes could represent a signature of HPV-driven cancers?
Tobias Neumann: We used a type of algorithm called ‘Random Forest’, which on top of classifying samples into HPV-driven and mutation-driven cancers ranks all the genes of the dataset by importance for discrimination. This comes in handy for identifying predictive genes. With that model, we could pinpoint CDKN2A/p16 and 11 other genes that are the most predictive of a cancer’s origin, viral or mutational.
Lukas Leiendecker: We’re really excited about the genes in this signature, as we found that besides their predictive value, their specific expression in HPV-driven cancers could make them good targets for immunotherapies.
Could that change the way we diagnose and treat HPV-driven cancers in the clinic?
Lukas Leiendecker: The issue is that few medical doctors have the resources to analyse the expression data for 12 genes in every cancer they diagnose. As a result, we tried to narrow our signature of 12 genes down to the minimal number of genes that could help us distinguish virus-driven cancers from the others. We found that with the expression data of only two genes, CDKN2A/p16 and SYCP2, we can identify virus-driven cancers with high accuracy. This could have important implications for clinicians, cancer patients and even future therapies that utilise those signature genes as targets.
Original publication
Lukas Leiendecker, Tobias Neumann, Pauline S. Jung, Shona M. Cronin, Thomas L. Steinacker, Alexander Schleiffer, Michael Schutzbier, Karl Mechtler, Thibault Kervarrec, Estelle Laurent, Kamel Bachiri, Etienne Coyaud, Rajmohan Murali, Klaus J. Busam, Babak Itzinger-Monshi, Reinhard Kirnbauer, Lorenzo Cerroni, Eduardo Calonje, Arno Rütten, Frank Stubenrauch, Klaus G. Griewank, Thomas Wiesner, Anna C. Obenauf:
“Human papillomavirus 42 drives digital papillary adenocarcinoma and elicits a germ-cell like program conserved in HPV-positive cancers”. Cancer Discovery (2022).
Further reading