Disentangling cell mixtures in cancer from omics data and images

Student(s): Julia Hindel, Mehrdad Pouyanfar, Olivia Kirk, Lucy Quirant, Felix Rustemeyer, Marius Sommerfeld;
Supervisor(s): Rachel Cavill, Katerina Staňková, and Esther Baena (Cancer Research UK);
Semester: 2020-2021.

Problem statement and motivation:

Game theory uses mathematical models to look at how different agents (in this case cells) interact. It is used in many domains, including to re-examine cancer treatment in an innovative way. Doctors tend to treat tumours with medication or radiation until the treatment stops working. This can result in creating more dangerous tumours that have become robust against treatment (Cunningham et al., 2018). Using game theory can help improve patient survival and quality of life (Staňková et al., 2019). We want to predict which treatment works, therefore cell types and their proportions need to be identified.

The aim of this project is to identify cells and establish their proportions. To disentangle/ identify cell mixtures in prostate cancer, two types of data are used. The data is publicly available and comes from The Cancer Genome Atlas (TCGA) database. The first type of data is images of tumour cuts that will be investigated with the help of algorithms which examine cell features (CellProfiler) and group these features together (cluster them). The images are stained with chemicals (in this case H&E staining) which dye the centre of the cell (called cell nucleus) in purple. The software CellProfiler can detect the nucleus size and shape from the stain in the image (Carpenter et al., 2006). Features can be used to establish groups based on similarity, also called hierarchical clustering (Nielsen, 2016). An algorithm clusters extracted features into groups with similar features, e.g. cells with a round nucleus or nucleus diameter of less than 10 μm. Thereby, the algorithm selects the features used for clustering itself without human interaction. The established groups hopefully correspond to cell types. This will then lead to cell identification and cell quantification.

The second type of data is gene expression and methylation data, also called omics data, which will be analysed with various clustering methods called non-negative matrix factorisation (NMF) methods. Genes can be imagined as an instruction manual for what your body looks like and how it is run and are found in the nucleus of every cell of the body. Methylation tells us whether genes are switched on or off. If a gene is switched off, it has no effect on the behaviour of the cell. As there is no labelled data available, gene and methylation data will first be simulated with known cell types. The NMF algorithm takes an input matrix V (e.g. gene expression data) and factorises it into two matrices W and H. In this process, it clusters the columns of input data V (Lee & Seung, 2001; Brunet et al., 2004). NMF can be extended to use multiple inputs (e.g. gene expression and methylation data) simultaneously which is then called joint non-negative matrix factorisation (JNMF). After clustering with (J)NMF, hopefully cell types can be identified and cell proportions quantified (Chalise & Fridley, 2017). There is some uncertainty whether the clusters found correspond to biologically relevant cell types.

Research questions/hypotheses:

  • How effective is an NMF-based approach in determining cell proportions from both gene-expression and methylation data?
  • How effective is the clustering based on the features extracted from the H&E stained tumour-cut data?
  • What hinders and enables effective clustering of cell types within cell mixtures?

Main outcomes:

  • Image data: A pipeline using the biology oriented computer vision software ‘CellProfiler’ is fitted to produce a dataset of extracted features from H&E stained tumour-cut data. Furthermore, a hierarchical clustering model can be trained on the extracted features to distinguish different cell types. For every unique cluster, an image that contains a high proportion of that cluster to provide to an expert for aassignment of clusters to ground-truth biology cell types (identification of the clusters).
  • Simulation of gene expression & methylation data: In order to test, verify, and gain confidence in the deconvolution algorithms we need to simulate the data with labels. Therefore, we want to deliver a Python script, which is able to simulate data that meets our expectations of reality and resembles the real data.
  • Model to identify cell types in gene expression & methylation data: Gene expression and methylation data will be analysed with different variants of non-negative matrix factorisation algorithms. We are going to deliver the method and tuned parameters that performed the best in our experiments.
  • Combination of results (if time allows):If good results are obtained from the two outlined approaches, how the pair of results compare and contrast can then be discussed and investigated.

Acknowledgement:

The data used for this research is in whole or partly based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga

References:

Brunet, J.-P., Tamayo, P., Golub, T. R., & Mesirov, J. P. (2004). Metagenes and molecular pattern discovery using matrix factorization. Proceedings of the National Academy of Sciences, 101(12), 4164–4169. https://doi.org/10.1073/pnas.0308531101

Carpenter, A. E., Jones, T. R., Lamprecht, M. R., Clarke, C., Kang, I., Friman, O., … Sabatini, D. M. (2006). Genome Biology, 7(10), R100. https://doi.org/10.1186/gb-2006-7-10-r100

Chalise, P., & Fridley, B. L. (2017). Integrative clustering of multi-level ‘omic data based on non-negative matrix factorization algorithm. PLOS ONE, 12(5), e0176278.https://doi.org/10.1371/journal.pone.0176278

Cunningham, J. J., Brown, J. S., Gatenby, R. A., & Staňková, K. (2018). Optimal control to develop therapeutic strategies for metastatic castrate resistant prostate cancer. Journal of Theoretical Biology, 459, 67–78.https://doi.org/10.1016/j.jtbi.2018.09.022

Lee, D. D., & Seung, H. S. (2001). Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems 13 – Proceedings of the 2000 Conference, NIPS 2000 (Advances in Neural Information Processing Systems). Neural information processing systems foundation.

McQuin, C., Goodman, A., Chernyshev, V., Kamentsky, L., Cimini, B. A., Karhohs, K. W., … Carpenter, A. E. (2018). CellProfiler 3.0: Next-generation image processing for biology. PLOS Biology, 16(7), e2005970. https://doi.org/10.1371/journal.pbio.2005970

Nielsen, F. (2016). Hierarchical Clustering. In Introduction to HPC with MPI for Data Science (pp. 195–211). Springer International Publishing. https://doi.org/10.1007/978-3-319-21903-5_8

Stanková, K., Brown, J. S., Dalton, W. S., & Gatenby, R. A. (2019). Optimizing Cancer Treatment Using Game Theory. JAMA Oncology, 5(1), 96. https://doi.org/10.1001/jamaoncol.2018.3395

Comments are closed.