Session: MP40: Prostate Cancer: Detection & Screening I
MP40-10: Creating High Quality Synthetic Genitourinary Tissue Images From Histology Repositories to Overcome Limitations with the use of Clinical Trial data in AI models
Introduction: Timely diagnosis and assessment of prognosis are challenges for prostate cancer (PCa)—this results in many deaths and increases the overall disease risk and cost of treatment. Although clinical testing strategies are effective, recent advancements in machine learning suggest promising strategies that could allow the generation of synthetic data to avoid the need to rely on using data from clinical trials to develop training models. Our study is the first to use the generative adversarial network technology to generate synthetic digital histology data of high quality from 9 genitourinary organs. Methods: We downloaded digital pathology images for 9 genitourinary tissues from the GTEx Portal, a repository of 25,713 images from 39 unique tissue classes. Downloaded images were subjected to color normalization within each tissue to ensure that all images have the same range of expected colors and that none are too light or dark, which could lead to post-analysis issues. These were then run through HistoQC to determine baseline quality and then through PyHist for single patch creation. This led to a data repository of over 10,000 patches on average per tissue. These patches were then run through our custom deep convolutional GAN implemented in PyTorch and run on a local GPU cluster to create models for which we can generate new synthetic images. Results: 2700 images were obtained from GTEx. Segmentation was done on these images in 96x96 blocks resulting in 32,411 patches, and 256x256 blocks resulting in 163,916 patches. These patches were used as the training database in a standard deep convolutional GAN (dcGAN) to create synthetic images per tissue. Upon creation, a color gradient PCA was done on the images to show a dropout rate of 21.2%. Images were subjected to further quality control of random manual inspection, where file sizes of a certain threshold were discarded. Post-image inspection, pathologists were also asked to assess the quality of the images resulting in 80% of the images being deemed good enough quality. After the entire QC process, inception distance (FID) was calculated, and a simple classification module was created to determine image similarity within a tissue and image distinctness between tissues. The lowest FID was determined at 54.2 at 5,000 synthetic images. The classification module was able to separate the images with an accuracy of 74%, with similar tissues being classified together being the most common mistake. Conclusions: We have created a deep convolutional GAN that can create new synthetic images with a reliability rate of 80%. This needs to be enhanced, and quality needs to be improved before a final repository can be created. SOURCE OF Funding: AUA Research Scholar Award to HA