Cancer & Health Informatics Research

Cancer & Health Informatics Research


My broad research  will be on the topic of machine learning methods on surgical path reports for oncology. 

Machine Learning Methods On  Surgical Path Reports 

Surgical Pathology Reports(SPRs) are critical documents that are part of patient care. SPRs are becoming  available for clinical researchers for secondary use of Electronic Health Records  to text-mine and learn about diagnosis at a much higher scale than ever before due to increased emphasis on data driven healthcare research and treatment. 

Foundational Issues and Barriers using SPRs for secondary research analysis

  • “Big Data” 

The increased availability of pathology records has shed a light on the   heterogeneity, volume, and veracity  of SPRs, which in turn has brought a great deal of interest from the research community to better understand the structure, format and content of these reports for not just patient care but for mining these reports to better understand the disease at a cohort/community level. The sheer volume and heterogeneity of these reports are a foundational barrier in extracting meaningful information at scale and in an efficient manner.

  • Research Access 

As with most data collected in medical records and in medicine, the primary focus of SPRs is directed for patient care and the role of research plays a distant second goal.   Access to SPRs outside of the  operational process of the hospital/clinic has many barriers. Three key ones are: 1)  Regulatory barriers associated with HIPAA and  IRB, 2)  Information access & system barriers – transforming data for use, for example, ETL to transfer data to data warehousing, and finally 3) Information analytics barriers – generating cohorts of reports from data warehouses.

  • Data Pipeline

Another  issue with research access of SPRs is the lack of a data pipeline. Most of the access to SPRs are based on ad-hoc processes with limited flexibility for text and data processing. This limitation significantly reduces sophisticated analytical capabilities to be undertaken at scale.

  • Methods and Tools

Extracting information from SPRs requires a variety of tools and libraries. While many of the tools and libraries are yet to be built and require more research, there is a lack of an inventory of tools and contexts in which to use the tools. This places a tremendous burden on researchers to identify the correct tool for the correct purpose and context. There are  barriers for complex algorithm  development due to lack of large curated datasets, integration of informatics (machine learning and/or deep learning with disease specific vocabularies) and evaluation of the algorithm across a heterogeneous  set of SPRs and across different institutional datasets.

  • Dissemination

The privacy and security issues surrounding SPRS are partially the root causes for results being published and shared without the original datasets used in the research and analysis. This makes it very difficult to reproduce the results. Many of the tools become arcane and/or get used by a fragmented set of research groups.  As a consequence, knowledge translation from these tools gets lost or never gets used or propagated outside of the small groups.