Machine Interpretable Document Analytics Services

( Need Updates )

Keywords: Text Mining, Data Analysis, Human Genome 


The project aims to identify the most famous genes across the human genome using text mining and data analysis. The project is the revision of the original research published in the Nature article, in the year 2017, where the authors have identified the most popular genes till that time. Now 5 years have passed since the original study, and with the emergence of new diseases, there are chances that we can see a shift from the originally identified genes, to new ones. 

Proposed Methodology: 

The process is to retrieve the genes from all the papers published in the PubMed database by May’22 which have any functional or structural information linked to the human genome. Then the data will be analyzed to determine the most well-known from the information captured based on the count of the papers per gene and other attributes like timeframe, geography, etc.


The results from the research will help to identify the ongoing and past trends in biomedical research. The findings can assist in focusing on the popular genes and the less famous ones and can open a platform to look out for the reasons behind such observations. Research funding, societal pressure, and medical relevance are factors that have a major influence on medical research, and to see in which direction they are steering the science will be an important observation. Apart from this, the study will help to get better insights into human psychology/understanding and how we change our focus from one gene to another. 

Project Supervisor: Dr. Javed Mostafa 
Email: jm@unc.edu 

Project Lead: Vibhor Gupta 
Email: gvibhor@unc.edu