
Tech Stack
Description
In any genomics research project, once sequencing data has passed quality control and been processed, the next critical step is understanding which genes are actually changing between experimental conditions. Raw gene expression data—whether from RNA-seq, microarrays, or other transcriptomic experiments—contains thousands of measurements, but without proper statistical analysis, it's impossible to distinguish meaningful biological signals from technical noise.
This is where a differential gene expression analysis tool becomes essential. It serves as the analytical bridge between raw expression measurements and biological insights. Before researchers can make claims about which genes are upregulated in disease versus healthy tissue, or which pathways respond to a particular treatment, they must first use a tool like this one to rigorously identify statistically significant expression changes and filter out genes that vary due to experimental variation.
A key part of this process is understanding both the size and the reliability of changes in gene expression. The fold change tells us how many times higher or lower a gene’s expression is in one group compared to another, but to make these changes easier to interpret, we use the log2 fold change. This log scale means that a value of 1 represents a twofold increase, while -1 means a twofold decrease, making it straightforward to compare up- and down-regulation. However, not all observed changes are meaningful as random variation can make a gene appear different when it’s not. That’s why statistical tests, such as the t-test, are used to calculate a p-value, which estimates the likelihood that the observed difference happened by chance. By considering both the magnitude of change (log2 fold change) and its statistical significance (p-value), tools like these can help identify which genes are truly differentially expressed.
The data analysis was implemented entirely in Python using pandas for data manipulation, NumPy for numerical computations, and SciPy for statistical testing. For the user interface, I chose Streamlit to create an intuitive, web-based application to easily upload data, and explore results interactively. Users can also configure analysis parameters from the sidebar, featuring real-time feedback on data loading and processing status.
All visualizations use both static and interactive plotting libraries. The application generates clustered heatmaps using Seaborn and Matplotlib, while interactive volcano plots and gene-specific boxplots are powered by Plotly. This dual approach allows users to hover over data points, zoom into regions of interest, and export results for further analysis.
I implemented several quality-of-life functions, including intelligent file format detection for multiple input types (CSV, TSV, Excel, compressed files), error handling for malformed datasets, data caching for improved performance, and a modular code structure that separates data processing, statistical analysis, and visualization components. The result is a tool that makes differential expression analysis accessible while maintaining the statistical rigor required for high quality results.