DNA QC Toolkit

Bioinformatics
Computational Bioengineering
UI/UX
DNA QC Toolkit

Tech Stack

Java
Python
Git
HTML 5
JavaFX
Biopython

Description

In any DNA sequencing project, the raw data generated by sequencing machines is the foundation for all subsequent discoveries. However, this raw data is rarely perfect. It can contain a range of technical errors and artifacts, such as low-quality reads, leftover adapter sequences from the lab process, and other biases.

This is where a Quality Control (QC) application comes in. It serves as the first and most critical checkpoint in the bioinformatics pipeline. Before any meaningful biological analysis can begin—such as aligning reads to a genome or identifying mutations—researchers must first use a tool like this one to assess the health of their raw data

By generating a comprehensive summary of data quality, these applications allow scientists to diagnose potential issues. Based on the QC report, a researcher can then make an informed decision: either the data is of high quality and ready for analysis, or it needs to be cleaned (e.g., by trimming adapters and low-quality ends) to ensure that the final scientific conclusions are accurate and reliable. To get a deeper understanding of the quality control process, I made an amateur DNA QC Toolkit, inspired by applications like FastQC.

A key part of this project was developing a deep understanding of the FASTQ file format, including how to parse its four-line structure to extract both the DNA sequences and their corresponding Phred quality scores. I researched and implemented the logic for several essential quality control (QC) metrics, such as Per-Base Sequence Quality, GC Content, Sequence Duplication Levels, and Adapter Contamination, ensuring the application provides a scientifically valid analysis.

The underlying application logic, including file I/O and data manipulation, was implemented in Java. For the user interface, JavaFX and FXML were used to create a user-friendly and responsive graphical application with the multi-tabbed design. To prevent UI freezing while performing analysis of large files, I implemented a multi-threaded system based on Java's ExecutorService where all intensive computations are performed on background threads.

All plots are visualized using a Python backend. The central Java application governs and communicates with external Python processes. This option leverages the strengths and versatility of the Matplotlib library to generate the entire range of scientific plots and tables required for final analysis reports.

    Features

    Welcome Page

    Select .fastq/fastq.gz file for quality analysis, dragging from file explorer or browsing locally.

    Welcome Page

    Base Statistics

    Get a quality overview for the selected tab's data, and the core information, including areas of concern to do with the data.

    Base Statistics

    Data Visualisation

    Select between 10 different plots, and learn essential information on the base sequences and dataset information. See these examples from a good quality dataset:

    Data Visualisation

    Data Trimming

    Remove adapter and poor quality sequences, allowing for an overrall better quality dataset

    Data Trimming

    Export to HTML

    Export the report as an interactive HTML, allowing it to be shared and saved.