Autonomous QC and Anomaly Detection for Omics Pipelines

January 2025 15 min read

Bioinformatics Agentic AI Machine Learning Python

Introduction

I've been working on building an autonomous QC layer for the omics pipelines I work with, and it's been a game-changer. The system continuously monitors pipeline QC metrics, learns what "normal" looks like for each assay and pipeline step, and automatically flags outliers. What I love about it is that it doesn't just detect problems—it traces anomalies back to specific tools, parameters, or batches and actually recommends what to do about them (reruns, parameter tweaks, or sample exclusion). This has cut down debugging time significantly and caught issues that would have otherwise slipped through.

I wanted to share how I built this, focusing on a lightweight v1 that you can actually implement. I'll walk through the core components, show you the code I used, and explain the design decisions that made it work in practice.

System Overview

The system breaks down into three main pieces that work together:

QC Metrics Collector - Pulls together all the QC metrics from your pipeline runs into one place
QC Anomaly Detector - Learns what normal looks like and flags when things are off
Explanation & Remediation Recommender - This is the "agentic" part—it figures out what went wrong and suggests fixes

              Keeping it simple for v1: I started with one assay type (bulk RNA-seq) and one pipeline (the existing Airflow setup I was working with). This let me focus on getting the core logic right without getting overwhelmed. You can always expand later once you have the foundation working.
            

QC Metrics: What to Collect

For RNA-seq, I focused on the metrics that actually matter. Here's what I'm tracking:

Per Sample Metrics

Total reads, mapped reads, % mapped
% duplicates
% rRNA / % mitochondrial reads
Insert size mean / std
GC content
Number of detected genes
5'–3' coverage bias (if available)
Library complexity estimates (optional)

Per Pipeline Step Metrics

Runtime, exit status, memory usage
Tool name + version (e.g., STAR 2.7.10a, fastp 0.23.3)

System Design: Core Components

1. QC Metrics Collector

The first challenge was getting all the QC metrics into one place. The pipeline outputs metrics in different formats (FastQC HTML, Picard text files, STAR logs), so I needed a way to normalize everything.

I ended up with a simple schema that captures what I need:

sample_id
batch_id
assay_type (e.g., rna_seq)
pipeline_step (e.g., alignment, trimming, qc_summary)
tool_name, tool_version
metric_name (e.g., pct_mapped, duplication_rate)
metric_value (numeric)
run_timestamp

In practice, I added a small task at the end of each major step in the Airflow DAG that:

Parses the QC outputs (FastQC, Picard, STAR logs, etc.)
Writes everything into a central store—I started with a simple CSV file, but you could use Postgres or Parquet files in object storage

2. QC Anomaly Detector

This is where it gets interesting. The goal is to learn what "normal" looks like for each metric and flag when something's off.

I went with a simple but robust approach using median and MAD (Median Absolute Deviation). Why? Because QC metrics can be messy—outliers in your training data shouldn't break your baseline. Standard deviation gets thrown off by outliers, but median and MAD are much more resilient.

Here's how it works:

Group metrics by (assay_type, pipeline_step, metric_name)—this way, alignment metrics are compared to other alignment metrics, not trimming metrics
For each group, compute:
- Median: m
- Median Absolute Deviation (MAD): MAD
For a new metric value x, compute a robust z-score:
```
z = (x - m) / (1.4826 * MAD + ε)
```
Flag anything where |z| > 3 (or whatever threshold works for you)

You could get fancier with per-batch comparisons or ML-based approaches like IsolationForest, but honestly, this simple method has worked really well for me. Start simple, iterate later.

3. Explanation & Remediation Recommender (Agentic Layer)

This is the part that makes it feel "agentic"—the system doesn't just say "something's wrong," it figures out what probably went wrong and suggests what to do about it.

For v1, I kept it simple with rule-based logic. You could add an LLM later (v1.5?), but honestly, rules work great when you have domain knowledge. The idea is to map (metric_name, direction) to likely causes and suggested actions.

Here are some examples from what I've built:

pct_mapped very low → Usually means wrong reference genome, contamination, or sample swap. I have it suggest: verify the reference build, check the sample sheet, and re-run alignment with stricter QC if needed.
duplication_rate very high → Often indicates low input, over-amplification, or library prep issues. The system recommends reviewing the library prep batch and considering excluding the sample or re-sequencing.
pct_rRNA very high → Classic sign of failed rRNA depletion. It flags the batch and suggests reviewing the wet-lab protocol, possibly dropping or re-running that subset.

Implementation: Code Examples

Baseline Computation and Anomaly Detection

Here's the core code I'm using. It's pretty straightforward:

import pandas as pd
import numpy as np

Z_THRESHOLD = 3.0

def compute_baselines(qc_history: pd.DataFrame):
    """
    Compute robust baselines (median and MAD) for each metric group.
    
    Args:
        qc_history: DataFrame with columns [assay_type, pipeline_step, 
                   metric_name, metric_value]
    
    Returns:
        DataFrame with baselines (median, mad) per group
    """
    baselines = []
    grouped = qc_history.groupby(["assay_type", "pipeline_step", "metric_name"])
    
    for (assay, step, metric), df in grouped:
        values = df["metric_value"].dropna().values
        if len(values) < 20:
            continue  # not enough history yet

        median = np.median(values)
        mad = np.median(np.abs(values - median)) or 1e-6  # avoid zero

        baselines.append({
            "assay_type": assay,
            "pipeline_step": step,
            "metric_name": metric,
            "median": median,
            "mad": mad
        })
    
    return pd.DataFrame(baselines)

def detect_anomalies(qc_new: pd.DataFrame, baselines: pd.DataFrame):
    """
    Detect anomalies in new QC metrics using robust z-scores.
    
    Args:
        qc_new: New QC metrics to check
        baselines: Baseline statistics from compute_baselines()
    
    Returns:
        DataFrame of detected anomalies
    """
    merged = qc_new.merge(
        baselines,
        on=["assay_type", "pipeline_step", "metric_name"],
        how="left"
    )

    def score(row):
        if pd.isna(row["median"]):
            return np.nan
        z = (row["metric_value"] - row["median"]) / (1.4826 * row["mad"])
        return z

    merged["z_score"] = merged.apply(score, axis=1)
    merged["is_anomaly"] = merged["z_score"].abs() > Z_THRESHOLD
    
    return merged[merged["is_anomaly"]]

Rule-Based Explanation and Remediation

The agentic part uses simple rules based on domain knowledge. Here's how I set it up:

RULES = [
    {
        "metric_name": "pct_mapped",
        "condition": lambda row: row["metric_value"] < 60,  # arbitrary example
        "likely_cause": "Low mapping rate; potential reference mismatch or contamination.",
        "recommendation": "Verify reference genome/build, check sample sheet integrity, and re-run alignment if misconfiguration is found."
    },
    {
        "metric_name": "duplication_rate",
        "condition": lambda row: row["metric_value"] > 0.8,
        "likely_cause": "High duplication suggests library complexity issues.",
        "recommendation": "Review library prep batch and consider excluding or re-sequencing affected samples."
    },
    {
        "metric_name": "pct_rRNA",
        "condition": lambda row: row["metric_value"] > 0.3,
        "likely_cause": "High rRNA content suggests failed rRNA depletion.",
        "recommendation": "Flag batch, review wet-lab protocol, possibly drop/re-run subset."
    },
    # Add more rules as needed
]

def explain_and_recommend(anomalies: pd.DataFrame):
    """
    Generate explanations and remediation recommendations for anomalies.
    
    Args:
        anomalies: DataFrame of detected anomalies
    
    Returns:
        DataFrame with explanations and recommendations
    """
    explanations = []
    
    for _, row in anomalies.iterrows():
        for rule in RULES:
            if row["metric_name"] == rule["metric_name"] and rule["condition"](row):
                explanations.append({
                    "sample_id": row["sample_id"],
                    "batch_id": row.get("batch_id"),
                    "assay_type": row["assay_type"],
                    "pipeline_step": row["pipeline_step"],
                    "metric_name": row["metric_name"],
                    "metric_value": row["metric_value"],
                    "z_score": row["z_score"],
                    "likely_cause": rule["likely_cause"],
                    "recommendation": rule["recommendation"],
                    "tool_name": row.get("tool_name"),
                    "tool_version": row.get("tool_version")
                })
                break
    
    return pd.DataFrame(explanations)

Putting It All Together

Here's how I wired everything together:

def run_qc_agent(qc_history_path: str, qc_new_path: str, output_path: str):
    """
    Main function to run the QC anomaly detection agent.
    
    Args:
        qc_history_path: Path to historical QC metrics
        qc_new_path: Path to new QC metrics to check
        output_path: Path to save anomaly report
    """
    # Load data
    qc_history = pd.read_csv(qc_history_path)
    qc_new = pd.read_csv(qc_new_path)
    
    # Compute baselines
    print("Computing baselines from historical data...")
    baselines = compute_baselines(qc_history)
    
    # Detect anomalies
    print("Detecting anomalies...")
    anomalies = detect_anomalies(qc_new, baselines)
    
    # Generate explanations and recommendations
    print("Generating recommendations...")
    recommendations = explain_and_recommend(anomalies)
    
    # Save report
    recommendations.to_csv(output_path, index=False)
    
    # Optionally: generate markdown report
    generate_markdown_report(recommendations, output_path.replace('.csv', '.md'))
    
    return recommendations

def generate_markdown_report(recommendations: pd.DataFrame, output_path: str):
    """Generate a human-readable markdown report."""
    with open(output_path, 'w') as f:
        f.write("# QC Anomaly Detection Report\n\n")
        f.write(f"Generated: {pd.Timestamp.now()}\n\n")
        f.write(f"Total anomalies detected: {len(recommendations)}\n\n")
        
        for _, rec in recommendations.iterrows():
            f.write(f"## Sample: {rec['sample_id']}\n\n")
            f.write(f"- **Batch:** {rec.get('batch_id', 'N/A')}\n")
            f.write(f"- **Pipeline Step:** {rec['pipeline_step']}\n")
            f.write(f"- **Metric:** {rec['metric_name']} = {rec['metric_value']:.2f}\n")
            f.write(f"- **Z-Score:** {rec['z_score']:.2f}\n")
            f.write(f"- **Tool:** {rec.get('tool_name', 'N/A')} {rec.get('tool_version', '')}\n\n")
            f.write(f"**Likely Cause:** {rec['likely_cause']}\n\n")
            f.write(f"**Recommendation:** {rec['recommendation']}\n\n")
            f.write("---\n\n")

Integration with Airflow

To hook this into the existing pipeline, I added a task at the end of the DAG. Here's what it looks like:

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

def qc_anomaly_detection_task(**context):
    """Airflow task to run QC anomaly detection."""
    from qc_agent import run_qc_agent
    
    # Get paths from context or config
    qc_history_path = "/path/to/qc_metrics_history.csv"
    qc_new_path = "/path/to/qc_metrics_new.csv"
    output_path = f"/path/to/reports/qc_report_{datetime.now().strftime('%Y%m%d')}.csv"
    
    recommendations = run_qc_agent(qc_history_path, qc_new_path, output_path)
    
    # Optionally send alerts if critical anomalies found
    if len(recommendations) > 0:
        send_alert(recommendations)
    
    return output_path

# Add to your DAG
qc_check = PythonOperator(
    task_id='qc_anomaly_detection',
    python_callable=qc_anomaly_detection_task,
    dag=dag
)

Future Enhancements

The current system is working well, but there's always room to improve. Here are some directions I'm exploring:

ML-based detection: Experimenting with IsolationForest or autoencoders for more sophisticated anomaly detection. The simple statistical approach works great, but ML might catch more subtle patterns.
LLM integration: Using an LLM to generate more nuanced explanations and recommendations could be powerful, especially for complex multi-metric anomalies. The rule-based system handles most cases, but there's potential for more contextual reasoning.
Batch-level analysis: Comparing entire batch distributions against historical patterns could catch systematic issues that individual sample analysis might miss—like batch effects or protocol drift.
Automated remediation: For well-understood anomaly types, automatically triggering reruns or parameter adjustments could save even more time. This needs careful validation first, but it's a natural next step.
Multi-assay support: The framework should generalize to other omics types (WGS, ATAC-seq, etc.). The main work is defining the right QC metrics for each assay type and building the corresponding rules.

Conclusion

I've developed this system and it's been really useful—it's cut down on debugging time and caught issues that would have otherwise gone unnoticed. The modular design makes it easy to extend, and I've already added new rules as I've encountered edge cases.

If you're building something similar, I'd recommend starting simple. Focus on one assay type, collect the metrics that actually matter, and build robust baseline models. Once that foundation is working, adding the agentic reasoning layer is pretty straightforward. Then you can iterate and improve as you go.

              A few things to keep in mind:
              Robust statistics (median, MAD) handle messy QC data better than mean/std
Rule-based explanations are often sufficient—you don't need ML/LLM from day one
Integrating into existing pipelines is usually easier than building from scratch
The real value is in actionable recommendations, not just flagging problems