Autonomous QC and Anomaly Detection for Omics Pipelines
Introduction
I've been working on building an autonomous QC layer for the omics pipelines I work with, and it's been a game-changer. The system continuously monitors pipeline QC metrics, learns what "normal" looks like for each assay and pipeline step, and automatically flags outliers. What I love about it is that it doesn't just detect problems—it traces anomalies back to specific tools, parameters, or batches and actually recommends what to do about them (reruns, parameter tweaks, or sample exclusion). This has cut down debugging time significantly and caught issues that would have otherwise slipped through.
I wanted to share how I built this, focusing on a lightweight v1 that you can actually implement. I'll walk through the core components, show you the code I used, and explain the design decisions that made it work in practice.
System Overview
The system breaks down into three main pieces that work together:
- QC Metrics Collector - Pulls together all the QC metrics from your pipeline runs into one place
- QC Anomaly Detector - Learns what normal looks like and flags when things are off
- Explanation & Remediation Recommender - This is the "agentic" part—it figures out what went wrong and suggests fixes
QC Metrics: What to Collect
For RNA-seq, I focused on the metrics that actually matter. Here's what I'm tracking:
Per Sample Metrics
- Total reads, mapped reads, % mapped
- % duplicates
- % rRNA / % mitochondrial reads
- Insert size mean / std
- GC content
- Number of detected genes
- 5'–3' coverage bias (if available)
- Library complexity estimates (optional)
Per Pipeline Step Metrics
- Runtime, exit status, memory usage
- Tool name + version (e.g., STAR 2.7.10a, fastp 0.23.3)
System Design: Core Components
1. QC Metrics Collector
The first challenge was getting all the QC metrics into one place. The pipeline outputs metrics in different formats (FastQC HTML, Picard text files, STAR logs), so I needed a way to normalize everything.
I ended up with a simple schema that captures what I need:
sample_idbatch_idassay_type(e.g., rna_seq)pipeline_step(e.g., alignment, trimming, qc_summary)tool_name,tool_versionmetric_name(e.g., pct_mapped, duplication_rate)metric_value(numeric)run_timestamp
In practice, I added a small task at the end of each major step in the Airflow DAG that:
- Parses the QC outputs (FastQC, Picard, STAR logs, etc.)
- Writes everything into a central store—I started with a simple CSV file, but you could use Postgres or Parquet files in object storage
2. QC Anomaly Detector
This is where it gets interesting. The goal is to learn what "normal" looks like for each metric and flag when something's off.
I went with a simple but robust approach using median and MAD (Median Absolute Deviation). Why? Because QC metrics can be messy—outliers in your training data shouldn't break your baseline. Standard deviation gets thrown off by outliers, but median and MAD are much more resilient.
Here's how it works:
- Group metrics by
(assay_type, pipeline_step, metric_name)—this way, alignment metrics are compared to other alignment metrics, not trimming metrics - For each group, compute:
- Median: m
- Median Absolute Deviation (MAD): MAD
- For a new metric value x, compute a robust z-score:
z = (x - m) / (1.4826 * MAD + ε) - Flag anything where |z| > 3 (or whatever threshold works for you)
You could get fancier with per-batch comparisons or ML-based approaches like IsolationForest, but honestly, this simple method has worked really well for me. Start simple, iterate later.
3. Explanation & Remediation Recommender (Agentic Layer)
This is the part that makes it feel "agentic"—the system doesn't just say "something's wrong," it figures out what probably went wrong and suggests what to do about it.
For v1, I kept it simple with rule-based logic. You could add an LLM later (v1.5?), but honestly, rules work great when you have domain knowledge. The idea is to map (metric_name, direction) to likely causes and suggested actions.
Here are some examples from what I've built:
- pct_mapped very low → Usually means wrong reference genome, contamination, or sample swap. I have it suggest: verify the reference build, check the sample sheet, and re-run alignment with stricter QC if needed.
- duplication_rate very high → Often indicates low input, over-amplification, or library prep issues. The system recommends reviewing the library prep batch and considering excluding the sample or re-sequencing.
- pct_rRNA very high → Classic sign of failed rRNA depletion. It flags the batch and suggests reviewing the wet-lab protocol, possibly dropping or re-running that subset.
Implementation: Code Examples
Baseline Computation and Anomaly Detection
Here's the core code I'm using. It's pretty straightforward:
import pandas as pd
import numpy as np
Z_THRESHOLD = 3.0
def compute_baselines(qc_history: pd.DataFrame):
"""
Compute robust baselines (median and MAD) for each metric group.
Args:
qc_history: DataFrame with columns [assay_type, pipeline_step,
metric_name, metric_value]
Returns:
DataFrame with baselines (median, mad) per group
"""
baselines = []
grouped = qc_history.groupby(["assay_type", "pipeline_step", "metric_name"])
for (assay, step, metric), df in grouped:
values = df["metric_value"].dropna().values
if len(values) < 20:
continue # not enough history yet
median = np.median(values)
mad = np.median(np.abs(values - median)) or 1e-6 # avoid zero
baselines.append({
"assay_type": assay,
"pipeline_step": step,
"metric_name": metric,
"median": median,
"mad": mad
})
return pd.DataFrame(baselines)
def detect_anomalies(qc_new: pd.DataFrame, baselines: pd.DataFrame):
"""
Detect anomalies in new QC metrics using robust z-scores.
Args:
qc_new: New QC metrics to check
baselines: Baseline statistics from compute_baselines()
Returns:
DataFrame of detected anomalies
"""
merged = qc_new.merge(
baselines,
on=["assay_type", "pipeline_step", "metric_name"],
how="left"
)
def score(row):
if pd.isna(row["median"]):
return np.nan
z = (row["metric_value"] - row["median"]) / (1.4826 * row["mad"])
return z
merged["z_score"] = merged.apply(score, axis=1)
merged["is_anomaly"] = merged["z_score"].abs() > Z_THRESHOLD
return merged[merged["is_anomaly"]]
Rule-Based Explanation and Remediation
The agentic part uses simple rules based on domain knowledge. Here's how I set it up:
RULES = [
{
"metric_name": "pct_mapped",
"condition": lambda row: row["metric_value"] < 60, # arbitrary example
"likely_cause": "Low mapping rate; potential reference mismatch or contamination.",
"recommendation": "Verify reference genome/build, check sample sheet integrity, and re-run alignment if misconfiguration is found."
},
{
"metric_name": "duplication_rate",
"condition": lambda row: row["metric_value"] > 0.8,
"likely_cause": "High duplication suggests library complexity issues.",
"recommendation": "Review library prep batch and consider excluding or re-sequencing affected samples."
},
{
"metric_name": "pct_rRNA",
"condition": lambda row: row["metric_value"] > 0.3,
"likely_cause": "High rRNA content suggests failed rRNA depletion.",
"recommendation": "Flag batch, review wet-lab protocol, possibly drop/re-run subset."
},
# Add more rules as needed
]
def explain_and_recommend(anomalies: pd.DataFrame):
"""
Generate explanations and remediation recommendations for anomalies.
Args:
anomalies: DataFrame of detected anomalies
Returns:
DataFrame with explanations and recommendations
"""
explanations = []
for _, row in anomalies.iterrows():
for rule in RULES:
if row["metric_name"] == rule["metric_name"] and rule["condition"](row):
explanations.append({
"sample_id": row["sample_id"],
"batch_id": row.get("batch_id"),
"assay_type": row["assay_type"],
"pipeline_step": row["pipeline_step"],
"metric_name": row["metric_name"],
"metric_value": row["metric_value"],
"z_score": row["z_score"],
"likely_cause": rule["likely_cause"],
"recommendation": rule["recommendation"],
"tool_name": row.get("tool_name"),
"tool_version": row.get("tool_version")
})
break
return pd.DataFrame(explanations)
Putting It All Together
Here's how I wired everything together:
def run_qc_agent(qc_history_path: str, qc_new_path: str, output_path: str):
"""
Main function to run the QC anomaly detection agent.
Args:
qc_history_path: Path to historical QC metrics
qc_new_path: Path to new QC metrics to check
output_path: Path to save anomaly report
"""
# Load data
qc_history = pd.read_csv(qc_history_path)
qc_new = pd.read_csv(qc_new_path)
# Compute baselines
print("Computing baselines from historical data...")
baselines = compute_baselines(qc_history)
# Detect anomalies
print("Detecting anomalies...")
anomalies = detect_anomalies(qc_new, baselines)
# Generate explanations and recommendations
print("Generating recommendations...")
recommendations = explain_and_recommend(anomalies)
# Save report
recommendations.to_csv(output_path, index=False)
# Optionally: generate markdown report
generate_markdown_report(recommendations, output_path.replace('.csv', '.md'))
return recommendations
def generate_markdown_report(recommendations: pd.DataFrame, output_path: str):
"""Generate a human-readable markdown report."""
with open(output_path, 'w') as f:
f.write("# QC Anomaly Detection Report\n\n")
f.write(f"Generated: {pd.Timestamp.now()}\n\n")
f.write(f"Total anomalies detected: {len(recommendations)}\n\n")
for _, rec in recommendations.iterrows():
f.write(f"## Sample: {rec['sample_id']}\n\n")
f.write(f"- **Batch:** {rec.get('batch_id', 'N/A')}\n")
f.write(f"- **Pipeline Step:** {rec['pipeline_step']}\n")
f.write(f"- **Metric:** {rec['metric_name']} = {rec['metric_value']:.2f}\n")
f.write(f"- **Z-Score:** {rec['z_score']:.2f}\n")
f.write(f"- **Tool:** {rec.get('tool_name', 'N/A')} {rec.get('tool_version', '')}\n\n")
f.write(f"**Likely Cause:** {rec['likely_cause']}\n\n")
f.write(f"**Recommendation:** {rec['recommendation']}\n\n")
f.write("---\n\n")
Integration with Airflow
To hook this into the existing pipeline, I added a task at the end of the DAG. Here's what it looks like:
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
def qc_anomaly_detection_task(**context):
"""Airflow task to run QC anomaly detection."""
from qc_agent import run_qc_agent
# Get paths from context or config
qc_history_path = "/path/to/qc_metrics_history.csv"
qc_new_path = "/path/to/qc_metrics_new.csv"
output_path = f"/path/to/reports/qc_report_{datetime.now().strftime('%Y%m%d')}.csv"
recommendations = run_qc_agent(qc_history_path, qc_new_path, output_path)
# Optionally send alerts if critical anomalies found
if len(recommendations) > 0:
send_alert(recommendations)
return output_path
# Add to your DAG
qc_check = PythonOperator(
task_id='qc_anomaly_detection',
python_callable=qc_anomaly_detection_task,
dag=dag
)
Future Enhancements
The current system is working well, but there's always room to improve. Here are some directions I'm exploring:
- ML-based detection: Experimenting with IsolationForest or autoencoders for more sophisticated anomaly detection. The simple statistical approach works great, but ML might catch more subtle patterns.
- LLM integration: Using an LLM to generate more nuanced explanations and recommendations could be powerful, especially for complex multi-metric anomalies. The rule-based system handles most cases, but there's potential for more contextual reasoning.
- Batch-level analysis: Comparing entire batch distributions against historical patterns could catch systematic issues that individual sample analysis might miss—like batch effects or protocol drift.
- Automated remediation: For well-understood anomaly types, automatically triggering reruns or parameter adjustments could save even more time. This needs careful validation first, but it's a natural next step.
- Multi-assay support: The framework should generalize to other omics types (WGS, ATAC-seq, etc.). The main work is defining the right QC metrics for each assay type and building the corresponding rules.
Conclusion
I've developed this system and it's been really useful—it's cut down on debugging time and caught issues that would have otherwise gone unnoticed. The modular design makes it easy to extend, and I've already added new rules as I've encountered edge cases.
If you're building something similar, I'd recommend starting simple. Focus on one assay type, collect the metrics that actually matter, and build robust baseline models. Once that foundation is working, adding the agentic reasoning layer is pretty straightforward. Then you can iterate and improve as you go.
- Robust statistics (median, MAD) handle messy QC data better than mean/std
- Rule-based explanations are often sufficient—you don't need ML/LLM from day one
- Integrating into existing pipelines is usually easier than building from scratch
- The real value is in actionable recommendations, not just flagging problems