Clustering and Merging Queries ============================== This section covers patterns for clustering overlapping intervals and merging them into unified regions using GIQL's aggregation operators. .. contents:: :local: :depth: 2 Basic Clustering ---------------- Assign Cluster IDs ~~~~~~~~~~~~~~~~~~ Assign unique cluster IDs to groups of overlapping intervals: .. code-block:: python cursor = engine.execute(""" SELECT *, CLUSTER(interval) AS cluster_id FROM features ORDER BY chromosome, start_pos """) **Use case:** Group overlapping peaks or annotations for downstream analysis. View Cluster Assignments ~~~~~~~~~~~~~~~~~~~~~~~~ See which features belong to which cluster: .. code-block:: python cursor = engine.execute(""" SELECT cluster_id, chromosome, name, start_pos, end_pos FROM ( SELECT *, CLUSTER(interval) AS cluster_id FROM features ) ORDER BY cluster_id, start_pos """) **Use case:** Inspect clustering results to understand feature groupings. Distance-Based Clustering ------------------------- Cluster with Gap Tolerance ~~~~~~~~~~~~~~~~~~~~~~~~~~ Cluster intervals that are within a specified distance of each other: .. code-block:: python cursor = engine.execute(""" SELECT *, CLUSTER(interval, 1000) AS cluster_id FROM features ORDER BY chromosome, start_pos """) **Use case:** Group nearby features even if they don't directly overlap (e.g., cluster peaks within 1kb of each other). Variable Distance Thresholds ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Experiment with different clustering distances: .. code-block:: python # Tight clustering (overlapping only) cursor = engine.execute(""" SELECT *, CLUSTER(interval, 0) AS tight_cluster FROM features """) # Medium clustering (within 500bp) cursor = engine.execute(""" SELECT *, CLUSTER(interval, 500) AS medium_cluster FROM features """) # Loose clustering (within 5kb) cursor = engine.execute(""" SELECT *, CLUSTER(interval, 5000) AS loose_cluster FROM features """) **Use case:** Compare clustering at different resolutions for sensitivity analysis. Strand-Specific Clustering -------------------------- Cluster by Strand ~~~~~~~~~~~~~~~~~ Cluster intervals separately for each strand: .. code-block:: python cursor = engine.execute(""" SELECT *, CLUSTER(interval, stranded=true) AS cluster_id FROM features ORDER BY chromosome, strand, start_pos """) **Use case:** Maintain strand separation when clustering transcripts or strand-specific regulatory elements. Strand-Specific with Distance ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Combine strand awareness with distance tolerance: .. code-block:: python cursor = engine.execute(""" SELECT *, CLUSTER(interval, 1000, stranded=true) AS cluster_id FROM features ORDER BY chromosome, strand, start_pos """) **Use case:** Cluster nearby same-strand features while keeping opposite strands separate. Cluster Statistics ------------------ Count Features per Cluster ~~~~~~~~~~~~~~~~~~~~~~~~~~ Calculate how many features are in each cluster: .. code-block:: python cursor = engine.execute(""" WITH clustered AS ( SELECT *, CLUSTER(interval) AS cluster_id FROM features ) SELECT cluster_id, chromosome, COUNT(*) AS feature_count, MIN(start_pos) AS cluster_start, MAX(end_pos) AS cluster_end FROM clustered GROUP BY cluster_id, chromosome ORDER BY chromosome, cluster_start """) **Use case:** Identify cluster sizes and boundaries. Filter by Cluster Size ~~~~~~~~~~~~~~~~~~~~~~ Find clusters with a minimum number of features: .. code-block:: python cursor = engine.execute(""" WITH clustered AS ( SELECT *, CLUSTER(interval) AS cluster_id FROM features ), cluster_sizes AS ( SELECT cluster_id, COUNT(*) AS size FROM clustered GROUP BY cluster_id ) SELECT c.* FROM clustered c JOIN cluster_sizes s ON c.cluster_id = s.cluster_id WHERE s.size >= 3 ORDER BY c.cluster_id, c.start_pos """) **Use case:** Focus on regions with multiple overlapping features (hotspots). Cluster Summary Statistics ~~~~~~~~~~~~~~~~~~~~~~~~~~ Calculate statistics for each cluster: .. code-block:: python cursor = engine.execute(""" WITH clustered AS ( SELECT *, CLUSTER(interval) AS cluster_id FROM features ) SELECT cluster_id, chromosome, COUNT(*) AS feature_count, MIN(start_pos) AS cluster_start, MAX(end_pos) AS cluster_end, MAX(end_pos) - MIN(start_pos) AS cluster_span, AVG(score) AS avg_score, MAX(score) AS max_score FROM clustered GROUP BY cluster_id, chromosome ORDER BY feature_count DESC """) **Use case:** Rank clusters by size, span, or aggregate scores. Basic Merging ------------- Merge Overlapping Intervals ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Combine overlapping intervals into unified regions: .. code-block:: python cursor = engine.execute(""" SELECT MERGE(interval) FROM features """) **Use case:** Create non-overlapping consensus regions from redundant annotations. Merge with Distance ~~~~~~~~~~~~~~~~~~~ Merge intervals within a specified distance: .. code-block:: python cursor = engine.execute(""" SELECT MERGE(interval, 1000) FROM features """) **Use case:** Create broader regions by joining nearby features. Strand-Specific Merge ~~~~~~~~~~~~~~~~~~~~~ Merge intervals separately by strand: .. code-block:: python cursor = engine.execute(""" SELECT MERGE(interval, stranded=true) FROM features """) **Use case:** Create strand-aware consensus regions. Merge with Aggregations ----------------------- Count Merged Features ~~~~~~~~~~~~~~~~~~~~~ Track how many features were merged into each region: .. code-block:: python cursor = engine.execute(""" SELECT MERGE(interval), COUNT(*) AS feature_count FROM features """) **Use case:** Understand the complexity of each merged region. Aggregate Scores ~~~~~~~~~~~~~~~~ Calculate statistics for merged regions: .. code-block:: python cursor = engine.execute(""" SELECT MERGE(interval), COUNT(*) AS feature_count, AVG(score) AS avg_score, MAX(score) AS max_score, SUM(score) AS total_score FROM features """) **Use case:** Summarize signal intensity across merged regions. Collect Feature Names ~~~~~~~~~~~~~~~~~~~~~ List the names of features that were merged: .. code-block:: python cursor = engine.execute(""" SELECT MERGE(interval), STRING_AGG(name, ',') AS merged_features FROM features """) **Use case:** Track provenance of merged regions. Coverage Calculations --------------------- Total Base Pair Coverage ~~~~~~~~~~~~~~~~~~~~~~~~ Calculate total genomic coverage after merging: .. code-block:: python cursor = engine.execute(""" WITH merged AS ( SELECT MERGE(interval) FROM features ) SELECT SUM(end_pos - start_pos) AS total_coverage_bp FROM merged """) **Use case:** Calculate the total genome fraction covered by features. Coverage per Chromosome ~~~~~~~~~~~~~~~~~~~~~~~ Calculate coverage for each chromosome: .. code-block:: python cursor = engine.execute(""" WITH merged AS ( SELECT MERGE(interval) FROM features ) SELECT chromosome, COUNT(*) AS region_count, SUM(end_pos - start_pos) AS coverage_bp FROM merged GROUP BY chromosome ORDER BY chromosome """) **Use case:** Compare feature density across chromosomes. Coverage Reduction ~~~~~~~~~~~~~~~~~~ Compare raw vs merged coverage: .. code-block:: python cursor = engine.execute(""" WITH raw_stats AS ( SELECT COUNT(*) AS raw_count, SUM(end_pos - start_pos) AS raw_bp FROM features ), merged_stats AS ( SELECT COUNT(*) AS merged_count, SUM(end_pos - start_pos) AS merged_bp FROM (SELECT MERGE(interval) FROM features) ) SELECT raw_count, merged_count, raw_bp, merged_bp, ROUND(100.0 * merged_bp / raw_bp, 2) AS coverage_retained_pct FROM raw_stats, merged_stats """) **Use case:** Quantify the redundancy in your feature set. Advanced Patterns ----------------- Cluster Then Merge ~~~~~~~~~~~~~~~~~~ First cluster features, then analyze each cluster: .. code-block:: python cursor = engine.execute(""" WITH clustered AS ( SELECT *, CLUSTER(interval) AS cluster_id FROM features ) SELECT cluster_id, MIN(chromosome) AS chromosome, MIN(start_pos) AS start_pos, MAX(end_pos) AS end_pos, COUNT(*) AS feature_count, STRING_AGG(name, ',') AS features FROM clustered GROUP BY cluster_id ORDER BY chromosome, start_pos """) **Use case:** Alternative to MERGE that preserves cluster identifiers. Hierarchical Clustering ~~~~~~~~~~~~~~~~~~~~~~~ Apply multiple clustering levels: .. code-block:: python cursor = engine.execute(""" WITH level1 AS ( SELECT *, CLUSTER(interval, 0) AS cluster_l1 FROM features ), level2 AS ( SELECT *, CLUSTER(interval, 1000) AS cluster_l2 FROM level1 ) SELECT cluster_l1, cluster_l2, chromosome, name, start_pos, end_pos FROM level2 ORDER BY cluster_l2, cluster_l1, start_pos """) **Use case:** Analyze feature relationships at multiple scales.