Clustering and Merging Queries
This section covers patterns for clustering overlapping intervals and merging them into unified regions using GIQL’s aggregation operators.
Basic Clustering
Assign Cluster IDs
Assign unique cluster IDs to groups of overlapping intervals:
cursor = engine.execute("""
SELECT
*,
CLUSTER(interval) AS cluster_id
FROM features
ORDER BY chromosome, start_pos
""")
Use case: Group overlapping peaks or annotations for downstream analysis.
View Cluster Assignments
See which features belong to which cluster:
cursor = engine.execute("""
SELECT
cluster_id,
chromosome,
name,
start_pos,
end_pos
FROM (
SELECT *, CLUSTER(interval) AS cluster_id
FROM features
)
ORDER BY cluster_id, start_pos
""")
Use case: Inspect clustering results to understand feature groupings.
Distance-Based Clustering
Cluster with Gap Tolerance
Cluster intervals that are within a specified distance of each other:
cursor = engine.execute("""
SELECT
*,
CLUSTER(interval, 1000) AS cluster_id
FROM features
ORDER BY chromosome, start_pos
""")
Use case: Group nearby features even if they don’t directly overlap (e.g., cluster peaks within 1kb of each other).
Variable Distance Thresholds
Experiment with different clustering distances:
# Tight clustering (overlapping only)
cursor = engine.execute("""
SELECT *, CLUSTER(interval, 0) AS tight_cluster FROM features
""")
# Medium clustering (within 500bp)
cursor = engine.execute("""
SELECT *, CLUSTER(interval, 500) AS medium_cluster FROM features
""")
# Loose clustering (within 5kb)
cursor = engine.execute("""
SELECT *, CLUSTER(interval, 5000) AS loose_cluster FROM features
""")
Use case: Compare clustering at different resolutions for sensitivity analysis.
Strand-Specific Clustering
Cluster by Strand
Cluster intervals separately for each strand:
cursor = engine.execute("""
SELECT
*,
CLUSTER(interval, stranded=true) AS cluster_id
FROM features
ORDER BY chromosome, strand, start_pos
""")
Use case: Maintain strand separation when clustering transcripts or strand-specific regulatory elements.
Strand-Specific with Distance
Combine strand awareness with distance tolerance:
cursor = engine.execute("""
SELECT
*,
CLUSTER(interval, 1000, stranded=true) AS cluster_id
FROM features
ORDER BY chromosome, strand, start_pos
""")
Use case: Cluster nearby same-strand features while keeping opposite strands separate.
Cluster Statistics
Count Features per Cluster
Calculate how many features are in each cluster:
cursor = engine.execute("""
WITH clustered AS (
SELECT *, CLUSTER(interval) AS cluster_id
FROM features
)
SELECT
cluster_id,
chromosome,
COUNT(*) AS feature_count,
MIN(start_pos) AS cluster_start,
MAX(end_pos) AS cluster_end
FROM clustered
GROUP BY cluster_id, chromosome
ORDER BY chromosome, cluster_start
""")
Use case: Identify cluster sizes and boundaries.
Filter by Cluster Size
Find clusters with a minimum number of features:
cursor = engine.execute("""
WITH clustered AS (
SELECT *, CLUSTER(interval) AS cluster_id
FROM features
),
cluster_sizes AS (
SELECT cluster_id, COUNT(*) AS size
FROM clustered
GROUP BY cluster_id
)
SELECT c.*
FROM clustered c
JOIN cluster_sizes s ON c.cluster_id = s.cluster_id
WHERE s.size >= 3
ORDER BY c.cluster_id, c.start_pos
""")
Use case: Focus on regions with multiple overlapping features (hotspots).
Cluster Summary Statistics
Calculate statistics for each cluster:
cursor = engine.execute("""
WITH clustered AS (
SELECT *, CLUSTER(interval) AS cluster_id
FROM features
)
SELECT
cluster_id,
chromosome,
COUNT(*) AS feature_count,
MIN(start_pos) AS cluster_start,
MAX(end_pos) AS cluster_end,
MAX(end_pos) - MIN(start_pos) AS cluster_span,
AVG(score) AS avg_score,
MAX(score) AS max_score
FROM clustered
GROUP BY cluster_id, chromosome
ORDER BY feature_count DESC
""")
Use case: Rank clusters by size, span, or aggregate scores.
Basic Merging
Merge Overlapping Intervals
Combine overlapping intervals into unified regions:
cursor = engine.execute("""
SELECT MERGE(interval)
FROM features
""")
Use case: Create non-overlapping consensus regions from redundant annotations.
Merge with Distance
Merge intervals within a specified distance:
cursor = engine.execute("""
SELECT MERGE(interval, 1000)
FROM features
""")
Use case: Create broader regions by joining nearby features.
Strand-Specific Merge
Merge intervals separately by strand:
cursor = engine.execute("""
SELECT MERGE(interval, stranded=true)
FROM features
""")
Use case: Create strand-aware consensus regions.
Merge with Aggregations
Count Merged Features
Track how many features were merged into each region:
cursor = engine.execute("""
SELECT
MERGE(interval),
COUNT(*) AS feature_count
FROM features
""")
Use case: Understand the complexity of each merged region.
Aggregate Scores
Calculate statistics for merged regions:
cursor = engine.execute("""
SELECT
MERGE(interval),
COUNT(*) AS feature_count,
AVG(score) AS avg_score,
MAX(score) AS max_score,
SUM(score) AS total_score
FROM features
""")
Use case: Summarize signal intensity across merged regions.
Collect Feature Names
List the names of features that were merged:
cursor = engine.execute("""
SELECT
MERGE(interval),
STRING_AGG(name, ',') AS merged_features
FROM features
""")
Use case: Track provenance of merged regions.
Coverage Calculations
Total Base Pair Coverage
Calculate total genomic coverage after merging:
cursor = engine.execute("""
WITH merged AS (
SELECT MERGE(interval)
FROM features
)
SELECT SUM(end_pos - start_pos) AS total_coverage_bp
FROM merged
""")
Use case: Calculate the total genome fraction covered by features.
Coverage per Chromosome
Calculate coverage for each chromosome:
cursor = engine.execute("""
WITH merged AS (
SELECT MERGE(interval)
FROM features
)
SELECT
chromosome,
COUNT(*) AS region_count,
SUM(end_pos - start_pos) AS coverage_bp
FROM merged
GROUP BY chromosome
ORDER BY chromosome
""")
Use case: Compare feature density across chromosomes.
Coverage Reduction
Compare raw vs merged coverage:
cursor = engine.execute("""
WITH raw_stats AS (
SELECT
COUNT(*) AS raw_count,
SUM(end_pos - start_pos) AS raw_bp
FROM features
),
merged_stats AS (
SELECT
COUNT(*) AS merged_count,
SUM(end_pos - start_pos) AS merged_bp
FROM (SELECT MERGE(interval) FROM features)
)
SELECT
raw_count,
merged_count,
raw_bp,
merged_bp,
ROUND(100.0 * merged_bp / raw_bp, 2) AS coverage_retained_pct
FROM raw_stats, merged_stats
""")
Use case: Quantify the redundancy in your feature set.
Advanced Patterns
Cluster Then Merge
First cluster features, then analyze each cluster:
cursor = engine.execute("""
WITH clustered AS (
SELECT *, CLUSTER(interval) AS cluster_id
FROM features
)
SELECT
cluster_id,
MIN(chromosome) AS chromosome,
MIN(start_pos) AS start_pos,
MAX(end_pos) AS end_pos,
COUNT(*) AS feature_count,
STRING_AGG(name, ',') AS features
FROM clustered
GROUP BY cluster_id
ORDER BY chromosome, start_pos
""")
Use case: Alternative to MERGE that preserves cluster identifiers.
Hierarchical Clustering
Apply multiple clustering levels:
cursor = engine.execute("""
WITH level1 AS (
SELECT *, CLUSTER(interval, 0) AS cluster_l1
FROM features
),
level2 AS (
SELECT *, CLUSTER(interval, 1000) AS cluster_l2
FROM level1
)
SELECT
cluster_l1,
cluster_l2,
chromosome,
name,
start_pos,
end_pos
FROM level2
ORDER BY cluster_l2, cluster_l1, start_pos
""")
Use case: Analyze feature relationships at multiple scales.