Clustering and Merging

Clustering and Merging#

This section covers patterns for clustering overlapping intervals and merging them into unified regions using GIQL’s aggregation operators.

Basic Clustering#

Assign Cluster IDs#

Assign unique cluster IDs to groups of overlapping intervals:

SELECT
    *,
    CLUSTER(interval) AS cluster_id
FROM features
ORDER BY chrom, start

Use case: Group overlapping peaks or annotations for downstream analysis.

View Cluster Assignments#

See which features belong to which cluster:

SELECT
    cluster_id,
    chrom,
    name,
    start,
    end
FROM (
    SELECT *, CLUSTER(interval) AS cluster_id
    FROM features
)
ORDER BY cluster_id, start

Use case: Inspect clustering results to understand feature groupings.

Distance-Based Clustering#

Cluster with Gap Tolerance#

Cluster intervals that are within a specified distance of each other:

SELECT
    *,
    CLUSTER(interval, 1000) AS cluster_id
FROM features
ORDER BY chrom, start

Use case: Group nearby features even if they don’t directly overlap (e.g., cluster peaks within 1kb of each other).

Variable Distance Thresholds#

Experiment with different clustering distances:

-- Tight clustering (overlapping only)
SELECT *, CLUSTER(interval, 0) AS tight_cluster FROM features

-- Medium clustering (within 500bp)
SELECT *, CLUSTER(interval, 500) AS medium_cluster FROM features

-- Loose clustering (within 5kb)
SELECT *, CLUSTER(interval, 5000) AS loose_cluster FROM features

Use case: Compare clustering at different resolutions for sensitivity analysis.

Strand-Specific Clustering#

Cluster by Strand#

Cluster intervals separately for each strand:

SELECT
    *,
    CLUSTER(interval, stranded=true) AS cluster_id
FROM features
ORDER BY chrom, strand, start

Use case: Maintain strand separation when clustering transcripts or strand-specific regulatory elements.

Strand-Specific with Distance#

Combine strand awareness with distance tolerance:

SELECT
    *,
    CLUSTER(interval, 1000, stranded=true) AS cluster_id
FROM features
ORDER BY chrom, strand, start

Use case: Cluster nearby same-strand features while keeping opposite strands separate.

Cluster Statistics#

Count Features per Cluster#

Calculate how many features are in each cluster:

WITH clustered AS (
    SELECT *, CLUSTER(interval) AS cluster_id
    FROM features
)
SELECT
    cluster_id,
    chrom,
    COUNT(*) AS feature_count,
    MIN(start) AS cluster_start,
    MAX(end) AS cluster_end
FROM clustered
GROUP BY cluster_id, chrom
ORDER BY chrom, cluster_start

Use case: Identify cluster sizes and boundaries.

Filter by Cluster Size#

Find clusters with a minimum number of features:

WITH clustered AS (
    SELECT *, CLUSTER(interval) AS cluster_id
    FROM features
),
cluster_sizes AS (
    SELECT cluster_id, COUNT(*) AS size
    FROM clustered
    GROUP BY cluster_id
)
SELECT c.*
FROM clustered c
JOIN cluster_sizes s ON c.cluster_id = s.cluster_id
WHERE s.size >= 3
ORDER BY c.cluster_id, c.start

Use case: Focus on regions with multiple overlapping features (hotspots).

Cluster Summary Statistics#

Calculate statistics for each cluster:

WITH clustered AS (
    SELECT *, CLUSTER(interval) AS cluster_id
    FROM features
)
SELECT
    cluster_id,
    chrom,
    COUNT(*) AS feature_count,
    MIN(start) AS cluster_start,
    MAX(end) AS cluster_end,
    MAX(end) - MIN(start) AS cluster_span,
    AVG(score) AS avg_score,
    MAX(score) AS max_score
FROM clustered
GROUP BY cluster_id, chrom
ORDER BY feature_count DESC

Use case: Rank clusters by size, span, or aggregate scores.

Basic Merging#

Merge Overlapping Intervals#

Combine overlapping intervals into unified regions:

SELECT MERGE(interval)
FROM features

Use case: Create non-overlapping consensus regions from redundant annotations.

Merge with Distance#

Merge intervals within a specified distance:

SELECT MERGE(interval, 1000)
FROM features

Use case: Create broader regions by joining nearby features.

Strand-Specific Merge#

Merge intervals separately by strand:

SELECT MERGE(interval, stranded=true)
FROM features

Use case: Create strand-aware consensus regions.

Merge with Aggregations#

Count Merged Features#

Track how many features were merged into each region:

SELECT
    MERGE(interval),
    COUNT(*) AS feature_count
FROM features

Use case: Understand the complexity of each merged region.

Aggregate Scores#

Calculate statistics for merged regions:

SELECT
    MERGE(interval),
    COUNT(*) AS feature_count,
    AVG(score) AS avg_score,
    MAX(score) AS max_score,
    SUM(score) AS total_score
FROM features

Use case: Summarize signal intensity across merged regions.

Collect Feature Names#

List the names of features that were merged:

SELECT
    MERGE(interval),
    STRING_AGG(name, ',') AS merged_features
FROM features

Use case: Track provenance of merged regions.

Coverage Calculations#

Total Base Pair Coverage#

Calculate total genomic coverage after merging:

WITH merged AS (
    SELECT MERGE(interval)
    FROM features
)
SELECT SUM(end - start) AS total_coverage_bp
FROM merged

Use case: Calculate the total genome fraction covered by features.

Coverage per Chromosome#

Calculate coverage for each chromosome:

WITH merged AS (
    SELECT MERGE(interval)
    FROM features
)
SELECT
    chrom,
    COUNT(*) AS region_count,
    SUM(end - start) AS coverage_bp
FROM merged
GROUP BY chrom
ORDER BY chrom

Use case: Compare feature density across chromosomes.

Coverage Reduction#

Compare raw vs merged coverage:

WITH raw_stats AS (
    SELECT
        COUNT(*) AS raw_count,
        SUM(end - start) AS raw_bp
    FROM features
),
merged_stats AS (
    SELECT
        COUNT(*) AS merged_count,
        SUM(end - start) AS merged_bp
    FROM (SELECT MERGE(interval) FROM features)
)
SELECT
    raw_count,
    merged_count,
    raw_bp,
    merged_bp,
    ROUND(100.0 * merged_bp / raw_bp, 2) AS coverage_retained_pct
FROM raw_stats, merged_stats

Use case: Quantify the redundancy in your feature set.

Advanced Patterns#

Cluster Then Merge#

First cluster features, then analyze each cluster:

WITH clustered AS (
    SELECT *, CLUSTER(interval) AS cluster_id
    FROM features
)
SELECT
    cluster_id,
    MIN(chrom) AS chrom,
    MIN(start) AS start,
    MAX(end) AS end,
    COUNT(*) AS feature_count,
    STRING_AGG(name, ',') AS features
FROM clustered
GROUP BY cluster_id
ORDER BY chrom, start

Use case: Alternative to MERGE that preserves cluster identifiers.

Hierarchical Clustering#

Apply multiple clustering levels:

WITH level1 AS (
    SELECT *, CLUSTER(interval, 0) AS cluster_l1
    FROM features
),
level2 AS (
    SELECT *, CLUSTER(interval, 1000) AS cluster_l2
    FROM level1
)
SELECT
    cluster_l1,
    cluster_l2,
    chrom,
    name,
    start,
    end
FROM level2
ORDER BY cluster_l2, cluster_l1, start

Use case: Analyze feature relationships at multiple scales.