Advanced Queries

This section covers advanced query patterns including multi-range matching, complex filtering, aggregate statistics, and multi-table workflows.

Multi-Range Matching 

Match Any of Multiple Regions 

Find features overlapping any of several regions of interest:

cursor = engine.execute("""
    SELECT * FROM variants
    WHERE interval INTERSECTS ANY(
        'chr1:1000000-2000000',
        'chr1:5000000-6000000',
        'chr2:1000000-3000000'
    )
""")

Use case: Query multiple regions of interest in a single statement.

Match All of Multiple Points 

Find features containing all specified positions:

cursor = engine.execute("""
    SELECT * FROM genes
    WHERE interval CONTAINS ALL(
        'chr1:1500',
        'chr1:1600',
        'chr1:1700'
    )
""")

Use case: Find genes spanning a set of SNP positions.

Exclude Multiple Regions 

Find features that don’t overlap any blacklisted region:

cursor = engine.execute("""
    SELECT * FROM peaks
    WHERE NOT interval INTERSECTS ANY(
        'chr1:120000000-125000000',   -- Centromere region
        'chr1:140000000-142000000',   -- Known artifact
        'chrM:1-16569'                -- Mitochondrial
    )
""")

Use case: Filter out features in problematic genomic regions.

Combine ANY and ALL 

Complex multi-range logic:

cursor = engine.execute("""
    SELECT * FROM features
    WHERE interval INTERSECTS ANY('chr1:1000-2000', 'chr1:5000-6000')
      AND interval CONTAINS ALL('chr1:1100', 'chr1:1200')
""")

Use case: Find features matching complex spatial criteria.

Complex Filtering 

Multi-Attribute Filtering 

Combine spatial and attribute filters:

cursor = engine.execute("""
    SELECT v.*, g.name AS gene_name, g.biotype
    FROM variants v
    INNER JOIN genes g ON v.interval INTERSECTS g.interval
    WHERE v.quality >= 30
      AND v.filter = 'PASS'
      AND v.allele_frequency > 0.01
      AND g.biotype = 'protein_coding'
    ORDER BY v.chromosome, v.start_pos
""")

Use case: Extract high-quality variants in protein-coding genes.

Target Gene Lists 

Filter to specific genes of interest:

cursor = engine.execute("""
    SELECT v.*, g.name AS gene_name
    FROM variants v
    INNER JOIN genes g ON v.interval INTERSECTS g.interval
    WHERE g.name IN (
        'BRCA1', 'BRCA2', 'TP53', 'EGFR', 'KRAS',
        'BRAF', 'PIK3CA', 'PTEN', 'APC', 'ATM'
    )
    ORDER BY g.name, v.start_pos
""")

Use case: Extract variants in clinically actionable genes.

Conditional Logic 

Apply different criteria based on feature type:

cursor = engine.execute("""
    SELECT v.*, g.name, g.biotype,
        CASE
            WHEN g.biotype = 'protein_coding' THEN 'coding'
            WHEN g.biotype LIKE '%RNA%' THEN 'noncoding_RNA'
            ELSE 'other'
        END AS gene_category
    FROM variants v
    INNER JOIN genes g ON v.interval INTERSECTS g.interval
    WHERE CASE
        WHEN g.biotype = 'protein_coding' THEN v.quality >= 30
        ELSE v.quality >= 20
    END
""")

Use case: Apply different quality thresholds based on genomic context.

Aggregate Statistics 

Per-Chromosome Statistics 

Calculate summary statistics by chromosome:

cursor = engine.execute("""
    SELECT
        a.chromosome,
        COUNT(DISTINCT a.name) AS total_features,
        COUNT(b.name) AS total_overlaps,
        COUNT(DISTINCT CASE WHEN b.name IS NOT NULL THEN a.name END) AS features_with_overlap
    FROM features_a a
    LEFT JOIN features_b b ON a.interval INTERSECTS b.interval
    GROUP BY a.chromosome
    ORDER BY a.chromosome
""")

Use case: Compare feature distribution across chromosomes.

Overlap Statistics 

Calculate overlap metrics:

cursor = engine.execute("""
    SELECT
        a.chromosome,
        COUNT(*) AS overlap_count,
        AVG(LEAST(a.end_pos, b.end_pos) - GREATEST(a.start_pos, b.start_pos)) AS avg_overlap_bp,
        SUM(LEAST(a.end_pos, b.end_pos) - GREATEST(a.start_pos, b.start_pos)) AS total_overlap_bp
    FROM features_a a
    INNER JOIN features_b b ON a.interval INTERSECTS b.interval
    GROUP BY a.chromosome
    ORDER BY a.chromosome
""")

Use case: Quantify overlap patterns across the genome.

Feature Size Distribution 

Analyze feature sizes by category:

cursor = engine.execute("""
    SELECT
        biotype,
        COUNT(*) AS count,
        AVG(end_pos - start_pos) AS avg_length,
        MIN(end_pos - start_pos) AS min_length,
        MAX(end_pos - start_pos) AS max_length
    FROM genes
    GROUP BY biotype
    ORDER BY count DESC
""")

Use case: Compare size distributions across feature types.

Multi-Table Workflows 

Three-Way Intersection 

Find features overlapping in all three tables:

cursor = engine.execute("""
    SELECT DISTINCT a.*
    FROM features_a a
    INNER JOIN features_b b ON a.interval INTERSECTS b.interval
    INNER JOIN features_c c ON a.interval INTERSECTS c.interval
""")

Use case: Find consensus regions across multiple datasets.

Hierarchical Annotations 

Join multiple annotation levels:

cursor = engine.execute("""
    SELECT
        v.name AS variant,
        e.name AS exon,
        t.name AS transcript,
        g.name AS gene
    FROM variants v
    INNER JOIN exons e ON v.interval INTERSECTS e.interval
    INNER JOIN transcripts t ON e.interval WITHIN t.interval
    INNER JOIN genes g ON t.interval WITHIN g.interval
""")

Use case: Build hierarchical annotations for variants.

Union with Deduplication 

Combine features from multiple sources:

cursor = engine.execute("""
    WITH all_peaks AS (
        SELECT *, 'chip_seq' AS source FROM chip_peaks
        UNION ALL
        SELECT *, 'atac_seq' AS source FROM atac_peaks
        UNION ALL
        SELECT *, 'dnase_seq' AS source FROM dnase_peaks
    )
    SELECT
        chromosome,
        start_pos,
        end_pos,
        STRING_AGG(DISTINCT source, ',') AS sources,
        COUNT(DISTINCT source) AS source_count
    FROM all_peaks
    GROUP BY chromosome, start_pos, end_pos
    HAVING COUNT(DISTINCT source) >= 2
""")

Use case: Find regulatory regions supported by multiple assays.

Subqueries and CTEs 

Filtered Subquery 

Use subqueries to pre-filter data:

cursor = engine.execute("""
    SELECT v.*
    FROM variants v
    WHERE v.interval INTERSECTS ANY(
        SELECT position FROM genes WHERE biotype = 'protein_coding'
    )
""")

Use case: Intersect with dynamically filtered reference data.

Note

Subquery support depends on the target database backend.

Chained CTEs 

Build complex analyses with Common Table Expressions:

cursor = engine.execute("""
    WITH
    -- Step 1: Find high-quality variants
    hq_variants AS (
        SELECT * FROM variants
        WHERE quality >= 30 AND filter = 'PASS'
    ),
    -- Step 2: Annotate with genes
    annotated AS (
        SELECT v.*, g.name AS gene_name, g.biotype
        FROM hq_variants v
        LEFT JOIN genes g ON v.interval INTERSECTS g.interval
    ),
    -- Step 3: Summarize by gene
    gene_summary AS (
        SELECT
            gene_name,
            biotype,
            COUNT(*) AS variant_count
        FROM annotated
        WHERE gene_name IS NOT NULL
        GROUP BY gene_name, biotype
    )
    SELECT * FROM gene_summary
    ORDER BY variant_count DESC
    LIMIT 20
""")

Use case: Build multi-step analysis pipelines in a single query.

Window Functions 

Rank Overlaps 

Rank features by their overlap characteristics:

cursor = engine.execute("""
    SELECT
        a.name,
        a.chromosome,
        a.start_pos,
        overlap_count,
        RANK() OVER (ORDER BY overlap_count DESC) AS rank
    FROM (
        SELECT a.*, COUNT(b.name) AS overlap_count
        FROM features_a a
        LEFT JOIN features_b b ON a.interval INTERSECTS b.interval
        GROUP BY a.chromosome, a.start_pos, a.end_pos, a.name, a.score, a.strand
    ) a
""")

Use case: Identify features with the most overlaps.

Running Totals 

Calculate cumulative coverage:

cursor = engine.execute("""
    SELECT
        chromosome,
        start_pos,
        end_pos,
        end_pos - start_pos AS length,
        SUM(end_pos - start_pos) OVER (
            PARTITION BY chromosome
            ORDER BY start_pos
        ) AS cumulative_bp
    FROM features
    ORDER BY chromosome, start_pos
""")

Use case: Track cumulative coverage along each chromosome.

Debugging and Optimization 

View Generated SQL 

Use transpile() to see the SQL GIQL generates:

sql = engine.transpile("""
    SELECT * FROM variants
    WHERE interval INTERSECTS 'chr1:1000-2000'
""")
print(sql)
# See the actual SQL that will be executed

Use case: Debug queries or understand GIQL’s translation.

Verbose Mode 

Enable detailed logging:

with GIQLEngine(target_dialect="duckdb", verbose=True) as engine:
    # All queries will print transpilation details
    cursor = engine.execute("""
        SELECT * FROM variants
        WHERE interval INTERSECTS 'chr1:1000-2000'
    """)

Use case: Diagnose query translation issues.

Explain Query Plan 

Analyze query execution:

# First transpile to get the SQL
sql = engine.transpile("""
    SELECT v.*, g.name
    FROM variants v
    JOIN genes g ON v.interval INTERSECTS g.interval
""")

# Then use database-native EXPLAIN
cursor = engine.execute(f"EXPLAIN {sql}")
for row in cursor:
    print(row)

Use case: Optimize slow queries by examining execution plans.