Advanced Queries

Advanced Queries#

This section covers advanced query patterns including multi-range matching, complex filtering, aggregate statistics, and multi-table workflows.

Multi-Range Matching#

Match Any of Multiple Regions#

Find features overlapping any of several regions of interest:

SELECT * FROM variants
WHERE interval INTERSECTS ANY(
    'chr1:1000000-2000000',
    'chr1:5000000-6000000',
    'chr2:1000000-3000000'
)

Use case: Query multiple regions of interest in a single statement.

Match All of Multiple Points#

Find features containing all specified positions:

SELECT * FROM genes
WHERE interval CONTAINS ALL(
    'chr1:1500',
    'chr1:1600',
    'chr1:1700'
)

Use case: Find genes spanning a set of SNP positions.

Exclude Multiple Regions#

Find features that don’t overlap any blacklisted region:

SELECT * FROM peaks
WHERE NOT interval INTERSECTS ANY(
    'chr1:120000000-125000000',   -- Centromere region
    'chr1:140000000-142000000',   -- Known artifact
    'chrM:1-16569'                -- Mitochondrial
)

Use case: Filter out features in problematic genomic regions.

Combine ANY and ALL#

Complex multi-range logic:

SELECT * FROM features
WHERE interval INTERSECTS ANY('chr1:1000-2000', 'chr1:5000-6000')
  AND interval CONTAINS ALL('chr1:1100', 'chr1:1200')

Use case: Find features matching complex spatial criteria.

Complex Filtering#

Multi-Attribute Filtering#

Combine spatial and attribute filters:

SELECT v.*, g.name AS gene_name, g.biotype
FROM variants v
INNER JOIN genes g ON v.interval INTERSECTS g.interval
WHERE v.quality >= 30
  AND v.filter = 'PASS'
  AND v.allele_frequency > 0.01
  AND g.biotype = 'protein_coding'
ORDER BY v.chrom, v.start

Use case: Extract high-quality variants in protein-coding genes.

Target Gene Lists#

Filter to specific genes of interest:

SELECT v.*, g.name AS gene_name
FROM variants v
INNER JOIN genes g ON v.interval INTERSECTS g.interval
WHERE g.name IN (
    'BRCA1', 'BRCA2', 'TP53', 'EGFR', 'KRAS',
    'BRAF', 'PIK3CA', 'PTEN', 'APC', 'ATM'
)
ORDER BY g.name, v.start

Use case: Extract variants in clinically actionable genes.

Conditional Logic#

Apply different criteria based on feature type:

SELECT v.*, g.name, g.biotype,
    CASE
        WHEN g.biotype = 'protein_coding' THEN 'coding'
        WHEN g.biotype LIKE '%RNA%' THEN 'noncoding_RNA'
        ELSE 'other'
    END AS gene_category
FROM variants v
INNER JOIN genes g ON v.interval INTERSECTS g.interval
WHERE CASE
    WHEN g.biotype = 'protein_coding' THEN v.quality >= 30
    ELSE v.quality >= 20
END

Use case: Apply different quality thresholds based on genomic context.

Aggregate Statistics#

Per-Chromosome Statistics#

Calculate summary statistics by chromosome:

SELECT
    a.chrom,
    COUNT(DISTINCT a.name) AS total_features,
    COUNT(b.name) AS total_overlaps,
    COUNT(DISTINCT CASE WHEN b.name IS NOT NULL THEN a.name END) AS features_with_overlap
FROM features_a a
LEFT JOIN features_b b ON a.interval INTERSECTS b.interval
GROUP BY a.chrom
ORDER BY a.chrom

Use case: Compare feature distribution across chromosomes.

Overlap Statistics#

Calculate overlap metrics:

SELECT
    a.chrom,
    COUNT(*) AS overlap_count,
    AVG(LEAST(a.end, b.end) - GREATEST(a.start, b.start)) AS avg_overlap_bp,
    SUM(LEAST(a.end, b.end) - GREATEST(a.start, b.start)) AS total_overlap_bp
FROM features_a a
INNER JOIN features_b b ON a.interval INTERSECTS b.interval
GROUP BY a.chrom
ORDER BY a.chrom

Use case: Quantify overlap patterns across the genome.

Feature Size Distribution#

Analyze feature sizes by category:

SELECT
    biotype,
    COUNT(*) AS count,
    AVG(end - start) AS avg_length,
    MIN(end - start) AS min_length,
    MAX(end - start) AS max_length
FROM genes
GROUP BY biotype
ORDER BY count DESC

Use case: Compare size distributions across feature types.

Multi-Table Workflows#

Three-Way Intersection#

Find features overlapping in all three tables:

SELECT DISTINCT a.*
FROM features_a a
INNER JOIN features_b b ON a.interval INTERSECTS b.interval
INNER JOIN features_c c ON a.interval INTERSECTS c.interval

Use case: Find consensus regions across multiple datasets.

Hierarchical Annotations#

Join multiple annotation levels:

SELECT
    v.name AS variant,
    e.name AS exon,
    t.name AS transcript,
    g.name AS gene
FROM variants v
INNER JOIN exons e ON v.interval INTERSECTS e.interval
INNER JOIN transcripts t ON e.interval WITHIN t.interval
INNER JOIN genes g ON t.interval WITHIN g.interval

Use case: Build hierarchical annotations for variants.

Union with Deduplication#

Combine features from multiple sources:

WITH all_peaks AS (
    SELECT *, 'chip_seq' AS source FROM chip_peaks
    UNION ALL
    SELECT *, 'atac_seq' AS source FROM atac_peaks
    UNION ALL
    SELECT *, 'dnase_seq' AS source FROM dnase_peaks
)
SELECT
    chrom,
    start,
    end,
    STRING_AGG(DISTINCT source, ',') AS sources,
    COUNT(DISTINCT source) AS source_count
FROM all_peaks
GROUP BY chrom, start, end
HAVING COUNT(DISTINCT source) >= 2

Use case: Find regulatory regions supported by multiple assays.

Subqueries and CTEs#

Filtered Subquery#

Use subqueries to pre-filter data:

SELECT v.*
FROM variants v
WHERE v.interval INTERSECTS ANY(
    SELECT position FROM genes WHERE biotype = 'protein_coding'
)

Use case: Intersect with dynamically filtered reference data.

Note

Subquery support depends on the target database backend.

Chained CTEs#

Build complex analyses with Common Table Expressions:

WITH
-- Step 1: Find high-quality variants
hq_variants AS (
    SELECT * FROM variants
    WHERE quality >= 30 AND filter = 'PASS'
),
-- Step 2: Annotate with genes
annotated AS (
    SELECT v.*, g.name AS gene_name, g.biotype
    FROM hq_variants v
    LEFT JOIN genes g ON v.interval INTERSECTS g.interval
),
-- Step 3: Summarize by gene
gene_summary AS (
    SELECT
        gene_name,
        biotype,
        COUNT(*) AS variant_count
    FROM annotated
    WHERE gene_name IS NOT NULL
    GROUP BY gene_name, biotype
)
SELECT * FROM gene_summary
ORDER BY variant_count DESC
LIMIT 20

Use case: Build multi-step analysis pipelines in a single query.

Window Functions#

Rank Overlaps#

Rank features by their overlap characteristics:

SELECT
    a.name,
    a.chrom,
    a.start,
    overlap_count,
    RANK() OVER (ORDER BY overlap_count DESC) AS rank
FROM (
    SELECT a.*, COUNT(b.name) AS overlap_count
    FROM features_a a
    LEFT JOIN features_b b ON a.interval INTERSECTS b.interval
    GROUP BY a.chrom, a.start, a.end, a.name, a.score, a.strand
) a

Use case: Identify features with the most overlaps.

Running Totals#

Calculate cumulative coverage:

SELECT
    chrom,
    start,
    end,
    end - start AS length,
    SUM(end - start) OVER (
        PARTITION BY chrom
        ORDER BY start
    ) AS cumulative_bp
FROM features
ORDER BY chrom, start

Use case: Track cumulative coverage along each chromosome.

Debugging and Optimization#

View Generated SQL#

Use transpile() to see the SQL GIQL generates:

from giql import transpile

sql = transpile(
    "SELECT * FROM variants WHERE interval INTERSECTS 'chr1:1000-2000'",
    tables=["variants"],
)
print(sql)

Use case: Debug queries or understand GIQL’s translation.

Explain Query Plan#

Analyze query execution:

from giql import transpile

sql = transpile(
    """
    SELECT v.*, g.name
    FROM variants v
    JOIN genes g ON v.interval INTERSECTS g.interval
    """,
    tables=["variants", "genes"],
)

# Then use database-native EXPLAIN
# e.g., conn.execute(f"EXPLAIN {sql}")

Use case: Optimize slow queries by examining execution plans.