Advanced Queries#
This section covers advanced query patterns including multi-range matching, complex filtering, aggregate statistics, and multi-table workflows.
Multi-Range Matching#
Match Any of Multiple Regions#
Find features overlapping any of several regions of interest:
SELECT * FROM variants
WHERE interval INTERSECTS ANY(
'chr1:1000000-2000000',
'chr1:5000000-6000000',
'chr2:1000000-3000000'
)
Use case: Query multiple regions of interest in a single statement.
Match All of Multiple Points#
Find features containing all specified positions:
SELECT * FROM genes
WHERE interval CONTAINS ALL(
'chr1:1500',
'chr1:1600',
'chr1:1700'
)
Use case: Find genes spanning a set of SNP positions.
Exclude Multiple Regions#
Find features that don’t overlap any blacklisted region:
SELECT * FROM peaks
WHERE NOT interval INTERSECTS ANY(
'chr1:120000000-125000000', -- Centromere region
'chr1:140000000-142000000', -- Known artifact
'chrM:1-16569' -- Mitochondrial
)
Use case: Filter out features in problematic genomic regions.
Combine ANY and ALL#
Complex multi-range logic:
SELECT * FROM features
WHERE interval INTERSECTS ANY('chr1:1000-2000', 'chr1:5000-6000')
AND interval CONTAINS ALL('chr1:1100', 'chr1:1200')
Use case: Find features matching complex spatial criteria.
Complex Filtering#
Multi-Attribute Filtering#
Combine spatial and attribute filters:
SELECT v.*, g.name AS gene_name, g.biotype
FROM variants v
INNER JOIN genes g ON v.interval INTERSECTS g.interval
WHERE v.quality >= 30
AND v.filter = 'PASS'
AND v.allele_frequency > 0.01
AND g.biotype = 'protein_coding'
ORDER BY v.chrom, v.start
Use case: Extract high-quality variants in protein-coding genes.
Target Gene Lists#
Filter to specific genes of interest:
SELECT v.*, g.name AS gene_name
FROM variants v
INNER JOIN genes g ON v.interval INTERSECTS g.interval
WHERE g.name IN (
'BRCA1', 'BRCA2', 'TP53', 'EGFR', 'KRAS',
'BRAF', 'PIK3CA', 'PTEN', 'APC', 'ATM'
)
ORDER BY g.name, v.start
Use case: Extract variants in clinically actionable genes.
Conditional Logic#
Apply different criteria based on feature type:
SELECT v.*, g.name, g.biotype,
CASE
WHEN g.biotype = 'protein_coding' THEN 'coding'
WHEN g.biotype LIKE '%RNA%' THEN 'noncoding_RNA'
ELSE 'other'
END AS gene_category
FROM variants v
INNER JOIN genes g ON v.interval INTERSECTS g.interval
WHERE CASE
WHEN g.biotype = 'protein_coding' THEN v.quality >= 30
ELSE v.quality >= 20
END
Use case: Apply different quality thresholds based on genomic context.
Aggregate Statistics#
Per-Chromosome Statistics#
Calculate summary statistics by chromosome:
SELECT
a.chrom,
COUNT(DISTINCT a.name) AS total_features,
COUNT(b.name) AS total_overlaps,
COUNT(DISTINCT CASE WHEN b.name IS NOT NULL THEN a.name END) AS features_with_overlap
FROM features_a a
LEFT JOIN features_b b ON a.interval INTERSECTS b.interval
GROUP BY a.chrom
ORDER BY a.chrom
Use case: Compare feature distribution across chromosomes.
Overlap Statistics#
Calculate overlap metrics:
SELECT
a.chrom,
COUNT(*) AS overlap_count,
AVG(LEAST(a.end, b.end) - GREATEST(a.start, b.start)) AS avg_overlap_bp,
SUM(LEAST(a.end, b.end) - GREATEST(a.start, b.start)) AS total_overlap_bp
FROM features_a a
INNER JOIN features_b b ON a.interval INTERSECTS b.interval
GROUP BY a.chrom
ORDER BY a.chrom
Use case: Quantify overlap patterns across the genome.
Feature Size Distribution#
Analyze feature sizes by category:
SELECT
biotype,
COUNT(*) AS count,
AVG(end - start) AS avg_length,
MIN(end - start) AS min_length,
MAX(end - start) AS max_length
FROM genes
GROUP BY biotype
ORDER BY count DESC
Use case: Compare size distributions across feature types.
Multi-Table Workflows#
Three-Way Intersection#
Find features overlapping in all three tables:
SELECT DISTINCT a.*
FROM features_a a
INNER JOIN features_b b ON a.interval INTERSECTS b.interval
INNER JOIN features_c c ON a.interval INTERSECTS c.interval
Use case: Find consensus regions across multiple datasets.
Hierarchical Annotations#
Join multiple annotation levels:
SELECT
v.name AS variant,
e.name AS exon,
t.name AS transcript,
g.name AS gene
FROM variants v
INNER JOIN exons e ON v.interval INTERSECTS e.interval
INNER JOIN transcripts t ON e.interval WITHIN t.interval
INNER JOIN genes g ON t.interval WITHIN g.interval
Use case: Build hierarchical annotations for variants.
Union with Deduplication#
Combine features from multiple sources:
WITH all_peaks AS (
SELECT *, 'chip_seq' AS source FROM chip_peaks
UNION ALL
SELECT *, 'atac_seq' AS source FROM atac_peaks
UNION ALL
SELECT *, 'dnase_seq' AS source FROM dnase_peaks
)
SELECT
chrom,
start,
end,
STRING_AGG(DISTINCT source, ',') AS sources,
COUNT(DISTINCT source) AS source_count
FROM all_peaks
GROUP BY chrom, start, end
HAVING COUNT(DISTINCT source) >= 2
Use case: Find regulatory regions supported by multiple assays.
Subqueries and CTEs#
Filtered Subquery#
Use subqueries to pre-filter data:
SELECT v.*
FROM variants v
WHERE v.interval INTERSECTS ANY(
SELECT position FROM genes WHERE biotype = 'protein_coding'
)
Use case: Intersect with dynamically filtered reference data.
Note
Subquery support depends on the target database backend.
Chained CTEs#
Build complex analyses with Common Table Expressions:
WITH
-- Step 1: Find high-quality variants
hq_variants AS (
SELECT * FROM variants
WHERE quality >= 30 AND filter = 'PASS'
),
-- Step 2: Annotate with genes
annotated AS (
SELECT v.*, g.name AS gene_name, g.biotype
FROM hq_variants v
LEFT JOIN genes g ON v.interval INTERSECTS g.interval
),
-- Step 3: Summarize by gene
gene_summary AS (
SELECT
gene_name,
biotype,
COUNT(*) AS variant_count
FROM annotated
WHERE gene_name IS NOT NULL
GROUP BY gene_name, biotype
)
SELECT * FROM gene_summary
ORDER BY variant_count DESC
LIMIT 20
Use case: Build multi-step analysis pipelines in a single query.
Window Functions#
Rank Overlaps#
Rank features by their overlap characteristics:
SELECT
a.name,
a.chrom,
a.start,
overlap_count,
RANK() OVER (ORDER BY overlap_count DESC) AS rank
FROM (
SELECT a.*, COUNT(b.name) AS overlap_count
FROM features_a a
LEFT JOIN features_b b ON a.interval INTERSECTS b.interval
GROUP BY a.chrom, a.start, a.end, a.name, a.score, a.strand
) a
Use case: Identify features with the most overlaps.
Running Totals#
Calculate cumulative coverage:
SELECT
chrom,
start,
end,
end - start AS length,
SUM(end - start) OVER (
PARTITION BY chrom
ORDER BY start
) AS cumulative_bp
FROM features
ORDER BY chrom, start
Use case: Track cumulative coverage along each chromosome.
Debugging and Optimization#
View Generated SQL#
Use transpile() to see the SQL GIQL generates:
from giql import transpile
sql = transpile(
"SELECT * FROM variants WHERE interval INTERSECTS 'chr1:1000-2000'",
tables=["variants"],
)
print(sql)
Use case: Debug queries or understand GIQL’s translation.
Explain Query Plan#
Analyze query execution:
from giql import transpile
sql = transpile(
"""
SELECT v.*, g.name
FROM variants v
JOIN genes g ON v.interval INTERSECTS g.interval
""",
tables=["variants", "genes"],
)
# Then use database-native EXPLAIN
# e.g., conn.execute(f"EXPLAIN {sql}")
Use case: Optimize slow queries by examining execution plans.