Advanced Queries
This section covers advanced query patterns including multi-range matching, complex filtering, aggregate statistics, and multi-table workflows.
Multi-Range Matching
Match Any of Multiple Regions
Find features overlapping any of several regions of interest:
cursor = engine.execute("""
SELECT * FROM variants
WHERE interval INTERSECTS ANY(
'chr1:1000000-2000000',
'chr1:5000000-6000000',
'chr2:1000000-3000000'
)
""")
Use case: Query multiple regions of interest in a single statement.
Match All of Multiple Points
Find features containing all specified positions:
cursor = engine.execute("""
SELECT * FROM genes
WHERE interval CONTAINS ALL(
'chr1:1500',
'chr1:1600',
'chr1:1700'
)
""")
Use case: Find genes spanning a set of SNP positions.
Exclude Multiple Regions
Find features that don’t overlap any blacklisted region:
cursor = engine.execute("""
SELECT * FROM peaks
WHERE NOT interval INTERSECTS ANY(
'chr1:120000000-125000000', -- Centromere region
'chr1:140000000-142000000', -- Known artifact
'chrM:1-16569' -- Mitochondrial
)
""")
Use case: Filter out features in problematic genomic regions.
Combine ANY and ALL
Complex multi-range logic:
cursor = engine.execute("""
SELECT * FROM features
WHERE interval INTERSECTS ANY('chr1:1000-2000', 'chr1:5000-6000')
AND interval CONTAINS ALL('chr1:1100', 'chr1:1200')
""")
Use case: Find features matching complex spatial criteria.
Complex Filtering
Multi-Attribute Filtering
Combine spatial and attribute filters:
cursor = engine.execute("""
SELECT v.*, g.name AS gene_name, g.biotype
FROM variants v
INNER JOIN genes g ON v.interval INTERSECTS g.interval
WHERE v.quality >= 30
AND v.filter = 'PASS'
AND v.allele_frequency > 0.01
AND g.biotype = 'protein_coding'
ORDER BY v.chromosome, v.start_pos
""")
Use case: Extract high-quality variants in protein-coding genes.
Target Gene Lists
Filter to specific genes of interest:
cursor = engine.execute("""
SELECT v.*, g.name AS gene_name
FROM variants v
INNER JOIN genes g ON v.interval INTERSECTS g.interval
WHERE g.name IN (
'BRCA1', 'BRCA2', 'TP53', 'EGFR', 'KRAS',
'BRAF', 'PIK3CA', 'PTEN', 'APC', 'ATM'
)
ORDER BY g.name, v.start_pos
""")
Use case: Extract variants in clinically actionable genes.
Conditional Logic
Apply different criteria based on feature type:
cursor = engine.execute("""
SELECT v.*, g.name, g.biotype,
CASE
WHEN g.biotype = 'protein_coding' THEN 'coding'
WHEN g.biotype LIKE '%RNA%' THEN 'noncoding_RNA'
ELSE 'other'
END AS gene_category
FROM variants v
INNER JOIN genes g ON v.interval INTERSECTS g.interval
WHERE CASE
WHEN g.biotype = 'protein_coding' THEN v.quality >= 30
ELSE v.quality >= 20
END
""")
Use case: Apply different quality thresholds based on genomic context.
Aggregate Statistics
Per-Chromosome Statistics
Calculate summary statistics by chromosome:
cursor = engine.execute("""
SELECT
a.chromosome,
COUNT(DISTINCT a.name) AS total_features,
COUNT(b.name) AS total_overlaps,
COUNT(DISTINCT CASE WHEN b.name IS NOT NULL THEN a.name END) AS features_with_overlap
FROM features_a a
LEFT JOIN features_b b ON a.interval INTERSECTS b.interval
GROUP BY a.chromosome
ORDER BY a.chromosome
""")
Use case: Compare feature distribution across chromosomes.
Overlap Statistics
Calculate overlap metrics:
cursor = engine.execute("""
SELECT
a.chromosome,
COUNT(*) AS overlap_count,
AVG(LEAST(a.end_pos, b.end_pos) - GREATEST(a.start_pos, b.start_pos)) AS avg_overlap_bp,
SUM(LEAST(a.end_pos, b.end_pos) - GREATEST(a.start_pos, b.start_pos)) AS total_overlap_bp
FROM features_a a
INNER JOIN features_b b ON a.interval INTERSECTS b.interval
GROUP BY a.chromosome
ORDER BY a.chromosome
""")
Use case: Quantify overlap patterns across the genome.
Feature Size Distribution
Analyze feature sizes by category:
cursor = engine.execute("""
SELECT
biotype,
COUNT(*) AS count,
AVG(end_pos - start_pos) AS avg_length,
MIN(end_pos - start_pos) AS min_length,
MAX(end_pos - start_pos) AS max_length
FROM genes
GROUP BY biotype
ORDER BY count DESC
""")
Use case: Compare size distributions across feature types.
Multi-Table Workflows
Three-Way Intersection
Find features overlapping in all three tables:
cursor = engine.execute("""
SELECT DISTINCT a.*
FROM features_a a
INNER JOIN features_b b ON a.interval INTERSECTS b.interval
INNER JOIN features_c c ON a.interval INTERSECTS c.interval
""")
Use case: Find consensus regions across multiple datasets.
Hierarchical Annotations
Join multiple annotation levels:
cursor = engine.execute("""
SELECT
v.name AS variant,
e.name AS exon,
t.name AS transcript,
g.name AS gene
FROM variants v
INNER JOIN exons e ON v.interval INTERSECTS e.interval
INNER JOIN transcripts t ON e.interval WITHIN t.interval
INNER JOIN genes g ON t.interval WITHIN g.interval
""")
Use case: Build hierarchical annotations for variants.
Union with Deduplication
Combine features from multiple sources:
cursor = engine.execute("""
WITH all_peaks AS (
SELECT *, 'chip_seq' AS source FROM chip_peaks
UNION ALL
SELECT *, 'atac_seq' AS source FROM atac_peaks
UNION ALL
SELECT *, 'dnase_seq' AS source FROM dnase_peaks
)
SELECT
chromosome,
start_pos,
end_pos,
STRING_AGG(DISTINCT source, ',') AS sources,
COUNT(DISTINCT source) AS source_count
FROM all_peaks
GROUP BY chromosome, start_pos, end_pos
HAVING COUNT(DISTINCT source) >= 2
""")
Use case: Find regulatory regions supported by multiple assays.
Subqueries and CTEs
Filtered Subquery
Use subqueries to pre-filter data:
cursor = engine.execute("""
SELECT v.*
FROM variants v
WHERE v.interval INTERSECTS ANY(
SELECT position FROM genes WHERE biotype = 'protein_coding'
)
""")
Use case: Intersect with dynamically filtered reference data.
Note
Subquery support depends on the target database backend.
Chained CTEs
Build complex analyses with Common Table Expressions:
cursor = engine.execute("""
WITH
-- Step 1: Find high-quality variants
hq_variants AS (
SELECT * FROM variants
WHERE quality >= 30 AND filter = 'PASS'
),
-- Step 2: Annotate with genes
annotated AS (
SELECT v.*, g.name AS gene_name, g.biotype
FROM hq_variants v
LEFT JOIN genes g ON v.interval INTERSECTS g.interval
),
-- Step 3: Summarize by gene
gene_summary AS (
SELECT
gene_name,
biotype,
COUNT(*) AS variant_count
FROM annotated
WHERE gene_name IS NOT NULL
GROUP BY gene_name, biotype
)
SELECT * FROM gene_summary
ORDER BY variant_count DESC
LIMIT 20
""")
Use case: Build multi-step analysis pipelines in a single query.
Window Functions
Rank Overlaps
Rank features by their overlap characteristics:
cursor = engine.execute("""
SELECT
a.name,
a.chromosome,
a.start_pos,
overlap_count,
RANK() OVER (ORDER BY overlap_count DESC) AS rank
FROM (
SELECT a.*, COUNT(b.name) AS overlap_count
FROM features_a a
LEFT JOIN features_b b ON a.interval INTERSECTS b.interval
GROUP BY a.chromosome, a.start_pos, a.end_pos, a.name, a.score, a.strand
) a
""")
Use case: Identify features with the most overlaps.
Running Totals
Calculate cumulative coverage:
cursor = engine.execute("""
SELECT
chromosome,
start_pos,
end_pos,
end_pos - start_pos AS length,
SUM(end_pos - start_pos) OVER (
PARTITION BY chromosome
ORDER BY start_pos
) AS cumulative_bp
FROM features
ORDER BY chromosome, start_pos
""")
Use case: Track cumulative coverage along each chromosome.
Debugging and Optimization
View Generated SQL
Use transpile() to see the SQL GIQL generates:
sql = engine.transpile("""
SELECT * FROM variants
WHERE interval INTERSECTS 'chr1:1000-2000'
""")
print(sql)
# See the actual SQL that will be executed
Use case: Debug queries or understand GIQL’s translation.
Verbose Mode
Enable detailed logging:
with GIQLEngine(target_dialect="duckdb", verbose=True) as engine:
# All queries will print transpilation details
cursor = engine.execute("""
SELECT * FROM variants
WHERE interval INTERSECTS 'chr1:1000-2000'
""")
Use case: Diagnose query translation issues.
Explain Query Plan
Analyze query execution:
# First transpile to get the SQL
sql = engine.transpile("""
SELECT v.*, g.name
FROM variants v
JOIN genes g ON v.interval INTERSECTS g.interval
""")
# Then use database-native EXPLAIN
cursor = engine.execute(f"EXPLAIN {sql}")
for row in cursor:
print(row)
Use case: Optimize slow queries by examining execution plans.