Advanced Queries ================ This section covers advanced query patterns including multi-range matching, complex filtering, aggregate statistics, and multi-table workflows. .. contents:: :local: :depth: 2 Multi-Range Matching -------------------- Match Any of Multiple Regions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Find features overlapping any of several regions of interest: .. code-block:: python cursor = engine.execute(""" SELECT * FROM variants WHERE interval INTERSECTS ANY( 'chr1:1000000-2000000', 'chr1:5000000-6000000', 'chr2:1000000-3000000' ) """) **Use case:** Query multiple regions of interest in a single statement. Match All of Multiple Points ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Find features containing all specified positions: .. code-block:: python cursor = engine.execute(""" SELECT * FROM genes WHERE interval CONTAINS ALL( 'chr1:1500', 'chr1:1600', 'chr1:1700' ) """) **Use case:** Find genes spanning a set of SNP positions. Exclude Multiple Regions ~~~~~~~~~~~~~~~~~~~~~~~~ Find features that don't overlap any blacklisted region: .. code-block:: python cursor = engine.execute(""" SELECT * FROM peaks WHERE NOT interval INTERSECTS ANY( 'chr1:120000000-125000000', -- Centromere region 'chr1:140000000-142000000', -- Known artifact 'chrM:1-16569' -- Mitochondrial ) """) **Use case:** Filter out features in problematic genomic regions. Combine ANY and ALL ~~~~~~~~~~~~~~~~~~~ Complex multi-range logic: .. code-block:: python cursor = engine.execute(""" SELECT * FROM features WHERE interval INTERSECTS ANY('chr1:1000-2000', 'chr1:5000-6000') AND interval CONTAINS ALL('chr1:1100', 'chr1:1200') """) **Use case:** Find features matching complex spatial criteria. Complex Filtering ----------------- Multi-Attribute Filtering ~~~~~~~~~~~~~~~~~~~~~~~~~ Combine spatial and attribute filters: .. code-block:: python cursor = engine.execute(""" SELECT v.*, g.name AS gene_name, g.biotype FROM variants v INNER JOIN genes g ON v.interval INTERSECTS g.interval WHERE v.quality >= 30 AND v.filter = 'PASS' AND v.allele_frequency > 0.01 AND g.biotype = 'protein_coding' ORDER BY v.chromosome, v.start_pos """) **Use case:** Extract high-quality variants in protein-coding genes. Target Gene Lists ~~~~~~~~~~~~~~~~~ Filter to specific genes of interest: .. code-block:: python cursor = engine.execute(""" SELECT v.*, g.name AS gene_name FROM variants v INNER JOIN genes g ON v.interval INTERSECTS g.interval WHERE g.name IN ( 'BRCA1', 'BRCA2', 'TP53', 'EGFR', 'KRAS', 'BRAF', 'PIK3CA', 'PTEN', 'APC', 'ATM' ) ORDER BY g.name, v.start_pos """) **Use case:** Extract variants in clinically actionable genes. Conditional Logic ~~~~~~~~~~~~~~~~~ Apply different criteria based on feature type: .. code-block:: python cursor = engine.execute(""" SELECT v.*, g.name, g.biotype, CASE WHEN g.biotype = 'protein_coding' THEN 'coding' WHEN g.biotype LIKE '%RNA%' THEN 'noncoding_RNA' ELSE 'other' END AS gene_category FROM variants v INNER JOIN genes g ON v.interval INTERSECTS g.interval WHERE CASE WHEN g.biotype = 'protein_coding' THEN v.quality >= 30 ELSE v.quality >= 20 END """) **Use case:** Apply different quality thresholds based on genomic context. Aggregate Statistics -------------------- Per-Chromosome Statistics ~~~~~~~~~~~~~~~~~~~~~~~~~ Calculate summary statistics by chromosome: .. code-block:: python cursor = engine.execute(""" SELECT a.chromosome, COUNT(DISTINCT a.name) AS total_features, COUNT(b.name) AS total_overlaps, COUNT(DISTINCT CASE WHEN b.name IS NOT NULL THEN a.name END) AS features_with_overlap FROM features_a a LEFT JOIN features_b b ON a.interval INTERSECTS b.interval GROUP BY a.chromosome ORDER BY a.chromosome """) **Use case:** Compare feature distribution across chromosomes. Overlap Statistics ~~~~~~~~~~~~~~~~~~ Calculate overlap metrics: .. code-block:: python cursor = engine.execute(""" SELECT a.chromosome, COUNT(*) AS overlap_count, AVG(LEAST(a.end_pos, b.end_pos) - GREATEST(a.start_pos, b.start_pos)) AS avg_overlap_bp, SUM(LEAST(a.end_pos, b.end_pos) - GREATEST(a.start_pos, b.start_pos)) AS total_overlap_bp FROM features_a a INNER JOIN features_b b ON a.interval INTERSECTS b.interval GROUP BY a.chromosome ORDER BY a.chromosome """) **Use case:** Quantify overlap patterns across the genome. Feature Size Distribution ~~~~~~~~~~~~~~~~~~~~~~~~~ Analyze feature sizes by category: .. code-block:: python cursor = engine.execute(""" SELECT biotype, COUNT(*) AS count, AVG(end_pos - start_pos) AS avg_length, MIN(end_pos - start_pos) AS min_length, MAX(end_pos - start_pos) AS max_length FROM genes GROUP BY biotype ORDER BY count DESC """) **Use case:** Compare size distributions across feature types. Multi-Table Workflows --------------------- Three-Way Intersection ~~~~~~~~~~~~~~~~~~~~~~ Find features overlapping in all three tables: .. code-block:: python cursor = engine.execute(""" SELECT DISTINCT a.* FROM features_a a INNER JOIN features_b b ON a.interval INTERSECTS b.interval INNER JOIN features_c c ON a.interval INTERSECTS c.interval """) **Use case:** Find consensus regions across multiple datasets. Hierarchical Annotations ~~~~~~~~~~~~~~~~~~~~~~~~ Join multiple annotation levels: .. code-block:: python cursor = engine.execute(""" SELECT v.name AS variant, e.name AS exon, t.name AS transcript, g.name AS gene FROM variants v INNER JOIN exons e ON v.interval INTERSECTS e.interval INNER JOIN transcripts t ON e.interval WITHIN t.interval INNER JOIN genes g ON t.interval WITHIN g.interval """) **Use case:** Build hierarchical annotations for variants. Union with Deduplication ~~~~~~~~~~~~~~~~~~~~~~~~ Combine features from multiple sources: .. code-block:: python cursor = engine.execute(""" WITH all_peaks AS ( SELECT *, 'chip_seq' AS source FROM chip_peaks UNION ALL SELECT *, 'atac_seq' AS source FROM atac_peaks UNION ALL SELECT *, 'dnase_seq' AS source FROM dnase_peaks ) SELECT chromosome, start_pos, end_pos, STRING_AGG(DISTINCT source, ',') AS sources, COUNT(DISTINCT source) AS source_count FROM all_peaks GROUP BY chromosome, start_pos, end_pos HAVING COUNT(DISTINCT source) >= 2 """) **Use case:** Find regulatory regions supported by multiple assays. Subqueries and CTEs ------------------- Filtered Subquery ~~~~~~~~~~~~~~~~~ Use subqueries to pre-filter data: .. code-block:: python cursor = engine.execute(""" SELECT v.* FROM variants v WHERE v.interval INTERSECTS ANY( SELECT position FROM genes WHERE biotype = 'protein_coding' ) """) **Use case:** Intersect with dynamically filtered reference data. .. note:: Subquery support depends on the target database backend. Chained CTEs ~~~~~~~~~~~~ Build complex analyses with Common Table Expressions: .. code-block:: python cursor = engine.execute(""" WITH -- Step 1: Find high-quality variants hq_variants AS ( SELECT * FROM variants WHERE quality >= 30 AND filter = 'PASS' ), -- Step 2: Annotate with genes annotated AS ( SELECT v.*, g.name AS gene_name, g.biotype FROM hq_variants v LEFT JOIN genes g ON v.interval INTERSECTS g.interval ), -- Step 3: Summarize by gene gene_summary AS ( SELECT gene_name, biotype, COUNT(*) AS variant_count FROM annotated WHERE gene_name IS NOT NULL GROUP BY gene_name, biotype ) SELECT * FROM gene_summary ORDER BY variant_count DESC LIMIT 20 """) **Use case:** Build multi-step analysis pipelines in a single query. Window Functions ---------------- Rank Overlaps ~~~~~~~~~~~~~ Rank features by their overlap characteristics: .. code-block:: python cursor = engine.execute(""" SELECT a.name, a.chromosome, a.start_pos, overlap_count, RANK() OVER (ORDER BY overlap_count DESC) AS rank FROM ( SELECT a.*, COUNT(b.name) AS overlap_count FROM features_a a LEFT JOIN features_b b ON a.interval INTERSECTS b.interval GROUP BY a.chromosome, a.start_pos, a.end_pos, a.name, a.score, a.strand ) a """) **Use case:** Identify features with the most overlaps. Running Totals ~~~~~~~~~~~~~~ Calculate cumulative coverage: .. code-block:: python cursor = engine.execute(""" SELECT chromosome, start_pos, end_pos, end_pos - start_pos AS length, SUM(end_pos - start_pos) OVER ( PARTITION BY chromosome ORDER BY start_pos ) AS cumulative_bp FROM features ORDER BY chromosome, start_pos """) **Use case:** Track cumulative coverage along each chromosome. Debugging and Optimization -------------------------- View Generated SQL ~~~~~~~~~~~~~~~~~~ Use transpile() to see the SQL GIQL generates: .. code-block:: python sql = engine.transpile(""" SELECT * FROM variants WHERE interval INTERSECTS 'chr1:1000-2000' """) print(sql) # See the actual SQL that will be executed **Use case:** Debug queries or understand GIQL's translation. Verbose Mode ~~~~~~~~~~~~ Enable detailed logging: .. code-block:: python with GIQLEngine(target_dialect="duckdb", verbose=True) as engine: # All queries will print transpilation details cursor = engine.execute(""" SELECT * FROM variants WHERE interval INTERSECTS 'chr1:1000-2000' """) **Use case:** Diagnose query translation issues. Explain Query Plan ~~~~~~~~~~~~~~~~~~ Analyze query execution: .. code-block:: python # First transpile to get the SQL sql = engine.transpile(""" SELECT v.*, g.name FROM variants v JOIN genes g ON v.interval INTERSECTS g.interval """) # Then use database-native EXPLAIN cursor = engine.execute(f"EXPLAIN {sql}") for row in cursor: print(row) **Use case:** Optimize slow queries by examining execution plans.