Performance Guide
This guide covers strategies for optimizing GIQL query performance, including indexing, query patterns, and backend-specific optimizations.
Understanding Query Performance
How GIQL Queries Execute
When you execute a GIQL query:
GIQL parses the query and identifies genomic operators
Operators are expanded into standard SQL predicates
The SQL is sent to the database backend
The database executes the query using its optimizer
Performance depends on both the generated SQL and how the database executes it.
Common Performance Bottlenecks
Full table scans: No indexes to speed up filtering
Cartesian products: Large cross joins without early filtering
Missing chromosome filters: Comparing features across all chromosomes
Inefficient join order: Small tables should drive joins
Indexing Strategies
Creating Indexes
Create indexes on genomic columns for faster queries:
# DuckDB
engine.conn.execute("""
CREATE INDEX idx_features_position
ON features (chromosome, start_pos, end_pos)
""")
# SQLite
engine.conn.execute("""
CREATE INDEX idx_features_position
ON features (chromosome, start_pos, end_pos)
""")
Recommended Index Patterns
For single-table queries (filtering):
CREATE INDEX idx_table_position ON table_name (chromosome, start_pos, end_pos)
For join queries:
-- Index both tables involved in joins
CREATE INDEX idx_variants_position ON variants (chromosome, start_pos, end_pos)
CREATE INDEX idx_genes_position ON genes (chromosome, start_pos, end_pos)
For strand-specific queries:
CREATE INDEX idx_features_strand ON features (chromosome, strand, start_pos, end_pos)
When to Create Indexes
Create indexes when:
Tables have more than ~10,000 rows
You’re running repeated queries on the same tables
Join queries are slow
Filtering by genomic position is common
Skip indexes when:
Tables are small
You’re doing one-time analysis
Full table scans are acceptable
Query Optimization Patterns
Pre-filter by Chromosome
Always include chromosome filtering when joining tables:
# Good: Explicit chromosome filter
cursor = engine.execute("""
SELECT a.*, b.name
FROM features_a a
JOIN features_b b ON a.interval INTERSECTS b.interval
WHERE a.chromosome = 'chr1'
""")
# Also good: Cross-chromosome join with implicit filtering
# GIQL handles this, but explicit is clearer
cursor = engine.execute("""
SELECT a.*, b.name
FROM features_a a
JOIN features_b b ON a.interval INTERSECTS b.interval
AND a.chromosome = b.chromosome
""")
Use Selective Filters Early
Apply selective filters before joins:
# Good: Filter before joining
cursor = engine.execute("""
WITH filtered_variants AS (
SELECT * FROM variants
WHERE quality >= 30 AND filter = 'PASS'
)
SELECT f.*, g.name
FROM filtered_variants f
JOIN genes g ON f.interval INTERSECTS g.interval
""")
# Less efficient: Filter after joining
cursor = engine.execute("""
SELECT v.*, g.name
FROM variants v
JOIN genes g ON v.interval INTERSECTS g.interval
WHERE v.quality >= 30 AND v.filter = 'PASS'
""")
Limit Result Sets
Use LIMIT for exploratory queries:
# Good: Limit results during exploration
cursor = engine.execute("""
SELECT * FROM variants
WHERE interval INTERSECTS 'chr1:1000000-2000000'
LIMIT 100
""")
Use DISTINCT Wisely
DISTINCT can be expensive. Only use when necessary:
# Only use DISTINCT when you actually need unique rows
cursor = engine.execute("""
SELECT DISTINCT a.*
FROM features_a a
JOIN features_b b ON a.interval INTERSECTS b.interval
""")
# If you just need to check existence, use EXISTS instead
cursor = engine.execute("""
SELECT a.*
FROM features_a a
WHERE EXISTS (
SELECT 1 FROM features_b b
WHERE a.interval INTERSECTS b.interval
)
""")
NEAREST Query Optimization
Optimizing K-NN Queries
The NEAREST operator can be expensive for large datasets. Optimize with:
1. Use max_distance to limit search space:
# Good: Constrained search
cursor = engine.execute("""
SELECT peaks.name, nearest.name, nearest.distance
FROM peaks
CROSS JOIN LATERAL NEAREST(
genes,
reference=peaks.interval,
k=5,
max_distance=100000 -- Only search within 100kb
) AS nearest
""")
2. Request only the k you need:
# Good: Request exactly what you need
NEAREST(genes, reference=peaks.interval, k=3)
# Wasteful: Request more than needed
NEAREST(genes, reference=peaks.interval, k=100)
3. Index the target table:
CREATE INDEX idx_genes_position ON genes (chromosome, start_pos, end_pos)
Merge and Cluster Optimization
Efficient Clustering
For large datasets, consider pre-sorting:
# Pre-sort data for clustering
cursor = engine.execute("""
WITH sorted AS (
SELECT * FROM features
ORDER BY chromosome, start_pos
)
SELECT *, CLUSTER(interval) AS cluster_id
FROM sorted
""")
Efficient Merging
Filter before merging to reduce data volume:
# Good: Filter first, then merge
cursor = engine.execute("""
WITH filtered AS (
SELECT * FROM features
WHERE score >= 10
)
SELECT MERGE(interval), COUNT(*) AS count
FROM filtered
""")
Analyzing Query Performance
Using EXPLAIN
Analyze query execution plans:
# Get the transpiled SQL
sql = engine.transpile("""
SELECT a.*, b.name
FROM variants a
JOIN genes b ON a.interval INTERSECTS b.interval
""")
# Analyze the execution plan
cursor = engine.execute(f"EXPLAIN {sql}")
for row in cursor:
print(row)
# DuckDB also supports EXPLAIN ANALYZE for actual timing
cursor = engine.execute(f"EXPLAIN ANALYZE {sql}")
Timing Queries
Measure query execution time:
import time
start = time.time()
cursor = engine.execute("""
SELECT * FROM variants
WHERE interval INTERSECTS 'chr1:1000000-2000000'
""")
results = cursor.fetchall()
elapsed = time.time() - start
print(f"Query returned {len(results)} rows in {elapsed:.2f} seconds")
Backend-Specific Tips
DuckDB Optimizations
Use columnar strengths:
DuckDB is columnar, so queries that select few columns are faster:
# Faster: Select only needed columns
cursor = engine.execute("""
SELECT chromosome, start_pos, end_pos, name
FROM features
WHERE interval INTERSECTS 'chr1:1000-2000'
""")
# Slower: Select all columns
cursor = engine.execute("""
SELECT *
FROM features
WHERE interval INTERSECTS 'chr1:1000-2000'
""")
Parallel execution:
DuckDB automatically parallelizes queries. For very large datasets, ensure you’re not limiting parallelism.
SQLite Optimizations
Use covering indexes:
-- Include commonly selected columns in the index
CREATE INDEX idx_features_covering
ON features (chromosome, start_pos, end_pos, name, score)
Analyze tables:
# Help SQLite's query planner
engine.conn.execute("ANALYZE features")
Memory Management
Streaming Results
For large result sets, iterate instead of fetching all:
# Good: Stream results
cursor = engine.execute("SELECT * FROM large_table")
for row in cursor:
process(row)
# Memory-intensive: Fetch all at once
cursor = engine.execute("SELECT * FROM large_table")
all_rows = cursor.fetchall() # Loads everything into memory
Batch Processing
Process large datasets in batches:
chromosomes = ['chr1', 'chr2', 'chr3', ...] # All chromosomes
for chrom in chromosomes:
cursor = engine.execute(f"""
SELECT * FROM features
WHERE chromosome = '{chrom}'
AND interval INTERSECTS '{chrom}:1-1000000'
""")
process_chromosome(cursor)
Performance Checklist
Before running large queries, check:
□ Indexes created on genomic columns
□ Chromosome filtering included in joins
□ Selective filters applied early
□ LIMIT used for exploration
□ Only necessary columns selected
□ NEAREST queries use max_distance
□ Results streamed instead of fetched all at once
Quick Wins
Add indexes - Usually the biggest performance improvement
Filter by chromosome - Reduces join complexity significantly
Use max_distance with NEAREST - Limits search space
Stream results - Reduces memory pressure
Use DuckDB - Generally faster for analytical queries