Performance Guide

This guide covers strategies for optimizing GIQL query performance, including indexing, query patterns, and backend-specific optimizations.

Understanding Query Performance 

How GIQL Queries Execute 

When you execute a GIQL query:

GIQL parses the query and identifies genomic operators
Operators are expanded into standard SQL predicates
The SQL is sent to the database backend
The database executes the query using its optimizer

Performance depends on both the generated SQL and how the database executes it.

Common Performance Bottlenecks 

Full table scans: No indexes to speed up filtering
Cartesian products: Large cross joins without early filtering
Missing chromosome filters: Comparing features across all chromosomes
Inefficient join order: Small tables should drive joins

Indexing Strategies 

Creating Indexes 

Create indexes on genomic columns for faster queries:

# DuckDB
engine.conn.execute("""
    CREATE INDEX idx_features_position
    ON features (chromosome, start_pos, end_pos)
""")

# SQLite
engine.conn.execute("""
    CREATE INDEX idx_features_position
    ON features (chromosome, start_pos, end_pos)
""")

Recommended Index Patterns 

For single-table queries (filtering):

CREATE INDEX idx_table_position ON table_name (chromosome, start_pos, end_pos)

For join queries:

-- Index both tables involved in joins
CREATE INDEX idx_variants_position ON variants (chromosome, start_pos, end_pos)
CREATE INDEX idx_genes_position ON genes (chromosome, start_pos, end_pos)

For strand-specific queries:

CREATE INDEX idx_features_strand ON features (chromosome, strand, start_pos, end_pos)

When to Create Indexes 

Create indexes when:

Tables have more than ~10,000 rows
You’re running repeated queries on the same tables
Join queries are slow
Filtering by genomic position is common

Skip indexes when:

Tables are small
You’re doing one-time analysis
Full table scans are acceptable

Query Optimization Patterns 

Pre-filter by Chromosome 

Always include chromosome filtering when joining tables:

# Good: Explicit chromosome filter
cursor = engine.execute("""
    SELECT a.*, b.name
    FROM features_a a
    JOIN features_b b ON a.interval INTERSECTS b.interval
    WHERE a.chromosome = 'chr1'
""")

# Also good: Cross-chromosome join with implicit filtering
# GIQL handles this, but explicit is clearer
cursor = engine.execute("""
    SELECT a.*, b.name
    FROM features_a a
    JOIN features_b b ON a.interval INTERSECTS b.interval
      AND a.chromosome = b.chromosome
""")

Use Selective Filters Early 

Apply selective filters before joins:

# Good: Filter before joining
cursor = engine.execute("""
    WITH filtered_variants AS (
        SELECT * FROM variants
        WHERE quality >= 30 AND filter = 'PASS'
    )
    SELECT f.*, g.name
    FROM filtered_variants f
    JOIN genes g ON f.interval INTERSECTS g.interval
""")

# Less efficient: Filter after joining
cursor = engine.execute("""
    SELECT v.*, g.name
    FROM variants v
    JOIN genes g ON v.interval INTERSECTS g.interval
    WHERE v.quality >= 30 AND v.filter = 'PASS'
""")

Limit Result Sets 

Use LIMIT for exploratory queries:

# Good: Limit results during exploration
cursor = engine.execute("""
    SELECT * FROM variants
    WHERE interval INTERSECTS 'chr1:1000000-2000000'
    LIMIT 100
""")

Use DISTINCT Wisely 

DISTINCT can be expensive. Only use when necessary:

# Only use DISTINCT when you actually need unique rows
cursor = engine.execute("""
    SELECT DISTINCT a.*
    FROM features_a a
    JOIN features_b b ON a.interval INTERSECTS b.interval
""")

# If you just need to check existence, use EXISTS instead
cursor = engine.execute("""
    SELECT a.*
    FROM features_a a
    WHERE EXISTS (
        SELECT 1 FROM features_b b
        WHERE a.interval INTERSECTS b.interval
    )
""")

NEAREST Query Optimization 

Optimizing K-NN Queries 

The NEAREST operator can be expensive for large datasets. Optimize with:

1. Use max_distance to limit search space:

# Good: Constrained search
cursor = engine.execute("""
    SELECT peaks.name, nearest.name, nearest.distance
    FROM peaks
    CROSS JOIN LATERAL NEAREST(
        genes,
        reference=peaks.interval,
        k=5,
        max_distance=100000   -- Only search within 100kb
    ) AS nearest
""")

2. Request only the k you need:

# Good: Request exactly what you need
NEAREST(genes, reference=peaks.interval, k=3)

# Wasteful: Request more than needed
NEAREST(genes, reference=peaks.interval, k=100)

3. Index the target table:

CREATE INDEX idx_genes_position ON genes (chromosome, start_pos, end_pos)

Merge and Cluster Optimization 

Efficient Clustering 

For large datasets, consider pre-sorting:

# Pre-sort data for clustering
cursor = engine.execute("""
    WITH sorted AS (
        SELECT * FROM features
        ORDER BY chromosome, start_pos
    )
    SELECT *, CLUSTER(interval) AS cluster_id
    FROM sorted
""")

Efficient Merging 

Filter before merging to reduce data volume:

# Good: Filter first, then merge
cursor = engine.execute("""
    WITH filtered AS (
        SELECT * FROM features
        WHERE score >= 10
    )
    SELECT MERGE(interval), COUNT(*) AS count
    FROM filtered
""")

Analyzing Query Performance 

Using EXPLAIN 

Analyze query execution plans:

# Get the transpiled SQL
sql = engine.transpile("""
    SELECT a.*, b.name
    FROM variants a
    JOIN genes b ON a.interval INTERSECTS b.interval
""")

# Analyze the execution plan
cursor = engine.execute(f"EXPLAIN {sql}")
for row in cursor:
    print(row)

# DuckDB also supports EXPLAIN ANALYZE for actual timing
cursor = engine.execute(f"EXPLAIN ANALYZE {sql}")

Timing Queries 

Measure query execution time:

import time

start = time.time()
cursor = engine.execute("""
    SELECT * FROM variants
    WHERE interval INTERSECTS 'chr1:1000000-2000000'
""")
results = cursor.fetchall()
elapsed = time.time() - start

print(f"Query returned {len(results)} rows in {elapsed:.2f} seconds")

Backend-Specific Tips 

DuckDB Optimizations 

Use columnar strengths:

DuckDB is columnar, so queries that select few columns are faster:

# Faster: Select only needed columns
cursor = engine.execute("""
    SELECT chromosome, start_pos, end_pos, name
    FROM features
    WHERE interval INTERSECTS 'chr1:1000-2000'
""")

# Slower: Select all columns
cursor = engine.execute("""
    SELECT *
    FROM features
    WHERE interval INTERSECTS 'chr1:1000-2000'
""")

Parallel execution:

DuckDB automatically parallelizes queries. For very large datasets, ensure you’re not limiting parallelism.

SQLite Optimizations 

Use covering indexes:

-- Include commonly selected columns in the index
CREATE INDEX idx_features_covering
ON features (chromosome, start_pos, end_pos, name, score)

Analyze tables:

# Help SQLite's query planner
engine.conn.execute("ANALYZE features")

Memory Management 

Streaming Results 

For large result sets, iterate instead of fetching all:

# Good: Stream results
cursor = engine.execute("SELECT * FROM large_table")
for row in cursor:
    process(row)

# Memory-intensive: Fetch all at once
cursor = engine.execute("SELECT * FROM large_table")
all_rows = cursor.fetchall()  # Loads everything into memory

Batch Processing 

Process large datasets in batches:

chromosomes = ['chr1', 'chr2', 'chr3', ...]  # All chromosomes

for chrom in chromosomes:
    cursor = engine.execute(f"""
        SELECT * FROM features
        WHERE chromosome = '{chrom}'
          AND interval INTERSECTS '{chrom}:1-1000000'
    """)
    process_chromosome(cursor)

Performance Checklist 

Before running large queries, check:

□ Indexes created on genomic columns
□ Chromosome filtering included in joins
□ Selective filters applied early
□ LIMIT used for exploration
□ Only necessary columns selected
□ NEAREST queries use max_distance
□ Results streamed instead of fetched all at once

Quick Wins 

Add indexes - Usually the biggest performance improvement
Filter by chromosome - Reduces join complexity significantly
Use max_distance with NEAREST - Limits search space
Stream results - Reduces memory pressure
Use DuckDB - Generally faster for analytical queries