Performance Guide

This guide covers strategies for optimizing GIQL query performance, including indexing, query patterns, and backend-specific optimizations.

Understanding Query Performance

How GIQL Queries Execute

When you execute a GIQL query:

  1. GIQL parses the query and identifies genomic operators

  2. Operators are expanded into standard SQL predicates

  3. The SQL is sent to the database backend

  4. The database executes the query using its optimizer

Performance depends on both the generated SQL and how the database executes it.

Common Performance Bottlenecks

  • Full table scans: No indexes to speed up filtering

  • Cartesian products: Large cross joins without early filtering

  • Missing chromosome filters: Comparing features across all chromosomes

  • Inefficient join order: Small tables should drive joins

Indexing Strategies

Creating Indexes

Create indexes on genomic columns for faster queries:

# DuckDB
engine.conn.execute("""
    CREATE INDEX idx_features_position
    ON features (chromosome, start_pos, end_pos)
""")

# SQLite
engine.conn.execute("""
    CREATE INDEX idx_features_position
    ON features (chromosome, start_pos, end_pos)
""")

When to Create Indexes

Create indexes when:

  • Tables have more than ~10,000 rows

  • You’re running repeated queries on the same tables

  • Join queries are slow

  • Filtering by genomic position is common

Skip indexes when:

  • Tables are small

  • You’re doing one-time analysis

  • Full table scans are acceptable

Query Optimization Patterns

Pre-filter by Chromosome

Always include chromosome filtering when joining tables:

# Good: Explicit chromosome filter
cursor = engine.execute("""
    SELECT a.*, b.name
    FROM features_a a
    JOIN features_b b ON a.interval INTERSECTS b.interval
    WHERE a.chromosome = 'chr1'
""")

# Also good: Cross-chromosome join with implicit filtering
# GIQL handles this, but explicit is clearer
cursor = engine.execute("""
    SELECT a.*, b.name
    FROM features_a a
    JOIN features_b b ON a.interval INTERSECTS b.interval
      AND a.chromosome = b.chromosome
""")

Use Selective Filters Early

Apply selective filters before joins:

# Good: Filter before joining
cursor = engine.execute("""
    WITH filtered_variants AS (
        SELECT * FROM variants
        WHERE quality >= 30 AND filter = 'PASS'
    )
    SELECT f.*, g.name
    FROM filtered_variants f
    JOIN genes g ON f.interval INTERSECTS g.interval
""")

# Less efficient: Filter after joining
cursor = engine.execute("""
    SELECT v.*, g.name
    FROM variants v
    JOIN genes g ON v.interval INTERSECTS g.interval
    WHERE v.quality >= 30 AND v.filter = 'PASS'
""")

Limit Result Sets

Use LIMIT for exploratory queries:

# Good: Limit results during exploration
cursor = engine.execute("""
    SELECT * FROM variants
    WHERE interval INTERSECTS 'chr1:1000000-2000000'
    LIMIT 100
""")

Use DISTINCT Wisely

DISTINCT can be expensive. Only use when necessary:

# Only use DISTINCT when you actually need unique rows
cursor = engine.execute("""
    SELECT DISTINCT a.*
    FROM features_a a
    JOIN features_b b ON a.interval INTERSECTS b.interval
""")

# If you just need to check existence, use EXISTS instead
cursor = engine.execute("""
    SELECT a.*
    FROM features_a a
    WHERE EXISTS (
        SELECT 1 FROM features_b b
        WHERE a.interval INTERSECTS b.interval
    )
""")

NEAREST Query Optimization

Optimizing K-NN Queries

The NEAREST operator can be expensive for large datasets. Optimize with:

1. Use max_distance to limit search space:

# Good: Constrained search
cursor = engine.execute("""
    SELECT peaks.name, nearest.name, nearest.distance
    FROM peaks
    CROSS JOIN LATERAL NEAREST(
        genes,
        reference=peaks.interval,
        k=5,
        max_distance=100000   -- Only search within 100kb
    ) AS nearest
""")

2. Request only the k you need:

# Good: Request exactly what you need
NEAREST(genes, reference=peaks.interval, k=3)

# Wasteful: Request more than needed
NEAREST(genes, reference=peaks.interval, k=100)

3. Index the target table:

CREATE INDEX idx_genes_position ON genes (chromosome, start_pos, end_pos)

Merge and Cluster Optimization

Efficient Clustering

For large datasets, consider pre-sorting:

# Pre-sort data for clustering
cursor = engine.execute("""
    WITH sorted AS (
        SELECT * FROM features
        ORDER BY chromosome, start_pos
    )
    SELECT *, CLUSTER(interval) AS cluster_id
    FROM sorted
""")

Efficient Merging

Filter before merging to reduce data volume:

# Good: Filter first, then merge
cursor = engine.execute("""
    WITH filtered AS (
        SELECT * FROM features
        WHERE score >= 10
    )
    SELECT MERGE(interval), COUNT(*) AS count
    FROM filtered
""")

Analyzing Query Performance

Using EXPLAIN

Analyze query execution plans:

# Get the transpiled SQL
sql = engine.transpile("""
    SELECT a.*, b.name
    FROM variants a
    JOIN genes b ON a.interval INTERSECTS b.interval
""")

# Analyze the execution plan
cursor = engine.execute(f"EXPLAIN {sql}")
for row in cursor:
    print(row)

# DuckDB also supports EXPLAIN ANALYZE for actual timing
cursor = engine.execute(f"EXPLAIN ANALYZE {sql}")

Timing Queries

Measure query execution time:

import time

start = time.time()
cursor = engine.execute("""
    SELECT * FROM variants
    WHERE interval INTERSECTS 'chr1:1000000-2000000'
""")
results = cursor.fetchall()
elapsed = time.time() - start

print(f"Query returned {len(results)} rows in {elapsed:.2f} seconds")

Backend-Specific Tips

DuckDB Optimizations

Use columnar strengths:

DuckDB is columnar, so queries that select few columns are faster:

# Faster: Select only needed columns
cursor = engine.execute("""
    SELECT chromosome, start_pos, end_pos, name
    FROM features
    WHERE interval INTERSECTS 'chr1:1000-2000'
""")

# Slower: Select all columns
cursor = engine.execute("""
    SELECT *
    FROM features
    WHERE interval INTERSECTS 'chr1:1000-2000'
""")

Parallel execution:

DuckDB automatically parallelizes queries. For very large datasets, ensure you’re not limiting parallelism.

SQLite Optimizations

Use covering indexes:

-- Include commonly selected columns in the index
CREATE INDEX idx_features_covering
ON features (chromosome, start_pos, end_pos, name, score)

Analyze tables:

# Help SQLite's query planner
engine.conn.execute("ANALYZE features")

Memory Management

Streaming Results

For large result sets, iterate instead of fetching all:

# Good: Stream results
cursor = engine.execute("SELECT * FROM large_table")
for row in cursor:
    process(row)

# Memory-intensive: Fetch all at once
cursor = engine.execute("SELECT * FROM large_table")
all_rows = cursor.fetchall()  # Loads everything into memory

Batch Processing

Process large datasets in batches:

chromosomes = ['chr1', 'chr2', 'chr3', ...]  # All chromosomes

for chrom in chromosomes:
    cursor = engine.execute(f"""
        SELECT * FROM features
        WHERE chromosome = '{chrom}'
          AND interval INTERSECTS '{chrom}:1-1000000'
    """)
    process_chromosome(cursor)

Performance Checklist

Before running large queries, check:

□ Indexes created on genomic columns
□ Chromosome filtering included in joins
□ Selective filters applied early
□ LIMIT used for exploration
□ Only necessary columns selected
□ NEAREST queries use max_distance
□ Results streamed instead of fetched all at once

Quick Wins

  1. Add indexes - Usually the biggest performance improvement

  2. Filter by chromosome - Reduces join complexity significantly

  3. Use max_distance with NEAREST - Limits search space

  4. Stream results - Reduces memory pressure

  5. Use DuckDB - Generally faster for analytical queries