Performance Guide
=================

This guide covers strategies for optimizing GIQL query performance, including
indexing, query patterns, and backend-specific optimizations.

.. contents::
   :local:
   :depth: 2

Understanding Query Performance
-------------------------------

How GIQL Queries Execute
~~~~~~~~~~~~~~~~~~~~~~~~

When you execute a GIQL query:

1. GIQL parses the query and identifies genomic operators
2. Operators are expanded into standard SQL predicates
3. The SQL is sent to the database backend
4. The database executes the query using its optimizer

Performance depends on both the generated SQL and how the database executes it.

Common Performance Bottlenecks
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- **Full table scans**: No indexes to speed up filtering
- **Cartesian products**: Large cross joins without early filtering
- **Missing chromosome filters**: Comparing features across all chromosomes
- **Inefficient join order**: Small tables should drive joins

Indexing Strategies
-------------------

Creating Indexes
~~~~~~~~~~~~~~~~

Create indexes on genomic columns for faster queries:

.. code-block:: python

   # DuckDB
   engine.conn.execute("""
       CREATE INDEX idx_features_position
       ON features (chromosome, start_pos, end_pos)
   """)

   # SQLite
   engine.conn.execute("""
       CREATE INDEX idx_features_position
       ON features (chromosome, start_pos, end_pos)
   """)

Recommended Index Patterns
~~~~~~~~~~~~~~~~~~~~~~~~~~

**For single-table queries (filtering):**

.. code-block:: sql

   CREATE INDEX idx_table_position ON table_name (chromosome, start_pos, end_pos)

**For join queries:**

.. code-block:: sql

   -- Index both tables involved in joins
   CREATE INDEX idx_variants_position ON variants (chromosome, start_pos, end_pos)
   CREATE INDEX idx_genes_position ON genes (chromosome, start_pos, end_pos)

**For strand-specific queries:**

.. code-block:: sql

   CREATE INDEX idx_features_strand ON features (chromosome, strand, start_pos, end_pos)

When to Create Indexes
~~~~~~~~~~~~~~~~~~~~~~

Create indexes when:

- Tables have more than ~10,000 rows
- You're running repeated queries on the same tables
- Join queries are slow
- Filtering by genomic position is common

Skip indexes when:

- Tables are small
- You're doing one-time analysis
- Full table scans are acceptable

Query Optimization Patterns
---------------------------

Pre-filter by Chromosome
~~~~~~~~~~~~~~~~~~~~~~~~

Always include chromosome filtering when joining tables:

.. code-block:: python

   # Good: Explicit chromosome filter
   cursor = engine.execute("""
       SELECT a.*, b.name
       FROM features_a a
       JOIN features_b b ON a.interval INTERSECTS b.interval
       WHERE a.chromosome = 'chr1'
   """)

   # Also good: Cross-chromosome join with implicit filtering
   # GIQL handles this, but explicit is clearer
   cursor = engine.execute("""
       SELECT a.*, b.name
       FROM features_a a
       JOIN features_b b ON a.interval INTERSECTS b.interval
         AND a.chromosome = b.chromosome
   """)

Use Selective Filters Early
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Apply selective filters before joins:

.. code-block:: python

   # Good: Filter before joining
   cursor = engine.execute("""
       WITH filtered_variants AS (
           SELECT * FROM variants
           WHERE quality >= 30 AND filter = 'PASS'
       )
       SELECT f.*, g.name
       FROM filtered_variants f
       JOIN genes g ON f.interval INTERSECTS g.interval
   """)

   # Less efficient: Filter after joining
   cursor = engine.execute("""
       SELECT v.*, g.name
       FROM variants v
       JOIN genes g ON v.interval INTERSECTS g.interval
       WHERE v.quality >= 30 AND v.filter = 'PASS'
   """)

Limit Result Sets
~~~~~~~~~~~~~~~~~

Use LIMIT for exploratory queries:

.. code-block:: python

   # Good: Limit results during exploration
   cursor = engine.execute("""
       SELECT * FROM variants
       WHERE interval INTERSECTS 'chr1:1000000-2000000'
       LIMIT 100
   """)

Use DISTINCT Wisely
~~~~~~~~~~~~~~~~~~~

DISTINCT can be expensive. Only use when necessary:

.. code-block:: python

   # Only use DISTINCT when you actually need unique rows
   cursor = engine.execute("""
       SELECT DISTINCT a.*
       FROM features_a a
       JOIN features_b b ON a.interval INTERSECTS b.interval
   """)

   # If you just need to check existence, use EXISTS instead
   cursor = engine.execute("""
       SELECT a.*
       FROM features_a a
       WHERE EXISTS (
           SELECT 1 FROM features_b b
           WHERE a.interval INTERSECTS b.interval
       )
   """)

NEAREST Query Optimization
--------------------------

Optimizing K-NN Queries
~~~~~~~~~~~~~~~~~~~~~~~

The NEAREST operator can be expensive for large datasets. Optimize with:

**1. Use max_distance to limit search space:**

.. code-block:: python

   # Good: Constrained search
   cursor = engine.execute("""
       SELECT peaks.name, nearest.name, nearest.distance
       FROM peaks
       CROSS JOIN LATERAL NEAREST(
           genes,
           reference=peaks.interval,
           k=5,
           max_distance=100000   -- Only search within 100kb
       ) AS nearest
   """)

**2. Request only the k you need:**

.. code-block:: python

   # Good: Request exactly what you need
   NEAREST(genes, reference=peaks.interval, k=3)

   # Wasteful: Request more than needed
   NEAREST(genes, reference=peaks.interval, k=100)

**3. Index the target table:**

.. code-block:: sql

   CREATE INDEX idx_genes_position ON genes (chromosome, start_pos, end_pos)

Merge and Cluster Optimization
------------------------------

Efficient Clustering
~~~~~~~~~~~~~~~~~~~~

For large datasets, consider pre-sorting:

.. code-block:: python

   # Pre-sort data for clustering
   cursor = engine.execute("""
       WITH sorted AS (
           SELECT * FROM features
           ORDER BY chromosome, start_pos
       )
       SELECT *, CLUSTER(interval) AS cluster_id
       FROM sorted
   """)

Efficient Merging
~~~~~~~~~~~~~~~~~

Filter before merging to reduce data volume:

.. code-block:: python

   # Good: Filter first, then merge
   cursor = engine.execute("""
       WITH filtered AS (
           SELECT * FROM features
           WHERE score >= 10
       )
       SELECT MERGE(interval), COUNT(*) AS count
       FROM filtered
   """)

Analyzing Query Performance
---------------------------

Using EXPLAIN
~~~~~~~~~~~~~

Analyze query execution plans:

.. code-block:: python

   # Get the transpiled SQL
   sql = engine.transpile("""
       SELECT a.*, b.name
       FROM variants a
       JOIN genes b ON a.interval INTERSECTS b.interval
   """)

   # Analyze the execution plan
   cursor = engine.execute(f"EXPLAIN {sql}")
   for row in cursor:
       print(row)

   # DuckDB also supports EXPLAIN ANALYZE for actual timing
   cursor = engine.execute(f"EXPLAIN ANALYZE {sql}")

Timing Queries
~~~~~~~~~~~~~~

Measure query execution time:

.. code-block:: python

   import time

   start = time.time()
   cursor = engine.execute("""
       SELECT * FROM variants
       WHERE interval INTERSECTS 'chr1:1000000-2000000'
   """)
   results = cursor.fetchall()
   elapsed = time.time() - start

   print(f"Query returned {len(results)} rows in {elapsed:.2f} seconds")

Backend-Specific Tips
---------------------

DuckDB Optimizations
~~~~~~~~~~~~~~~~~~~~

**Use columnar strengths:**

DuckDB is columnar, so queries that select few columns are faster:

.. code-block:: python

   # Faster: Select only needed columns
   cursor = engine.execute("""
       SELECT chromosome, start_pos, end_pos, name
       FROM features
       WHERE interval INTERSECTS 'chr1:1000-2000'
   """)

   # Slower: Select all columns
   cursor = engine.execute("""
       SELECT *
       FROM features
       WHERE interval INTERSECTS 'chr1:1000-2000'
   """)

**Parallel execution:**

DuckDB automatically parallelizes queries. For very large datasets,
ensure you're not limiting parallelism.

SQLite Optimizations
~~~~~~~~~~~~~~~~~~~~

**Use covering indexes:**

.. code-block:: sql

   -- Include commonly selected columns in the index
   CREATE INDEX idx_features_covering
   ON features (chromosome, start_pos, end_pos, name, score)

**Analyze tables:**

.. code-block:: python

   # Help SQLite's query planner
   engine.conn.execute("ANALYZE features")

Memory Management
-----------------

Streaming Results
~~~~~~~~~~~~~~~~~

For large result sets, iterate instead of fetching all:

.. code-block:: python

   # Good: Stream results
   cursor = engine.execute("SELECT * FROM large_table")
   for row in cursor:
       process(row)

   # Memory-intensive: Fetch all at once
   cursor = engine.execute("SELECT * FROM large_table")
   all_rows = cursor.fetchall()  # Loads everything into memory

Batch Processing
~~~~~~~~~~~~~~~~

Process large datasets in batches:

.. code-block:: python

   chromosomes = ['chr1', 'chr2', 'chr3', ...]  # All chromosomes

   for chrom in chromosomes:
       cursor = engine.execute(f"""
           SELECT * FROM features
           WHERE chromosome = '{chrom}'
             AND interval INTERSECTS '{chrom}:1-1000000'
       """)
       process_chromosome(cursor)

Performance Checklist
---------------------

Before running large queries, check:

.. code-block:: text

   □ Indexes created on genomic columns
   □ Chromosome filtering included in joins
   □ Selective filters applied early
   □ LIMIT used for exploration
   □ Only necessary columns selected
   □ NEAREST queries use max_distance
   □ Results streamed instead of fetched all at once

Quick Wins
~~~~~~~~~~

1. **Add indexes** - Usually the biggest performance improvement
2. **Filter by chromosome** - Reduces join complexity significantly
3. **Use max_distance with NEAREST** - Limits search space
4. **Stream results** - Reduces memory pressure
5. **Use DuckDB** - Generally faster for analytical queries