Distance and Proximity Queries ============================== This section covers patterns for calculating genomic distances and finding nearest features using GIQL's distance operators. .. contents:: :local: :depth: 2 Calculating Distances --------------------- Distance Between Feature Pairs ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Calculate the distance between features in two tables: .. code-block:: python cursor = engine.execute(""" SELECT a.name AS feature_a, b.name AS feature_b, DISTANCE(a.interval, b.interval) AS distance FROM features_a a CROSS JOIN features_b b WHERE a.chromosome = b.chromosome ORDER BY a.name, distance """) **Use case:** Generate a distance matrix between regulatory elements and genes. .. note:: Always include ``WHERE a.chromosome = b.chromosome`` to avoid comparing features on different chromosomes (which returns NULL for distance). Identify Overlapping vs Proximal ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Classify relationships based on distance: .. code-block:: python cursor = engine.execute(""" SELECT p.name AS peak, g.name AS gene, DISTANCE(p.interval, g.interval) AS dist, CASE WHEN DISTANCE(p.interval, g.interval) = 0 THEN 'overlapping' WHEN DISTANCE(p.interval, g.interval) <= 1000 THEN 'proximal (<1kb)' WHEN DISTANCE(p.interval, g.interval) <= 10000 THEN 'nearby (<10kb)' ELSE 'distant' END AS relationship FROM peaks p CROSS JOIN genes g WHERE p.chromosome = g.chromosome """) **Use case:** Categorize peak-gene relationships for enhancer analysis. Filter by Maximum Distance ~~~~~~~~~~~~~~~~~~~~~~~~~~ Find feature pairs within a distance threshold: .. code-block:: python cursor = engine.execute(""" SELECT a.name, b.name, DISTANCE(a.interval, b.interval) AS dist FROM features_a a CROSS JOIN features_b b WHERE a.chromosome = b.chromosome AND DISTANCE(a.interval, b.interval) <= 50000 ORDER BY dist """) **Use case:** Find regulatory elements within 50kb of genes. K-Nearest Neighbor Queries -------------------------- Find K Nearest Features ~~~~~~~~~~~~~~~~~~~~~~~ For each peak, find the 3 nearest genes: .. code-block:: python cursor = engine.execute(""" SELECT peaks.name AS peak, nearest.name AS gene, nearest.distance FROM peaks CROSS JOIN LATERAL NEAREST(genes, reference=peaks.interval, k=3) AS nearest ORDER BY peaks.name, nearest.distance """) **Use case:** Annotate ChIP-seq peaks with nearby genes. Nearest Feature to a Specific Location ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Find the 5 nearest genes to a specific genomic coordinate: .. code-block:: python cursor = engine.execute(""" SELECT name, distance FROM NEAREST(genes, reference='chr1:1000000-1001000', k=5) ORDER BY distance """) **Use case:** Explore the genomic neighborhood of a position of interest. Nearest with Distance Constraint ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Find nearest features within a maximum distance: .. code-block:: python cursor = engine.execute(""" SELECT peaks.name AS peak, nearest.name AS gene, nearest.distance FROM peaks CROSS JOIN LATERAL NEAREST( genes, reference=peaks.interval, k=5, max_distance=100000 ) AS nearest ORDER BY peaks.name, nearest.distance """) **Use case:** Find regulatory targets within 100kb, ignoring distant genes. Strand-Specific Queries ----------------------- Same-Strand Nearest Neighbors ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Find nearest features on the same strand only: .. code-block:: python cursor = engine.execute(""" SELECT peaks.name AS peak, nearest.name AS gene, nearest.strand, nearest.distance FROM peaks CROSS JOIN LATERAL NEAREST( genes, reference=peaks.interval, k=3, stranded=true ) AS nearest ORDER BY peaks.name, nearest.distance """) **Use case:** Find same-strand genes for strand-specific regulatory analysis. Directional Queries ------------------- Upstream Features ~~~~~~~~~~~~~~~~~ Find features upstream (5') of reference positions using signed distances: .. code-block:: python cursor = engine.execute(""" SELECT peaks.name AS peak, nearest.name AS gene, nearest.distance FROM peaks CROSS JOIN LATERAL NEAREST( genes, reference=peaks.interval, k=10, signed=true ) AS nearest WHERE nearest.distance < 0 ORDER BY peaks.name, nearest.distance DESC """) **Use case:** Find genes upstream of regulatory elements. .. note:: With ``signed=true``, negative distances indicate upstream features and positive distances indicate downstream features. Downstream Features ~~~~~~~~~~~~~~~~~~~ Find features downstream (3') of reference positions: .. code-block:: python cursor = engine.execute(""" SELECT peaks.name AS peak, nearest.name AS gene, nearest.distance FROM peaks CROSS JOIN LATERAL NEAREST( genes, reference=peaks.interval, k=10, signed=true ) AS nearest WHERE nearest.distance > 0 ORDER BY peaks.name, nearest.distance """) **Use case:** Identify downstream targets of promoter elements. Promoter-Proximal Analysis ~~~~~~~~~~~~~~~~~~~~~~~~~~ Find features within a specific distance window around the reference: .. code-block:: python cursor = engine.execute(""" SELECT peaks.name AS peak, nearest.name AS gene, nearest.distance FROM peaks CROSS JOIN LATERAL NEAREST( genes, reference=peaks.interval, k=10, signed=true ) AS nearest WHERE nearest.distance BETWEEN -2000 AND 500 ORDER BY peaks.name, ABS(nearest.distance) """) **Use case:** Find genes with peaks in their promoter regions (-2kb to +500bp from TSS). Combined Parameters ------------------- Strand-Specific with Distance Constraint ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Find nearby same-strand features: .. code-block:: python cursor = engine.execute(""" SELECT peaks.name AS peak, nearest.name AS gene, nearest.distance FROM peaks CROSS JOIN LATERAL NEAREST( genes, reference=peaks.interval, k=5, max_distance=50000, stranded=true, signed=true ) AS nearest WHERE nearest.distance BETWEEN -10000 AND 10000 ORDER BY peaks.name, ABS(nearest.distance) """) **Use case:** Find same-strand genes within ±10kb for promoter-enhancer analysis. Distance Statistics ------------------- Average Distance to Nearest Gene ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Calculate the average distance from peaks to their nearest gene: .. code-block:: python cursor = engine.execute(""" WITH nearest_genes AS ( SELECT peaks.name AS peak, nearest.distance FROM peaks CROSS JOIN LATERAL NEAREST(genes, reference=peaks.interval, k=1) AS nearest ) SELECT COUNT(*) AS peak_count, AVG(distance) AS avg_distance, MIN(distance) AS min_distance, MAX(distance) AS max_distance FROM nearest_genes """) **Use case:** Characterize the genomic distribution of peaks relative to genes. Distance Distribution by Chromosome ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Analyze distance patterns per chromosome: .. code-block:: python cursor = engine.execute(""" WITH nearest_genes AS ( SELECT peaks.chromosome, peaks.name AS peak, nearest.distance FROM peaks CROSS JOIN LATERAL NEAREST(genes, reference=peaks.interval, k=1) AS nearest ) SELECT chromosome, COUNT(*) AS peak_count, AVG(distance) AS avg_distance FROM nearest_genes GROUP BY chromosome ORDER BY chromosome """) **Use case:** Compare regulatory element distribution across chromosomes. Window Expansion Patterns ------------------------- Expand Search Window ~~~~~~~~~~~~~~~~~~~~ Find features within an expanded window around each feature: .. code-block:: python cursor = engine.execute(""" WITH expanded AS ( SELECT name, chromosome, start_pos - 5000 AS search_start, end_pos + 5000 AS search_end FROM peaks ) SELECT e.name AS peak, b.* FROM expanded e JOIN features_b b ON b.chromosome = e.chromosome AND b.start_pos < e.search_end AND b.end_pos > e.search_start """) **Use case:** Find all features within 5kb flanking regions. .. note:: This pattern uses raw coordinate manipulation rather than the NEAREST operator, which is useful when you need custom window shapes.