Schema Mapping Guide ==================== This guide explains how to configure GIQL to work with your genomic data by registering table schemas and mapping logical genomic columns. .. contents:: :local: :depth: 2 Understanding Schema Mapping ---------------------------- GIQL needs to know how your genomic data is structured in order to translate genomic operators into SQL. This is done through schema registration, which maps a logical "genomic column" (used in your queries) to the physical columns in your database tables. The Core Concept ~~~~~~~~~~~~~~~~ In GIQL queries, you use a logical genomic column name like ``interval``: .. code-block:: sql SELECT * FROM variants WHERE interval INTERSECTS 'chr1:1000-2000' Behind the scenes, GIQL expands this to actual column comparisons: .. code-block:: sql SELECT * FROM variants WHERE chromosome = 'chr1' AND start_pos < 2000 AND end_pos > 1000 Schema registration tells GIQL which physical columns (``chromosome``, ``start_pos``, ``end_pos``) correspond to the logical ``interval`` column. Registering Table Schemas ------------------------- Basic Registration ~~~~~~~~~~~~~~~~~~ Register a table schema using ``register_table_schema()``: .. code-block:: python from giql import GIQLEngine with GIQLEngine(target_dialect="duckdb") as engine: # Load data engine.load_csv("variants", "variants.csv") # Register schema engine.register_table_schema( "variants", # Table name { "id": "INTEGER", "chromosome": "VARCHAR", "start_pos": "BIGINT", "end_pos": "BIGINT", "name": "VARCHAR", "quality": "FLOAT", }, genomic_column="interval", # Logical column name for queries ) # Now you can use 'interval' in queries cursor = engine.execute(""" SELECT * FROM variants WHERE interval INTERSECTS 'chr1:1000-2000' """) Required Columns ~~~~~~~~~~~~~~~~ For schema registration, your table must have columns that map to: - **chromosome**: The chromosome/contig identifier (e.g., 'chr1', 'chrX') - **start_pos**: The start position of the genomic interval (0-based, inclusive) - **end_pos**: The end position of the genomic interval (0-based, exclusive) GIQL looks for these column names by default. If your columns have different names, see :ref:`custom-column-names`. Optional Strand Column ~~~~~~~~~~~~~~~~~~~~~~ If your data includes strand information, include it in the schema: .. code-block:: python engine.register_table_schema( "features", { "chromosome": "VARCHAR", "start_pos": "BIGINT", "end_pos": "BIGINT", "strand": "VARCHAR", # '+', '-', or '.' "name": "VARCHAR", }, genomic_column="interval", ) The strand column enables strand-specific operations in operators like CLUSTER and NEAREST. .. _custom-column-names: Custom Column Names ~~~~~~~~~~~~~~~~~~~ If your table uses different column names for genomic coordinates, specify the mapping explicitly: .. code-block:: python engine.register_table_schema( "my_table", { "chrom": "VARCHAR", # Your chromosome column "chromStart": "BIGINT", # Your start column (UCSC-style) "chromEnd": "BIGINT", # Your end column "name": "VARCHAR", }, genomic_column="interval", chromosome_column="chrom", # Map to your column name start_column="chromStart", # Map to your column name end_column="chromEnd", # Map to your column name ) Multiple Tables --------------- Register Multiple Tables ~~~~~~~~~~~~~~~~~~~~~~~~ Register all tables that will participate in genomic queries: .. code-block:: python with GIQLEngine(target_dialect="duckdb") as engine: # Load data files engine.load_csv("variants", "variants.bed") engine.load_csv("genes", "genes.bed") engine.load_csv("regulatory", "regulatory.bed") # Define common schema bed_schema = { "chromosome": "VARCHAR", "start_pos": "BIGINT", "end_pos": "BIGINT", "name": "VARCHAR", "score": "FLOAT", "strand": "VARCHAR", } # Register each table for table in ["variants", "genes", "regulatory"]: engine.register_table_schema( table, bed_schema, genomic_column="interval", ) # Now you can join tables using genomic operators cursor = engine.execute(""" SELECT v.*, g.name AS gene_name FROM variants v JOIN genes g ON v.interval INTERSECTS g.interval """) Different Schemas Per Table ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Tables can have different schemas and even different genomic column names: .. code-block:: python # Variants table with VCF-style columns engine.register_table_schema( "variants", { "CHROM": "VARCHAR", "POS": "BIGINT", "END": "BIGINT", "ID": "VARCHAR", "QUAL": "FLOAT", }, genomic_column="var_interval", chromosome_column="CHROM", start_column="POS", end_column="END", ) # Genes table with BED-style columns engine.register_table_schema( "genes", { "chromosome": "VARCHAR", "start_pos": "BIGINT", "end_pos": "BIGINT", "gene_name": "VARCHAR", "strand": "VARCHAR", }, genomic_column="gene_interval", ) # Query using different genomic column names cursor = engine.execute(""" SELECT v.ID, g.gene_name FROM variants v JOIN genes g ON v.var_interval INTERSECTS g.gene_interval """) Coordinate Systems ------------------ Understanding BED Coordinates ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ GIQL uses the BED coordinate convention: - **0-based start**: The first base of a chromosome is position 0 - **Half-open intervals**: Start is inclusive, end is exclusive - **Interval [start, end)**: Contains positions from start to end-1 Example: An interval ``chr1:100-200`` covers bases 100 through 199 (100 bases total). Converting from 1-Based Coordinates ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If your data uses 1-based coordinates (like VCF or GFF), convert when loading: .. code-block:: python import pandas as pd # Load 1-based data df = pd.read_csv("variants.vcf", sep="\t") # Convert to 0-based df['start_pos'] = df['POS'] - 1 # Convert 1-based to 0-based df['end_pos'] = df['POS'] # For SNPs, end = start + 1 # Load into engine engine.conn.execute("CREATE TABLE variants AS SELECT * FROM df") # Register schema engine.register_table_schema( "variants", { "chromosome": "VARCHAR", "start_pos": "BIGINT", "end_pos": "BIGINT", # ... other columns }, genomic_column="interval", ) Working with Point Features ~~~~~~~~~~~~~~~~~~~~~~~~~~~ For point features (like SNPs), create an interval of length 1: .. code-block:: python # For a SNP at position 1000 (1-based) # 0-based interval: [999, 1000) start_pos = 999 end_pos = 1000 Data Types ---------- Recommended Column Types ~~~~~~~~~~~~~~~~~~~~~~~~ For optimal performance, use appropriate data types: .. list-table:: :header-rows: 1 :widths: 25 25 50 * - Column - Recommended Type - Notes * - chromosome - VARCHAR - String type for chromosome names * - start_pos - BIGINT - 64-bit integer for large genomes * - end_pos - BIGINT - 64-bit integer for large genomes * - strand - VARCHAR(1) or CHAR(1) - Single character: '+', '-', '.' * - score - FLOAT or DOUBLE - Numeric scores * - name - VARCHAR - Feature identifiers Type Compatibility ~~~~~~~~~~~~~~~~~~ GIQL schemas use SQL type names. Common mappings: .. list-table:: :header-rows: 1 :widths: 30 35 35 * - GIQL Schema Type - DuckDB Type - SQLite Type * - INTEGER - INTEGER - INTEGER * - BIGINT - BIGINT - INTEGER * - VARCHAR - VARCHAR - TEXT * - FLOAT - FLOAT - REAL * - DOUBLE - DOUBLE - REAL Loading Data ------------ From CSV Files ~~~~~~~~~~~~~~ Load CSV files directly: .. code-block:: python engine.load_csv("features", "features.csv") # With custom options engine.load_csv( "features", "features.tsv", delimiter="\t", header=True, ) From Pandas DataFrames ~~~~~~~~~~~~~~~~~~~~~~ Load data from pandas: .. code-block:: python import pandas as pd df = pd.read_csv("features.bed", sep="\t", header=None, names=["chromosome", "start_pos", "end_pos", "name"]) # Register the DataFrame as a table engine.conn.execute("CREATE TABLE features AS SELECT * FROM df") # Then register the schema engine.register_table_schema( "features", { "chromosome": "VARCHAR", "start_pos": "BIGINT", "end_pos": "BIGINT", "name": "VARCHAR", }, genomic_column="interval", ) From Existing Database Tables ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If tables already exist in your database, just register their schemas: .. code-block:: python # Connect to existing database with GIQLEngine(target_dialect="duckdb", db_path="my_database.duckdb") as engine: # Register schemas for existing tables engine.register_table_schema( "existing_table", { "chromosome": "VARCHAR", "start_pos": "BIGINT", "end_pos": "BIGINT", "name": "VARCHAR", }, genomic_column="interval", ) # Query existing data cursor = engine.execute(""" SELECT * FROM existing_table WHERE interval INTERSECTS 'chr1:1000-2000' """) Troubleshooting --------------- Common Issues ~~~~~~~~~~~~~ **"Unknown column" errors:** - Ensure the table schema is registered before querying - Check that the genomic column name in your query matches the registered name - Verify column names in the schema match actual table columns **Incorrect results:** - Verify your coordinate system (0-based vs 1-based) - Check that start_pos < end_pos for all intervals - Ensure chromosome names match between tables (e.g., 'chr1' vs '1') **Performance issues:** - See the :doc:`performance` guide for optimization tips - Consider adding indexes on genomic columns Verifying Schema Registration ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Check that schemas are registered correctly: .. code-block:: python # After registration, test with a simple query sql = engine.transpile(""" SELECT * FROM variants WHERE interval INTERSECTS 'chr1:1000-2000' """) print(sql) # Should show expanded SQL with chromosome, start_pos, end_pos comparisons