Schema Mapping Guide

This guide explains how to configure GIQL to work with your genomic data by registering table schemas and mapping logical genomic columns.

Understanding Schema Mapping

GIQL needs to know how your genomic data is structured in order to translate genomic operators into SQL. This is done through schema registration, which maps a logical “genomic column” (used in your queries) to the physical columns in your database tables.

The Core Concept

In GIQL queries, you use a logical genomic column name like interval:

SELECT * FROM variants WHERE interval INTERSECTS 'chr1:1000-2000'

Behind the scenes, GIQL expands this to actual column comparisons:

SELECT * FROM variants
WHERE chromosome = 'chr1' AND start_pos < 2000 AND end_pos > 1000

Schema registration tells GIQL which physical columns (chromosome, start_pos, end_pos) correspond to the logical interval column.

Registering Table Schemas

Basic Registration

Register a table schema using register_table_schema():

from giql import GIQLEngine

with GIQLEngine(target_dialect="duckdb") as engine:
    # Load data
    engine.load_csv("variants", "variants.csv")

    # Register schema
    engine.register_table_schema(
        "variants",                    # Table name
        {
            "id": "INTEGER",
            "chromosome": "VARCHAR",
            "start_pos": "BIGINT",
            "end_pos": "BIGINT",
            "name": "VARCHAR",
            "quality": "FLOAT",
        },
        genomic_column="interval",     # Logical column name for queries
    )

    # Now you can use 'interval' in queries
    cursor = engine.execute("""
        SELECT * FROM variants
        WHERE interval INTERSECTS 'chr1:1000-2000'
    """)

Required Columns

For schema registration, your table must have columns that map to:

  • chromosome: The chromosome/contig identifier (e.g., ‘chr1’, ‘chrX’)

  • start_pos: The start position of the genomic interval (0-based, inclusive)

  • end_pos: The end position of the genomic interval (0-based, exclusive)

GIQL looks for these column names by default. If your columns have different names, see Custom Column Names.

Optional Strand Column

If your data includes strand information, include it in the schema:

engine.register_table_schema(
    "features",
    {
        "chromosome": "VARCHAR",
        "start_pos": "BIGINT",
        "end_pos": "BIGINT",
        "strand": "VARCHAR",       # '+', '-', or '.'
        "name": "VARCHAR",
    },
    genomic_column="interval",
)

The strand column enables strand-specific operations in operators like CLUSTER and NEAREST.

Custom Column Names

If your table uses different column names for genomic coordinates, specify the mapping explicitly:

engine.register_table_schema(
    "my_table",
    {
        "chrom": "VARCHAR",        # Your chromosome column
        "chromStart": "BIGINT",    # Your start column (UCSC-style)
        "chromEnd": "BIGINT",      # Your end column
        "name": "VARCHAR",
    },
    genomic_column="interval",
    chromosome_column="chrom",      # Map to your column name
    start_column="chromStart",      # Map to your column name
    end_column="chromEnd",          # Map to your column name
)

Multiple Tables

Register Multiple Tables

Register all tables that will participate in genomic queries:

with GIQLEngine(target_dialect="duckdb") as engine:
    # Load data files
    engine.load_csv("variants", "variants.bed")
    engine.load_csv("genes", "genes.bed")
    engine.load_csv("regulatory", "regulatory.bed")

    # Define common schema
    bed_schema = {
        "chromosome": "VARCHAR",
        "start_pos": "BIGINT",
        "end_pos": "BIGINT",
        "name": "VARCHAR",
        "score": "FLOAT",
        "strand": "VARCHAR",
    }

    # Register each table
    for table in ["variants", "genes", "regulatory"]:
        engine.register_table_schema(
            table,
            bed_schema,
            genomic_column="interval",
        )

    # Now you can join tables using genomic operators
    cursor = engine.execute("""
        SELECT v.*, g.name AS gene_name
        FROM variants v
        JOIN genes g ON v.interval INTERSECTS g.interval
    """)

Different Schemas Per Table

Tables can have different schemas and even different genomic column names:

# Variants table with VCF-style columns
engine.register_table_schema(
    "variants",
    {
        "CHROM": "VARCHAR",
        "POS": "BIGINT",
        "END": "BIGINT",
        "ID": "VARCHAR",
        "QUAL": "FLOAT",
    },
    genomic_column="var_interval",
    chromosome_column="CHROM",
    start_column="POS",
    end_column="END",
)

# Genes table with BED-style columns
engine.register_table_schema(
    "genes",
    {
        "chromosome": "VARCHAR",
        "start_pos": "BIGINT",
        "end_pos": "BIGINT",
        "gene_name": "VARCHAR",
        "strand": "VARCHAR",
    },
    genomic_column="gene_interval",
)

# Query using different genomic column names
cursor = engine.execute("""
    SELECT v.ID, g.gene_name
    FROM variants v
    JOIN genes g ON v.var_interval INTERSECTS g.gene_interval
""")

Coordinate Systems

Understanding BED Coordinates

GIQL uses the BED coordinate convention:

  • 0-based start: The first base of a chromosome is position 0

  • Half-open intervals: Start is inclusive, end is exclusive

  • Interval [start, end): Contains positions from start to end-1

Example: An interval chr1:100-200 covers bases 100 through 199 (100 bases total).

Converting from 1-Based Coordinates

If your data uses 1-based coordinates (like VCF or GFF), convert when loading:

import pandas as pd

# Load 1-based data
df = pd.read_csv("variants.vcf", sep="\t")

# Convert to 0-based
df['start_pos'] = df['POS'] - 1  # Convert 1-based to 0-based
df['end_pos'] = df['POS']        # For SNPs, end = start + 1

# Load into engine
engine.conn.execute("CREATE TABLE variants AS SELECT * FROM df")

# Register schema
engine.register_table_schema(
    "variants",
    {
        "chromosome": "VARCHAR",
        "start_pos": "BIGINT",
        "end_pos": "BIGINT",
        # ... other columns
    },
    genomic_column="interval",
)

Working with Point Features

For point features (like SNPs), create an interval of length 1:

# For a SNP at position 1000 (1-based)
# 0-based interval: [999, 1000)
start_pos = 999
end_pos = 1000

Data Types

Type Compatibility

GIQL schemas use SQL type names. Common mappings:

GIQL Schema Type

DuckDB Type

SQLite Type

INTEGER

INTEGER

INTEGER

BIGINT

BIGINT

INTEGER

VARCHAR

VARCHAR

TEXT

FLOAT

FLOAT

REAL

DOUBLE

DOUBLE

REAL

Loading Data

From CSV Files

Load CSV files directly:

engine.load_csv("features", "features.csv")

# With custom options
engine.load_csv(
    "features",
    "features.tsv",
    delimiter="\t",
    header=True,
)

From Pandas DataFrames

Load data from pandas:

import pandas as pd

df = pd.read_csv("features.bed", sep="\t", header=None,
                 names=["chromosome", "start_pos", "end_pos", "name"])

# Register the DataFrame as a table
engine.conn.execute("CREATE TABLE features AS SELECT * FROM df")

# Then register the schema
engine.register_table_schema(
    "features",
    {
        "chromosome": "VARCHAR",
        "start_pos": "BIGINT",
        "end_pos": "BIGINT",
        "name": "VARCHAR",
    },
    genomic_column="interval",
)

From Existing Database Tables

If tables already exist in your database, just register their schemas:

# Connect to existing database
with GIQLEngine(target_dialect="duckdb", db_path="my_database.duckdb") as engine:
    # Register schemas for existing tables
    engine.register_table_schema(
        "existing_table",
        {
            "chromosome": "VARCHAR",
            "start_pos": "BIGINT",
            "end_pos": "BIGINT",
            "name": "VARCHAR",
        },
        genomic_column="interval",
    )

    # Query existing data
    cursor = engine.execute("""
        SELECT * FROM existing_table
        WHERE interval INTERSECTS 'chr1:1000-2000'
    """)

Troubleshooting

Common Issues

“Unknown column” errors:

  • Ensure the table schema is registered before querying

  • Check that the genomic column name in your query matches the registered name

  • Verify column names in the schema match actual table columns

Incorrect results:

  • Verify your coordinate system (0-based vs 1-based)

  • Check that start_pos < end_pos for all intervals

  • Ensure chromosome names match between tables (e.g., ‘chr1’ vs ‘1’)

Performance issues:

  • See the Performance Guide guide for optimization tips

  • Consider adding indexes on genomic columns

Verifying Schema Registration

Check that schemas are registered correctly:

# After registration, test with a simple query
sql = engine.transpile("""
    SELECT * FROM variants
    WHERE interval INTERSECTS 'chr1:1000-2000'
""")
print(sql)
# Should show expanded SQL with chromosome, start_pos, end_pos comparisons