Schema Mapping Guide
This guide explains how to configure GIQL to work with your genomic data by registering table schemas and mapping logical genomic columns.
Understanding Schema Mapping
GIQL needs to know how your genomic data is structured in order to translate genomic operators into SQL. This is done through schema registration, which maps a logical “genomic column” (used in your queries) to the physical columns in your database tables.
The Core Concept
In GIQL queries, you use a logical genomic column name like interval:
SELECT * FROM variants WHERE interval INTERSECTS 'chr1:1000-2000'
Behind the scenes, GIQL expands this to actual column comparisons:
SELECT * FROM variants
WHERE chromosome = 'chr1' AND start_pos < 2000 AND end_pos > 1000
Schema registration tells GIQL which physical columns (chromosome,
start_pos, end_pos) correspond to the logical interval column.
Registering Table Schemas
Basic Registration
Register a table schema using register_table_schema():
from giql import GIQLEngine
with GIQLEngine(target_dialect="duckdb") as engine:
# Load data
engine.load_csv("variants", "variants.csv")
# Register schema
engine.register_table_schema(
"variants", # Table name
{
"id": "INTEGER",
"chromosome": "VARCHAR",
"start_pos": "BIGINT",
"end_pos": "BIGINT",
"name": "VARCHAR",
"quality": "FLOAT",
},
genomic_column="interval", # Logical column name for queries
)
# Now you can use 'interval' in queries
cursor = engine.execute("""
SELECT * FROM variants
WHERE interval INTERSECTS 'chr1:1000-2000'
""")
Required Columns
For schema registration, your table must have columns that map to:
chromosome: The chromosome/contig identifier (e.g., ‘chr1’, ‘chrX’)
start_pos: The start position of the genomic interval (0-based, inclusive)
end_pos: The end position of the genomic interval (0-based, exclusive)
GIQL looks for these column names by default. If your columns have different names, see Custom Column Names.
Optional Strand Column
If your data includes strand information, include it in the schema:
engine.register_table_schema(
"features",
{
"chromosome": "VARCHAR",
"start_pos": "BIGINT",
"end_pos": "BIGINT",
"strand": "VARCHAR", # '+', '-', or '.'
"name": "VARCHAR",
},
genomic_column="interval",
)
The strand column enables strand-specific operations in operators like CLUSTER and NEAREST.
Custom Column Names
If your table uses different column names for genomic coordinates, specify the mapping explicitly:
engine.register_table_schema(
"my_table",
{
"chrom": "VARCHAR", # Your chromosome column
"chromStart": "BIGINT", # Your start column (UCSC-style)
"chromEnd": "BIGINT", # Your end column
"name": "VARCHAR",
},
genomic_column="interval",
chromosome_column="chrom", # Map to your column name
start_column="chromStart", # Map to your column name
end_column="chromEnd", # Map to your column name
)
Multiple Tables
Register Multiple Tables
Register all tables that will participate in genomic queries:
with GIQLEngine(target_dialect="duckdb") as engine:
# Load data files
engine.load_csv("variants", "variants.bed")
engine.load_csv("genes", "genes.bed")
engine.load_csv("regulatory", "regulatory.bed")
# Define common schema
bed_schema = {
"chromosome": "VARCHAR",
"start_pos": "BIGINT",
"end_pos": "BIGINT",
"name": "VARCHAR",
"score": "FLOAT",
"strand": "VARCHAR",
}
# Register each table
for table in ["variants", "genes", "regulatory"]:
engine.register_table_schema(
table,
bed_schema,
genomic_column="interval",
)
# Now you can join tables using genomic operators
cursor = engine.execute("""
SELECT v.*, g.name AS gene_name
FROM variants v
JOIN genes g ON v.interval INTERSECTS g.interval
""")
Different Schemas Per Table
Tables can have different schemas and even different genomic column names:
# Variants table with VCF-style columns
engine.register_table_schema(
"variants",
{
"CHROM": "VARCHAR",
"POS": "BIGINT",
"END": "BIGINT",
"ID": "VARCHAR",
"QUAL": "FLOAT",
},
genomic_column="var_interval",
chromosome_column="CHROM",
start_column="POS",
end_column="END",
)
# Genes table with BED-style columns
engine.register_table_schema(
"genes",
{
"chromosome": "VARCHAR",
"start_pos": "BIGINT",
"end_pos": "BIGINT",
"gene_name": "VARCHAR",
"strand": "VARCHAR",
},
genomic_column="gene_interval",
)
# Query using different genomic column names
cursor = engine.execute("""
SELECT v.ID, g.gene_name
FROM variants v
JOIN genes g ON v.var_interval INTERSECTS g.gene_interval
""")
Coordinate Systems
Understanding BED Coordinates
GIQL uses the BED coordinate convention:
0-based start: The first base of a chromosome is position 0
Half-open intervals: Start is inclusive, end is exclusive
Interval [start, end): Contains positions from start to end-1
Example: An interval chr1:100-200 covers bases 100 through 199 (100 bases total).
Converting from 1-Based Coordinates
If your data uses 1-based coordinates (like VCF or GFF), convert when loading:
import pandas as pd
# Load 1-based data
df = pd.read_csv("variants.vcf", sep="\t")
# Convert to 0-based
df['start_pos'] = df['POS'] - 1 # Convert 1-based to 0-based
df['end_pos'] = df['POS'] # For SNPs, end = start + 1
# Load into engine
engine.conn.execute("CREATE TABLE variants AS SELECT * FROM df")
# Register schema
engine.register_table_schema(
"variants",
{
"chromosome": "VARCHAR",
"start_pos": "BIGINT",
"end_pos": "BIGINT",
# ... other columns
},
genomic_column="interval",
)
Working with Point Features
For point features (like SNPs), create an interval of length 1:
# For a SNP at position 1000 (1-based)
# 0-based interval: [999, 1000)
start_pos = 999
end_pos = 1000
Data Types
Recommended Column Types
For optimal performance, use appropriate data types:
Column |
Recommended Type |
Notes |
|---|---|---|
chromosome |
VARCHAR |
String type for chromosome names |
start_pos |
BIGINT |
64-bit integer for large genomes |
end_pos |
BIGINT |
64-bit integer for large genomes |
strand |
VARCHAR(1) or CHAR(1) |
Single character: ‘+’, ‘-’, ‘.’ |
score |
FLOAT or DOUBLE |
Numeric scores |
name |
VARCHAR |
Feature identifiers |
Type Compatibility
GIQL schemas use SQL type names. Common mappings:
GIQL Schema Type |
DuckDB Type |
SQLite Type |
|---|---|---|
INTEGER |
INTEGER |
INTEGER |
BIGINT |
BIGINT |
INTEGER |
VARCHAR |
VARCHAR |
TEXT |
FLOAT |
FLOAT |
REAL |
DOUBLE |
DOUBLE |
REAL |
Loading Data
From CSV Files
Load CSV files directly:
engine.load_csv("features", "features.csv")
# With custom options
engine.load_csv(
"features",
"features.tsv",
delimiter="\t",
header=True,
)
From Pandas DataFrames
Load data from pandas:
import pandas as pd
df = pd.read_csv("features.bed", sep="\t", header=None,
names=["chromosome", "start_pos", "end_pos", "name"])
# Register the DataFrame as a table
engine.conn.execute("CREATE TABLE features AS SELECT * FROM df")
# Then register the schema
engine.register_table_schema(
"features",
{
"chromosome": "VARCHAR",
"start_pos": "BIGINT",
"end_pos": "BIGINT",
"name": "VARCHAR",
},
genomic_column="interval",
)
From Existing Database Tables
If tables already exist in your database, just register their schemas:
# Connect to existing database
with GIQLEngine(target_dialect="duckdb", db_path="my_database.duckdb") as engine:
# Register schemas for existing tables
engine.register_table_schema(
"existing_table",
{
"chromosome": "VARCHAR",
"start_pos": "BIGINT",
"end_pos": "BIGINT",
"name": "VARCHAR",
},
genomic_column="interval",
)
# Query existing data
cursor = engine.execute("""
SELECT * FROM existing_table
WHERE interval INTERSECTS 'chr1:1000-2000'
""")
Troubleshooting
Common Issues
“Unknown column” errors:
Ensure the table schema is registered before querying
Check that the genomic column name in your query matches the registered name
Verify column names in the schema match actual table columns
Incorrect results:
Verify your coordinate system (0-based vs 1-based)
Check that start_pos < end_pos for all intervals
Ensure chromosome names match between tables (e.g., ‘chr1’ vs ‘1’)
Performance issues:
See the Performance Guide guide for optimization tips
Consider adding indexes on genomic columns
Verifying Schema Registration
Check that schemas are registered correctly:
# After registration, test with a simple query
sql = engine.transpile("""
SELECT * FROM variants
WHERE interval INTERSECTS 'chr1:1000-2000'
""")
print(sql)
# Should show expanded SQL with chromosome, start_pos, end_pos comparisons