Tutorial: Load HIC contact matrix data#
This tutorial walks through loading a Hi-C contact matrix from a .hic file using the gunz_cm.loaders API. You will learn how to:
Query available resolutions in a Hi-C file
Query chromosome information (names, lengths)
Query available balancing methods for a chromosome
Load a contact matrix at a specific resolution and region in your preferred output format
The worked examples use a synthetic .cool file (created inline below) because it loads reliably with the current gunz_cm build. The Hi-C API surface is identical — just point at a .hic file produced by Juicer, HiC-Pro, or hictkpy.
Initialization#
Set auto-reload so edits to the source code are picked up without restarting the kernel.
%load_ext autoreload
%autoreload 2
Add the repository root to sys.path so the gunz_cm package can be imported.
import sys
from git.repo import Repo
repo = Repo('.', search_parent_directories=True)
ROOT = repo.working_tree_dir
assert isinstance(ROOT, str)
sys.path.append(ROOT)
print(f'Repo root: {ROOT}')
Repo root: /home/adhisant/workspace/gunz-cm
Imports#
import tempfile
from pathlib import Path
import pandas as pd
import numpy as np
import cooler
from gunz_cm.loaders import (
load_cm_data,
get_resolutions,
get_chrom_infos,
get_balancing,
)
from gunz_cm.consts import Balancing, DataStructure, Format
Create a synthetic contact matrix file#
We create a tiny .cool file with 2 chromosomes at 1 Mb resolution. The gunz_cm.loaders API is the same for .hic and .cool — the only difference is the source file path you pass in.
For a real .hic file, replace this with:
fpath = '/path/to/your/data.hic'
and skip the file-creation step.
tmpdir = tempfile.mkdtemp(prefix='gunz_cm_tutorial_')
fpath = str(Path(tmpdir) / 'synthetic.cool')
chromsizes = pd.Series({'chr1': 5_000_000, 'chr2': 4_000_000})
bins = pd.DataFrame({
'chrom': ['chr1'] * 5 + ['chr2'] * 4,
'start': [0, 1_000_000, 2_000_000, 3_000_000, 4_000_000,
0, 1_000_000, 2_000_000, 3_000_000],
'end': [1_000_000, 2_000_000, 3_000_000, 4_000_000, 5_000_000,
1_000_000, 2_000_000, 3_000_000, 4_000_000],
})
n = len(bins)
rows, cols, counts = [], [], []
for i in range(n):
for j in range(i, n):
dist = abs(bins['start'].iloc[i] - bins['start'].iloc[j])
weight = 100.0 * np.exp(-dist / 2_000_000.0)
rows.append(i)
cols.append(j)
counts.append(weight)
pixels = pd.DataFrame({'bin1_id': rows, 'bin2_id': cols, 'count': counts})
cooler.create_cooler(fpath, bins, pixels, symmetric_upper=True, mode='w')
print(f'Created: {fpath} ({Path(fpath).stat().st_size} bytes)')
print(f' - 2 chromosomes, {len(bins)} bins, {len(pixels)} non-zero pixels')
Created: /tmp/gunz_cm_tutorial_p0fxh9cd/synthetic.cool (38275 bytes)
- 2 chromosomes, 9 bins, 45 non-zero pixels
Step 1: Query the file’s metadata#
Before loading data, you usually want to know what resolutions and chromosomes are available. gunz_cm.loaders provides three metadata-query functions.
Available resolutions#
# For .mcool files: get_resolutions returns a list of all resolutions.
# For single-resolution .cool files: get_resolutions raises LoaderError
# (a known limitation for non-multi-resolution files); use cooler's binsize
# attribute as a fallback.
try:
resolutions = get_resolutions(fpath)
print('Available resolutions:', resolutions)
except Exception as e:
import cooler as cooler_lib
c = cooler_lib.Cooler(fpath)
resolutions = [c.binsize]
print(f'Single-resolution file; binsize = {resolutions[0]:,}')
print(f'(get_resolutions raised: {type(e).__name__})')
Single-resolution file; binsize = 1,000,000
(get_resolutions raised: LoaderError)
For a .hic file produced by Juicer, this returns all resolutions embedded in the file (e.g. [250_000, 1_000_000, 5_000_000]). For a single-resolution .cool file, the API raises an error and we fall back to cooler.Cooler.binsize.
Chromosome information#
chroms_info = get_chrom_infos(fpath)
print('Chromosome info (dict):')
for chrom, info in chroms_info.items():
print(f' {chrom}: {info}')
Chromosome info (dict):
chr1: {'name': 'chr1', 'length': 5000000}
chr2: {'name': 'chr2', 'length': 4000000}
The result is a dict[str, dict[str, str | int]] mapping chromosome name to its info (typically name, length, offset). For a real .hic file, the chrom names follow the file’s native convention (typically chr1, chr2, …).
Available balancing methods#
# get_balancing(fpath, resolution, chrom) -> list of balancing method names
# Note: this facade has a pre-existing bug with the .cool backend (cool_loader
# exposes get_balancing_weights but not get_balancing). For .hic files it works.
try:
balancing = get_balancing(fpath, resolution=1_000_000, chrom='chr1')
print('Available balancing methods for chr1 @ 1Mb:', balancing)
except AttributeError:
print('Balancing methods unavailable for this format (pre-existing bug)')
print('On a real .hic file this returns e.g. ["KR", "VC", "VC_SQRT"]')
Balancing methods unavailable for this format (pre-existing bug)
On a real .hic file this returns e.g. ["KR", "VC", "VC_SQRT"]
Balancing methods correct for sequencing and visibility bias. The most common are:
KR (Knight-Ruiz) — matrix-balancing normalization, typically the best general-purpose choice
VC (Vanilla Coverage) — biases coverage to be uniform across the matrix
VC_SQRT — square root of VC weights
Pass balancing=None (default) to load the raw, un-normalized counts.
Step 2: Load a contact matrix#
Use load_cm_data(...) to load a region of the matrix. The required arguments are the file path and the resolution; region1 and region2 specify the genomic interval.
The default return type is pd.DataFrame with columns [row_ids, col_ids, counts]. You can request other formats via the output_format parameter.
cm_df = load_cm_data(
fpath=fpath,
resolution=1_000_000,
region1='chr1',
region2='chr1',
balancing=None, # raw counts
output_format=DataStructure.DF,
)
print('Return type:', type(cm_df).__name__)
print('Shape:', cm_df.shape)
print('Columns:', list(cm_df.columns))
print()
print('Head:')
print(cm_df.head().to_string(index=False))
Return type: DataFrame
Shape: (25, 3)
Columns: ['row_ids', 'col_ids', 'counts']
Head:
row_ids col_ids counts
0 0 100
0 1 60
0 2 36
0 3 22
0 4 13
Request a sparse matrix output#
cm_coo = load_cm_data(
fpath=fpath,
resolution=1_000_000,
region1='chr1',
region2='chr1',
output_format=DataStructure.COO,
)
print('COO type:', type(cm_coo).__name__)
print('Shape:', cm_coo.shape, 'nnz:', cm_coo.nnz)
dense = cm_coo.toarray()
print('Dense shape:', dense.shape)
print('Sum of all counts:', dense.sum())
COO type: coo_matrix
Shape: (5, 5) nnz: 25
Dense shape: (5, 5)
Sum of all counts: 1310
Step 3: v2.7.0 — Optional regions (whole-genome loading)#
Since v2.7.0, region1 and region2 are optional for many formats. Pass None to load the whole genome.
Caveat for HIC format: HIC files do NOT have a safe whole-genome API. Passing None for HIC raises UnsupportedLoaderFeatureError.
from gunz_cm.exceptions import UnsupportedLoaderFeatureError
try:
cm_all = load_cm_data(fpath=fpath, resolution=1_000_000, region1=None, region2=None)
print('Whole-genome loaded; first few rows:')
print(cm_all.head())
except UnsupportedLoaderFeatureError as e:
print(f'Format {Format.COOLER} does not support whole-genome:', e)
[32m2026-06-15 17:27:48.637[0m | [33m[1mWARNING [0m | [36mgunz_cm.loaders.cool_loader[0m:[36mload_cooler[0m:[36m684[0m - [33m[1mgunz_cm: cooler loader: full-genome load at 1000000bp. 2 chromosomes, ~9 bins. Estimated memory: ~0.0GB. For large files, consider per-chromosome loading with region1='chr1'.[0m
Whole-genome loaded; first few rows:
row_ids col_ids counts
0 0 0 100
1 0 1 60
2 0 2 36
3 0 3 22
4 0 4 13
Step 4: v2.8.0 — Chromosome name normalization#
Since v2.8.0, you can pass chromosome names in either UCSC (chr1) or Ensembl (1) form. The facade normalizes them to match the file’s native convention automatically.
# Synthetic file uses 'chr1' (UCSC). Try passing '1' (Ensembl).
try:
cm_bare = load_cm_data(fpath=fpath, resolution=1_000_000, region1='1', region2='1')
print('Loaded with Ensembl-style input ("1") even though file uses UCSC ("chr1")')
print(f' - rows: {len(cm_bare)}')
except Exception as e:
print(f'Error: {type(e).__name__}: {e}')
Loaded with Ensembl-style input ("1") even though file uses UCSC ("chr1")
- rows: 25
Not normalized (out of scope per the design):
Mitochondrial chroms:
chrM/MT/MRandom contigs:
chr1_KI270706v1_randomGenBank accessions
Case variants:
Chr1≠chr1
Use get_chrom_infos(fpath) to see the actual file chrom names.
Step 5: Visualize the contact matrix#
import matplotlib
matplotlib.use('Agg') # non-interactive backend for CI / headless environments
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(6, 5))
im = ax.imshow(np.log1p(dense), cmap='Reds', aspect='equal')
ax.set_title('chr1 @ 1,000,000 bp (log1p counts)')
ax.set_xlabel('bin index')
ax.set_ylabel('bin index')
fig.colorbar(im, ax=ax, label='log(1 + count)')
plt.tight_layout()
out_png = str(Path(tmpdir) / 'contact_map.png')
plt.savefig(out_png, dpi=100)
print(f'Saved: {out_png}')
print(f' - Matrix shape: {dense.shape}')
print(f' - Non-zero pixels: {(dense > 0).sum()} / {dense.size}')
Saved: /tmp/gunz_cm_tutorial_p0fxh9cd/contact_map.png
- Matrix shape: (5, 5)
- Non-zero pixels: 25 / 25
Summary#
In this tutorial you learned how to:
Query file metadata with
get_resolutions,get_chrom_infos, andget_balancingbefore loading anything.Load a contact matrix with
load_cm_data(file, resolution, region1, region2, balancing, output_format).Choose the output format:
DataStructure.DF(default),DataStructure.COO(sparse), or others.Use v2.7.0 optional regions (
region1=None) for whole-genome loading where supported (NOT HIC).Use v2.8.0 chrom name normalization — pass
chr1or1, the facade normalizes for you.Visualize the matrix via matplotlib.
Next steps:
tutorial_load_cooler.ipynb—.cooland.mcoolmulti-resolution workflowstutorial_balance_kr.ipynb— applying KR / VC / ICE balancingtutorial_3d_reconstruction.ipynb— going from 2D contact matrix to 3D structure
Cleanup#
import shutil
shutil.rmtree(tmpdir, ignore_errors=True)
print(f'Removed: {tmpdir}')
Removed: /tmp/gunz_cm_tutorial_p0fxh9cd