Tutorial: Load HIC contact matrix data#

This tutorial walks through loading a Hi-C contact matrix from a .hic file using the gunz_cm.loaders API. You will learn how to:

Query available resolutions in a Hi-C file
Query chromosome information (names, lengths)
Query available balancing methods for a chromosome
Load a contact matrix at a specific resolution and region in your preferred output format

The worked examples use a synthetic .cool file (created inline below) because it loads reliably with the current gunz_cm build. The Hi-C API surface is identical — just point at a .hic file produced by Juicer, HiC-Pro, or hictkpy.

Initialization#

Set auto-reload so edits to the source code are picked up without restarting the kernel.

%load_ext autoreload
%autoreload 2

Add the repository root to sys.path so the gunz_cm package can be imported.

import sys
from git.repo import Repo

repo = Repo('.', search_parent_directories=True)
ROOT = repo.working_tree_dir

assert isinstance(ROOT, str)
sys.path.append(ROOT)
print(f'Repo root: {ROOT}')

Repo root: /home/adhisant/workspace/gunz-cm

Imports#

import tempfile
from pathlib import Path

import pandas as pd
import numpy as np

import cooler

from gunz_cm.loaders import (
    load_cm_data,
    get_resolutions,
    get_chrom_infos,
    get_balancing,
)
from gunz_cm.consts import Balancing, DataStructure, Format

Create a synthetic contact matrix file#

We create a tiny .cool file with 2 chromosomes at 1 Mb resolution. The gunz_cm.loaders API is the same for .hic and .cool — the only difference is the source file path you pass in.

For a real .hic file, replace this with:

fpath = '/path/to/your/data.hic'

and skip the file-creation step.

tmpdir = tempfile.mkdtemp(prefix='gunz_cm_tutorial_')
fpath = str(Path(tmpdir) / 'synthetic.cool')

chromsizes = pd.Series({'chr1': 5_000_000, 'chr2': 4_000_000})
bins = pd.DataFrame({
    'chrom': ['chr1'] * 5 + ['chr2'] * 4,
    'start': [0, 1_000_000, 2_000_000, 3_000_000, 4_000_000,
              0, 1_000_000, 2_000_000, 3_000_000],
    'end':   [1_000_000, 2_000_000, 3_000_000, 4_000_000, 5_000_000,
              1_000_000, 2_000_000, 3_000_000, 4_000_000],
})

n = len(bins)
rows, cols, counts = [], [], []
for i in range(n):
    for j in range(i, n):
        dist = abs(bins['start'].iloc[i] - bins['start'].iloc[j])
        weight = 100.0 * np.exp(-dist / 2_000_000.0)
        rows.append(i)
        cols.append(j)
        counts.append(weight)
pixels = pd.DataFrame({'bin1_id': rows, 'bin2_id': cols, 'count': counts})

cooler.create_cooler(fpath, bins, pixels, symmetric_upper=True, mode='w')
print(f'Created: {fpath} ({Path(fpath).stat().st_size} bytes)')
print(f'  - 2 chromosomes, {len(bins)} bins, {len(pixels)} non-zero pixels')

Created: /tmp/gunz_cm_tutorial_p0fxh9cd/synthetic.cool (38275 bytes)
  - 2 chromosomes, 9 bins, 45 non-zero pixels

Step 1: Query the file’s metadata#

Before loading data, you usually want to know what resolutions and chromosomes are available. gunz_cm.loaders provides three metadata-query functions.

Available resolutions#

# For .mcool files: get_resolutions returns a list of all resolutions.
# For single-resolution .cool files: get_resolutions raises LoaderError
# (a known limitation for non-multi-resolution files); use cooler's binsize
# attribute as a fallback.
try:
    resolutions = get_resolutions(fpath)
    print('Available resolutions:', resolutions)
except Exception as e:
    import cooler as cooler_lib
    c = cooler_lib.Cooler(fpath)
    resolutions = [c.binsize]
    print(f'Single-resolution file; binsize = {resolutions[0]:,}')
    print(f'(get_resolutions raised: {type(e).__name__})')

Single-resolution file; binsize = 1,000,000
(get_resolutions raised: LoaderError)

For a .hic file produced by Juicer, this returns all resolutions embedded in the file (e.g. [250_000, 1_000_000, 5_000_000]). For a single-resolution .cool file, the API raises an error and we fall back to cooler.Cooler.binsize.

Chromosome information#

chroms_info = get_chrom_infos(fpath)
print('Chromosome info (dict):')
for chrom, info in chroms_info.items():
    print(f'  {chrom}: {info}')

Chromosome info (dict):
  chr1: {'name': 'chr1', 'length': 5000000}
  chr2: {'name': 'chr2', 'length': 4000000}

The result is a dict[str, dict[str, str | int]] mapping chromosome name to its info (typically name, length, offset). For a real .hic file, the chrom names follow the file’s native convention (typically chr1, chr2, …).

Available balancing methods#

# get_balancing(fpath, resolution, chrom) -> list of balancing method names
# Note: this facade has a pre-existing bug with the .cool backend (cool_loader
# exposes get_balancing_weights but not get_balancing). For .hic files it works.
try:
    balancing = get_balancing(fpath, resolution=1_000_000, chrom='chr1')
    print('Available balancing methods for chr1 @ 1Mb:', balancing)
except AttributeError:
    print('Balancing methods unavailable for this format (pre-existing bug)')
    print('On a real .hic file this returns e.g. ["KR", "VC", "VC_SQRT"]')

Balancing methods unavailable for this format (pre-existing bug)
On a real .hic file this returns e.g. ["KR", "VC", "VC_SQRT"]

Balancing methods correct for sequencing and visibility bias. The most common are:

KR (Knight-Ruiz) — matrix-balancing normalization, typically the best general-purpose choice
VC (Vanilla Coverage) — biases coverage to be uniform across the matrix
VC_SQRT — square root of VC weights

Pass balancing=None (default) to load the raw, un-normalized counts.

Step 2: Load a contact matrix#

Use load_cm_data(...) to load a region of the matrix. The required arguments are the file path and the resolution; region1 and region2 specify the genomic interval.

The default return type is pd.DataFrame with columns [row_ids, col_ids, counts]. You can request other formats via the output_format parameter.

cm_df = load_cm_data(
    fpath=fpath,
    resolution=1_000_000,
    region1='chr1',
    region2='chr1',
    balancing=None,  # raw counts
    output_format=DataStructure.DF,
)
print('Return type:', type(cm_df).__name__)
print('Shape:', cm_df.shape)
print('Columns:', list(cm_df.columns))
print()
print('Head:')
print(cm_df.head().to_string(index=False))

Return type: DataFrame
Shape: (25, 3)
Columns: ['row_ids', 'col_ids', 'counts']

Head:
 row_ids  col_ids  counts
       0        0     100
       0        1      60
       0        2      36
       0        3      22
       0        4      13

Request a sparse matrix output#

cm_coo = load_cm_data(
    fpath=fpath,
    resolution=1_000_000,
    region1='chr1',
    region2='chr1',
    output_format=DataStructure.COO,
)
print('COO type:', type(cm_coo).__name__)
print('Shape:', cm_coo.shape, 'nnz:', cm_coo.nnz)

dense = cm_coo.toarray()
print('Dense shape:', dense.shape)
print('Sum of all counts:', dense.sum())

COO type: coo_matrix
Shape: (5, 5) nnz: 25
Dense shape: (5, 5)
Sum of all counts: 1310

Step 3: v2.7.0 — Optional regions (whole-genome loading)#

Since v2.7.0, region1 and region2 are optional for many formats. Pass None to load the whole genome.

Caveat for HIC format: HIC files do NOT have a safe whole-genome API. Passing None for HIC raises UnsupportedLoaderFeatureError.

from gunz_cm.exceptions import UnsupportedLoaderFeatureError

try:
    cm_all = load_cm_data(fpath=fpath, resolution=1_000_000, region1=None, region2=None)
    print('Whole-genome loaded; first few rows:')
    print(cm_all.head())
except UnsupportedLoaderFeatureError as e:
    print(f'Format {Format.COOLER} does not support whole-genome:', e)

[32m2026-06-15 17:27:48.637[0m | [33m[1mWARNING [0m | [36mgunz_cm.loaders.cool_loader[0m:[36mload_cooler[0m:[36m684[0m - [33m[1mgunz_cm: cooler loader: full-genome load at 1000000bp. 2 chromosomes, ~9 bins. Estimated memory: ~0.0GB. For large files, consider per-chromosome loading with region1='chr1'.[0m


Whole-genome loaded; first few rows:
   row_ids  col_ids  counts
0        0        0     100
1        0        1      60
2        0        2      36
3        0        3      22
4        0        4      13

Step 4: v2.8.0 — Chromosome name normalization#

Since v2.8.0, you can pass chromosome names in either UCSC (chr1) or Ensembl (1) form. The facade normalizes them to match the file’s native convention automatically.

# Synthetic file uses 'chr1' (UCSC). Try passing '1' (Ensembl).
try:
    cm_bare = load_cm_data(fpath=fpath, resolution=1_000_000, region1='1', region2='1')
    print('Loaded with Ensembl-style input ("1") even though file uses UCSC ("chr1")')
    print(f'  - rows: {len(cm_bare)}')
except Exception as e:
    print(f'Error: {type(e).__name__}: {e}')

Loaded with Ensembl-style input ("1") even though file uses UCSC ("chr1")
  - rows: 25

Not normalized (out of scope per the design):

Mitochondrial chroms: chrM / MT / M
Random contigs: chr1_KI270706v1_random
GenBank accessions
Case variants: Chr1 ≠ chr1

Use get_chrom_infos(fpath) to see the actual file chrom names.

Step 5: Visualize the contact matrix#

import matplotlib
matplotlib.use('Agg')  # non-interactive backend for CI / headless environments
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(6, 5))
im = ax.imshow(np.log1p(dense), cmap='Reds', aspect='equal')
ax.set_title('chr1 @ 1,000,000 bp (log1p counts)')
ax.set_xlabel('bin index')
ax.set_ylabel('bin index')
fig.colorbar(im, ax=ax, label='log(1 + count)')
plt.tight_layout()
out_png = str(Path(tmpdir) / 'contact_map.png')
plt.savefig(out_png, dpi=100)
print(f'Saved: {out_png}')
print(f'  - Matrix shape: {dense.shape}')
print(f'  - Non-zero pixels: {(dense > 0).sum()} / {dense.size}')

Saved: /tmp/gunz_cm_tutorial_p0fxh9cd/contact_map.png
  - Matrix shape: (5, 5)
  - Non-zero pixels: 25 / 25

Summary#

In this tutorial you learned how to:

Query file metadata with get_resolutions, get_chrom_infos, and get_balancing before loading anything.
Load a contact matrix with load_cm_data(file, resolution, region1, region2, balancing, output_format).
Choose the output format: DataStructure.DF (default), DataStructure.COO (sparse), or others.
Use v2.7.0 optional regions (region1=None) for whole-genome loading where supported (NOT HIC).
Use v2.8.0 chrom name normalization — pass chr1 or 1, the facade normalizes for you.
Visualize the matrix via matplotlib.

Next steps:

tutorial_load_cooler.ipynb — .cool and .mcool multi-resolution workflows
tutorial_balance_kr.ipynb — applying KR / VC / ICE balancing
tutorial_3d_reconstruction.ipynb — going from 2D contact matrix to 3D structure

Cleanup#

import shutil
shutil.rmtree(tmpdir, ignore_errors=True)
print(f'Removed: {tmpdir}')

Removed: /tmp/gunz_cm_tutorial_p0fxh9cd