Tutorial: Load COOL / MCOOL contact matrix data#

This tutorial walks through loading a Hi-C contact matrix from a .cool or .mcool (multi-resolution) file using the gunz_cm.loaders API. You will learn how to:

Create a synthetic multi-resolution .mcool file inline with cooler.create_cooler + cooler.zoomify_cooler
Query available resolutions in a multi-resolution cooler
Query chromosome information (names, lengths)
Load a contact matrix at a specific resolution and region in your preferred output format
Iterate over all available resolutions in a .mcool and compare shapes

The worked examples use a synthetic .mcool file (created inline below) because it loads reliably with the current gunz_cm build. The API surface for .cool and .mcool is identical at the load_cm_data(...) call site — the difference is just the set of resolutions you can choose from.

Initialization#

Set auto-reload so edits to the source code are picked up without restarting the kernel.

%load_ext autoreload
%autoreload 2

Add the repository root to sys.path so the gunz_cm package can be imported.

import sys
from git.repo import Repo

repo = Repo('.', search_parent_directories=True)
ROOT = repo.working_tree_dir

assert isinstance(ROOT, str)
sys.path.append(ROOT)
print(f'Repo root: {ROOT}')

Repo root: /home/adhisant/workspace/gunz-cm

Imports#

import tempfile
from pathlib import Path

import pandas as pd
import numpy as np

import cooler

from gunz_cm.loaders import (
    load_cm_data,
    get_resolutions,
    get_chrom_infos,
)
from gunz_cm.consts import DataStructure

Create a synthetic multi-resolution .mcool file#

We build a synthetic .mcool (multi-resolution cooler) in two steps:

cooler.create_cooler writes the base resolution (finest bin size).
cooler.zoomify_cooler recursively aggregates bins to produce coarser resolutions, all stored inside the same .mcool file under resolutions/<r>.

Zoomify is constrained: every target resolution must be an integer multiple of the base (finest) resolution. We start from a 100 kb base and generate three resolutions: 100 kb, 1 Mb, and 5 Mb.

For a real .mcool file, replace this with:

fpath = '/path/to/your/data.mcool'

and skip the file-creation step. The load_cm_data(...) API is identical.

tmpdir = tempfile.mkdtemp(prefix='gunz_cm_tutorial_')
fpath = str(Path(tmpdir) / 'synthetic.mcool')

# Base resolution: 100 kb. chr1 -> 50 bins, chr2 -> 40 bins.
base_res = 100_000
n1, n2 = 50, 40
starts1 = np.arange(n1) * base_res
starts2 = np.arange(n2) * base_res
bins = pd.DataFrame({
    'chrom': ['chr1'] * n1 + ['chr2'] * n2,
    'start': list(starts1) + list(starts2),
    'end':   list(starts1 + base_res) + list(starts2 + base_res),
})

# Synthetic pixel weights: exponentially decaying with genomic distance.
n = len(bins)
rows, cols, counts = [], [], []
for i in range(n):
    for j in range(i, n):
        dist = abs(bins['start'].iloc[i] - bins['start'].iloc[j])
        weight = 100.0 * np.exp(-dist / 2_000_000.0)
        rows.append(i)
        cols.append(j)
        counts.append(weight)
pixels = pd.DataFrame({'bin1_id': rows, 'bin2_id': cols, 'count': counts})

# Step 1: write base 100 kb cooler.
base_fpath = str(Path(tmpdir) / 'base_100kb.cool')
cooler.create_cooler(base_fpath, bins, pixels, symmetric_upper=True, mode='w')
print(f'Base resolution: {base_res:,} bp ({len(bins)} bins, {len(pixels)} non-zero pixels)')

# Step 2: zoomify into a multi-resolution .mcool.
# Target resolutions must be integer multiples of the base resolution (100 kb).
target_resolutions = [100_000, 1_000_000, 5_000_000]
cooler.zoomify_cooler(
    base_uris=base_fpath,
    outfile=fpath,
    resolutions=target_resolutions,
    chunksize=100,
    nproc=1,
)
print(f'Zoomified .mcool: {fpath} ({Path(fpath).stat().st_size} bytes)')
print(f'  Resolutions stored: {target_resolutions}')

Base resolution: 100,000 bp (90 bins, 4095 non-zero pixels)
Zoomified .mcool: /tmp/gunz_cm_tutorial_2ijeg_yq/synthetic.mcool (122281 bytes)
  Resolutions stored: [100000, 1000000, 5000000]

Step 1: Query the file’s metadata#

Before loading data, you usually want to know what resolutions and chromosomes are available. gunz_cm.loaders exposes metadata-query functions that work for both .cool and .mcool. The key difference: for a .mcool file, get_resolutions returns a list of all available resolutions; for a single-resolution .cool file, it raises LoaderError (covered in tutorial_load_hic.ipynb).

Available resolutions#

resolutions = get_resolutions(fpath)
print(f'Available resolutions: {sorted(resolutions)}')
print(f'Sorted ascending: {sorted(resolutions)}')
print(f'  - finest (base):   {min(resolutions):,} bp')
print(f'  - coarsest:        {max(resolutions):,} bp')
print(f'  - number of zoom levels: {len(resolutions)}')

Available resolutions: [100000, 1000000, 5000000]
Sorted ascending: [100000, 1000000, 5000000]
  - finest (base):   100,000 bp
  - coarsest:        5,000,000 bp
  - number of zoom levels: 3


/tmp/ipykernel_2697605/2254348678.py:1: DeprecationWarning: get_resolutions() is deprecated and will be removed in v2.13.0. Use get_bin_size_bps() instead.
  resolutions = get_resolutions(fpath)

get_resolutions is the canonical way to discover which zoom levels are embedded in a .mcool. Internally it reads the resolutions/ group of the HDF5 file.

Chromosome information#

chroms_info = get_chrom_infos(fpath)
print('Chromosome info (dict):')
for chrom, info in chroms_info.items():
    print(f'  {chrom}: {info}')

Chromosome info (dict):
  chr1: {'name': 'chr1', 'length': 5000000}
  chr2: {'name': 'chr2', 'length': 4000000}

The result is a dict[str, dict[str, str | int]] mapping chromosome name to its info (typically name, length, offset). For real .mcool files generated by cooler, the chrom names follow the source assembly’s native convention (typically chr1, chr2, …).

Step 2: Load a contact matrix at a specific resolution#

Use load_cm_data(...) to load a region of the matrix. The required arguments are the file path and the resolution; region1 and region2 specify the genomic interval.

The default return type is pd.DataFrame with columns [row_ids, col_ids, counts]. You can request other formats via the output_format parameter.

# Pick a resolution from the list returned by get_resolutions().
resolutions = sorted(get_resolutions(fpath))
chosen = 1_000_000
assert chosen in resolutions, f'{chosen} not in {resolutions}'

cm_df = load_cm_data(
    fpath=fpath,
    bin_size_bp=chosen,
    region1='chr1',
    region2='chr1',
    balancing=None,  # raw counts
    output_format=DataStructure.DF,
)
print(f'Loading at {chosen:,} bp (chr1:chr1):')
print('  Return type:', type(cm_df).__name__)
print('  Shape:', cm_df.shape)
print('  Columns:', list(cm_df.columns))
print()
print('Head:')
print(cm_df.head().to_string(index=False))

Loading at 1,000,000 bp (chr1:chr1):
  Return type: DataFrame
  Shape: (25, 3)
  Columns: ['row_ids', 'col_ids', 'counts']

Head:
 row_ids  col_ids  counts
       0        0    4751
       0        1    6137
       0        2    3693
       0        3    2235
       0        4    1334


/tmp/ipykernel_2697605/2338234655.py:2: DeprecationWarning: get_resolutions() is deprecated and will be removed in v2.13.0. Use get_bin_size_bps() instead.
  resolutions = sorted(get_resolutions(fpath))

Request a sparse matrix output#

cm_coo = load_cm_data(
    fpath=fpath,
    bin_size_bp=1_000_000,
    region1='chr1',
    region2='chr1',
    output_format=DataStructure.COO,
)
print('COO type:', type(cm_coo).__name__)
print('Shape:', cm_coo.shape, 'nnz:', cm_coo.nnz)

dense_1mb = cm_coo.toarray()
print('Dense shape:', dense_1mb.shape)
print('Sum of all counts:', dense_1mb.sum())

COO type: coo_matrix
Shape: (5, 5) nnz: 25
Dense shape: (5, 5)
Sum of all counts: 106617

Switch to a different resolution#

# Load the same region at the finest (base) resolution. The shape is 10x larger per axis.
cm_fine = load_cm_data(
    fpath=fpath,
    bin_size_bp=100_000,
    region1='chr1',
    region2='chr1',
    output_format=DataStructure.COO,
)
dense_100kb = cm_fine.toarray()
print(f'Loading the same region at 100,000 bp:')
print(f'  Shape: {cm_fine.toarray().sum().item()}...') if False else None  # noqa
print(f'  Shape: {dense_100kb.shape[0]*dense_100kb.shape[1]} cells = {cm_fine.nnz} non-zero, dense {dense_100kb.shape}')
print(f'  First few rows of the COO coordinate DataFrame view:')
cm_fine_df = load_cm_data(
    fpath=fpath,
    bin_size_bp=100_000,
    region1='chr1',
    region2='chr1',
    output_format=DataStructure.DF,
)
print(cm_fine_df.head().to_string(index=False))

Loading the same region at 100,000 bp:
  Shape: 2500 cells = 2500 non-zero, dense (50, 50)
  First few rows of the COO coordinate DataFrame view:
 row_ids  col_ids  counts
       0        0     100
       0        1      95
       0        2      90
       0        3      86
       0        4      81

Step 3: Iterate over all resolutions#

A common workflow is to load the same region at every available resolution and compare shapes / total counts. This is the recommended pattern for picking the right resolution for downstream analysis (e.g. TAD calling, compartment analysis).

print('chr1:chr1 across all resolutions:')
print(f'  {"resolution":>13}  {"shape":>7}  {"nnz":>6}  {"sum(counts)":>12}')
print(f'  {"-"*13}  {"-"*7}  {"-"*6}  {"-"*12}')
for res in sorted(get_resolutions(fpath)):
    cm = load_cm_data(
        fpath=fpath,
        bin_size_bp=res,
        region1='chr1',
        region2='chr1',
        output_format=DataStructure.COO,
    )
    dense = cm.toarray()
    print(f'  {res:>13,}  {cm.shape[0]}x{cm.shape[1]:<3}  {cm.nnz:>6}  {dense.sum():>12.2e}')

chr1:chr1 across all resolutions:
     resolution    shape     nnz   sum(counts)
  -------------  -------  ------  ------------
        100,000  50x50     2500      1.25e+05
      1,000,000  5x5        25      1.07e+05
      5,000,000  1x1         1      6.52e+04


/tmp/ipykernel_2697605/2282641961.py:4: DeprecationWarning: get_resolutions() is deprecated and will be removed in v2.13.0. Use get_bin_size_bps() instead.
  for res in sorted(get_resolutions(fpath)):

Notice how the matrix shrinks as resolution coarsens: 5 Mb resolution collapses the entire 5 Mb chr1 into a single bin. The total sum(counts) decreases as well because the diagonal and near-diagonal short-range interactions get merged into single bins.

Step 4: v2.8.0 — Chromosome name normalization#

Since v2.8.0, you can pass chromosome names in either UCSC (chr1) or Ensembl (1) form. The facade normalizes them to match the file’s native convention automatically.

# Synthetic file uses 'chr1' (UCSC). Try passing '1' (Ensembl).
try:
    cm_bare = load_cm_data(fpath=fpath, bin_size_bp=1_000_000, region1='1', region2='1')
    print('Loaded with Ensembl-style input ("1") even though file uses UCSC ("chr1")')
    print(f'  - rows: {len(cm_bare)}')
except Exception as e:
    print(f'Error: {type(e).__name__}: {e}')

Loaded with Ensembl-style input ("1") even though file uses UCSC ("chr1")
  - rows: 25

Not normalized (out of scope per the design):

Mitochondrial chroms: chrM / MT / M
Random contigs: chr1_KI270706v1_random
GenBank accessions
Case variants: Chr1 ≠ chr1

Use get_chrom_infos(fpath) to see the actual file chrom names.

Step 5: Visualize side-by-side at two resolutions#

import matplotlib
from IPython.display import display as ipydisplay
matplotlib.use('Agg')  # non-interactive backend for CI / headless environments
import matplotlib.pyplot as plt
from IPython.display import Image as ipyimage, display as ipydisplay

fig, axes = plt.subplots(1, 2, figsize=(11, 5))
for ax, res, dense in [
    (axes[0], 100_000, dense_100kb),
    (axes[1], 1_000_000, dense_1mb),
]:
    im = ax.imshow(np.log1p(dense), cmap='Reds', aspect='equal')
    ax.set_title(f'chr1 @ {res:,} bp (log1p counts)')
    ax.set_xlabel('bin index')
    ax.set_ylabel('bin index')
    fig.colorbar(im, ax=ax, label='log(1 + count)')
plt.tight_layout()
out_png = str(Path(tmpdir) / 'contact_map_two_resolutions.png')
plt.savefig(out_png, dpi=100)
print(f'Saved: {out_png}')
print(f'  - 100,000 bp panel: {dense_100kb.shape} ({int((dense_100kb > 0).sum())} non-zero)')
print(f'  - 1,000,000 bp panel: {dense_1mb.shape} ({int((dense_1mb > 0).sum())} non-zero)')
ipydisplay(ipyimage(filename=out_png))

Saved: /tmp/gunz_cm_tutorial_2ijeg_yq/contact_map_two_resolutions.png
  - 100,000 bp panel: (50, 50) (2500 non-zero)
  - 1,000,000 bp panel: (5, 5) (25 non-zero)

png

Summary#

In this tutorial you learned how to:

Build a synthetic multi-resolution .mcool file by chaining cooler.create_cooler (base resolution) with cooler.zoomify_cooler (recursive coarsening).
Query file metadata with get_resolutions (returns the full list of zoom levels) and get_chrom_infos (chrom names + lengths).
Load a contact matrix at a chosen resolution with load_cm_data(file, resolution, region1, region2, balancing, output_format).
Iterate over all resolutions in a .mcool and compare matrix shapes / total counts to pick the right zoom level for downstream analysis.
Choose the output format: DataStructure.DF (default), DataStructure.COO (sparse), or others.
Use v2.8.0 chrom name normalization — pass chr1 or 1, the facade normalizes for you.
Visualize two resolutions side-by-side via matplotlib.

Next steps:

tutorial_load_hic.ipynb — .hic file loading (Juicer / HiC-Pro / hictkpy)
tutorial_balance_kr.ipynb — applying KR / VC / ICE balancing
tutorial_3d_reconstruction.ipynb — going from 2D contact matrix to 3D structure

Cleanup#

import shutil
shutil.rmtree(tmpdir, ignore_errors=True)
print(f'Removed: {tmpdir}')

Removed: /tmp/gunz_cm_tutorial_2ijeg_yq