Tutorial: Tutorial 31: End-to-End In-Memory Hi-C Pipeline (synth -> model)Hi-C training often starts in memory: a coalesced high-resolutionmatrix, a simulated contact map, or an in-RAM assembly of contactpixels from an upstream pipeline. v2.28.0 supports the fullin-memory case via three composable pieces:1. A synthetic contact-matrix generator (notebooks/_synthetic_data.py).2. SparseCODataset.from_coo(...) to turn the generator’s output into a sliding-window dataset.3. sparse_collate_fn to batch into a MinkowskiEngine-compatible dict.This tutorial stitches them together. We then feed the batchesinto a tiny MinkowskiEngine-style model and verify that the forwardpass runs end-to-end without I/O.## Learning Objectives* Use the synthetic generator to produce a realistic Hi-C contact map* Convert the generator’s output to COO and wrap it in a SparseCODataset* Wire a DataLoader + sparse_collate_fn + sparse-conv head end-to-end* Verify forward-pass tensor shapes match Hi-C super-resolution conventions## Prerequisites* gunz-cm installed: pip install gunz-cm* Python 3.11+* Familiarity with NumPy## Estimated Time⏱️ 15 minutes## DataThe synthetic generator uses a power-law decay for contactintensity (mimicking distance-dependent decay in Hi-C) pluscompartment-like block structure. No external files.Expected runtime: ~1 minute.—#
# Set auto-reload so the notebook picks up code changes without restarting the kernel.
%load_ext autoreload
%autoreload 2
import sys
from git.repo import Repo
repo = Repo('.', search_parent_directories=True)
ROOT = repo.working_tree_dir
assert isinstance(ROOT, str)
sys.path.append(ROOT)
print(f'Repo root: {ROOT}')
Repo root: /home/adhisant/workspace/gunz-cm
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
import numpy as np
import pandas as pd
import tempfile
import pathlib
from pathlib import Path
sys.path.insert(0, str(Path(ROOT) / 'notebooks'))
import _synthetic_data
from gunz_cm.compressions import get_codec, WireFormat, UnknownCodecError
from gunz_cm.io.gnz import GZCMReader, GZCMWriter
from gunz_cm.consts import DataFrameSpecs, Balancing, Backend
from gunz_cm.converters import convert_to_gzcm
from gunz_cm.datasets.gzcm import GzcmDataset, _DEFAULT_TILE_CACHE_SIZE
def _row_col_counts(mat):
"""Extract (row_ids, col_ids, counts) from a ContactMatrix or ndarray.
GZCM's synthetic_data helper returns a ContactMatrix, not a raw
ndarray, so we unwrap via .data when present.
"""
arr = mat.data if hasattr(mat, 'data') else mat
row_ids, col_ids = np.nonzero(arr)
counts = arr[row_ids, col_ids].astype(np.uint32)
return row_ids, col_ids, counts
import torch
from torch.utils.data import DataLoader
from gunz_cm.datasets.sparse_coo import SparseCODataset
from gunz_cm.datasets.sparse_collate import sparse_collate_fn
rng = np.random.default_rng(42)
1. Generate a synthetic Hi-C matrixThe generator lives in notebooks/_synthetic_data.py. It returnsa ContactMatrix (or a plain np.ndarray, depending on version)with realistic block structure and a power-law decay.#
n_bins = 256
mat = _synthetic_data.make_synthetic_hic(n_bins=n_bins, bin_size_bp=50_000)
mat_dense = mat.data if hasattr(mat, 'data') else mat
row_ids, col_ids = np.nonzero(mat_dense)
counts = mat_dense[row_ids, col_ids].astype(np.float32)
print(f'Total contacts: {len(row_ids):,}')
print(f'Sparsity: {len(row_ids) / (n_bins * n_bins):.2%}')
print(f'Max count: {counts.max():.0f}')
Total contacts: 65,536
Sparsity: 100.00%
Max count: 4
2. Single-patch vs sliding-windowFor training, sliding windows are usually preferable so the modelsees overlapping context. The default stride == window_sizeyields non-overlapping tiles; explicitly set stride to enableoverlap.#
patch = SparseCODataset.from_coo(row_ids, col_ids, counts, n_bins=n_bins)
nonoverlap = SparseCODataset.from_coo(
row_ids, col_ids, counts, n_bins=n_bins, window_size=32, stride=32,
)
overlap = SparseCODataset.from_coo(
row_ids, col_ids, counts, n_bins=n_bins, window_size=32, stride=16,
)
print(f'single patch: len={len(patch)}')
print(f'non-overlapping: len={len(nonoverlap)} '
f'(={n_bins // 32} patches)')
print(f'overlap (stride=16): len={len(overlap)} '
f'(n_overlap={(n_bins - 32) // 16 + 1} patches)')
single patch: len=1
non-overlapping: len=8 (=8 patches)
overlap (stride=16): len=15 (n_overlap=15 patches)
3. Downsample-replay for super-resolutionSparseCODataset’s downsample path is the standard Hi-Csuper-resolution contract: the model sees features as thesparse-input view and reconstructs target (the full counts).We set downsample_ratio to a fixed value for this tutorial;a tuple (min, max) enables random alpha per item.#
ds_sr = SparseCODataset.from_coo(
row_ids, col_ids, counts, n_bins=n_bins,
window_size=64, stride=64, downsample_ratio=0.1,
)
np.random.seed(0)
item = ds_sr[0]
print(f'patch 0 (post-downsample):')
print(f' input shape: {tuple(item["features"].shape)}, sum={item["features"].sum().item():.0f}')
print(f' target shape: {tuple(item["target"].shape)}, sum={item["target"].sum().item():.0f}')
patch 0 (post-downsample):
input shape: (74, 1), sum=75
target shape: (74,), sum=75
4. The end-to-end forward passBuild a tiny sparse-conv head as a stand-in for the actualmodel. The head takes the MinkowskiEngine-style batch dict(coords with a batch-index column) and emits a per-contactlogit tensor. For full Hi-C super-resolution, swap thisfor a MinkowskiEngine or torch_geometric model — thedata contract from sparse_collate_fn stays the same.#
import torch.nn as nn
class TinySparseHead(nn.Module):
"""Per-contact MLP; takes the (sum_N, 3) coords tensor.
The first column is the batch index; columns 1-2 are the
(row_bin, col_bin) within the local patch.
"""
def __init__(self, in_features=1, hidden=32):
super().__init__()
self.mlp = nn.Sequential(
nn.Linear(in_features, hidden), nn.ReLU(),
nn.Linear(hidden, 1),
)
def forward(self, batch):
# batch['features'] is (sum_N, 1); one channel per contact.
return self.mlp(batch['features']).squeeze(-1)
head = TinySparseHead()
dl = DataLoader(ds_sr, batch_size=4, shuffle=False,
collate_fn=sparse_collate_fn)
batch = next(iter(dl))
logits = head(batch)
print(f'batch["coords"].shape: {tuple(batch["coords"].shape)}')
print(f'batch["features"].shape: {tuple(batch["features"].shape)}')
print(f'batch["target"].shape: {tuple(batch["target"].shape)}')
print(f'logits.shape: {tuple(logits.shape)}')
assert logits.shape == batch['target'].shape
batch["coords"].shape: (443, 3)
batch["features"].shape: (443, 1)
batch["target"].shape: (443,)
logits.shape: (443,)
5. End-to-end loss + backwardVerify the gradient graph connects. We use a simple L1 lossbetween the head’s logits and the dense target — the standardHi-C super-resolution objective.#
optim = torch.optim.Adam(head.parameters(), lr=1e-3)
losses = []
torch.manual_seed(0)
for step, batch in enumerate(dl):
if step >= 5:
break
optim.zero_grad()
pred = head(batch)
loss = torch.nn.functional.l1_loss(pred, batch['target'])
loss.backward()
optim.step()
losses.append(loss.item())
print(f'L1 losses over 5 steps: {[f"{x:.2f}" for x in losses]}')
L1 losses over 5 steps: ['1.26']
6. From in-memory to file-backed (seamless)If you outgrow the in-memory case (matrix too large for RAM),convert the same COO arrays to a v4 .gzcm file and switchto GzcmDataset (Tutorial 29) — the 4-key contract andsparse_collate_fn are identical. No upstream retrainingis needed because the dataset abstraction is the same.#
# Illustrate the conversion (writes to a tempfile).
import tempfile, pathlib
tmpdir = tempfile.mkdtemp()
out = pathlib.Path(tmpdir) / 'pipe.gzcm'
df = pd.DataFrame({
DataFrameSpecs.ROW_IDS: row_ids,
DataFrameSpecs.COL_IDS: col_ids,
DataFrameSpecs.COUNTS: counts.astype(np.uint32),
})
fake_hic = pathlib.Path(tmpdir) / 'fake.hic'
fake_hic.write_bytes(b'')
from gunz_cm import loaders
original = loaders.load_cm_data
loaders.load_cm_data = lambda *a, **kw: df
try:
convert_to_gzcm(
fpath=fake_hic, output_fpath=out, region1='chr1',
bin_size_bp=50_000, version=4, tile_size=256,
compression='zstd', overwrite=True,
)
finally:
loaders.load_cm_data = original
print(f'wrote {out.stat().st_size:,} bytes ({out})')
wrote 8,192 bytes (/tmp/tmpt2qoirva/pipe.gzcm)