Tutorial: Tutorial 31: End-to-End In-Memory Hi-C Pipeline (synth -> model)Hi-C training often starts in memory: a coalesced high-resolutionmatrix, a simulated contact map, or an in-RAM assembly of contactpixels from an upstream pipeline. v2.28.0 supports the fullin-memory case via three composable pieces:1. A synthetic contact-matrix generator (notebooks/_synthetic_data.py).2. SparseCODataset.from_coo(...) to turn the generator’s output into a sliding-window dataset.3. sparse_collate_fn to batch into a MinkowskiEngine-compatible dict.This tutorial stitches them together. We then feed the batchesinto a tiny MinkowskiEngine-style model and verify that the forwardpass runs end-to-end without I/O.## Learning Objectives* Use the synthetic generator to produce a realistic Hi-C contact map* Convert the generator’s output to COO and wrap it in a SparseCODataset* Wire a DataLoader + sparse_collate_fn + sparse-conv head end-to-end* Verify forward-pass tensor shapes match Hi-C super-resolution conventions## Prerequisites* gunz-cm installed: pip install gunz-cm* Python 3.11+* Familiarity with NumPy## Estimated Time⏱️ 15 minutes## DataThe synthetic generator uses a power-law decay for contactintensity (mimicking distance-dependent decay in Hi-C) pluscompartment-like block structure. No external files.Expected runtime: ~1 minute.—#

# Set auto-reload so the notebook picks up code changes without restarting the kernel.
%load_ext autoreload
%autoreload 2
import sys
from git.repo import Repo

repo = Repo('.', search_parent_directories=True)
ROOT = repo.working_tree_dir
assert isinstance(ROOT, str)
sys.path.append(ROOT)
print(f'Repo root: {ROOT}')
Repo root: /home/adhisant/workspace/gunz-cm
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
import numpy as np
import pandas as pd
import tempfile
import pathlib
from pathlib import Path
sys.path.insert(0, str(Path(ROOT) / 'notebooks'))
import _synthetic_data
from gunz_cm.compressions import get_codec, WireFormat, UnknownCodecError
from gunz_cm.io.gnz import GZCMReader, GZCMWriter
from gunz_cm.consts import DataFrameSpecs, Balancing, Backend
from gunz_cm.converters import convert_to_gzcm
from gunz_cm.datasets.gzcm import GzcmDataset, _DEFAULT_TILE_CACHE_SIZE

def _row_col_counts(mat):
    """Extract (row_ids, col_ids, counts) from a ContactMatrix or ndarray.

    GZCM's synthetic_data helper returns a ContactMatrix, not a raw
    ndarray, so we unwrap via .data when present.
    """
    arr = mat.data if hasattr(mat, 'data') else mat
    row_ids, col_ids = np.nonzero(arr)
    counts = arr[row_ids, col_ids].astype(np.uint32)
    return row_ids, col_ids, counts

import torch
from torch.utils.data import DataLoader

from gunz_cm.datasets.sparse_coo import SparseCODataset
from gunz_cm.datasets.sparse_collate import sparse_collate_fn

rng = np.random.default_rng(42)

1. Generate a synthetic Hi-C matrixThe generator lives in notebooks/_synthetic_data.py. It returnsa ContactMatrix (or a plain np.ndarray, depending on version)with realistic block structure and a power-law decay.#

n_bins = 256
mat = _synthetic_data.make_synthetic_hic(n_bins=n_bins, bin_size_bp=50_000)
mat_dense = mat.data if hasattr(mat, 'data') else mat
row_ids, col_ids = np.nonzero(mat_dense)
counts = mat_dense[row_ids, col_ids].astype(np.float32)
print(f'Total contacts: {len(row_ids):,}')
print(f'Sparsity: {len(row_ids) / (n_bins * n_bins):.2%}')
print(f'Max count: {counts.max():.0f}')
Total contacts: 65,536
Sparsity: 100.00%
Max count: 4

2. Single-patch vs sliding-windowFor training, sliding windows are usually preferable so the modelsees overlapping context. The default stride == window_sizeyields non-overlapping tiles; explicitly set stride to enableoverlap.#

patch = SparseCODataset.from_coo(row_ids, col_ids, counts, n_bins=n_bins)
nonoverlap = SparseCODataset.from_coo(
    row_ids, col_ids, counts, n_bins=n_bins, window_size=32, stride=32,
)
overlap = SparseCODataset.from_coo(
    row_ids, col_ids, counts, n_bins=n_bins, window_size=32, stride=16,
)
print(f'single patch:        len={len(patch)}')
print(f'non-overlapping:     len={len(nonoverlap)} '
      f'(={n_bins // 32} patches)')
print(f'overlap (stride=16): len={len(overlap)} '
      f'(n_overlap={(n_bins - 32) // 16 + 1} patches)')
single patch:        len=1
non-overlapping:     len=8 (=8 patches)
overlap (stride=16): len=15 (n_overlap=15 patches)

3. Downsample-replay for super-resolutionSparseCODataset’s downsample path is the standard Hi-Csuper-resolution contract: the model sees features as thesparse-input view and reconstructs target (the full counts).We set downsample_ratio to a fixed value for this tutorial;a tuple (min, max) enables random alpha per item.#

ds_sr = SparseCODataset.from_coo(
    row_ids, col_ids, counts, n_bins=n_bins,
    window_size=64, stride=64, downsample_ratio=0.1,
)
np.random.seed(0)
item = ds_sr[0]
print(f'patch 0 (post-downsample):')
print(f'  input  shape: {tuple(item["features"].shape)}, sum={item["features"].sum().item():.0f}')
print(f'  target shape: {tuple(item["target"].shape)}, sum={item["target"].sum().item():.0f}')
patch 0 (post-downsample):
  input  shape: (74, 1), sum=75
  target shape: (74,), sum=75

4. The end-to-end forward passBuild a tiny sparse-conv head as a stand-in for the actualmodel. The head takes the MinkowskiEngine-style batch dict(coords with a batch-index column) and emits a per-contactlogit tensor. For full Hi-C super-resolution, swap thisfor a MinkowskiEngine or torch_geometric model — thedata contract from sparse_collate_fn stays the same.#

import torch.nn as nn

class TinySparseHead(nn.Module):
    """Per-contact MLP; takes the (sum_N, 3) coords tensor.

    The first column is the batch index; columns 1-2 are the
    (row_bin, col_bin) within the local patch.
    """
    def __init__(self, in_features=1, hidden=32):
        super().__init__()
        self.mlp = nn.Sequential(
            nn.Linear(in_features, hidden), nn.ReLU(),
            nn.Linear(hidden, 1),
        )

    def forward(self, batch):
        # batch['features'] is (sum_N, 1); one channel per contact.
        return self.mlp(batch['features']).squeeze(-1)

head = TinySparseHead()
dl = DataLoader(ds_sr, batch_size=4, shuffle=False,
                collate_fn=sparse_collate_fn)
batch = next(iter(dl))
logits = head(batch)
print(f'batch["coords"].shape:   {tuple(batch["coords"].shape)}')
print(f'batch["features"].shape: {tuple(batch["features"].shape)}')
print(f'batch["target"].shape:   {tuple(batch["target"].shape)}')
print(f'logits.shape:            {tuple(logits.shape)}')
assert logits.shape == batch['target'].shape
batch["coords"].shape:   (443, 3)
batch["features"].shape: (443, 1)
batch["target"].shape:   (443,)
logits.shape:            (443,)

5. End-to-end loss + backwardVerify the gradient graph connects. We use a simple L1 lossbetween the head’s logits and the dense target — the standardHi-C super-resolution objective.#

optim = torch.optim.Adam(head.parameters(), lr=1e-3)
losses = []
torch.manual_seed(0)
for step, batch in enumerate(dl):
    if step >= 5:
        break
    optim.zero_grad()
    pred = head(batch)
    loss = torch.nn.functional.l1_loss(pred, batch['target'])
    loss.backward()
    optim.step()
    losses.append(loss.item())
print(f'L1 losses over 5 steps: {[f"{x:.2f}" for x in losses]}')
L1 losses over 5 steps: ['1.26']

6. From in-memory to file-backed (seamless)If you outgrow the in-memory case (matrix too large for RAM),convert the same COO arrays to a v4 .gzcm file and switchto GzcmDataset (Tutorial 29) — the 4-key contract andsparse_collate_fn are identical. No upstream retrainingis needed because the dataset abstraction is the same.#

# Illustrate the conversion (writes to a tempfile).
import tempfile, pathlib
tmpdir = tempfile.mkdtemp()
out = pathlib.Path(tmpdir) / 'pipe.gzcm'
df = pd.DataFrame({
    DataFrameSpecs.ROW_IDS: row_ids,
    DataFrameSpecs.COL_IDS: col_ids,
    DataFrameSpecs.COUNTS: counts.astype(np.uint32),
})
fake_hic = pathlib.Path(tmpdir) / 'fake.hic'
fake_hic.write_bytes(b'')
from gunz_cm import loaders
original = loaders.load_cm_data
loaders.load_cm_data = lambda *a, **kw: df
try:
    convert_to_gzcm(
        fpath=fake_hic, output_fpath=out, region1='chr1',
        bin_size_bp=50_000, version=4, tile_size=256,
        compression='zstd', overwrite=True,
    )
finally:
    loaders.load_cm_data = original
print(f'wrote {out.stat().st_size:,} bytes ({out})')
wrote 8,192 bytes (/tmp/tmpt2qoirva/pipe.gzcm)

7. Summary* The synthetic generator -> SparseCODataset -> DataLoader path is the end-to-end in-memory Hi-C training pipeline.* The base class owns downsample, dense output, and the 4-key contract; subclasses only contribute the file-specific _load_patch body.* sparse_collate_fn makes the batch dict drop-in compatible with MinkowskiEngine and torch_geometric sparse-conv layers.* Switching to the file-backed case (GzcmDataset, v4 .gzcm) is a single-line change; the upstream model code is unchanged.## Where to go from here* Tutorial 29: the file-backed DataLoader path via GzcmDataset and the v2.26.0 thread-safe LRU tile cache.* Tutorial 30: the in-memory sparse path (the focus here) in isolation.#