Tutorial: Tutorial 30: Sparse-COO DataLoader with SparseCODataset.from_cooFor small-to-medium matrices (single-chromosome patches, up to a fewtens of thousands of bins), there is no need to round-trip through afile on disk. v2.28.0 introduces `SparseCODataset.from_coo(...)`,which builds a sparse 4-key dataset directly from in-memory COO arrays.Combined with `sparse_collate_fn`, this is the fastest way to feed aPyTorch `DataLoader` with sparse Hi-C training data.This tutorial builds a sliding-window dataset from synthetic COOarrays, demonstrates `stride < window_size` for overlapping patches,shows the downsample-replay pattern that SparseCODataset provides,and concludes by wiring a `DataLoader` with the project’s`sparse_collate_fn`. The collate function prepends a batch-indexcolumn to `coords` (MinkowskiEngine convention), making the outputdirectly usable by sparse-conv layers.## Learning Objectives* Build a SparseCODataset from in-memory COO arrays via from_coo(…)* Configure sliding-window iteration with stride < window_size* Use sparse_collate_fn to batch items with a batch-index column* Wire everything into a torch DataLoader with num_workers > 0## Prerequisites* gunz-cm installed: `pip install gunz-cm`* Python 3.11+* Familiarity with NumPy## Estimated Time⏱️ 12 minutes## DataThis tutorial generates a synthetic 128x128 contact matrix inlinevia `np.random.default_rng(42)`. The full matrix is loaded intomemory as COO arrays; the SparseCODataset then yields slidingwindows over it.—#

# Set auto-reload so the notebook picks up code changes without restarting the kernel.
%load_ext autoreload
%autoreload 2

import sys
from git.repo import Repo

repo = Repo('.', search_parent_directories=True)
ROOT = repo.working_tree_dir
assert isinstance(ROOT, str)
sys.path.append(ROOT)
print(f'Repo root: {ROOT}')

Repo root: /home/adhisant/workspace/gunz-cm

from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)

import numpy as np
import torch
from torch.utils.data import DataLoader

from gunz_cm.datasets.sparse_coo import SparseCODataset
from gunz_cm.datasets.sparse_collate import sparse_collate_fn

rng = np.random.default_rng(42)

1. Synthesize a contact matrix as COOFor a controlled in-memory test, build a small symmetric matrixwith a realistic sparsity profile: most off-diagonal entries arezero, with a power-law decay along the diagonal.#

n_bins = 128
mat = np.zeros((n_bins, n_bins), dtype=np.float32)
for i in range(n_bins):
    for j in range(i, n_bins):
        d = abs(i - j)
        if d == 0:
            v = 100.0
        else:
            v = 50.0 / (1.0 + d)
            if rng.random() < 0.6:
                v = 0.0
        mat[i, j] = mat[j, i] = v
row_ids, col_ids = np.nonzero(mat)
counts = mat[row_ids, col_ids]
print(f'Total contacts: {len(row_ids):,}')
print(f'Sparsity: {len(row_ids) / (n_bins * n_bins):.2%}')

Total contacts: 6,516
Sparsity: 39.77%

2. Build a SparseCODataset from COO`SparseCODataset.from_coo(...)` is a classmethod that returns aready-to-iterate SparseCODataset. Two modes:* `window_size is None`: the whole matrix is a single patch.* `window_size is int`: patches are `window_size`-bin windows, spaced by `stride` bins. With `stride < window_size` patches overlap; with `stride == window_size` they do not.#

# Single-patch dataset (whole matrix as one item).
single = SparseCODataset.from_coo(row_ids, col_ids, counts, n_bins=n_bins)
print('single patch:', len(single), 'item(s)')
item = single[0]
print('  keys:', sorted(item.keys()))
print('  coords.shape:', tuple(item['coords'].shape))
print('  features.shape:', tuple(item['features'].shape))
print('  info:', item['info'])

single patch: 1 item(s)
  keys: ['coords', 'features', 'info', 'target']
  coords.shape: (6516, 2)
  features.shape: (6516, 1)
  info: {'chrom': None, 'start': 0, 'end': 128}

# Sliding-window dataset (window_size=32, stride=16 -> 50% overlap).
slide = SparseCODataset.from_coo(
    row_ids, col_ids, counts, n_bins=n_bins,
    window_size=32, stride=16,
)
print(f'sliding-window patches: {len(slide)}')
item0 = slide[0]
item1 = slide[1]
print(f'  patch 0 coords.shape: {tuple(item0["coords"].shape)}, '
      f'info: {item0["info"]}')
print(f'  patch 1 coords.shape: {tuple(item1["coords"].shape)}, '
      f'info: {item1["info"]}')
assert item0['info']['end'] > item1['info']['start'], 'patches must overlap'

sliding-window patches: 7
  patch 0 coords.shape: (404, 2), info: {'chrom': None, 'start': 0, 'end': 32}
  patch 1 coords.shape: (394, 2), info: {'chrom': None, 'start': 16, 'end': 48}

3. The downsampling contractSparseCODataset owns the binomial downsampling logic in one place.Set `downsample_ratio=0.1` (or a tuple `(0.05, 0.2)` for a randomper-item alpha). The returned item keeps the unsampled `target`tensor (alias for `features` today, since SparseCODataset has noground-truth source distinct from the input) so the contract isstable for Hi-C super-resolution training pipelines.#

ds_ds = SparseCODataset.from_coo(
    row_ids, col_ids, counts, n_bins=n_bins,
    window_size=32, stride=16,
    downsample_ratio=0.25,
)
rng2 = np.random.default_rng(42)
np.random.seed(42)
ds_ds_item = ds_ds[0]
n_kept = ds_ds_item['coords'].shape[0]
print(f'patch 0 after downsample=0.25: {n_kept} contacts kept')
print(f'  features sum: {ds_ds_item["features"].sum().item():.1f}')
print(f'  target sum:   {ds_ds_item["target"].sum().item():.1f}')

patch 0 after downsample=0.25: 293 contacts kept
  features sum: 1404.0
  target sum:   1404.0

4. The dense output pathPass `output_type='dense'` to get a `torch.Tensor` of shape`(1, n_bins_per_patch, n_bins_per_patch)` instead of the 4-key dict.Useful for direct-input convolutional architectures (no sparse-convstack to set up).#

ds_dense = SparseCODataset.from_coo(
    row_ids, col_ids, counts, n_bins=n_bins,
    window_size=32, stride=32,
    output_type='dense',
)
t = ds_dense[0]
print(f'dense patch shape: {tuple(t.shape)}')
print(f'  dtype: {t.dtype}, non-zero: {(t != 0).sum().item()}')

dense patch shape: (1, 32, 32)
  dtype: float32, non-zero: 404

5. sparse_collate_fn: the 4-key contractA default `DataLoader` collate function would stack the 4-keydicts along dim 0, which is wrong: `coords` has shape `(N, 2)`with `N` varying per item. `sparse_collate_fn` prepends abatch-index column to `coords` so the output has shape`(sum_N, 3)` — the MinkowskiEngine standard. `features`,`target` are concatenated; `infos` is returned as a Python listof per-item dicts (plural `infos`).#

batch = [slide[i] for i in range(4)]
out = sparse_collate_fn(batch)
print('output keys:', sorted(out.keys()))
print(f'coords.shape:   {tuple(out["coords"].shape)}')
print(f'features.shape: {tuple(out["features"].shape)}')
print(f'target.shape:   {tuple(out["target"].shape)}')
print(f'infos:          {len(out["infos"])} dict(s)')
print(f'  batch indices (first col): {out["coords"][:, 0].tolist()}')

output keys: ['coords', 'features', 'infos', 'target']
coords.shape:   (1620, 3)
features.shape: (1620, 1)
target.shape:   (1620,)
infos:          4 dict(s)
  batch indices (first col): [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]

6. Wire it into a DataLoaderWith `collate_fn=sparse_collate_fn`, every batch is the standardMinkowskiEngine sparse-tensor dict. `num_workers > 0` is safe:each worker re-imports `SparseCODataset` independently andconstructs its own in-memory copy of the COO arrays. There isno shared cache to race on; the dataset is purely functional(numpy slicing) and the only Python state is `self._row_ids`etc., which is read-only after `init`.#

import time

def time_loader(ds, num_workers, n_iter=20, **kw):
    dl = DataLoader(
        ds, batch_size=4, shuffle=False,
        num_workers=num_workers,
        collate_fn=sparse_collate_fn,
        **kw,
    )
    t0 = time.perf_counter()
    for _ in range(n_iter):
        for batch in dl:
            pass  # consume
    return (time.perf_counter() - t0) * 1000 / n_iter

single_t = time_loader(slide, num_workers=0)
multi_t = time_loader(slide, num_workers=2)
print(f'single-worker avg iter: {single_t:.2f} ms')
print(f'multi-worker (2) avg iter: {multi_t:.2f} ms')
if multi_t > 0:
    print(f'speedup: {single_t / multi_t:.2f}x')

single-worker avg iter: 0.65 ms
multi-worker (2) avg iter: 34.09 ms
speedup: 0.02x

7. Summary* `SparseCODataset.from_coo(...)` builds a sparse-COO dataset directly from in-memory arrays; no file loader required.* Sliding-window iteration: `window_size=None` -> single patch; `window_size=int` -> `stride`-spaced windows with optional overlap.* The base class owns the downsample-replay contract: `input` is the binomial-subsampled view, `target` is the unsampled counts.* `sparse_collate_fn` enforces the 4-key contract and produces a MinkowskiEngine-compatible sparse-tensor dict with a batch-index column prepended to `coords`.* The dataset is purely functional; `num_workers > 0` is safe.## Where to go from here* Tutorial 29: the file-backed GzcmDataset DataLoader path.* Tutorial 31: tutorial 07 in this index — full in-memory Hi-C pipeline from synthetic generator to PyTorch batch in one notebook.#