Tutorial: Tutorial 30: Sparse-COO DataLoader with SparseCODataset.from_cooFor small-to-medium matrices (single-chromosome patches, up to a fewtens of thousands of bins), there is no need to round-trip through afile on disk. v2.28.0 introduces SparseCODataset.from_coo(...),which builds a sparse 4-key dataset directly from in-memory COO arrays.Combined with sparse_collate_fn, this is the fastest way to feed aPyTorch DataLoader with sparse Hi-C training data.This tutorial builds a sliding-window dataset from synthetic COOarrays, demonstrates stride < window_size for overlapping patches,shows the downsample-replay pattern that SparseCODataset provides,and concludes by wiring a DataLoader with the project’ssparse_collate_fn. The collate function prepends a batch-indexcolumn to coords (MinkowskiEngine convention), making the outputdirectly usable by sparse-conv layers.## Learning Objectives* Build a SparseCODataset from in-memory COO arrays via from_coo(…)* Configure sliding-window iteration with stride < window_size* Use sparse_collate_fn to batch items with a batch-index column* Wire everything into a torch DataLoader with num_workers > 0## Prerequisites* gunz-cm installed: pip install gunz-cm* Python 3.11+* Familiarity with NumPy## Estimated Time⏱️ 12 minutes## DataThis tutorial generates a synthetic 128x128 contact matrix inlinevia np.random.default_rng(42). The full matrix is loaded intomemory as COO arrays; the SparseCODataset then yields slidingwindows over it.—#

# Set auto-reload so the notebook picks up code changes without restarting the kernel.
%load_ext autoreload
%autoreload 2
import sys
from git.repo import Repo

repo = Repo('.', search_parent_directories=True)
ROOT = repo.working_tree_dir
assert isinstance(ROOT, str)
sys.path.append(ROOT)
print(f'Repo root: {ROOT}')
Repo root: /home/adhisant/workspace/gunz-cm
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
import numpy as np
import torch
from torch.utils.data import DataLoader

from gunz_cm.datasets.sparse_coo import SparseCODataset
from gunz_cm.datasets.sparse_collate import sparse_collate_fn

rng = np.random.default_rng(42)

1. Synthesize a contact matrix as COOFor a controlled in-memory test, build a small symmetric matrixwith a realistic sparsity profile: most off-diagonal entries arezero, with a power-law decay along the diagonal.#

n_bins = 128
mat = np.zeros((n_bins, n_bins), dtype=np.float32)
for i in range(n_bins):
    for j in range(i, n_bins):
        d = abs(i - j)
        if d == 0:
            v = 100.0
        else:
            v = 50.0 / (1.0 + d)
            if rng.random() < 0.6:
                v = 0.0
        mat[i, j] = mat[j, i] = v
row_ids, col_ids = np.nonzero(mat)
counts = mat[row_ids, col_ids]
print(f'Total contacts: {len(row_ids):,}')
print(f'Sparsity: {len(row_ids) / (n_bins * n_bins):.2%}')
Total contacts: 6,516
Sparsity: 39.77%

2. Build a SparseCODataset from COOSparseCODataset.from_coo(...) is a classmethod that returns aready-to-iterate SparseCODataset. Two modes:* window_size is None: the whole matrix is a single patch.* window_size is int: patches are window_size-bin windows, spaced by stride bins. With stride < window_size patches overlap; with stride == window_size they do not.#

# Single-patch dataset (whole matrix as one item).
single = SparseCODataset.from_coo(row_ids, col_ids, counts, n_bins=n_bins)
print('single patch:', len(single), 'item(s)')
item = single[0]
print('  keys:', sorted(item.keys()))
print('  coords.shape:', tuple(item['coords'].shape))
print('  features.shape:', tuple(item['features'].shape))
print('  info:', item['info'])
single patch: 1 item(s)
  keys: ['coords', 'features', 'info', 'target']
  coords.shape: (6516, 2)
  features.shape: (6516, 1)
  info: {'chrom': None, 'start': 0, 'end': 128}
# Sliding-window dataset (window_size=32, stride=16 -> 50% overlap).
slide = SparseCODataset.from_coo(
    row_ids, col_ids, counts, n_bins=n_bins,
    window_size=32, stride=16,
)
print(f'sliding-window patches: {len(slide)}')
item0 = slide[0]
item1 = slide[1]
print(f'  patch 0 coords.shape: {tuple(item0["coords"].shape)}, '
      f'info: {item0["info"]}')
print(f'  patch 1 coords.shape: {tuple(item1["coords"].shape)}, '
      f'info: {item1["info"]}')
assert item0['info']['end'] > item1['info']['start'], 'patches must overlap'
sliding-window patches: 7
  patch 0 coords.shape: (404, 2), info: {'chrom': None, 'start': 0, 'end': 32}
  patch 1 coords.shape: (394, 2), info: {'chrom': None, 'start': 16, 'end': 48}

3. The downsampling contractSparseCODataset owns the binomial downsampling logic in one place.Set downsample_ratio=0.1 (or a tuple (0.05, 0.2) for a randomper-item alpha). The returned item keeps the unsampled targettensor (alias for features today, since SparseCODataset has noground-truth source distinct from the input) so the contract isstable for Hi-C super-resolution training pipelines.#

ds_ds = SparseCODataset.from_coo(
    row_ids, col_ids, counts, n_bins=n_bins,
    window_size=32, stride=16,
    downsample_ratio=0.25,
)
rng2 = np.random.default_rng(42)
np.random.seed(42)
ds_ds_item = ds_ds[0]
n_kept = ds_ds_item['coords'].shape[0]
print(f'patch 0 after downsample=0.25: {n_kept} contacts kept')
print(f'  features sum: {ds_ds_item["features"].sum().item():.1f}')
print(f'  target sum:   {ds_ds_item["target"].sum().item():.1f}')
patch 0 after downsample=0.25: 293 contacts kept
  features sum: 1404.0
  target sum:   1404.0

4. The dense output pathPass output_type='dense' to get a torch.Tensor of shape(1, n_bins_per_patch, n_bins_per_patch) instead of the 4-key dict.Useful for direct-input convolutional architectures (no sparse-convstack to set up).#

ds_dense = SparseCODataset.from_coo(
    row_ids, col_ids, counts, n_bins=n_bins,
    window_size=32, stride=32,
    output_type='dense',
)
t = ds_dense[0]
print(f'dense patch shape: {tuple(t.shape)}')
print(f'  dtype: {t.dtype}, non-zero: {(t != 0).sum().item()}')
dense patch shape: (1, 32, 32)
  dtype: float32, non-zero: 404

5. sparse_collate_fn: the 4-key contractA default DataLoader collate function would stack the 4-keydicts along dim 0, which is wrong: coords has shape (N, 2)with N varying per item. sparse_collate_fn prepends abatch-index column to coords so the output has shape(sum_N, 3) — the MinkowskiEngine standard. features,target are concatenated; infos is returned as a Python listof per-item dicts (plural infos).#

batch = [slide[i] for i in range(4)]
out = sparse_collate_fn(batch)
print('output keys:', sorted(out.keys()))
print(f'coords.shape:   {tuple(out["coords"].shape)}')
print(f'features.shape: {tuple(out["features"].shape)}')
print(f'target.shape:   {tuple(out["target"].shape)}')
print(f'infos:          {len(out["infos"])} dict(s)')
print(f'  batch indices (first col): {out["coords"][:, 0].tolist()}')
output keys: ['coords', 'features', 'infos', 'target']
coords.shape:   (1620, 3)
features.shape: (1620, 1)
target.shape:   (1620,)
infos:          4 dict(s)
  batch indices (first col): [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]

6. Wire it into a DataLoaderWith collate_fn=sparse_collate_fn, every batch is the standardMinkowskiEngine sparse-tensor dict. num_workers > 0 is safe:each worker re-imports SparseCODataset independently andconstructs its own in-memory copy of the COO arrays. There isno shared cache to race on; the dataset is purely functional(numpy slicing) and the only Python state is self._row_idsetc., which is read-only after __init__.#

import time

def time_loader(ds, num_workers, n_iter=20, **kw):
    dl = DataLoader(
        ds, batch_size=4, shuffle=False,
        num_workers=num_workers,
        collate_fn=sparse_collate_fn,
        **kw,
    )
    t0 = time.perf_counter()
    for _ in range(n_iter):
        for batch in dl:
            pass  # consume
    return (time.perf_counter() - t0) * 1000 / n_iter

single_t = time_loader(slide, num_workers=0)
multi_t = time_loader(slide, num_workers=2)
print(f'single-worker avg iter: {single_t:.2f} ms')
print(f'multi-worker (2) avg iter: {multi_t:.2f} ms')
if multi_t > 0:
    print(f'speedup: {single_t / multi_t:.2f}x')
single-worker avg iter: 0.65 ms
multi-worker (2) avg iter: 34.09 ms
speedup: 0.02x

7. Summary* SparseCODataset.from_coo(...) builds a sparse-COO dataset directly from in-memory arrays; no file loader required.* Sliding-window iteration: window_size=None -> single patch; window_size=int -> stride-spaced windows with optional overlap.* The base class owns the downsample-replay contract: input is the binomial-subsampled view, target is the unsampled counts.* sparse_collate_fn enforces the 4-key contract and produces a MinkowskiEngine-compatible sparse-tensor dict with a batch-index column prepended to coords.* The dataset is purely functional; num_workers > 0 is safe.## Where to go from here* Tutorial 29: the file-backed GzcmDataset DataLoader path.* Tutorial 31: tutorial 07 in this index — full in-memory Hi-C pipeline from synthetic generator to PyTorch batch in one notebook.#