Tutorial: Tutorial 30: Sparse-COO DataLoader with SparseCODataset.from_cooFor small-to-medium matrices (single-chromosome patches, up to a fewtens of thousands of bins), there is no need to round-trip through afile on disk. v2.28.0 introduces SparseCODataset.from_coo(...),which builds a sparse 4-key dataset directly from in-memory COO arrays.Combined with sparse_collate_fn, this is the fastest way to feed aPyTorch DataLoader with sparse Hi-C training data.This tutorial builds a sliding-window dataset from synthetic COOarrays, demonstrates stride < window_size for overlapping patches,shows the downsample-replay pattern that SparseCODataset provides,and concludes by wiring a DataLoader with the project’ssparse_collate_fn. The collate function prepends a batch-indexcolumn to coords (MinkowskiEngine convention), making the outputdirectly usable by sparse-conv layers.## Learning Objectives* Build a SparseCODataset from in-memory COO arrays via from_coo(…)* Configure sliding-window iteration with stride < window_size* Use sparse_collate_fn to batch items with a batch-index column* Wire everything into a torch DataLoader with num_workers > 0## Prerequisites* gunz-cm installed: pip install gunz-cm* Python 3.11+* Familiarity with NumPy## Estimated Time⏱️ 12 minutes## DataThis tutorial generates a synthetic 128x128 contact matrix inlinevia np.random.default_rng(42). The full matrix is loaded intomemory as COO arrays; the SparseCODataset then yields slidingwindows over it.—#
# Set auto-reload so the notebook picks up code changes without restarting the kernel.
%load_ext autoreload
%autoreload 2
import sys
from git.repo import Repo
repo = Repo('.', search_parent_directories=True)
ROOT = repo.working_tree_dir
assert isinstance(ROOT, str)
sys.path.append(ROOT)
print(f'Repo root: {ROOT}')
Repo root: /home/adhisant/workspace/gunz-cm
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
import numpy as np
import torch
from torch.utils.data import DataLoader
from gunz_cm.datasets.sparse_coo import SparseCODataset
from gunz_cm.datasets.sparse_collate import sparse_collate_fn
rng = np.random.default_rng(42)
1. Synthesize a contact matrix as COOFor a controlled in-memory test, build a small symmetric matrixwith a realistic sparsity profile: most off-diagonal entries arezero, with a power-law decay along the diagonal.#
n_bins = 128
mat = np.zeros((n_bins, n_bins), dtype=np.float32)
for i in range(n_bins):
for j in range(i, n_bins):
d = abs(i - j)
if d == 0:
v = 100.0
else:
v = 50.0 / (1.0 + d)
if rng.random() < 0.6:
v = 0.0
mat[i, j] = mat[j, i] = v
row_ids, col_ids = np.nonzero(mat)
counts = mat[row_ids, col_ids]
print(f'Total contacts: {len(row_ids):,}')
print(f'Sparsity: {len(row_ids) / (n_bins * n_bins):.2%}')
Total contacts: 6,516
Sparsity: 39.77%
2. Build a SparseCODataset from COOSparseCODataset.from_coo(...) is a classmethod that returns aready-to-iterate SparseCODataset. Two modes:* window_size is None: the whole matrix is a single patch.* window_size is int: patches are window_size-bin windows, spaced by stride bins. With stride < window_size patches overlap; with stride == window_size they do not.#
# Single-patch dataset (whole matrix as one item).
single = SparseCODataset.from_coo(row_ids, col_ids, counts, n_bins=n_bins)
print('single patch:', len(single), 'item(s)')
item = single[0]
print(' keys:', sorted(item.keys()))
print(' coords.shape:', tuple(item['coords'].shape))
print(' features.shape:', tuple(item['features'].shape))
print(' info:', item['info'])
single patch: 1 item(s)
keys: ['coords', 'features', 'info', 'target']
coords.shape: (6516, 2)
features.shape: (6516, 1)
info: {'chrom': None, 'start': 0, 'end': 128}
# Sliding-window dataset (window_size=32, stride=16 -> 50% overlap).
slide = SparseCODataset.from_coo(
row_ids, col_ids, counts, n_bins=n_bins,
window_size=32, stride=16,
)
print(f'sliding-window patches: {len(slide)}')
item0 = slide[0]
item1 = slide[1]
print(f' patch 0 coords.shape: {tuple(item0["coords"].shape)}, '
f'info: {item0["info"]}')
print(f' patch 1 coords.shape: {tuple(item1["coords"].shape)}, '
f'info: {item1["info"]}')
assert item0['info']['end'] > item1['info']['start'], 'patches must overlap'
sliding-window patches: 7
patch 0 coords.shape: (404, 2), info: {'chrom': None, 'start': 0, 'end': 32}
patch 1 coords.shape: (394, 2), info: {'chrom': None, 'start': 16, 'end': 48}
3. The downsampling contractSparseCODataset owns the binomial downsampling logic in one place.Set downsample_ratio=0.1 (or a tuple (0.05, 0.2) for a randomper-item alpha). The returned item keeps the unsampled targettensor (alias for features today, since SparseCODataset has noground-truth source distinct from the input) so the contract isstable for Hi-C super-resolution training pipelines.#
ds_ds = SparseCODataset.from_coo(
row_ids, col_ids, counts, n_bins=n_bins,
window_size=32, stride=16,
downsample_ratio=0.25,
)
rng2 = np.random.default_rng(42)
np.random.seed(42)
ds_ds_item = ds_ds[0]
n_kept = ds_ds_item['coords'].shape[0]
print(f'patch 0 after downsample=0.25: {n_kept} contacts kept')
print(f' features sum: {ds_ds_item["features"].sum().item():.1f}')
print(f' target sum: {ds_ds_item["target"].sum().item():.1f}')
patch 0 after downsample=0.25: 293 contacts kept
features sum: 1404.0
target sum: 1404.0
4. The dense output pathPass output_type='dense' to get a torch.Tensor of shape(1, n_bins_per_patch, n_bins_per_patch) instead of the 4-key dict.Useful for direct-input convolutional architectures (no sparse-convstack to set up).#
ds_dense = SparseCODataset.from_coo(
row_ids, col_ids, counts, n_bins=n_bins,
window_size=32, stride=32,
output_type='dense',
)
t = ds_dense[0]
print(f'dense patch shape: {tuple(t.shape)}')
print(f' dtype: {t.dtype}, non-zero: {(t != 0).sum().item()}')
dense patch shape: (1, 32, 32)
dtype: float32, non-zero: 404
5. sparse_collate_fn: the 4-key contractA default DataLoader collate function would stack the 4-keydicts along dim 0, which is wrong: coords has shape (N, 2)with N varying per item. sparse_collate_fn prepends abatch-index column to coords so the output has shape(sum_N, 3) — the MinkowskiEngine standard. features,target are concatenated; infos is returned as a Python listof per-item dicts (plural infos).#
batch = [slide[i] for i in range(4)]
out = sparse_collate_fn(batch)
print('output keys:', sorted(out.keys()))
print(f'coords.shape: {tuple(out["coords"].shape)}')
print(f'features.shape: {tuple(out["features"].shape)}')
print(f'target.shape: {tuple(out["target"].shape)}')
print(f'infos: {len(out["infos"])} dict(s)')
print(f' batch indices (first col): {out["coords"][:, 0].tolist()}')
output keys: ['coords', 'features', 'infos', 'target']
coords.shape: (1620, 3)
features.shape: (1620, 1)
target.shape: (1620,)
infos: 4 dict(s)
batch indices (first col): [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]