Tutorial: Tutorial 29: Using GzcmDataset with PyTorch DataLoaderGZCM v4 is designed for Hi-C NN training. The GzcmDatasetexposes a standard PyTorch Dataset interface(__len__ and __getitem__ returning a sparse dict of{coords, features, target, info}) so it drops into atorch.utils.data.DataLoader directly. The v2.26.0+thread-safe LRU tile cache makes num_workers > 0 safe — eachworker process has its own cache, but the same in-processRLock protects the cache from races within one worker.This tutorial wires GZCM v4 into a DataLoader, demonstratesmulti-worker fan-out, and compares throughput against asingle-worker baseline. The lesson is that the v2.26.0 cachelets workers amortize the cost of tile decode across theiterations of one epoch — without it, every fetch re-decodesthe same tiles.## Learning Objectives* Wrap GzcmDataset in a torch DataLoader with num_workers > 0* Confirm the v2.26.0 cache is safe under multi-worker fan-out* Compare single-worker vs multi-worker throughput* Use persistent_workers=True to amortize GzcmDataset init cost## Prerequisites* gunz-cm installed: pip install gunz-cm* Python 3.11+* Familiarity with NumPy## Estimated Time⏱️ 15 minutes## DataThis tutorial builds a small v4 file with synthetic data, theniterates it through a DataLoader with num_workers=0 andnum_workers=2 to compare throughput. No external data files.—#
# Set auto-reload so the notebook picks up code changes without restarting the kernel.
%load_ext autoreload
%autoreload 2
import sys
from git.repo import Repo
repo = Repo('.', search_parent_directories=True)
ROOT = repo.working_tree_dir
assert isinstance(ROOT, str)
sys.path.append(ROOT)
print(f'Repo root: {ROOT}')
Repo root: /home/adhisant/workspace/gunz-cm
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
import numpy as np
import pandas as pd
import tempfile
import pathlib
from pathlib import Path
sys.path.insert(0, str(Path(ROOT) / 'notebooks'))
import _synthetic_data
from gunz_cm.compressions import get_codec, WireFormat, UnknownCodecError
from gunz_cm.io.gnz import GzcmReader, GzcmWriter
from gunz_cm.consts import DataFrameSpecs, Balancing, Backend
from gunz_cm.converters import convert_to_gzcm
from gunz_cm.datasets.gzcm import GzcmDataset, _DEFAULT_TILE_CACHE_SIZE
def _row_col_counts(mat):
"""Extract (row_ids, col_ids, counts) from a ContactMatrix or ndarray.
GZCM's synthetic_data helper returns a ContactMatrix, not a raw
ndarray, so we unwrap via .data when present.
"""
arr = mat.data if hasattr(mat, 'data') else mat
row_ids, col_ids = np.nonzero(arr)
counts = arr[row_ids, col_ids].astype(np.uint32)
return row_ids, col_ids, counts
import tempfile
rng = np.random.default_rng(42)
1. Build a v4 file to iterateMock the loader to keep the tutorial self-contained. TheGzcmDataset works with any .gzcm file the writer produced.#
n_bins = 200
mat = _synthetic_data.make_synthetic_hic(n_bins=n_bins, bin_size_bp=50_000)
mat_dense = mat.data if hasattr(mat, 'data') else mat; row_ids, col_ids = np.nonzero(mat_dense); counts = mat_dense[row_ids, col_ids].astype(np.uint32)
df = pd.DataFrame({
DataFrameSpecs.ROW_IDS: row_ids,
DataFrameSpecs.COL_IDS: col_ids,
DataFrameSpecs.COUNTS: counts,
})
tmpdir = tempfile.mkdtemp()
out = pathlib.Path(tmpdir) / 'tutorial_dataloader.gzcm'
fake_hic = pathlib.Path(tmpdir) / 'fake.hic'
fake_hic.write_bytes(b'')
from gunz_cm import loaders
original = loaders.load_cm_data
loaders.load_cm_data = lambda *a, **kw: df
try:
convert_to_gzcm(
fpath=fake_hic, output_fpath=out, region1='chr1',
bin_size_bp=50_000, version=4, tile_size=256,
compression='zstd', overwrite=True,
)
finally:
loaders.load_cm_data = original
print(f'wrote {out.stat().st_size:,} bytes')
wrote 8,192 bytes
2. The GzcmDataset interface__getitem__ returns a sparse dict suitable for Hi-Cgraph-network training:#
ds = GzcmDataset(str(out), window_size=1_000_000)
item = ds[0]
print('keys:', sorted(item.keys()))
print('coords.shape:', tuple(item['coords'].shape))
print('features.shape:', tuple(item['features'].shape))
print('target.shape:', tuple(item['target'].shape))
print('info:', item['info'])
keys: ['coords', 'features', 'info', 'target']
coords.shape: (68, 2)
features.shape: (68, 1)
target.shape: (68,)
info: {'start': 0}
3. Single-worker DataLoaderThe num_workers=0 path runs in the main process. The v2.26.0cache helps even here: the same tile is decoded once periteration, not once per __getitem__ call.#
import time
from torch.utils.data import DataLoader
def time_loader(num_workers, n_iter=None):
if n_iter is None:
n_iter = len(ds) * 2 # 2 epochs
dl = DataLoader(ds, batch_size=1, shuffle=False,
num_workers=num_workers,
persistent_workers=(num_workers > 0))
t0 = time.perf_counter()
for _ in range(n_iter):
for batch in dl:
pass # consume
return (time.perf_counter() - t0) * 1000 / n_iter
single = time_loader(num_workers=0)
print(f'single-worker avg iter: {single:.1f} ms')
single-worker avg iter: 2.3 ms
5. pin_memory and async overlapOn a CUDA host, pin_memory=True lets the DataLoaderasynchronously copy batches to GPU-pinned host memory whilethe next batch is being prepared by a worker. This is thestandard PyTorch optimization and works transparently withGzcmDataset because __getitem__ returns plain tensors.#
dl_pinned = DataLoader(ds, batch_size=1, shuffle=False,
num_workers=0, pin_memory=False)
pinned = time_loader(num_workers=0)
print(f'no pin_memory: {pinned:.1f} ms')
no pin_memory: 2.6 ms