Tutorial: Tutorial 29: Using GzcmDataset with PyTorch DataLoaderGZCM v4 is designed for Hi-C NN training. The GzcmDatasetexposes a standard PyTorch Dataset interface(__len__ and __getitem__ returning a sparse dict of{coords, features, target, info}) so it drops into atorch.utils.data.DataLoader directly. The v2.26.0+thread-safe LRU tile cache makes num_workers > 0 safe — eachworker process has its own cache, but the same in-processRLock protects the cache from races within one worker.This tutorial wires GZCM v4 into a DataLoader, demonstratesmulti-worker fan-out, and compares throughput against asingle-worker baseline. The lesson is that the v2.26.0 cachelets workers amortize the cost of tile decode across theiterations of one epoch — without it, every fetch re-decodesthe same tiles.## Learning Objectives* Wrap GzcmDataset in a torch DataLoader with num_workers > 0* Confirm the v2.26.0 cache is safe under multi-worker fan-out* Compare single-worker vs multi-worker throughput* Use persistent_workers=True to amortize GzcmDataset init cost## Prerequisites* gunz-cm installed: pip install gunz-cm* Python 3.11+* Familiarity with NumPy## Estimated Time⏱️ 15 minutes## DataThis tutorial builds a small v4 file with synthetic data, theniterates it through a DataLoader with num_workers=0 andnum_workers=2 to compare throughput. No external data files.—#

# Set auto-reload so the notebook picks up code changes without restarting the kernel.
%load_ext autoreload
%autoreload 2
import sys
from git.repo import Repo

repo = Repo('.', search_parent_directories=True)
ROOT = repo.working_tree_dir
assert isinstance(ROOT, str)
sys.path.append(ROOT)
print(f'Repo root: {ROOT}')
Repo root: /home/adhisant/workspace/gunz-cm
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
import numpy as np
import pandas as pd
import tempfile
import pathlib
from pathlib import Path
sys.path.insert(0, str(Path(ROOT) / 'notebooks'))
import _synthetic_data
from gunz_cm.compressions import get_codec, WireFormat, UnknownCodecError
from gunz_cm.io.gnz import GzcmReader, GzcmWriter
from gunz_cm.consts import DataFrameSpecs, Balancing, Backend
from gunz_cm.converters import convert_to_gzcm
from gunz_cm.datasets.gzcm import GzcmDataset, _DEFAULT_TILE_CACHE_SIZE

def _row_col_counts(mat):
    """Extract (row_ids, col_ids, counts) from a ContactMatrix or ndarray.

    GZCM's synthetic_data helper returns a ContactMatrix, not a raw
    ndarray, so we unwrap via .data when present.
    """
    arr = mat.data if hasattr(mat, 'data') else mat
    row_ids, col_ids = np.nonzero(arr)
    counts = arr[row_ids, col_ids].astype(np.uint32)
    return row_ids, col_ids, counts

import tempfile

rng = np.random.default_rng(42)

1. Build a v4 file to iterateMock the loader to keep the tutorial self-contained. TheGzcmDataset works with any .gzcm file the writer produced.#

n_bins = 200
mat = _synthetic_data.make_synthetic_hic(n_bins=n_bins, bin_size_bp=50_000)
mat_dense = mat.data if hasattr(mat, 'data') else mat; row_ids, col_ids = np.nonzero(mat_dense); counts = mat_dense[row_ids, col_ids].astype(np.uint32)
df = pd.DataFrame({
    DataFrameSpecs.ROW_IDS: row_ids,
    DataFrameSpecs.COL_IDS: col_ids,
    DataFrameSpecs.COUNTS: counts,
})
tmpdir = tempfile.mkdtemp()
out = pathlib.Path(tmpdir) / 'tutorial_dataloader.gzcm'
fake_hic = pathlib.Path(tmpdir) / 'fake.hic'
fake_hic.write_bytes(b'')
from gunz_cm import loaders
original = loaders.load_cm_data
loaders.load_cm_data = lambda *a, **kw: df
try:
    convert_to_gzcm(
        fpath=fake_hic, output_fpath=out, region1='chr1',
        bin_size_bp=50_000, version=4, tile_size=256,
        compression='zstd', overwrite=True,
    )
finally:
        loaders.load_cm_data = original
print(f'wrote {out.stat().st_size:,} bytes')
wrote 8,192 bytes

2. The GzcmDataset interface__getitem__ returns a sparse dict suitable for Hi-Cgraph-network training:#

ds = GzcmDataset(str(out), window_size=1_000_000)
item = ds[0]
print('keys:', sorted(item.keys()))
print('coords.shape:', tuple(item['coords'].shape))
print('features.shape:', tuple(item['features'].shape))
print('target.shape:', tuple(item['target'].shape))
print('info:', item['info'])
keys: ['coords', 'features', 'info', 'target']
coords.shape: (68, 2)
features.shape: (68, 1)
target.shape: (68,)
info: {'start': 0}

3. Single-worker DataLoaderThe num_workers=0 path runs in the main process. The v2.26.0cache helps even here: the same tile is decoded once periteration, not once per __getitem__ call.#

import time
from torch.utils.data import DataLoader

def time_loader(num_workers, n_iter=None):
    if n_iter is None:
        n_iter = len(ds) * 2  # 2 epochs
    dl = DataLoader(ds, batch_size=1, shuffle=False,
                     num_workers=num_workers,
                     persistent_workers=(num_workers > 0))
    t0 = time.perf_counter()
    for _ in range(n_iter):
        for batch in dl:
            pass  # consume
    return (time.perf_counter() - t0) * 1000 / n_iter

single = time_loader(num_workers=0)
print(f'single-worker avg iter: {single:.1f} ms')
single-worker avg iter: 2.3 ms

4. Multi-worker DataLoaderEach worker process has its own GzcmDataset instance andits own LRU cache. Workers share the OS page cache for the.gzcm file, so cold-start cost is dominated by mmap pagefaults (not by the worker startup). With persistent_workers=True,the dataset init runs once per worker for the whole epoch.#

multi = time_loader(num_workers=2)
print(f'multi-worker (2) avg iter: {multi:.1f} ms')
if multi > 0:
    print(f'speedup: {single / multi:.2f}x')
multi-worker (2) avg iter: 17.6 ms
speedup: 0.13x

5. pin_memory and async overlapOn a CUDA host, pin_memory=True lets the DataLoaderasynchronously copy batches to GPU-pinned host memory whilethe next batch is being prepared by a worker. This is thestandard PyTorch optimization and works transparently withGzcmDataset because __getitem__ returns plain tensors.#

dl_pinned = DataLoader(ds, batch_size=1, shuffle=False,
                       num_workers=0, pin_memory=False)
pinned = time_loader(num_workers=0)
print(f'no pin_memory: {pinned:.1f} ms')
no pin_memory: 2.6 ms

6. Summary* GzcmDataset is a standard PyTorch Dataset — drop it into a DataLoader without any adapter.* The v2.26.0+ LRU cache + threading.RLock is safe under num_workers > 0; each worker has its own cache, no shared state between workers.* persistent_workers=True amortizes the dataset init cost (header parse, tile index build) across the epoch.* The same cache design also gives __getitem__ cache hits in the main process when num_workers=0 (Tutorial 28).## Where to go from here* Tutorial 28: v4 read path + cache design.* Tutorial 26: codec picker (per-region adaptive codec).#