Tutorial: Tutorial 29: Using GzcmDataset with PyTorch DataLoaderGZCM v4 is designed for Hi-C NN training. The `GzcmDataset`exposes a standard PyTorch `Dataset` interface(`len` and `getitem` returning a sparse dict of`{coords, features, target, info}`) so it drops into a`torch.utils.data.DataLoader` directly. The v2.26.0+thread-safe LRU tile cache makes `num_workers > 0` safe — eachworker process has its own cache, but the same in-processRLock protects the cache from races within one worker.This tutorial wires GZCM v4 into a DataLoader, demonstratesmulti-worker fan-out, and compares throughput against asingle-worker baseline. The lesson is that the v2.26.0 cachelets workers amortize the cost of tile decode across theiterations of one epoch — without it, every fetch re-decodesthe same tiles.## Learning Objectives* Wrap GzcmDataset in a torch DataLoader with num_workers > 0* Confirm the v2.26.0 cache is safe under multi-worker fan-out* Compare single-worker vs multi-worker throughput* Use `persistent_workers=True` to amortize GzcmDataset init cost## Prerequisites* gunz-cm installed: `pip install gunz-cm`* Python 3.11+* Familiarity with NumPy## Estimated Time⏱️ 15 minutes## DataThis tutorial builds a small v4 file with synthetic data, theniterates it through a DataLoader with `num_workers=0` and`num_workers=2` to compare throughput. No external data files.—#

# Set auto-reload so the notebook picks up code changes without restarting the kernel.
%load_ext autoreload
%autoreload 2

import sys
from git.repo import Repo

repo = Repo('.', search_parent_directories=True)
ROOT = repo.working_tree_dir
assert isinstance(ROOT, str)
sys.path.append(ROOT)
print(f'Repo root: {ROOT}')

Repo root: /home/adhisant/workspace/gunz-cm

from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)

import numpy as np
import pandas as pd
import tempfile
import pathlib
from pathlib import Path
sys.path.insert(0, str(Path(ROOT) / 'notebooks'))
import _synthetic_data
from gunz_cm.compressions import get_codec, WireFormat, UnknownCodecError
from gunz_cm.io.gnz import GzcmReader, GzcmWriter
from gunz_cm.consts import DataFrameSpecs, Balancing, Backend
from gunz_cm.converters import convert_to_gzcm
from gunz_cm.datasets.gzcm import GzcmDataset, _DEFAULT_TILE_CACHE_SIZE

def _row_col_counts(mat):
    """Extract (row_ids, col_ids, counts) from a ContactMatrix or ndarray.

    GZCM's synthetic_data helper returns a ContactMatrix, not a raw
    ndarray, so we unwrap via .data when present.
    """
    arr = mat.data if hasattr(mat, 'data') else mat
    row_ids, col_ids = np.nonzero(arr)
    counts = arr[row_ids, col_ids].astype(np.uint32)
    return row_ids, col_ids, counts

import tempfile

rng = np.random.default_rng(42)

1. Build a v4 file to iterateMock the loader to keep the tutorial self-contained. The`GzcmDataset` works with any `.gzcm` file the writer produced.#

n_bins = 200
mat = _synthetic_data.make_synthetic_hic(n_bins=n_bins, bin_size_bp=50_000)
mat_dense = mat.data if hasattr(mat, 'data') else mat; row_ids, col_ids = np.nonzero(mat_dense); counts = mat_dense[row_ids, col_ids].astype(np.uint32)
df = pd.DataFrame({
    DataFrameSpecs.ROW_IDS: row_ids,
    DataFrameSpecs.COL_IDS: col_ids,
    DataFrameSpecs.COUNTS: counts,
})
tmpdir = tempfile.mkdtemp()
out = pathlib.Path(tmpdir) / 'tutorial_dataloader.gzcm'
fake_hic = pathlib.Path(tmpdir) / 'fake.hic'
fake_hic.write_bytes(b'')
from gunz_cm import loaders
original = loaders.load_cm_data
loaders.load_cm_data = lambda *a, **kw: df
try:
    convert_to_gzcm(
        fpath=fake_hic, output_fpath=out, region1='chr1',
        bin_size_bp=50_000, version=4, tile_size=256,
        compression='zstd', overwrite=True,
    )
finally:
        loaders.load_cm_data = original
print(f'wrote {out.stat().st_size:,} bytes')

wrote 8,192 bytes

2. The `GzcmDataset` interface`getitem` returns a sparse dict suitable for Hi-Cgraph-network training:#

ds = GzcmDataset(str(out), window_size=1_000_000)
item = ds[0]
print('keys:', sorted(item.keys()))
print('coords.shape:', tuple(item['coords'].shape))
print('features.shape:', tuple(item['features'].shape))
print('target.shape:', tuple(item['target'].shape))
print('info:', item['info'])

keys: ['coords', 'features', 'info', 'target']
coords.shape: (68, 2)
features.shape: (68, 1)
target.shape: (68,)
info: {'start': 0}

3. Single-worker DataLoaderThe `num_workers=0` path runs in the main process. The v2.26.0cache helps even here: the same tile is decoded once periteration, not once per `getitem` call.#

import time
from torch.utils.data import DataLoader

def time_loader(num_workers, n_iter=None):
    if n_iter is None:
        n_iter = len(ds) * 2  # 2 epochs
    dl = DataLoader(ds, batch_size=1, shuffle=False,
                     num_workers=num_workers,
                     persistent_workers=(num_workers > 0))
    t0 = time.perf_counter()
    for _ in range(n_iter):
        for batch in dl:
            pass  # consume
    return (time.perf_counter() - t0) * 1000 / n_iter

single = time_loader(num_workers=0)
print(f'single-worker avg iter: {single:.1f} ms')

single-worker avg iter: 2.3 ms

5. `pin_memory` and async overlapOn a CUDA host, `pin_memory=True` lets the DataLoaderasynchronously copy batches to GPU-pinned host memory whilethe next batch is being prepared by a worker. This is thestandard PyTorch optimization and works transparently with`GzcmDataset` because `getitem` returns plain tensors.#

dl_pinned = DataLoader(ds, batch_size=1, shuffle=False,
                       num_workers=0, pin_memory=False)
pinned = time_loader(num_workers=0)
print(f'no pin_memory: {pinned:.1f} ms')

no pin_memory: 2.6 ms

6. Summary* `GzcmDataset` is a standard PyTorch `Dataset` — drop it into a `DataLoader` without any adapter.* The v2.26.0+ LRU cache + `threading.RLock` is safe under `num_workers > 0`; each worker has its own cache, no shared state between workers.* `persistent_workers=True` amortizes the dataset init cost (header parse, tile index build) across the epoch.* The same cache design also gives `getitem` cache hits in the main process when `num_workers=0` (Tutorial 28).## Where to go from here* Tutorial 28: v4 read path + cache design.* Tutorial 26: codec picker (per-region adaptive codec).#