Tutorial: Tutorial 28: Reading GZCM v4 Files with the LRU Tile CacheGzcmDataset is the high-level reader for GZCM v3 and v4 files.It builds a tile index from the file’s metadata, decodes individualtiles on demand, and caches the most-recently-used ones in anin-memory LRU (cachetools.LRUCache) protected by a threading.RLockso the cache is safe under PyTorch DataLoader(num_workers > 0).GZCM v4’s read path differs from v3 only in the metadata location(meta["regions"][0]["tile_bboxes"] vs v3’s meta["tiles"]). Thev4 dispatch in GzcmDataset._init_compressed normalizes both shapesinto the same internal representation. Without that normalization(Bug 0.1), every v4 file would have an empty tile index and__getitem__ would return an empty sparse dict regardless of theunderlying matrix.This tutorial walks through reading v4 files end-to-end, demonstratesthe v2.26.0+ thread-safe LRU cache, and shows how to tune thetile_cache_size parameter for high-concurrency NN training.## Learning Objectives* Read a v4 file with GzcmDataset and confirm the bug 0.1 metadata fix* Inspect _DEFAULT_TILE_CACHE_SIZE and the v2.26.0 cache design* Set tile_cache_size=0 to disable caching* Measure cache hit-rate under repeated getitem calls## Prerequisites* gunz-cm installed: pip install gunz-cm* Python 3.11+* Familiarity with NumPy## Estimated Time⏱️ 15 minutes## DataThis tutorial builds a v4 file in a tempdir using synthetic data fromnotebooks/_synthetic_data.py, then reads it back with GzcmDataset.No external data files are needed.—#

# Set auto-reload so the notebook picks up code changes without restarting the kernel.
%load_ext autoreload
%autoreload 2
import sys
from git.repo import Repo

repo = Repo('.', search_parent_directories=True)
ROOT = repo.working_tree_dir
assert isinstance(ROOT, str)
sys.path.append(ROOT)
print(f'Repo root: {ROOT}')
Repo root: /home/adhisant/workspace/gunz-cm
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
import numpy as np
import pandas as pd
import tempfile
import pathlib
from pathlib import Path
sys.path.insert(0, str(Path(ROOT) / 'notebooks'))
import _synthetic_data
from gunz_cm.compressions import get_codec, WireFormat, UnknownCodecError
from gunz_cm.io.gnz import GzcmReader, GzcmWriter
from gunz_cm.consts import DataFrameSpecs, Balancing, Backend
from gunz_cm.converters import convert_to_gzcm
from gunz_cm.datasets.gzcm import GzcmDataset, _DEFAULT_TILE_CACHE_SIZE

def _row_col_counts(mat):
    """Extract (row_ids, col_ids, counts) from a ContactMatrix or ndarray.

    GZCM's synthetic_data helper returns a ContactMatrix, not a raw
    ndarray, so we unwrap via .data when present.
    """
    arr = mat.data if hasattr(mat, 'data') else mat
    row_ids, col_ids = np.nonzero(arr)
    counts = arr[row_ids, col_ids].astype(np.uint32)
    return row_ids, col_ids, counts

rng = np.random.default_rng(42)

1. The v2.26.0 tile cache at a glanceBefore v2.26.0, the cache was a bare OrderedDict (maxsize=32, nolocking). It was the documented bottleneck for high-concurrencyNN training. v2.26.0 replaced it with cachetools.LRUCache (defaultmaxsize=256) guarded by threading.RLock.#

print(f'_DEFAULT_TILE_CACHE_SIZE = {_DEFAULT_TILE_CACHE_SIZE}')
print('(raised from 32 to 256 in v2.26.0; configurable via the')
print(' tile_cache_size constructor parameter on GzcmDataset)')
_DEFAULT_TILE_CACHE_SIZE = 256
(raised from 32 to 256 in v2.26.0; configurable via the
 tile_cache_size constructor parameter on GzcmDataset)

2. Build a v4 file to readThe reader reads whatever the writer wrote. We mock the loaderto keep the tutorial self-contained, then write a small v4 filewith adaptive_codec=False, compression='zstd' for determinism.#

n_bins = 200
mat = _synthetic_data.make_synthetic_hic(n_bins=n_bins, bin_size_bp=50_000)
mat_dense = mat.data if hasattr(mat, 'data') else mat; row_ids, col_ids = np.nonzero(mat_dense); counts = mat_dense[row_ids, col_ids].astype(np.uint32)
df = pd.DataFrame({
    DataFrameSpecs.ROW_IDS: row_ids,
    DataFrameSpecs.COL_IDS: col_ids,
    DataFrameSpecs.COUNTS: counts,
})

tmpdir = tempfile.mkdtemp()
out = pathlib.Path(tmpdir) / 'tutorial_v4_read.gzcm'
fake_hic = pathlib.Path(tmpdir) / 'fake.hic'
fake_hic.write_bytes(b'')
from gunz_cm import loaders
original = loaders.load_cm_data
loaders.load_cm_data = lambda *a, **kw: df
try:
    convert_to_gzcm(
        fpath=fake_hic, output_fpath=out, region1='chr1',
        bin_size_bp=50_000, version=4, tile_size=256,
        compression='zstd', overwrite=True,
    )
finally:
        loaders.load_cm_data = original
print(f'wrote {out.stat().st_size:,} bytes at {out}')
wrote 8,192 bytes at /tmp/tmp32k52ek3/tutorial_v4_read.gzcm

3. Read the v4 file with GzcmDatasetThe v4 dispatch in _init_compressed readsmeta["regions"][0]["tile_bboxes"] and normalizes it into thesame shape as v3’s meta["tiles"] (the bug 0.1 fix). Theresulting _tile_index should be non-empty.#

ds = GzcmDataset(str(out), window_size=1_000_000)
print(f'layout:           {ds.layout}')
print(f'codec:            {ds.codec}')
print(f'tile_cache_size:  {ds.tile_cache_size}')
print(f'_tile_index size: {len(ds._build_tile_index())}')
assert ds.layout == 'sparse-tiled-intra', 'v4 layout not detected'
assert len(ds._build_tile_index()) > 0, 'Bug 0.1: tile index empty'
layout:           sparse-tiled-intra
codec:            zstd
tile_cache_size:  256
_tile_index size: 1

4. Roundtrip a few items to verify the cache worksEach call to ds[i] walks the tile index, finds the tiles thatoverlap the requested window, decodes them via the registry,and caches the result. Repeated calls to the same i shouldhit the cache after the first decode.#

import time
for i in (0, len(ds) // 2, len(ds) - 1):
    item = ds[i]
    n_coords = item['coords'].shape[0]
    print(f'  ds[{i}]: n_coords={n_coords}, target.shape={tuple(item["target"].shape)}')
print()
print('Cache after 3 distinct fetches:')
print(f'  cache size: {len(ds._tile_cache)}')
print(f'  cache max:  {ds._tile_cache.maxsize}')
  ds[0]: n_coords=68, target.shape=(68,)
  ds[5]: n_coords=179, target.shape=(179,)
  ds[9]: n_coords=148, target.shape=(148,)

Cache after 3 distinct fetches:
  cache size: 1
  cache max:  256

5. Measure cold vs warm fetch timeThe v2.26.0 cache design is meant to make repeated reads fast.The first call decodes each tile; subsequent calls hit the cache.#

def time_one(i):
    t0 = time.perf_counter()
    _ = ds[i]
    return (time.perf_counter() - t0) * 1000

cold = time_one(0)
warm = [time_one(0) for _ in range(5)]
print(f'cold fetch ds[0]: {cold:.3f} ms')
print(f'warm fetch ds[0]: min={min(warm):.3f} ms, mean={sum(warm)/len(warm):.3f} ms')
print(f'speedup: {cold / max(0.001, min(warm)):.1f}x')
cold fetch ds[0]: 0.336 ms
warm fetch ds[0]: min=0.117 ms, mean=0.134 ms
speedup: 2.9x

6. Disable caching for memory-constrained workloadsFor very large matrices, the LRU cache can use significant memory(256 tiles × 256 KB each ≈ 64 MB at tile=256). Settile_cache_size=0 to disable caching entirely; each callre-decodes the tiles it needs.#

ds_no_cache = GzcmDataset(str(out), window_size=1_000_000, tile_cache_size=0)
print(f'no-cache dataset._tile_cache: {ds_no_cache._tile_cache}')
print(f'no-cache reads still work:')
for i in (0, len(ds_no_cache) // 2):
    n = ds_no_cache[i]['coords'].shape[0]
    print(f'  ds_no_cache[{i}]: n_coords={n}')
no-cache dataset._tile_cache: None
no-cache reads still work:
  ds_no_cache[0]: n_coords=68
  ds_no_cache[5]: n_coords=179

7. Summary* GzcmDataset reads v3 and v4 files transparently — the v4 dispatch normalizes the metadata shape (Bug 0.1 fix).* The v2.26.0+ cache is cachetools.LRUCache + threading.RLock (thread-safe under PyTorch DataLoader workers).* tile_cache_size defaults to _DEFAULT_TILE_CACHE_SIZE = 256 (raised from 32 in v2.26.0); set to 0 to disable caching.* The Codec registry (Tutorial 25) is used in _get_decoder, so adding a new codec doesn’t require reader changes.## Where to go from here* Tutorial 29: PyTorch DataLoader integration with v4.* Tutorial 25: codec registry and the wire-format contract.#