Tutorial: Tutorial 28: Reading GZCM v4 Files with the LRU Tile CacheGzcmDataset is the high-level reader for GZCM v3 and v4 files.It builds a tile index from the file’s metadata, decodes individualtiles on demand, and caches the most-recently-used ones in anin-memory LRU (cachetools.LRUCache) protected by a threading.RLockso the cache is safe under PyTorch DataLoader(num_workers > 0).GZCM v4’s read path differs from v3 only in the metadata location(meta["regions"][0]["tile_bboxes"] vs v3’s meta["tiles"]). Thev4 dispatch in GzcmDataset._init_compressed normalizes both shapesinto the same internal representation. Without that normalization(Bug 0.1), every v4 file would have an empty tile index and__getitem__ would return an empty sparse dict regardless of theunderlying matrix.This tutorial walks through reading v4 files end-to-end, demonstratesthe v2.26.0+ thread-safe LRU cache, and shows how to tune thetile_cache_size parameter for high-concurrency NN training.## Learning Objectives* Read a v4 file with GzcmDataset and confirm the bug 0.1 metadata fix* Inspect _DEFAULT_TILE_CACHE_SIZE and the v2.26.0 cache design* Set tile_cache_size=0 to disable caching* Measure cache hit-rate under repeated getitem calls## Prerequisites* gunz-cm installed: pip install gunz-cm* Python 3.11+* Familiarity with NumPy## Estimated Time⏱️ 15 minutes## DataThis tutorial builds a v4 file in a tempdir using synthetic data fromnotebooks/_synthetic_data.py, then reads it back with GzcmDataset.No external data files are needed.—#
# Set auto-reload so the notebook picks up code changes without restarting the kernel.
%load_ext autoreload
%autoreload 2
import sys
from git.repo import Repo
repo = Repo('.', search_parent_directories=True)
ROOT = repo.working_tree_dir
assert isinstance(ROOT, str)
sys.path.append(ROOT)
print(f'Repo root: {ROOT}')
Repo root: /home/adhisant/workspace/gunz-cm
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
import numpy as np
import pandas as pd
import tempfile
import pathlib
from pathlib import Path
sys.path.insert(0, str(Path(ROOT) / 'notebooks'))
import _synthetic_data
from gunz_cm.compressions import get_codec, WireFormat, UnknownCodecError
from gunz_cm.io.gnz import GzcmReader, GzcmWriter
from gunz_cm.consts import DataFrameSpecs, Balancing, Backend
from gunz_cm.converters import convert_to_gzcm
from gunz_cm.datasets.gzcm import GzcmDataset, _DEFAULT_TILE_CACHE_SIZE
def _row_col_counts(mat):
"""Extract (row_ids, col_ids, counts) from a ContactMatrix or ndarray.
GZCM's synthetic_data helper returns a ContactMatrix, not a raw
ndarray, so we unwrap via .data when present.
"""
arr = mat.data if hasattr(mat, 'data') else mat
row_ids, col_ids = np.nonzero(arr)
counts = arr[row_ids, col_ids].astype(np.uint32)
return row_ids, col_ids, counts
rng = np.random.default_rng(42)
1. The v2.26.0 tile cache at a glanceBefore v2.26.0, the cache was a bare OrderedDict (maxsize=32, nolocking). It was the documented bottleneck for high-concurrencyNN training. v2.26.0 replaced it with cachetools.LRUCache (defaultmaxsize=256) guarded by threading.RLock.#
print(f'_DEFAULT_TILE_CACHE_SIZE = {_DEFAULT_TILE_CACHE_SIZE}')
print('(raised from 32 to 256 in v2.26.0; configurable via the')
print(' tile_cache_size constructor parameter on GzcmDataset)')
_DEFAULT_TILE_CACHE_SIZE = 256
(raised from 32 to 256 in v2.26.0; configurable via the
tile_cache_size constructor parameter on GzcmDataset)
2. Build a v4 file to readThe reader reads whatever the writer wrote. We mock the loaderto keep the tutorial self-contained, then write a small v4 filewith adaptive_codec=False, compression='zstd' for determinism.#
n_bins = 200
mat = _synthetic_data.make_synthetic_hic(n_bins=n_bins, bin_size_bp=50_000)
mat_dense = mat.data if hasattr(mat, 'data') else mat; row_ids, col_ids = np.nonzero(mat_dense); counts = mat_dense[row_ids, col_ids].astype(np.uint32)
df = pd.DataFrame({
DataFrameSpecs.ROW_IDS: row_ids,
DataFrameSpecs.COL_IDS: col_ids,
DataFrameSpecs.COUNTS: counts,
})
tmpdir = tempfile.mkdtemp()
out = pathlib.Path(tmpdir) / 'tutorial_v4_read.gzcm'
fake_hic = pathlib.Path(tmpdir) / 'fake.hic'
fake_hic.write_bytes(b'')
from gunz_cm import loaders
original = loaders.load_cm_data
loaders.load_cm_data = lambda *a, **kw: df
try:
convert_to_gzcm(
fpath=fake_hic, output_fpath=out, region1='chr1',
bin_size_bp=50_000, version=4, tile_size=256,
compression='zstd', overwrite=True,
)
finally:
loaders.load_cm_data = original
print(f'wrote {out.stat().st_size:,} bytes at {out}')
wrote 8,192 bytes at /tmp/tmp32k52ek3/tutorial_v4_read.gzcm
3. Read the v4 file with GzcmDatasetThe v4 dispatch in _init_compressed readsmeta["regions"][0]["tile_bboxes"] and normalizes it into thesame shape as v3’s meta["tiles"] (the bug 0.1 fix). Theresulting _tile_index should be non-empty.#
ds = GzcmDataset(str(out), window_size=1_000_000)
print(f'layout: {ds.layout}')
print(f'codec: {ds.codec}')
print(f'tile_cache_size: {ds.tile_cache_size}')
print(f'_tile_index size: {len(ds._build_tile_index())}')
assert ds.layout == 'sparse-tiled-intra', 'v4 layout not detected'
assert len(ds._build_tile_index()) > 0, 'Bug 0.1: tile index empty'
layout: sparse-tiled-intra
codec: zstd
tile_cache_size: 256
_tile_index size: 1
4. Roundtrip a few items to verify the cache worksEach call to ds[i] walks the tile index, finds the tiles thatoverlap the requested window, decodes them via the registry,and caches the result. Repeated calls to the same i shouldhit the cache after the first decode.#
import time
for i in (0, len(ds) // 2, len(ds) - 1):
item = ds[i]
n_coords = item['coords'].shape[0]
print(f' ds[{i}]: n_coords={n_coords}, target.shape={tuple(item["target"].shape)}')
print()
print('Cache after 3 distinct fetches:')
print(f' cache size: {len(ds._tile_cache)}')
print(f' cache max: {ds._tile_cache.maxsize}')
ds[0]: n_coords=68, target.shape=(68,)
ds[5]: n_coords=179, target.shape=(179,)
ds[9]: n_coords=148, target.shape=(148,)
Cache after 3 distinct fetches:
cache size: 1
cache max: 256
5. Measure cold vs warm fetch timeThe v2.26.0 cache design is meant to make repeated reads fast.The first call decodes each tile; subsequent calls hit the cache.#
def time_one(i):
t0 = time.perf_counter()
_ = ds[i]
return (time.perf_counter() - t0) * 1000
cold = time_one(0)
warm = [time_one(0) for _ in range(5)]
print(f'cold fetch ds[0]: {cold:.3f} ms')
print(f'warm fetch ds[0]: min={min(warm):.3f} ms, mean={sum(warm)/len(warm):.3f} ms')
print(f'speedup: {cold / max(0.001, min(warm)):.1f}x')
cold fetch ds[0]: 0.336 ms
warm fetch ds[0]: min=0.117 ms, mean=0.134 ms
speedup: 2.9x
6. Disable caching for memory-constrained workloadsFor very large matrices, the LRU cache can use significant memory(256 tiles × 256 KB each ≈ 64 MB at tile=256). Settile_cache_size=0 to disable caching entirely; each callre-decodes the tiles it needs.#
ds_no_cache = GzcmDataset(str(out), window_size=1_000_000, tile_cache_size=0)
print(f'no-cache dataset._tile_cache: {ds_no_cache._tile_cache}')
print(f'no-cache reads still work:')
for i in (0, len(ds_no_cache) // 2):
n = ds_no_cache[i]['coords'].shape[0]
print(f' ds_no_cache[{i}]: n_coords={n}')
no-cache dataset._tile_cache: None
no-cache reads still work:
ds_no_cache[0]: n_coords=68
ds_no_cache[5]: n_coords=179