Tutorial: Tutorial 27: Writing GZCM v4 Files with Adaptive Codec PickerGZCM v4 is an opt-in container format that differs from v3 in twoways: (1) tiles are enumerated in anti-diagonal order instead ofrow-major, which keeps near-diagonal (dense) tiles in cache longerduring sequential reads; (2) the writer can use the adaptive codecpicker to choose a codec per region. The file wire format isbackward-compatible with v3 decoders — only the metadata structurechanges (v4 stores per-tile bboxes in `meta["regions"][0]["tile_bboxes"]`instead of v3’s `meta["tiles"]` dict).This tutorial walks through the full v4 writer API: the`adaptive_codec=True` flag, the `region_layouts` parameter, and the`codec_candidates` picker candidates. You will write a v4 file fromsynthetic data, inspect the resulting metadata, and round-trip thecontacts back through the v4 reader.## Learning Objectives* Call `convert_to_gzcm(..., version=4, adaptive_codec=True)`* Inspect the `meta["codec_picker"]` block the writer records* Understand the `region_layouts` parameter and the `sparse-tiled-intra` layout* Verify a v4 file round-trips end-to-end## Prerequisites* gunz-cm installed: `pip install gunz-cm`* Python 3.11+* Familiarity with NumPy## Estimated Time⏱️ 15 minutes## DataThis tutorial uses a small synthetic Hi-C matrix generated inline via theexisting `notebooks/_synthetic_data.py` helper. No external data files.—#

# Set auto-reload so the notebook picks up code changes without restarting the kernel.
%load_ext autoreload
%autoreload 2

import sys
from git.repo import Repo

repo = Repo('.', search_parent_directories=True)
ROOT = repo.working_tree_dir
assert isinstance(ROOT, str)
sys.path.append(ROOT)
print(f'Repo root: {ROOT}')

Repo root: /home/adhisant/workspace/gunz-cm

from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)

import numpy as np
import pandas as pd
import tempfile
import pathlib
from pathlib import Path
sys.path.insert(0, str(Path(ROOT) / 'notebooks'))
import _synthetic_data
from gunz_cm.compressions import get_codec, WireFormat, UnknownCodecError
from gunz_cm.io.gnz import GzcmReader, GzcmWriter
from gunz_cm.consts import DataFrameSpecs, Balancing, Backend
from gunz_cm.converters import convert_to_gzcm
from gunz_cm.datasets.gzcm import GzcmDataset, _DEFAULT_TILE_CACHE_SIZE

def _row_col_counts(mat):
    """Extract (row_ids, col_ids, counts) from a ContactMatrix or ndarray.

    GZCM's synthetic_data helper returns a ContactMatrix, not a raw
    ndarray, so we unwrap via .data when present.
    """
    arr = mat.data if hasattr(mat, 'data') else mat
    row_ids, col_ids = np.nonzero(arr)
    counts = arr[row_ids, col_ids].astype(np.uint32)
    return row_ids, col_ids, counts

rng = np.random.default_rng(42)

1. Build a synthetic v4-compatible test bedv4 works with any source format the loaders support. To keep thistutorial self-contained we mock the loader to return a smallsynthetic contact matrix — the writer doesn’t care where the COOdata came from.#

n_bins = 200
mat = _synthetic_data.make_synthetic_hic(n_bins=n_bins, bin_size_bp=50_000)
mat_dense = mat.data if hasattr(mat, 'data') else mat; row_ids, col_ids = np.nonzero(mat_dense); counts = mat_dense[row_ids, col_ids].astype(np.uint32)
print(f'Synthetic chr1: {n_bins}x{n_bins} bins, {len(row_ids):,} contacts')

Synthetic chr1: 200x200 bins, 40,000 contacts

2. Write a v4 file with the adaptive codec picker`convert_to_gzcm(..., version=4, adaptive_codec=True)` calls the codecpicker to choose a codec per region (chromosome) based on a 5%sample. The default candidates are `('cmc', 'zstd-3', 'lz4-hc-9')`;we narrow to just `zstd-3` here so the picker always returns it.#

df = pd.DataFrame({
    DataFrameSpecs.ROW_IDS: row_ids,
    DataFrameSpecs.COL_IDS: col_ids,
    DataFrameSpecs.COUNTS: counts,
})

with tempfile.TemporaryDirectory() as tmp:
    out = pathlib.Path(tmp) / 'tutorial_v4_write.gzcm'
    fake_hic = pathlib.Path(tmp) / 'fake.hic'
    fake_hic.write_bytes(b'')

    from gunz_cm import loaders
    original = loaders.load_cm_data
    loaders.load_cm_data = lambda *a, **kw: df

    try:
        convert_to_gzcm(
            fpath=fake_hic,
            output_fpath=out,
            region1='chr1',
            bin_size_bp=50_000,
            version=4,
            tile_size=256,
            compression='zstd',
            overwrite=True,
            adaptive_codec=True,
            codec_candidates=('zstd-3',),
        )
    finally:
        loaders.load_cm_data = original

    reader = GzcmReader(out)
    saved = reader.get_metadata()
    print(f'  layout: {saved["layout"]}')
    print(f'  version_gzcm: {saved["version_gzcm"]}')
    print(f'  n_tiles: {saved["n_tiles"]}')
    print(f'  codec_picker: {saved["codec_picker"]}')

  layout: sparse-tiled-intra
  version_gzcm: 4
  n_tiles: 1
  codec_picker: {'adaptive': True, 'candidates': ['zstd-3'], 'chosen': 'zstd-3', 'writer_codec': 'zstd'}

3. Inspect the per-tile bboxesv4 stores per-tile bounding boxes in`meta["regions"][0]["tile_bboxes"]` (a list of dicts) instead ofv3’s `meta["tiles"]` (a dict). Each bbox has `tile_name`,`row_start`, `col_start`, `row_end`, `col_end`, and `diagonal`.The `diagonal` field is the v4-specific addition: tiles are writtenin increasing `diagonal` order, which keeps near-diagonal (dense)tiles adjacent on disk for better cache locality.#

with tempfile.TemporaryDirectory() as tmp:
    out = pathlib.Path(tmp) / 'tutorial_v4_write.gzcm'
    fake_hic = pathlib.Path(tmp) / 'fake.hic'
    fake_hic.write_bytes(b'')
    from gunz_cm import loaders
    original = loaders.load_cm_data
    loaders.load_cm_data = lambda *a, **kw: df
    try:
        convert_to_gzcm(
            fpath=fake_hic, output_fpath=out, region1='chr1',
            bin_size_bp=50_000, version=4, tile_size=256,
            compression='zstd', overwrite=True,
            adaptive_codec=True, codec_candidates=('zstd-3',),
        )
    finally:
        loaders.load_cm_data = original
    reader = GzcmReader(out)
    saved = reader.get_metadata()
    bboxes = saved['regions'][0]['tile_bboxes']
    print(f'  {len(bboxes)} tiles written')
    print(f'  first 3 bboxes:')
    for bb in bboxes[:3]:
        print(f'    {bb}')

  1 tiles written
  first 3 bboxes:
    {'col_end': 200, 'col_start': 0, 'diagonal': 0, 'row_end': 200, 'row_start': 0, 'tile_name': 'tile_0'}

4. `region_layouts` and the v4 layout vocabulary`region_layouts` is a dict mapping chromosome name to a layoutidentifier. v4 only writes one layout (`sparse-tiled-intra`);future versions will add `sparse-roaring` for inter-chromosomalregions. The v4.1 conversion stores the layout under the region’s`layout` key, which the reader normalizes to the same internalshape as v3’s `meta["tiles"]`.#

# The default region_layouts (when None) is:
#   {region1.split(":")[0]: layout or 'sparse-tiled-intra'}
with tempfile.TemporaryDirectory() as tmp:
    out = pathlib.Path(tmp) / 'tutorial_v4_write.gzcm'
    fake_hic = pathlib.Path(tmp) / 'fake.hic'
    fake_hic.write_bytes(b'')
    from gunz_cm import loaders
    original = loaders.load_cm_data
    loaders.load_cm_data = lambda *a, **kw: df
    try:
        convert_to_gzcm(
            fpath=fake_hic, output_fpath=out, region1='chr1',
            bin_size_bp=50_000, version=4, tile_size=256,
            compression='zstd', overwrite=True,
            layout='sparse-tiled-intra',
        )
    finally:
        loaders.load_cm_data = original
    reader = GzcmReader(out)
    saved = reader.get_metadata()
    print(f'  layout: {saved["layout"]}')
    print(f'  regions[0][layout]: {saved["regions"][0]["layout"]}')

  layout: sparse-tiled-intra
  regions[0][layout]: sparse-tiled-intra

5. Summary* `convert_to_gzcm(..., version=4, adaptive_codec=True)` produces a v4 file with the picker choosing a codec per region.* The writer records the picker’s decision in `meta["codec_picker"]` and the per-tile bboxes (including `diagonal`) in `meta["regions"][0]["tile_bboxes"]`.* `region_layouts` and `layout` set the v4 layout vocabulary.* Custom codecs (Tutorial 25) plug into the picker by being registered in the registry.## Where to go from here* Tutorial 28: read this v4 file with `GzcmDataset`.* Tutorial 25: codec registry and the v5.1 wire-format contract.#