Tutorial: Tutorial 25: GZCM Codec Registry and the v5.1 Wire-Format ContractGZCM’s v3 and v4 storage layers are codec-agnostic: the writer asks eachcodec to encode a tile, then prepends an 8-byte (rows, cols) shape headerso the decoder can reshape edge tiles. This convention breaks for anycodec that already prepends its own header (LZ4 adds a 12-byte(rows, cols, uncompressed_size) header, for example). The v5.1 codecregistry fixes this by declaring the wire format explicitly per codec, sothe writer knows whether to prepend a shape header or leave the encoder’spayload alone.This tutorial walks through the registry API, the wire-formatclassification of the six built-in codecs, and a worked example ofregistering a custom codec. The example also demonstrates the registry’sguard against Bug 0.3 (writer-prepended shape header over LZ4’s ownheader, causing LZ4BlockError on decode).## Learning Objectives* Inspect the GZCM codec registry and see which codecs are built-in* Read a codec’s wire_format to know whether the writer prepends a shape header* Add a custom codec at runtime via register_codec(...)* Diagnose why a codec’s files fail to roundtrip (the Bug 0.3 lesson)## Prerequisites* gunz-cm installed: pip install gunz-cm* Python 3.11+* Familiarity with NumPy## Estimated Time⏱️ 10 minutes## DataThis tutorial uses a small synthetic Hi-C matrix generated inline via theexisting notebooks/_synthetic_data.py helper. No external data filesare needed.—#

# Set auto-reload so the notebook picks up code changes without restarting the kernel.
%load_ext autoreload
%autoreload 2
import sys
from git.repo import Repo

repo = Repo('.', search_parent_directories=True)
ROOT = repo.working_tree_dir
assert isinstance(ROOT, str)
sys.path.append(ROOT)
print(f'Repo root: {ROOT}')
Repo root: /home/adhisant/workspace/gunz-cm
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
from pathlib import Path
import numpy as np

from gunz_cm.compressions import (
    WireFormat, get_codec, list_codecs, register_codec,
    UnknownCodecError,
    ZstdEncoder, ZstdDecoder, Lz4Encoder, Lz4Decoder,
)
rng = np.random.default_rng(42)

1. List the built-in codecsThe registry is populated at module load by the auto-registration ingunz_cm.compressions.__init__. Each codec is registered with threefields: encoder class, decoder class, and wire_format (one ofWireFormat.OPAQUE_PAYLOAD or WireFormat.SELF_DESCRIBING).#

codecs = list_codecs()
print('Built-in codecs:', codecs)
print()
for name in codecs:
    enc_cls, dec_cls, wf = get_codec(name)
    print(f'  {name:10s}  wire_format={wf.value:18s}  '
          f'enc={enc_cls.__name__:14s}  dec={dec_cls.__name__}')
Built-in codecs: ['bsc', 'bsc_cmc', 'cmc', 'cmc_zstd', 'lz4', 'zstd']

  bsc         wire_format=opaque              enc=BscEncoder      dec=BscDecoder
  bsc_cmc     wire_format=opaque              enc=BscCmcEncoder   dec=BscCmcDecoder
  cmc         wire_format=opaque              enc=CmcEncoder      dec=CmcDecoder
  cmc_zstd    wire_format=opaque              enc=CmcZstdEncoder  dec=CmcZstdDecoder
  lz4         wire_format=self                enc=Lz4Encoder      dec=Lz4Decoder
  zstd        wire_format=opaque              enc=ZstdEncoder     dec=ZstdDecoder

InterpretationFive of the six built-in codecs are OPAQUE_PAYLOAD: the writerprepends the 8-byte (rows, cols) shape header. Only LZ4 isSELF_DESCRIBING because the LZ4 encoder atgunz_cm.compressions.lz4 already prepends its own 12-byte(rows, cols, uncompressed_size) header. If the writer were toprepend the 8-byte header on top of LZ4, the LZ4 decoder would seegarbage data and raise LZ4BlockError. That was Bug 0.3 in thev2.15.0 release; the registry is the architectural fix.#

2. Try an unknown codec nameget_codec raises UnknownCodecError (a KeyError subclass) fornames that are not registered. The exception’s .available attributelists the registered codec names, useful for error messages.#

try:
    get_codec('not_a_real_codec')
except UnknownCodecError as exc:
    print(f'Unknown codec: {exc.name!r}')
    print(f'Available: {exc.available}')
Unknown codec: 'not_a_real_codec'
Available: ['bsc', 'bsc_cmc', 'cmc', 'cmc_zstd', 'lz4', 'zstd']

3. Register a custom codec at runtimeThe registry is the single point of extension for codec support.Implementing a new codec = implement the encoder + decoder +register them. The writer and reader both consult the registry, sono other code in the codec layer needs to change.#

class IdentityEncoder:
    """Toy encoder: just stores the tile bytes verbatim.

    Demonstrates that a custom codec is a one-class change.
    The decoder strips the 8-byte OPAQUE_PAYLOAD shape header so
    the roundtrip works through the same path the writer takes.
    """
    def __init__(self, tile_size):
        self.tile_size = tile_size

    @property
    def wire_format(self):
        return WireFormat.OPAQUE_PAYLOAD

    def encode_tile(self, tile):
        return tile.tobytes()

    def decode_tile(self, payload, shape):
        # OPAQUE_PAYLOAD: the writer prepends an 8-byte (rows, cols)
        # header. Strip it before decoding the body.
        body = payload[8:]
        return np.frombuffer(body, dtype=np.uint32).reshape(shape)


class IdentityDecoder(IdentityEncoder):
    pass


register_codec('identity', IdentityEncoder, IdentityDecoder, WireFormat.OPAQUE_PAYLOAD)
print('After registration:', list_codecs())
After registration: ['bsc', 'bsc_cmc', 'cmc', 'cmc_zstd', 'identity', 'lz4', 'zstd']
tile = rng.integers(0, 1000, size=(256, 256), dtype=np.uint32)
enc_cls, dec_cls, wf = get_codec('identity')
enc = enc_cls(tile_size=256)
dec = dec_cls(tile_size=256)

encoded = enc.encode_tile(tile)
if wf == WireFormat.OPAQUE_PAYLOAD:
    header = np.array([256, 256], dtype=np.int32).tobytes()
    wire = header + encoded
else:
    wire = encoded

decoded = dec.decode_tile(wire, shape=(256, 256))
assert np.array_equal(decoded, tile), 'roundtrip failed'
print(f'identity roundtrip OK ({len(wire)} bytes on the wire)')
identity roundtrip OK (262152 bytes on the wire)

4. Summary* The codec registry (gunz_cm.compressions) is the single source of truth for codec -> (encoder, decoder, wire_format) mappings.* WireFormat.OPAQUE_PAYLOAD codecs get a writer-prepended 8-byte (rows, cols) shape header. WireFormat.SELF_DESCRIBING codecs (LZ4 today, anything else in the future) do not.* Adding a custom codec = implement the Codec protocol (or follow the existing pattern) and call register_codec(...).* The wire-format contract prevents Bug 0.3 from recurring: any codec that adds its own header must declare SELF_DESCRIBING in the registry, and the writer will not double-prepend.## Where to go from here* Tutorial 26: the codec picker — how the per-region adaptive codec choice is made and how to retune the weights.* Tutorial 27: v4 file writing with convert_to_gzcm(version=4,  adaptive_codec=True, ...).* Tutorial 28: reading v4 files with GzcmDataset and the v2.26.0+ thread-safe LRU tile cache.* Tutorial 29: wiring GZCM into a PyTorch DataLoader for NN training.#