Tutorial: Tutorial 25: GZCM Codec Registry and the v5.1 Wire-Format ContractGZCM’s v3 and v4 storage layers are codec-agnostic: the writer asks eachcodec to encode a tile, then prepends an 8-byte (rows, cols) shape headerso the decoder can reshape edge tiles. This convention breaks for anycodec that already prepends its own header (LZ4 adds a 12-byte(rows, cols, uncompressed_size) header, for example). The v5.1 codecregistry fixes this by declaring the wire format explicitly per codec, sothe writer knows whether to prepend a shape header or leave the encoder’spayload alone.This tutorial walks through the registry API, the wire-formatclassification of the six built-in codecs, and a worked example ofregistering a custom codec. The example also demonstrates the registry’sguard against Bug 0.3 (writer-prepended shape header over LZ4’s ownheader, causing LZ4BlockError on decode).## Learning Objectives* Inspect the GZCM codec registry and see which codecs are built-in* Read a codec’s wire_format to know whether the writer prepends a shape header* Add a custom codec at runtime via register_codec(...)* Diagnose why a codec’s files fail to roundtrip (the Bug 0.3 lesson)## Prerequisites* gunz-cm installed: pip install gunz-cm* Python 3.11+* Familiarity with NumPy## Estimated Time⏱️ 10 minutes## DataThis tutorial uses a small synthetic Hi-C matrix generated inline via theexisting notebooks/_synthetic_data.py helper. No external data filesare needed.—#
# Set auto-reload so the notebook picks up code changes without restarting the kernel.
%load_ext autoreload
%autoreload 2
import sys
from git.repo import Repo
repo = Repo('.', search_parent_directories=True)
ROOT = repo.working_tree_dir
assert isinstance(ROOT, str)
sys.path.append(ROOT)
print(f'Repo root: {ROOT}')
Repo root: /home/adhisant/workspace/gunz-cm
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
from pathlib import Path
import numpy as np
from gunz_cm.compressions import (
WireFormat, get_codec, list_codecs, register_codec,
UnknownCodecError,
ZstdEncoder, ZstdDecoder, Lz4Encoder, Lz4Decoder,
)
rng = np.random.default_rng(42)
1. List the built-in codecsThe registry is populated at module load by the auto-registration ingunz_cm.compressions.__init__. Each codec is registered with threefields: encoder class, decoder class, and wire_format (one ofWireFormat.OPAQUE_PAYLOAD or WireFormat.SELF_DESCRIBING).#
codecs = list_codecs()
print('Built-in codecs:', codecs)
print()
for name in codecs:
enc_cls, dec_cls, wf = get_codec(name)
print(f' {name:10s} wire_format={wf.value:18s} '
f'enc={enc_cls.__name__:14s} dec={dec_cls.__name__}')
Built-in codecs: ['bsc', 'bsc_cmc', 'cmc', 'cmc_zstd', 'lz4', 'zstd']
bsc wire_format=opaque enc=BscEncoder dec=BscDecoder
bsc_cmc wire_format=opaque enc=BscCmcEncoder dec=BscCmcDecoder
cmc wire_format=opaque enc=CmcEncoder dec=CmcDecoder
cmc_zstd wire_format=opaque enc=CmcZstdEncoder dec=CmcZstdDecoder
lz4 wire_format=self enc=Lz4Encoder dec=Lz4Decoder
zstd wire_format=opaque enc=ZstdEncoder dec=ZstdDecoder
InterpretationFive of the six built-in codecs are OPAQUE_PAYLOAD: the writerprepends the 8-byte (rows, cols) shape header. Only LZ4 isSELF_DESCRIBING because the LZ4 encoder atgunz_cm.compressions.lz4 already prepends its own 12-byte(rows, cols, uncompressed_size) header. If the writer were toprepend the 8-byte header on top of LZ4, the LZ4 decoder would seegarbage data and raise LZ4BlockError. That was Bug 0.3 in thev2.15.0 release; the registry is the architectural fix.#
2. Try an unknown codec nameget_codec raises UnknownCodecError (a KeyError subclass) fornames that are not registered. The exception’s .available attributelists the registered codec names, useful for error messages.#
try:
get_codec('not_a_real_codec')
except UnknownCodecError as exc:
print(f'Unknown codec: {exc.name!r}')
print(f'Available: {exc.available}')
Unknown codec: 'not_a_real_codec'
Available: ['bsc', 'bsc_cmc', 'cmc', 'cmc_zstd', 'lz4', 'zstd']
3. Register a custom codec at runtimeThe registry is the single point of extension for codec support.Implementing a new codec = implement the encoder + decoder +register them. The writer and reader both consult the registry, sono other code in the codec layer needs to change.#
class IdentityEncoder:
"""Toy encoder: just stores the tile bytes verbatim.
Demonstrates that a custom codec is a one-class change.
The decoder strips the 8-byte OPAQUE_PAYLOAD shape header so
the roundtrip works through the same path the writer takes.
"""
def __init__(self, tile_size):
self.tile_size = tile_size
@property
def wire_format(self):
return WireFormat.OPAQUE_PAYLOAD
def encode_tile(self, tile):
return tile.tobytes()
def decode_tile(self, payload, shape):
# OPAQUE_PAYLOAD: the writer prepends an 8-byte (rows, cols)
# header. Strip it before decoding the body.
body = payload[8:]
return np.frombuffer(body, dtype=np.uint32).reshape(shape)
class IdentityDecoder(IdentityEncoder):
pass
register_codec('identity', IdentityEncoder, IdentityDecoder, WireFormat.OPAQUE_PAYLOAD)
print('After registration:', list_codecs())
After registration: ['bsc', 'bsc_cmc', 'cmc', 'cmc_zstd', 'identity', 'lz4', 'zstd']
tile = rng.integers(0, 1000, size=(256, 256), dtype=np.uint32)
enc_cls, dec_cls, wf = get_codec('identity')
enc = enc_cls(tile_size=256)
dec = dec_cls(tile_size=256)
encoded = enc.encode_tile(tile)
if wf == WireFormat.OPAQUE_PAYLOAD:
header = np.array([256, 256], dtype=np.int32).tobytes()
wire = header + encoded
else:
wire = encoded
decoded = dec.decode_tile(wire, shape=(256, 256))
assert np.array_equal(decoded, tile), 'roundtrip failed'
print(f'identity roundtrip OK ({len(wire)} bytes on the wire)')
identity roundtrip OK (262152 bytes on the wire)