GZCM v3 → v4 migration guide#
date: 2026-06-25
target_audience: gunz-cm users who currently produce or consume v3 .gzcm files
TL;DR#
v3 files remain readable indefinitely. No action required.
v4 is opt-in via
convert_to_gzcm(..., version=4).v4 produces smaller files (~30-50% smaller than v3 zstd on chr1 of 4DNFI1UEG1D) at comparable or better read speed.
New defaults: v4.0 will bump
convert_to_gzcm’s defaultversionfrom1(current — unhelpfully produces uncompressed dense) to3(current compressed-tile). v4 itself remains opt-in.
What changes for users#
Reading files (transparent)#
from gunz_cm.datasets import GzcmTileDataset
# v3 file (existing) — works unchanged.
ds_v3 = GzcmTileDataset("chr1_v3_zstd.gzcm", window_size=1_000_000)
# v4 file (new) — same constructor, same API.
ds_v4 = GzcmTileDataset("chr1_v4.gzcm", window_size=1_000_000)
The reader dispatches on the header’s version field. v3 files use the
existing GzcmReader path (unchanged). v4 files use the new GzcmV4Reader.
Writing files (opt-in)#
from gunz_cm.converters.gzcm import convert_to_gzcm
# v3 (current behavior).
convert_to_gzcm("in.hic", "out_v3.gzcm", region1="1", bin_size_bp=50_000, version=3)
# v4 (new).
convert_to_gzcm(
"in.hic", "out_v4.gzcm",
region1="1", bin_size_bp=50_000,
version=4, # NEW flag
chunk_size=10_000_000, # NEW: memory-bounded conversion
enable_region_layout_picker=True, # NEW: adaptive scheme picker
enable_anti_diagonal_layout=True, # NEW: .hic-style for intra
enable_roaring_row_index=True, # NEW: sparse-roaring regions
codec_picker_weights=(0.5, 0.3, 0.2), # NEW: (decode, size, encode) weights
)
Default behavior change (v2.15.0): convert_to_gzcm’s version default
goes from 1 to 3. Rationale: version=1 produces 47 MB files (uncompressed
dense) for typical Hi-C data; version=3 produces 9-16 MB (compressed tiles)
and has been the de facto default in benchmarks. This is a behavior change
that affects any user relying on the default — see “Mitigation” below.
Mitigation for the default-version bump#
Two options, both available:
(a) Explicit version in your pipeline (recommended):
# Before upgrade (relies on default):
convert_to_gzcm("in.hic", "out.gzcm", region1="1", bin_size_bp=50_000)
# After upgrade (explicit v3, same behavior):
convert_to_gzcm("in.hic", "out.gzcm", region1="1", bin_size_bp=50_000, version=3)
(b) Environment variable override (if you can’t audit all call sites):
export GUNZ_CM_GZCM_DEFAULT_VERSION=1 # restores old default
We will emit a DeprecationWarning if version is not specified at the
call site, and the warning will become a FutureWarning in v2.17.0 before
the default is actually bumped. Plan for this.
What changes for downstream tools#
hictk users#
GZCM v4 is not a .cool file. To interoperate with .cool-based tooling
(cooler, cooltools, higlass), convert v4 → v3 first:
# gunz-cm roundtrip: v4 → v3
ds_v4 = GzcmTileDataset("chr1_v4.gzcm", window_size=...)
# Read all regions, write to v3
convert_to_gzcm("chr1_v4.gzcm", "chr1_v3_for_cooltools.gzcm", region1="1", version=3)
Future: a gunz-cm dump-to-cool CLI command that uses tiledbsoma or
cooler.create to write .cool files directly from GZCM v4 reads.
(Not in v4.0 scope.)
TileDB-SOMA users#
Not yet applicable. GZCM v4 is forward-compatible with a future TileDB-SOMA
backend (see specs/gzcm-v4-design.md §13) but no direct bridge ships in v4.0.
HDF5 / cooler users#
No change. v3 GZCM files remain readable. v4 files require conversion to
v3 (or to .cool via cooler CLI) for use with HDF5-based tools.
File-format compatibility matrix#
Writer version |
Reader (gunz-cm v2.13.x) |
Reader (gunz-cm v2.14.x) |
Reader (gunz-cm v2.15+) |
|---|---|---|---|
v1, v2, v3 |
reads OK |
reads OK |
reads OK |
v4 |
ERROR: unknown version, defaults to v1 (silent corruption risk) |
ERROR: same |
reads OK via new |
Important: gunz-cm versions < v2.15.0 (the version that ships GZCM v4 support)
will silently misread v4 files as v1 because GzcmReader._parse_header
defaults to version = header.get("version", 1) if the version key is missing
(src/gunz_cm/io/gnz.py:592).
Mitigation: v4.0 always writes "version": 4 in the JSON header. If you write
v4 files, do not downgrade the gunz-cm reader to a version < v2.15.0.
v3 → v4 metadata key differences#
Verified against a real v3 file (4DNFI1UEG1D chr1 @ 50 kb, KR, zstd):
v3 metadata keys (top-level metadata dict):
balancing, chromosome1, compression, n_tiles, original_shape,
padded_shape, region, resolution, source_file, tile_size,
tiles, version, version_gzcm
v4 metadata keys (added/replaced vs v3):
All v3 keys are preserved (backward compatibility).
metadata["regions"]is added: list of per-region descriptors (see spec §4.3 for full shape).metadata["tiles"](v3 per-tile bbox dict) is superseded bymetadata["regions"][i]["tile_bboxes"]in v4. v4 readers still parsetilesif present (legacy compat).NEW
weights_KR,weights_VCarrays inarrays: v4 stores balancing weight vectors in the file. v3 did not (weights were fetched on-demand from the source.hic).
v4 regions list example (chrom 1 only, KR):
"regions": [
{
"id": 0,
"name": "chr1:chr1",
"layout": "sparse-tiled-intra",
"n_tiles": 16,
"tile_size": 256,
"codec_per_tile": ["zstd-3", "zstd-3", "cmc", "cmc", ...]
}
]
For inter-chromosomal regions (e.g., chr1:chr2), the layout is typically sparse-roaring. For dense regions (>50% density), it’s dense with blosc-lz4 codec.
Storage cost comparison (preliminary benchmarks)#
Tested on 4DNFI1UEG1HD.hic chr1 at 50 kb resolution, 1 Mb window
(window_size = 1,000,000):
Format |
File size |
Single-thread throughput |
Notes |
|---|---|---|---|
GZCM v1 dense |
47 MB |
11,926 items/s |
Uncompressed baseline |
GZCM v3 cmc |
9.0 MB |
87 items/s |
Slowest decode, smallest file |
GZCM v3 zstd |
16 MB |
4,467 items/s |
v2.14.0 default fallback codec |
GZCM v4 (predicted) |
6-10 MB |
5,000-7,000 items/s |
BTRBlocks adaptive + Roaring + simdcomp |
These are projected numbers based on the v2.14.0 benchmark at
benchmarks/results/tile_dataset_benchmark_v2.14.0.md. Actual v4 numbers
will be measured by the new benchmarks/datasets/benchmark_gzcm_v4_vs_v3.py
when v4 ships.
When to migrate to v4#
Migrate now if you:
Store large Hi-C matrices (>10M pixels per region).
Run many training epochs over the same file (decode speed matters).
Need sparse row index access (Roaring gives 2-4× smaller than
uint32arrays).Hit the v3 zstd/bsc decode speed ceiling.
Stay on v3 if you:
Have existing production pipelines depending on v3 file format.
Have downstream tools that read GZCM directly (rare — most use cooler/h5).
Are CPU-bound on data loading (unlikely; v3 is already fast enough).
Defer the decision if you:
Have small data (<1M pixels per file).
Don’t iterate frequently over the same file.
Are still evaluating gunz-cm for adoption.
Pre-flight checklist for v4.0 release#
Before tagging v2.15.0:
[ ]
GzcmV4ReaderandGzcmV4Writerimplemented.[ ]
_write_gzcm_v4and_convert_chunked_v4implemented.[ ]
convert_to_gzcm(..., version=4)smoke-tested on 4DNFI1UEG1D chr1.[ ]
GzcmTileDatasetreads v3 + v4 transparently.[ ]
benchmark_gzcm_v4_vs_v3.pyreports v4 file size ≤ 70% of v3 zstd.[ ] All v3 regression tests pass unchanged.
[ ] CHANGELOG entry under v2.15.0 with migration note.
[ ]
docs/source/user_guide/datasets.mdupdated with v4 examples.[ ] Deprecation warning fires on
convert_to_gzcmwithout explicitversionargument.
This guide is a living document. As v4 ships and we get user feedback, this guide will be updated with concrete migration examples.