GZCM Format#
GZCM (GunZ Contact Matrix) is a binary container format for Hi-C contact matrices, designed for efficient storage, memory-mapped access, and optional compression.
Version Overview#
Version |
Storage Type |
Compression |
Use Case |
|---|---|---|---|
v1 |
Dense row-major |
None |
Small matrices, simple use |
v2 |
Tiled 4D blocks |
None |
Large matrices, streaming writes |
v3 |
Tiled + CMC |
BSC, ZSTD, CMC |
Maximum compression for large Hi-C data |
File Layout#
All versions share the same header structure:
┌─────────────────────────────────────┐
│ Magic: "GZCM" (4 bytes) │
├─────────────────────────────────────┤
│ Header Length (4 bytes, uint32) │
├─────────────────────────────────────┤
│ Header JSON (4096-byte aligned) │
├─────────────────────────────────────┤
│ Array Data (version-dependent) │
└─────────────────────────────────────┘
Header Schema#
The JSON header contains version, metadata, and array descriptions:
{
"version": 3,
"metadata": {
"resolution": 10000,
"region": "chr1",
"source_file": "/path/to/input.hic"
},
"arrays": {
"matrix": {
"dtype": "float32",
"shape": [1000, 1000],
"offset": 4096,
"order": "C"
}
}
}
Version Details#
v1 — Dense Storage#
The simplest format. Matrix is stored as a flat row-major array.
Characteristics:
No chunking — entire matrix must fit in memory
No compression — uses full float32 storage
Direct memory-mapping available
When to use: Small matrices (< 100k bins) or when simplicity is preferred.
v2 — Tiled Storage#
Matrix is divided into fixed-size blocks (tiles) for cache-efficient access.
Characteristics:
Streaming writes without full matrix in memory
Cache-efficient block processing
Random block access
Block sizes: 256, 512, or 1024
When to use: Large matrices where you want block-wise processing without loading the full matrix.
v3 — Compressed Tiles#
Tiles are independently compressed using CMC (Contact Matrix Codec).
Characteristics:
~10-25x compression for Hi-C data
Random tile access (decompress single tile)
Supported codecs:
cmc,cmc_zstd(recommended),bsc,bsc_cmc,zstdLossless only
When to use: Large matrices where storage size is a concern. Recommended for most production use.
CLI Usage#
Convert to GZCM#
# Convert .hic to GZCM v3 (recommended)
gunz-cm converters to-gzcm input.hic output.gzcm chr1 10000 --version 3 --compression bsc_cmc
# Convert to GZCM v2 (tiled, no compression)
gunz-cm converters to-gzcm input.hic output.gzcm chr1 10000 --version 2
# Convert to GZCM v1 (dense)
gunz-cm converters to-gzcm input.hic output.gzcm chr1 10000 --version 1
Normalize GZCM#
# Knight-Ruiz normalization
gunz-cm converters normalize input.gzcm normalized.gzcm --method kr
# ICE normalization
gunz-cm converters normalize input.gzcm normalized.gzcm --method ice
Python API#
Writing#
from gunz_cm.io.gnz import GzcmWriter
import numpy as np
# Write GZCM v3 (compressed)
writer = GzcmWriter("matrix.gzcm", overwrite=True, version=3)
writer.set_metadata({"resolution": 10000, "region": "chr1"})
writer.init_streaming_array("matrix", (1000, 1000), dtype=np.float32)
writer.write()
mm = writer.get_array_writable("matrix")
mm[:] = contact_matrix
mm.flush()
Reading#
from gunz_cm.io.gnz import GzcmReader
reader = GzcmReader("matrix.gzcm")
matrix = reader.get_array("matrix")
print(f"Shape: {matrix.shape}, Metadata: {reader.metadata}")
Chunked Access#
from gunz_cm.io.gnz import GzcmChunkedReader
chunked = GzcmChunkedReader("matrix.gzcm", chunk_size=1024)
for chunk, r, c in chunked.iter_chunks():
process(chunk)
Streaming Normalization#
from gunz_cm.io.gnz import kr_normalize_gzcm, ice_normalize_gzcm
# Knight-Ruiz normalization
weights = kr_normalize_gzcm("matrix.gzcm", "normalized.gzcm")
# ICE normalization
weights = ice_normalize_gzcm("matrix.gzcm", "normalized.gzcm")
Supported Codecs (v3)#
Codec |
Description |
Recommended |
|---|---|---|
|
BSC + CMC (default) |
Yes — best compression |
|
CMC + ZSTD |
Yes — good cross-platform |
|
CMC only |
For compatibility |
|
BSC only |
Good but less portable |
|
ZSTD only |
Fast but less compression |
Compression Ratios#
Typical compression ratios for Hi-C data at10kb resolution:
Tile Size |
Compression Ratio |
|---|---|
64 |
~10-20x |
128 |
~15-25x |
See Also#
GZCM v1 Specification — Dense storage format
GZCM v2 Specification — Tiled storage format
GZCM v3 Specification — CMC compressed format
Codec Guide — Detailed codec selection