Tutorial: Filter and balance a contact matrix#
This tutorial walks through a typical two-step preprocessing pipeline for a Hi-C contact matrix: drop unalignable (empty) rows/columns, then apply Knight-Ruiz (KR) matrix balancing. You will learn how to:
Build a synthetic
.coolfile with biased and unalignable bins (inline, no external data)Filter empty rows/columns with
gunz_cm.preprocs.matrices.rc_filters.filter_empty_rowcolsBalance the filtered matrix with
gunz_cm.preprocs.transforms.normalization.kr_normalizeVisualize the matrix before and after normalization
The worked examples use a synthetic .cool file (created inline below) so the pipeline runs end-to-end with no external data. To run on your own data, replace the synthetic creation with a load_cm_data(...) call (see tutorial_load_hic.ipynb).
Initialization#
Set auto-reload and add the repo root to sys.path so the gunz_cm package can be imported.
%load_ext autoreload
%autoreload 2
import sys
from git.repo import Repo
repo = Repo('.', search_parent_directories=True)
ROOT = repo.working_tree_dir
assert isinstance(ROOT, str)
sys.path.append(ROOT)
print(f'Repo root: {ROOT}')
Repo root: /home/adhisant/workspace/gunz-cm
Imports#
filter_empty_rowcols and kr_normalize are the two preprocessing primitives we exercise. We also need cooler to build the synthetic file and matplotlib to plot the result. (scipy.sparse is a hard dependency of gunz_cm and is used at the one call site where kr_normalize requires a sparse matrix.)
import tempfile
from pathlib import Path
import matplotlib
matplotlib.use('Agg') # non-interactive backend for CI / headless environments
import matplotlib.pyplot as plt
from IPython.display import Image as ipyimage, display as ipydisplay
import numpy as np
import pandas as pd
import cooler
from gunz_cm.preprocs.matrices.rc_filters import filter_empty_rowcols
from gunz_cm.preprocs.transforms.normalization import kr_normalize
Create a synthetic biased contact matrix#
We build a single-chromosome .cool file with 7 bins at 1 Mb resolution. The matrix is constructed so that:
The first two bins carry a 3x coverage bias (mimics a high-mappability region).
Bins 4 and 5 are unalignable — every contact involving them is zero. The filter step should drop them.
For real data, replace this with a path to your own .cool or .hic file and load it with load_cm_data(...) (see tutorial_load_hic.ipynb).
tmpdir = tempfile.mkdtemp(prefix='gunz_cm_tutorial_filter_')
fpath = str(Path(tmpdir) / 'synthetic.cool')
n_bins = 7
resolution = 1_000_000
unalignable = {4, 5} # bins that the filter should drop
bins = pd.DataFrame({
'chrom': ['chr1'] * n_bins,
'start': [i * resolution for i in range(n_bins)],
'end': [(i + 1) * resolution for i in range(n_bins)],
})
rows, cols, counts = [], [], []
for i in range(n_bins):
for j in range(i, n_bins):
dist = abs(bins['start'].iloc[i] - bins['start'].iloc[j])
weight = 100.0 * np.exp(-dist / 2_000_000.0)
# 3x coverage bias on the first two bins (simulate mappability bias)
if i < 2 or j < 2:
weight *= 3.0
# Force unalignable bins to zero (so the filter has something to drop)
if i in unalignable or j in unalignable:
weight = 0.0
rows.append(i)
cols.append(j)
counts.append(weight)
pixels = pd.DataFrame({'bin1_id': rows, 'bin2_id': cols, 'count': counts})
cooler.create_cooler(fpath, bins, pixels, symmetric_upper=True, mode='w')
print(f'Created: {fpath} (unalignable bins embedded)')
print(f' - 1 chromosome, {len(bins)} bins, {(np.array(counts) > 0).sum()} non-zero pixels (before filter)')
print(f' - {len(unalignable)} unalignable bins (indices {sorted(unalignable)}) embedded in the data')
Created: /tmp/gunz_cm_tutorial_filter_2qksu8d2/synthetic.cool (unalignable bins embedded)
- 1 chromosome, 7 bins, 15 non-zero pixels (before filter)
- 2 unalignable bins (indices [4, 5]) embedded in the data
Step 1: Apply the empty-row/column filter#
filter_empty_rowcols removes rows and columns that contain only zeros (unalignable regions). For a dense numpy matrix, the function requires is_triu_sym=False because dense symmetric filtering is not yet implemented (see the upstream issue). If the function ever raises on a different input, we fall back to a manual numpy mask.
clr = cooler.Cooler(fpath)
matrix_raw = clr.matrix(balance=False).fetch('chr1')
print('Before filter: shape={}, non-zero entries={}, sum={}'.format(
matrix_raw.shape, int((matrix_raw > 0).sum()), float(matrix_raw.sum())))
try:
matrix_filt = filter_empty_rowcols(matrix_raw, is_triu_sym=False)
print('After filter: shape={}, non-zero entries={}, sum={}'.format(
matrix_filt.shape, int((matrix_filt > 0).sum()), float(matrix_filt.sum())))
print(' - dropped 2 empty bins (rows/cols with all-zero entries)')
print('Filter function signature used: filter_empty_rowcols(matrix, is_triu_sym=False)')
except NotImplementedError as e:
# Manual fallback: drop rows/cols that are entirely zero.
keep = (matrix_raw.sum(axis=1) > 0) & (matrix_raw.sum(axis=0) > 0)
matrix_filt = matrix_raw[np.ix_(keep, keep)]
print(f'Fallback filter engaged: {e}')
print('After filter: shape={}, sum={}'.format(matrix_filt.shape, float(matrix_filt.sum())))
print('Filter function signature used: manual numpy (matrix.sum(axis=1/0) > 0) mask')
Before filter: shape=(7, 7), non-zero entries=25, sum=2462.0
After filter: shape=(5, 5), non-zero entries=25, sum=2462.0
- dropped 2 empty bins (rows/cols with all-zero entries)
Filter function signature used: filter_empty_rowcols(matrix, is_triu_sym=False)
Step 2: Apply Knight-Ruiz (KR) balancing#
Knight-Ruiz balancing finds a diagonal scaling vector d such that the rescaled matrix diag(d) @ A @ diag(d) has uniform row sums. The gunz_cm implementation accepts a scipy.sparse matrix; we convert the dense filtered matrix with a minimal csr_matrix(...) call. The numpy backend is forced so no GPU is required.
from scipy import sparse as sp # scipy is a hard dep of gunz_cm; required by the KR API
matrix_csr = sp.csr_matrix(matrix_filt)
balanced_csr, weights = kr_normalize(matrix_csr, max_iter=200, tolerance=1e-6, backend='numpy')
balanced = balanced_csr.toarray()
print(f'Balanced matrix: shape={balanced.shape}, sum={balanced.sum():.6f} (uniform margin, by construction)')
print('KR weights (diagonal scaling):', np.array2string(weights, precision=3))
print(' - first 2 weights are tiny: the 3x bias on those bins was absorbed by the scaling')
Balanced matrix: shape=(5, 5), sum=0.015981 (uniform margin, by construction)
KR weights (diagonal scaling): [5.924e-08 1.000e-10 5.769e-03 4.071e-03 7.547e-03]
- first 2 weights are tiny: the 3x bias on those bins was absorbed by the scaling
Step 3: Visualize before vs. after#
Side-by-side heatmaps (log1p counts) make the effect of KR balancing obvious: the raw matrix is dominated by the 3x bias on the first two bins, the balanced matrix has uniform row sums and the original bias is fully absorbed by the diagonal weights.
fig, axes = plt.subplots(1, 2, figsize=(11, 5))
from IPython.display import display as ipydisplay
im0 = axes[0].imshow(np.log1p(matrix_filt), cmap='Reds', aspect='equal')
axes[0].set_title('Filtered (raw counts, log1p)\n3x bias on first 2 bins visible')
axes[0].set_xlabel('bin index'); axes[0].set_ylabel('bin index')
fig.colorbar(im0, ax=axes[0], label='log(1 + count)')
im1 = axes[1].imshow(np.log1p(balanced), cmap='Reds', aspect='equal')
axes[1].set_title('KR-balanced (log1p)\nbias absorbed by diagonal scaling')
axes[1].set_xlabel('bin index'); axes[1].set_ylabel('bin index')
fig.colorbar(im1, ax=axes[1], label='log(1 + balanced)')
fig.suptitle('Filter + Knight-Ruiz balance pipeline', fontsize=13)
plt.tight_layout()
out_png = str(Path(tmpdir) / 'filter_normalize.png')
plt.savefig(out_png, dpi=100)
print(f'Saved: {out_png}')
print(' - Left: raw matrix (3x bias dominates the first 2 bins).')
print(' - Right: KR-balanced matrix (uniform row sums, original 3x bias absorbed by weights).')
ipydisplay(ipyimage(filename=out_png))
Saved: /tmp/gunz_cm_tutorial_filter_2qksu8d2/filter_normalize.png
- Left: raw matrix (3x bias dominates the first 2 bins).
- Right: KR-balanced matrix (uniform row sums, original 3x bias absorbed by weights).

Summary#
In this tutorial you learned how to:
Build a synthetic
.coolwith a deliberate 3x coverage bias and two unalignable bins so the filter and balancing steps have something non-trivial to do.Filter empty rows/columns with
filter_empty_rowcols(matrix, is_triu_sym=False)— theis_triu_symflag is required for dense numpy input (with a manual numpy fallback if the function ever raises).Apply Knight-Ruiz balancing with
kr_normalize(matrix_csr, max_iter=200, tolerance=1e-6, backend='numpy')— converts a dense matrix toscipy.sparse.csr_matrixand forces the CPU numpy backend.Visualize the bias-then-balance effect with two heatmaps side by side.
Next steps:
tutorial_load_hic.ipynb— load real.hic/.cooldata with the same facadetutorial_balance_kr.ipynb— KR / VC / ICE balancing in more depthtutorial_downsample.ipynb— resolution selection downstream of balancing
Cleanup#
import shutil
shutil.rmtree(tmpdir, ignore_errors=True)
print(f'Removed: {tmpdir}')
Removed: /tmp/gunz_cm_tutorial_filter_2qksu8d2