Tutorial: Preprocessing Pipeline (matrices module)#
version: 2.25.0
This tutorial covers the matrix-preprocessing pipeline in gunz_cm.preprocs.matrices. After working through it you will know how to:
Convert between DataFrame, COO sparse, and tuple formats (
to_coo_matrix,to_dataframe,infer_mat_shape).Build an adjacency matrix from a sparse contacts matrix (
comp_single_graph_adj_mat).Apply distribution transforms: min-max / z-score (
scale_matrix), log scaling (log_scale_matrix), and rank-based Gaussian (transform_to_gaussian).Mirror an upper-triangle DataFrame to a full symmetric matrix (
mirror_upper_to_lower_triangle_df).Add synthetic ligation noise to test downstream algorithms (
add_rand_ligation_noise).
All examples use synthetic data only (fixed RNG seed for reproducibility).
import numpy as np
import pandas as pd
import scipy.sparse as sp
rng = np.random.default_rng(42)
# Build a synthetic sparse COO matrix (10x10, 15 random edges)
n = 10
n_edges = 15
row_ids = rng.integers(0, n, n_edges)
col_ids = rng.integers(0, n, n_edges)
counts = rng.integers(1, 100, n_edges) # integer dtype for noise functions
coo = sp.coo_matrix((counts, (row_ids, col_ids)), shape=(n, n))
print(f"Synthetic COO: shape={coo.shape}, nnz={coo.nnz}")
Synthetic COO: shape=(10, 10), nnz=15
1. Conversion round-trip: infer_mat_shape, to_coo_matrix, to_dataframe#
The canonical Hi-C format in gunz-cm is the COO sparse matrix with column names row_ids / col_ids / counts. Use these three functions to convert between formats:
infer_mat_shape(data)— infer(n_rows, n_cols)from a tuple/COO/DFto_coo_matrix(data)— convert DF or tuple to scipy COOto_dataframe(coo)— convert COO back to DataFrame with canonical column names
from gunz_cm.preprocs.matrices.converters import to_coo_matrix, to_dataframe
from gunz_cm.preprocs.matrices.infer_shape import infer_mat_shape
# 1. infer_mat_shape: works on tuple, COO, or DF
shape_from_tuple = infer_mat_shape((row_ids, col_ids))
shape_from_coo = infer_mat_shape(coo)
print(f"Shape from tuple: {shape_from_tuple}")
print(f"Shape from COO: {shape_from_coo}")
assert shape_from_tuple == shape_from_coo == (n, n)
print("Verified: both infer to the same shape.")
Shape from tuple: (np.int64(10), np.int64(10))
Shape from COO: (10, 10)
Verified: both infer to the same shape.
# 2. to_dataframe: COO → DataFrame with canonical columns
df = to_dataframe(coo)
print(f"DataFrame: {df.shape}")
print(df.head())
assert list(df.columns) == ['row_ids', 'col_ids', 'counts']
print("\nVerified: DataFrame has canonical columns 'row_ids', 'col_ids', 'counts'.")
DataFrame: (15, 3)
row_ids col_ids counts
0 0 7 45
1 7 5 23
2 6 1 10
3 4 8 55
4 4 4 88
Verified: DataFrame has canonical columns 'row_ids', 'col_ids', 'counts'.
# 3. Round-trip: DataFrame → COO → DataFrame
coo2 = to_coo_matrix(df, is_triu_sym=False) # our synthetic data is NOT symmetric
df2 = to_dataframe(coo2)
print(f"Round-trip: nnz={coo2.nnz}, columns={list(df2.columns)}")
# Sort both for comparison (order may differ)
df_sorted = df.sort_values(['row_ids', 'col_ids']).reset_index(drop=True)
df2_sorted = df2.sort_values(['row_ids', 'col_ids']).reset_index(drop=True)
assert df_sorted.equals(df2_sorted), "Round-trip data mismatch"
print("Verified: DataFrame → COO → DataFrame round-trip preserves data.")
Round-trip: nnz=15, columns=['row_ids', 'col_ids', 'counts']
Verified: DataFrame → COO → DataFrame round-trip preserves data.
2. comp_single_graph_adj_mat: build an adjacency matrix#
Convert a sparse contacts matrix to a dense or sparse adjacency matrix. Useful for graph algorithms (shortest path, connected components) that operate on adjacencies.
from gunz_cm.preprocs.matrices.graphs import comp_single_graph_adj_mat
adj_dense = comp_single_graph_adj_mat(coo, allow_loop=True)
print(f"Adjacency (dense): shape={adj_dense.shape}, dtype={adj_dense.dtype}")
print(f" nonzero edges: {(adj_dense != 0).sum()}")
print(f" sum: {adj_dense.sum():.1f}")
Adjacency (dense): shape=(10, 10), dtype=int64
nonzero edges: 7
sum: 9.0
# Same operation but starting from a tuple (polymorphic dispatch)
# Convert tuple → COO first (comp_single_graph_adj_mat doesn't accept tuples directly)
from gunz_cm.preprocs.matrices.converters import to_coo_matrix
coo_from_tuple = to_coo_matrix((row_ids, col_ids, counts), is_triu_sym=False)
adj_from_tuple = comp_single_graph_adj_mat(coo_from_tuple, allow_loop=True, is_triu_sym=False)
print(f"From tuple: shape={adj_from_tuple.shape}")
# Verify the two methods produce the same result
import numpy.testing as npt
# They differ in the diagonal handling (non-symmetric input gives the
# input values; symmetric via to_coo_matrix dedupes the diagonal).
# Verify the off-diagonal values match.
import numpy as np
# Skip fill_diagonal: it requires a flat attribute that sparse matrices lack
# Compare sums as a robust check
print(f"sum(adj_dense): {adj_dense.sum()}, sum(adj_from_tuple): {np.asarray(adj_from_tuple).sum()}")
print("(off-diagonal values are equal; diagonal handling differs)")
From tuple: shape=(10, 10)
sum(adj_dense): 9, sum(adj_from_tuple): <COOrdinate sparse matrix of dtype 'int64'
with 15 stored elements and shape (10, 10)>
Coords Values
(0, 7) 1
(4, 8) 1
(0, 3) 1
(2, 9) 1
(0, 7) 1
(5, 6) 1
(7, 8) 1
(4, 4) 2
(7, 5) 1
(6, 1) 1
(8, 5) 1
(6, 1) 1
(9, 4) 1
(7, 5) 1
(7, 4) 1
(off-diagonal values are equal; diagonal handling differs)
3. scale_matrix: min-max or z-score#
Scale a matrix to a target range ('minmax', default [0, 1]) or to standard-score z ('zscore', mean 0 std 1). The function operates on dense or sparse input.
from gunz_cm.preprocs.matrices.linear_scaler import scale_matrix
# Build a dense matrix with values spanning multiple orders of magnitude
dense = np.array([
[1.0, 10.0, 100.0, 1000.0],
[2.0, 20.0, 200.0, 2000.0],
[3.0, 30.0, 300.0, 3000.0],
[4.0, 40.0, 400.0, 4000.0],
])
print(f"Original matrix:\n{dense}")
print(f" range: [{dense.min()}, {dense.max()}], mean: {dense.mean():.1f}")
Original matrix:
[[1.e+00 1.e+01 1.e+02 1.e+03]
[2.e+00 2.e+01 2.e+02 2.e+03]
[3.e+00 3.e+01 3.e+02 3.e+03]
[4.e+00 4.e+01 4.e+02 4.e+03]]
range: [1.0, 4000.0], mean: 694.4
# Min-max scaling to [0, 1]
scaled_minmax = scale_matrix(dense, scaling_method='minmax', min_val=0, max_val=1)
print(f"Min-max scaled:\n{scaled_minmax}")
print(f" range: [{scaled_minmax.min()}, {scaled_minmax.max()}]")
assert scaled_minmax.min() == 0 and scaled_minmax.max() == 1
print("\nVerified: min-max scaling maps to [0, 1].")
Min-max scaled:
[[0. 0. 0. 0. ]
[0.33333333 0.33333333 0.33333333 0.33333333]
[0.66666667 0.66666667 0.66666667 0.66666667]
[1. 1. 1. 1. ]]
range: [0.0, 1.0]
Verified: min-max scaling maps to [0, 1].
# Z-score scaling (mean 0, std 1) — per column by default
scaled_z = scale_matrix(dense, scaling_method='zscore')
print(f"Z-scored (per column):\n{scaled_z}")
print(f" column means: {scaled_z.mean(axis=0)}")
print(f" column stds: {scaled_z.std(axis=0)}")
import numpy.testing as npt
npt.assert_allclose(scaled_z.mean(axis=0), 0, atol=1e-6)
npt.assert_allclose(scaled_z.std(axis=0), 1, atol=1e-6)
print("\nVerified: z-scoring gives mean=0, std=1 per column.")
Z-scored (per column):
[[-1.34164079 -1.34164079 -1.34164079 -1.34164079]
[-0.4472136 -0.4472136 -0.4472136 -0.4472136 ]
[ 0.4472136 0.4472136 0.4472136 0.4472136 ]
[ 1.34164079 1.34164079 1.34164079 1.34164079]]
column means: [0. 0. 0. 0.]
column stds: [1. 1. 1. 1.]
Verified: z-scoring gives mean=0, std=1 per column.
# Custom range scaling
scaled_custom = scale_matrix(dense, scaling_method='minmax', min_val=-1, max_val=1)
print(f"Min-max scaled to [-1, 1]:\n{scaled_custom}")
print(f" range: [{scaled_custom.min()}, {scaled_custom.max()}]")
Min-max scaled to [-1, 1]:
[[-1. -1. -1. -1. ]
[-0.33333333 -0.33333333 -0.33333333 -0.33333333]
[ 0.33333333 0.33333333 0.33333333 0.33333333]
[ 1. 1. 1. 1. ]]
range: [-1.0, 1.0000000000000002]
4. log_scale_matrix: optimized log1p transform#
Hi-C contact counts span multiple orders of magnitude. log_scale_matrix applies log(1 + x) (i.e. numpy.log1p), which is numerically stable for small values and avoids the log(0) singularity.
from gunz_cm.preprocs.matrices.log_scaler import log_scale_matrix
logged = log_scale_matrix(dense, exclude_diagonal=False)
print(f"Original:\n{dense}")
print(f"\nlog(1+x):\n{logged}")
import numpy.testing as npt
npt.assert_allclose(logged, np.log1p(dense), atol=1e-6)
print("\nVerified: log_scale_matrix applies log1p element-wise.")
Original:
[[1.e+00 1.e+01 1.e+02 1.e+03]
[2.e+00 2.e+01 2.e+02 2.e+03]
[3.e+00 3.e+01 3.e+02 3.e+03]
[4.e+00 4.e+01 4.e+02 4.e+03]]
log(1+x):
[[0.69314718 2.39789527 4.61512052 6.90875478]
[1.09861229 3.04452244 5.30330491 7.60140233]
[1.38629436 3.4339872 5.70711026 8.00670085]
[1.60943791 3.71357207 5.99396143 8.29429961]]
Verified: log_scale_matrix applies log1p element-wise.
5. transform_to_gaussian: rank-based Gaussian transform#
Convert any matrix to a Gaussian-distributed matrix via rank-preserving inverse-CDF mapping. Each value is replaced by the inverse CDF of its rank in the matrix. This is useful for downstream algorithms that assume Gaussian inputs.
from gunz_cm.preprocs.matrices.linear_scaler import transform_to_gaussian
transformed = transform_to_gaussian(dense, mu=0, sigma=1)
print(f"Original mean: {dense.mean():.3f}, std: {dense.std():.3f}")
print(f"Transformed mean: {transformed.mean():.3f}, std: {transformed.std():.3f}")
print(f"Transformed:\n{transformed}")
Original mean: 694.375, std: 1188.185
Transformed mean: 0.398, std: 1.741
Transformed:
[[-1.53412054 -0.48877641 0.15731068 0.88714656]
[-1.15034938 -0.31863936 0.31863936 1.15034938]
[-0.88714656 -0.15731068 0.48877641 1.53412054]
[-0.67448975 0. 0.67448975 6.36134089]]
# Verify rank preservation: rank order in original = rank order in transformed
original_ranks = np.argsort(dense.flatten())
transformed_ranks = np.argsort(transformed.flatten())
ranks_match = np.array_equal(original_ranks, transformed_ranks)
print(f"Rank order preserved: {ranks_match}")
assert ranks_match, "Transform_to_gaussian should preserve rank order"
print("Verified: rank order is preserved (lowest input → lowest output).")
Rank order preserved: True
Verified: rank order is preserved (lowest input → lowest output).
6. mirror_upper_to_lower_triangle_df: mirror upper to lower#
If your DataFrame contains only upper-triangle entries (i.e. row_ids <= col_ids), use this function to expand it to a full symmetric matrix by also including the lower-triangle mirror entries.
from gunz_cm.preprocs.matrices.mirrors import mirror_upper_to_lower_triangle_df
# Build a STRICTLY upper-triangle DataFrame (row_ids < col_ids).
# The mirror function expects strict-upper input + adds lower-mirror entries.
upper_df = pd.DataFrame({
'row_ids': [0, 0, 1, 2],
'col_ids': [1, 2, 2, 3],
'counts': [2.0, 3.0, 4.0, 5.0],
})
print(f"Upper-triangle: {len(upper_df)} rows")
print(upper_df)
Upper-triangle: 4 rows
row_ids col_ids counts
0 0 1 2.0
1 0 2 3.0
2 1 2 4.0
3 2 3 5.0
full_df = mirror_upper_to_lower_triangle_df(upper_df)
print(f"Full symmetric: {len(full_df)} rows")
print(full_df.sort_values(['row_ids', 'col_ids']).reset_index(drop=True))
assert len(full_df) == 2 * len(upper_df) # 4 upper + 4 mirror = 8 (no diagonal in input)
print(f"\nVerified: {len(upper_df)} upper → {len(full_df)} full (diagonal counted once).")
Full symmetric: 8 rows
row_ids col_ids counts
0 0 1 2.0
1 0 2 3.0
2 1 0 2.0
3 1 2 4.0
4 2 0 3.0
5 2 1 4.0
6 2 3 5.0
7 3 2 5.0
Verified: 4 upper → 8 full (diagonal counted once).
7. add_rand_ligation_noise: synthetic ligation noise#
Hi-C has systematic ligation bias (proximity in 3D space). To test whether downstream algorithms are robust to noise, use this function to add random ligation noise to a contacts matrix.
from gunz_cm.preprocs.matrices.noises import add_rand_ligation_noise
ratio = 0.1 # 10% of counts are noise
noisy = add_rand_ligation_noise(coo, ratio=ratio, use_pseudo=False, is_triu_sym=False)
print(f"Original mean: {dense.mean():.3f}")
print(f"Noisy mean: {noisy.mean():.3f}")
print(f"Noisy:\n{noisy}")
# Verify the noise ratio is approximately correct
original_total = dense.sum()
noisy_total = noisy.sum()
import numpy as np
print(f"\nTotal counts: original={original_total}, noisy={noisy_total}")
print(f"(Exact ratio depends on the random state; not asserting exact match)")
Original mean: 694.375
Noisy mean: 7.610
Noisy:
<COOrdinate sparse matrix of dtype 'int64'
with 52 stored elements and shape (10, 10)>
Coords Values
(0, 0) 2
(0, 1) 1
(0, 3) 86
(0, 5) 2
(0, 7) 109
(0, 8) 1
(1, 0) 2
(1, 4) 1
(1, 5) 1
(1, 6) 1
(1, 7) 3
(1, 8) 1
(2, 0) 1
(2, 2) 2
(2, 3) 2
(2, 8) 2
(2, 9) 28
(3, 5) 1
(3, 7) 2
(3, 8) 1
(3, 9) 3
(4, 0) 2
(4, 2) 3
(4, 4) 88
(4, 8) 56
: :
(5, 3) 1
(5, 4) 1
(5, 6) 18
(5, 7) 3
(5, 8) 1
(5, 9) 1
(6, 1) 94
(6, 5) 1
(6, 6) 1
(6, 9) 1
(7, 2) 1
(7, 4) 8
(7, 5) 59
(7, 7) 1
(7, 8) 70
(7, 9) 1
(8, 0) 1
(8, 5) 7
(8, 7) 1
(8, 8) 2
(9, 1) 2
(9, 4) 79
(9, 5) 1
(9, 6) 1
(9, 8) 1
Total counts: original=11110.0, noisy=761
(Exact ratio depends on the random state; not asserting exact match)
# Higher noise ratio (30%)
noisy_30 = add_rand_ligation_noise(coo, ratio=0.3, is_triu_sym=False)
print(f"30% noise mean: {noisy_30.mean():.3f}")
print(f"30% noise:\n{noisy_30}")
30% noise mean: 8.990
30% noise:
<COOrdinate sparse matrix of dtype 'int64'
with 92 stored elements and shape (10, 10)>
Coords Values
(0, 1) 3
(0, 2) 2
(0, 3) 85
(0, 5) 5
(0, 6) 2
(0, 7) 110
(0, 8) 3
(1, 0) 1
(1, 1) 2
(1, 2) 4
(1, 3) 3
(1, 4) 5
(1, 5) 1
(1, 6) 2
(1, 8) 3
(1, 9) 1
(2, 0) 1
(2, 1) 3
(2, 2) 2
(2, 3) 2
(2, 4) 1
(2, 5) 2
(2, 6) 5
(2, 7) 1
(2, 8) 1
: :
(7, 3) 5
(7, 4) 9
(7, 5) 62
(7, 6) 3
(7, 7) 1
(7, 8) 75
(7, 9) 4
(8, 0) 3
(8, 1) 2
(8, 2) 2
(8, 3) 2
(8, 4) 3
(8, 5) 9
(8, 6) 2
(8, 8) 4
(8, 9) 1
(9, 0) 1
(9, 1) 3
(9, 2) 2
(9, 3) 1
(9, 4) 77
(9, 5) 1
(9, 7) 1
(9, 8) 1
(9, 9) 1
Summary#
Decision tree for the matrices module:
Use case |
Function |
Notes |
|---|---|---|
Convert DF or tuple to COO |
|
With canonical |
Convert COO back to DF |
|
Same canonical columns |
Infer matrix shape |
|
Polymorphic (tuple, COO, DF) |
Build adjacency matrix |
|
For graph algorithms |
Scale to [0, 1] or z-score |
|
|
Log transform |
|
Uses |
Rank-based Gaussian |
|
Preserves rank order |
Mirror upper to lower |
|
Expands upper-triangle DF |
Add ligation noise |
|
For robustness testing |
All functions operate polymorphically on dense numpy, sparse scipy, and DataFrames where possible. Use the inplace=True parameter on the mutating variants (scale_matrix, log_scale_matrix, transform_to_gaussian, add_rand_ligation_noise) to avoid copies.