gunz_cm.preprocs package#

Subpackages

Module contents

Preprocessing module for contact matrix operations.

Sub-modules: - matrices: Matrix operations (filtering, scaling, masking, etc.) - points: 3D point manipulation (downsample, filter, mask) - transforms: Transformations between representations (EDM, Gram)

gunz_cm.preprocs.add_rand_ligation_noise(data: numpy.ndarray | scipy.sparse._coo.coo_matrix | pandas.core.frame.DataFrame, ratio: float, use_pseudo: bool = False, is_triu_sym: bool = True, inplace: bool = False) → numpy.ndarray | scipy.sparse._coo.coo_matrix | pandas.core.frame.DataFrame[source]#

Add random ligation noise to the input data.

Parameters:

data (Union[numpy.ndarray, scipy.sparse.coo_matrix, pandas.DataFrame]) – Input data.
is_triu_sym (bool, optional) – Whether the matrix is triangular upper and symmetric (default is True).
inplace (bool, optional) – Whether to modify the input data in place (default is False).

Returns:

Data with added random ligation noise.

Return type:

Union[numpy.ndarray, scipy.sparse.coo_matrix, pandas.DataFrame]

Notes

This function adds random ligation noise to the input data. It supports numpy arrays, scipy sparse matrices, and pandas dataframes. Note: The inplace parameter only affects the input data type. For numpy arrays and scipy sparse matrices, inplace=True will modify the original data. For pandas dataframes, inplace=True will not modify the original data.

gunz_cm.preprocs.add_rand_ligation_noise_coo(cm_coo: coo_matrix, ratio: float, use_pseudo: bool = False, is_triu_sym: bool = True, inplace: bool = False) → coo_matrix[source]#

Add random ligation noise to a scipy sparse matrix.

Notes

This function adds random ligation noise to the input scipy sparse matrix. If inplace is False, a copy of the input matrix is created before adding noise. If is_triu_sym is True, the matrix is assumed to be triangular upper and symmetric.

Parameters:

cm_coo (scipy.sparse.coo_matrix) – Input scipy sparse matrix.
is_triu_sym (bool, optional) – Whether the matrix is triangular upper and symmetric (default is True).
inplace (bool, optional) – Whether to modify the input data in place (default is False).

Returns:

Scipy sparse matrix with added random ligation noise.

Return type:

scipy.sparse.coo_matrix

gunz_cm.preprocs.add_rand_ligation_noise_df(cm_df: DataFrame, ratio: float, use_pseudo: bool = False, is_triu_sym: bool = True, inplace: bool = False, row_ids_colname: str = 'row_ids', col_ids_colname: str = 'col_ids', vals_colname: str = 'counts') → DataFrame[source]#

Add random ligation noise to a pandas DataFrame.

Notes

This function adds random ligation noise to the input pandas DataFrame. If inplace is False, a copy of the input DataFrame is created before adding noise. If is_triu_sym is True, the matrix is assumed to be triangular upper and symmetric.

Parameters:

cm_df (pd.DataFrame) – Input pandas DataFrame.
is_triu_sym (bool, optional) – Whether the matrix is triangular upper and symmetric (default is True).
inplace (bool, optional) – Whether to modify the input data in place (default is False).
row_ids_colname (str, optional) – Column name for row IDs (default is ‘row_ids’).
col_ids_colname (str, optional) – Column name for column IDs (default is ‘col_ids’).
vals_colname (str, optional) – Column name for values (default is ‘counts’).

Returns:

Pandas DataFrame with added random ligation noise.

Return type:

pd.DataFrame

gunz_cm.preprocs.add_rand_ligation_noise_mat(cm_mat: coo_matrix, ratio: float, is_triu_sym: bool = True, inplace: bool = False)[source]#: Implement add_rand_ligation_noise_mat.

gunz_cm.preprocs.comp_single_graph_adj_mat(data: numpy.ndarray | scipy.sparse._coo.coo_matrix | pandas.core.frame.DataFrame, allow_loop: bool = True, is_triu_sym: bool = True, row_ids_colname: str = 'row_ids', col_ids_colname: str = 'col_ids', counts_colname: str = 'counts') → numpy.ndarray | scipy.sparse._coo.coo_matrix | pandas.core.frame.DataFrame[source]#

gunz_cm.preprocs.comp_single_graph_adj_mat(cm_coo: coo_matrix, allow_loop: bool = True, is_triu_sym: bool = True, **kwargs) → coo_matrix

gunz_cm.preprocs.comp_single_graph_adj_mat(cm_df: DataFrame, allow_loop: bool = True, is_triu_sym: bool = True, row_ids_colname: str = 'row_ids', col_ids_colname: str = 'col_ids', counts_colname: str = 'counts') → DataFrame

Compute the adjacency matrix from a given data structure.

Notes

This function operates under the premise that the input matrix is symmetric but keeps only the upper triangular part and the diagonal from the matrix for processing. If allow_loop is True, the diagonal (self-loops) receives value 2 in the adjacency matrix. If allow_loop is False, the diagonal positions are set to 0 in the adjacency matrix, indicating no self-loop is encoded.

Parameters:

allow_loop (bool, optional) – Determines if a self-loop should be included in the resulting matrix. Default is True.
is_triu_sym (bool, optional) – Determines if the input matrix is symmetric and only the upper triangular part is used. Default is True.
row_ids_colname (str, optional) – The column name for row IDs in the input DataFrame. Default is cm_consts.ROW_IDS_COLNAME.
col_ids_colname (str, optional) – The column name for column IDs in the input DataFrame. Default is cm_consts.COL_IDS_COLNAME.
counts_colname (str, optional) – The column name for counts in the input DataFrame. Default is cm_consts.COUNTS_COLNAME.

Returns:

adj_matrix – The adjacency matrix.

Return type:

t.Union[np.ndarray, sp.coo_matrix, pd.DataFrame]

gunz_cm.preprocs.comp_sparse_wish_dist(data, alpha: float = -0.25, na_inf_val: float | None = None)[source]#
gunz_cm.preprocs.comp_sparse_wish_dist(cm_coo: coo_matrix, alpha: float = -0.25, na_inf_val: float | None = None, **kwargs)
gunz_cm.preprocs.comp_sparse_wish_dist(cm_df: DataFrame, alpha: float = -0.25, na_inf_val: float | None = None, **kwargs) → DataFrame: Implement comp_sparse_wish_dist.

gunz_cm.preprocs.comp_sparse_wish_dist_rc_ids(row_ids: numpy.ndarray | list[int], col_ids: numpy.ndarray | list[int], C_vals: ndarray, alpha: float = -0.25, na_inf_val: float | None = None) → tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray][source]#

Calculate sparse form of Euclidean distance matrix from contact matrix.

Create a tuple of row indices, column indices, contact matrix values, and Euclidean distance values.

Parameters:

row_ids (t.Union[np.ndarray, t.List[int]]) – Array of row indices.
col_ids (t.Union[np.ndarray, t.List[int]]) – Array of column indices.
C_vals (np.ndarray) – Array of contact matrix values.
alpha (float, optional) – Conversion factor from contact matrix to Euclidean distance matrix (default is -0.25).
na_inf_val (t.Optional[float], optional) – Value to replace NaN or infinite values (default is None).

Returns:

A tuple containing (row_ids, col_ids, C_vals, D_vals).

Return type:

t.Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]

Notes

Removes the main diagonal of the matrix. NaN handling is not yet implemented and will raise a NotImplementedError if invalid values are found.

gunz_cm.preprocs.create_band_matrix(matrix: numpy.ndarray | scipy.sparse._coo.coo_matrix | pandas.core.frame.DataFrame, max_k: int | None = None, remove_main_diag: bool = False, *, row_ids_colname: str = 'row_ids', col_ids_colname: str = 'col_ids') → numpy.ndarray | scipy.sparse._coo.coo_matrix | pandas.core.frame.DataFrame[source]#

gunz_cm.preprocs.create_band_matrix(matrix: ndarray, max_k: int | None, remove_main_diag: bool, **kwargs) → ndarray

gunz_cm.preprocs.create_band_matrix(matrix: coo_matrix, max_k: int | None, remove_main_diag: bool, **kwargs) → coo_matrix

gunz_cm.preprocs.create_band_matrix(matrix: DataFrame, max_k: int | None, remove_main_diag: bool, *, row_ids_colname: str, col_ids_colname: str) → DataFrame

Create a band matrix by keeping elements near the main diagonal.

This function filters a matrix to retain only the elements where the absolute difference between the row and column index is less than or equal to max_k.

Parameters:

matrix (np.ndarray, sp.coo_matrix, or pd.DataFrame) – The input matrix to filter.
max_k (int, optional) – The maximum distance from the main diagonal to keep. If None, all elements are kept (no filtering by distance). Defaults to None.
remove_main_diag (bool, optional) – If True, elements on the main diagonal (k=0) are removed. Defaults to False.
row_ids_colname (str, optional) – Column name for row IDs (for DataFrame input).
col_ids_colname (str, optional) – Column name for column IDs (for DataFrame input).

Returns:

A new matrix of the same type as the input, containing only the elements within the specified band.

Return type:

np.ndarray, sp.coo_matrix, or pd.DataFrame

gunz_cm.preprocs.create_triu_matrix(cm_mat: ndarray, min_k: int | None = None, max_k: int | None = None, remove_main_diag: bool = False, **kwargs) → ndarray

gunz_cm.preprocs.create_triu_matrix(cm_coo: coo_matrix, min_k: int | None = None, max_k: int | None = None, remove_main_diag: bool = False, **kwargs) → coo_matrix

gunz_cm.preprocs.create_triu_matrix(cm_df: DataFrame, min_k: int | None = None, max_k: int | None = None, remove_main_diag: bool = False, row_ids_colname: str = 'row_ids', col_ids_colname: str = 'col_ids', **kwargs) → DataFrame

Create a triangular matrix.

Notes

This function creates a triangular matrix based on the input data. The min_k and max_k parameters control the minimum and maximum distance from the main diagonal. If remove_main_diag is True, the main diagonal elements are removed.

Parameters:

data (t.Union[np.ndarray, sp.coo_matrix, pd.DataFrame]) – The input data to be converted to a triangular matrix.
min_k (t.Optional[int], optional) – The minimum distance from the main diagonal (default is None).
max_k (t.Optional[int], optional) – The maximum distance from the main diagonal (default is None).
remove_main_diag (bool, optional) – Whether to remove the main diagonal elements (default is False).

Returns:

The triangular matrix.

Return type:

t.Union[np.ndarray, tuple, sp.coo_matrix, pd.DataFrame]

gunz_cm.preprocs.downsample_points(points: ndarray, ds_ratio: int, def_coor: float = nan) → ndarray[source]#

Downsample the given points by a specified ratio.

Notes

The function ensures that the ds_ratio is greater than 1.
Points with all NaN values are ignored during downsampling.
The resulting array is filled with def_coor for indices without valid points.

Parameters:

points (np.ndarray) – The array of points to be downsampled.
ds_ratio (int) – The downsampling ratio. Must be greater than 1.
def_coor (float, optional) – The default coordinate value for indices without valid points, by default np.nan.

Returns:

The downsampled points array.

Return type:

np.ndarray

gunz_cm.preprocs.expand_with_nans(points_filtered: ndarray, mask: ndarray, full_length: int | None = None) → ndarray[source]#: Expand a filtered point cloud back to genomic length, inserting NaNs where the mask is False.

gunz_cm.preprocs.filter_by_raw_counts(matrix: DataFrame, min_val: int | None, max_val: int | None, raw_counts_colname: str) → DataFrame

gunz_cm.preprocs.filter_by_raw_counts(matrix: coo_matrix, min_val: int | None, max_val: int | None, raw_counts_colname: str) → coo_matrix

gunz_cm.preprocs.filter_by_raw_counts(matrix: csr_matrix, min_val: int | None, max_val: int | None, raw_counts_colname: str) → csr_matrix

gunz_cm.preprocs.filter_by_raw_counts(matrix: ndarray, min_val: int | None, max_val: int | None, raw_counts_colname: str) → ndarray

Filter entries of a matrix based on raw interaction counts.

This function uses Pydantic to validate inputs and single dispatch to route to the correct implementation based on the input data type.

Parameters:

matrix (pd.DataFrame, sp.coo_matrix, sp.csr_matrix, or np.ndarray) – The input data. For NumPy arrays, this filters by setting values outside the range to 0. For sparse matrices and DataFrames, it removes the entries.
min_val (int, optional) – The minimum raw count value to include (inclusive). Defaults to None.
max_val (int, optional) – The maximum raw count value to include (inclusive). Defaults to None.
raw_counts_colname (str, optional) – The name of the column containing raw counts. This is only used if the input is a pandas DataFrame. Defaults to DataFrameSpecs.RAW_COUNTS.

Returns:

A new data object of the same type as the input, containing only the filtered entries.

Return type:

pd.DataFrame, sp.coo_matrix, sp.csr_matrix, or np.ndarray

Raises:

pydantic.ValidationError – If any argument’s type is incorrect.
ValueError – If min_val > max_val, or if raw_counts_colname is not found.
TypeError – If the target column in a DataFrame is not numeric.

Filter a DataFrame based on weight quantiles.

Notes

This function calculates weights based on the ratio of normalized counts to raw counts. It then applies log transformation if specified and filters the DataFrame based on the weight quantiles.

Parameters:

cm_df (pd.DataFrame) – The input DataFrame containing count data.
q1 (float, optional) – The lower quantile value (default is 0).
q3 (float, optional) – The upper quantile value (default is 1.0).
log (bool, optional) – Whether to apply log transformation to the weights. Default is True.
val_colname (str, optional) – The column name for normalized counts. Default is cm_consts.COUNTS_COLNAME.
orig_val_colname (str, optional) – The column name for raw counts. Default is cm_consts.RAW_COUNTS_COLNAME.

Returns:

A new DataFrame filtered based on the weight quantiles.

Return type:

pd.DataFrame

gunz_cm.preprocs.filter_common_empty_rowcols(data1, data2, op: str = 'union', is_triu_sym: bool = True, axis: int | None = None, ret_mapping: bool = False, ret_unique_ids: bool = False)[source]#

gunz_cm.preprocs.filter_common_empty_rowcols(cm_mat1: ndarray, cm_mat2: ndarray, is_triu_sym: bool = True, axis: int = None, ret_mapping: bool = False, **kwargs) → ndarray

gunz_cm.preprocs.filter_common_empty_rowcols(data1: tuple[numpy.ndarray, numpy.ndarray], data2: tuple[numpy.ndarray, numpy.ndarray], op: str = 'union', is_triu_sym: bool = True, axis: int | None = None, ret_mapping: bool = False, ret_unique_ids: bool = False, **kwargs) → tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray | None, numpy.ndarray | None]

gunz_cm.preprocs.filter_common_empty_rowcols(cm_coo1: coo_matrix, cm_coo2: coo_matrix, op: str = 'union', is_triu_sym: bool = True, axis: int | None = None, ret_mapping: bool = False, ret_unique_ids: bool = False, **kwargs) → tuple[scipy.sparse._matrix.spmatrix, tuple[numpy.ndarray, ...] | None]

gunz_cm.preprocs.filter_common_empty_rowcols(cm_df1: DataFrame, cm_df2: DataFrame, op: str = 'union', is_triu_sym: bool = True, axis: int | None = None, ret_mapping: bool = False, ret_unique_ids: bool = False, row_ids_colname: str = 'row_ids', col_ids_colname: str = 'col_ids', **kwargs) → tuple[pandas.core.frame.DataFrame, ...] | pandas.core.frame.DataFrame

Filter out unalignable regions from the input data.

Parameters:

is_triu_sym (bool, optional) – If the input is symmetric but only the upper triangle part of the matrix is given. Defaults to True.
axis (int, optional) – The axis to filter on. Defaults to None.
ret_mapping (bool, optional) – Whether to return the mapping of the original ids to the new ids. Defaults to False.

Returns:

filtered_data – The filtered data.

Return type:

pandas.DataFrame or scipy.sparse matrix

gunz_cm.preprocs.filter_empty_rowcols(data: numpy.ndarray | tuple | scipy.sparse._coo.coo_matrix | pandas.core.frame.DataFrame, is_triu_sym: bool = True, axis: int | None = None, ret_mapping: bool = False, ret_unique_ids: bool = False, row_ids_colname: str = 'row_ids', col_ids_colname: str = 'col_ids') → numpy.ndarray | tuple | scipy.sparse._coo.coo_matrix | pandas.core.frame.DataFrame[source]#

gunz_cm.preprocs.filter_empty_rowcols(cm_mat: ndarray, is_triu_sym: bool = True, axis: int | None = None, ret_mapping: bool = False, ret_unique_ids: bool = False, **kwargs) → ndarray

gunz_cm.preprocs.filter_empty_rowcols(data: tuple[numpy.ndarray, numpy.ndarray], is_triu_sym: bool = True, axis: int | None = None, ret_mapping: bool = False, ret_unique_ids: bool = False, **kwargs) → tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray | None, numpy.ndarray | None]

gunz_cm.preprocs.filter_empty_rowcols(cm_coo: coo_matrix, is_triu_sym: bool = True, axis: int | None = None, ret_mapping: bool = False, ret_unique_ids: bool = False, **kwargs) → tuple[scipy.sparse._coo.coo_matrix, tuple[numpy.ndarray, ...] | None]

gunz_cm.preprocs.filter_empty_rowcols(df: DataFrame, is_triu_sym: bool = True, axis: int = None, ret_mapping: bool = False, ret_unique_ids: bool = False, row_ids_colname: str = 'row_ids', col_ids_colname: str = 'col_ids', **kwargs) → pandas.core.frame.DataFrame | tuple[pandas.core.frame.DataFrame, ...]

Filter out row or columns which entries are zeros (unalignable regions) and project the row and/or column ids.

Notes

This function filters out empty rows and columns from the input data.

Parameters:

data (np.ndarray or tuple or scipy.sparse.coo_matrix or pd.DataFrame) – The input data.
is_triu_sym (bool, optional) – If the input is symmetric but only the upper triangle part of the matrix is given. Defaults to True.
axis (int, optional) – The axis to filter on. Defaults to None.
ret_mapping (bool, optional) – Whether to return the mapping of the original ids to the new ids. Defaults to False.
ret_unique_ids (bool, optional) – Whether to return unique ids. Defaults to False.

Returns:

filtered_data – The filtered data.

Return type:

np.ndarray or tuple or scipy.sparse.coo_matrix or pd.DataFrame

gunz_cm.preprocs.filter_points(points: ndarray, ret_mask: bool = False) → numpy.ndarray | tuple[numpy.ndarray, numpy.ndarray][source]#

Filter out points with any NaN values.

Notes

If ret_mask is True, the function returns both the filtered points and the mask used for filtering.
If ret_mask is False, only the filtered points are returned.

Parameters:

points (np.ndarray) – The array of points to be filtered.
ret_mask (bool, optional) – Whether to return the mask used for filtering, by default False.

Returns:

The filtered points, and optionally the mask used for filtering.

Return type:

np.ndarray or Tuple[np.ndarray, np.ndarray]

gunz_cm.preprocs.filter_valid_points(points: ndarray, cm_df: DataFrame, ds_ratio: int = 1) → ndarray[source]#

Filter valid points based on the provided DataFrame and downsampling ratio.

This function is used when the coordinates of the points covers also the empty regions.

Notes

The function ensures that the ds_ratio is a positive integer.
It extracts unique row and column IDs from the DataFrame and filters the points accordingly.
If ds_ratio is greater than 1, it performs downsampling by averaging points within the same coarse-grained bin ID.

Parameters:

points (np.ndarray) – The array of points to be filtered.
cm_df (pd.DataFrame) – The DataFrame containing row and column IDs.
ds_ratio (int, optional) – The downsampling ratio, by default 1. Must be a positive integer.

Returns:

The filtered and optionally downsampled points.

Return type:

np.ndarray

gunz_cm.preprocs.get_genomic_mask(bin_size_bp: int, region: str, hic_path: str | os.PathLike, balancing: str = 'KR', root: str | os.PathLike | None = None) → ndarray[source]#

Identify valid (aligned) bins from Hi-C data by inspecting non-zero contacts.

Parameters:

bin_size_bp (int) – The bin size (in bp).
region (str) – Chromosome/region identifier.
hic_path (str | os.PathLike) – Path to the .hic file.
balancing (str) – Normalization scheme (e.g., ‘KR’).
root (str | t.Optional[os.PathLike]) – Project root directory.

Returns:

Boolean mask of valid bins.

Return type:

np.ndarray

gunz_cm.preprocs.get_optimization_mask(points: ndarray, threshold: float = 1e-05) → ndarray[source]#: Identify points that have moved from the origin (stagnant noise filter).

gunz_cm.preprocs.get_unified_mask(points: ndarray, bin_size_bp: int, region: str, hic_path: str | os.PathLike, balancing: str = 'KR', root: str | os.PathLike | None = None) → ndarray[source]#: Combine Genomic (Hi-C) and Optimization (Movement) masks.

gunz_cm.preprocs.infer_mat_shape(matrix: tuple[numpy.ndarray, numpy.ndarray] | scipy.sparse._coo.coo_matrix | pandas.core.frame.DataFrame, is_triu_sym: bool = True, row_ids_colname: str = 'row_ids', col_ids_colname: str = 'col_ids') → tuple[int, int][source]#

gunz_cm.preprocs.infer_mat_shape(matrix: tuple[numpy.ndarray, numpy.ndarray], is_triu_sym: bool, **kwargs) → tuple[int, int]

gunz_cm.preprocs.infer_mat_shape(matrix: coo_matrix, is_triu_sym: bool, **kwargs) → tuple[int, int]

gunz_cm.preprocs.infer_mat_shape(matrix: DataFrame, is_triu_sym: bool, row_ids_colname: str, col_ids_colname: str) → tuple[int, int]

Infer the shape of a matrix from different data types.

This function uses Pydantic to validate inputs and single dispatch to route to the correct implementation based on the input data type.

Parameters:

matrix (tuple, sp.coo_matrix, or pd.DataFrame) – Input data, which can be: - A tuple of (row_indices, column_indices) NumPy arrays. - A SciPy COO sparse matrix. - A pandas DataFrame with coordinate columns.
is_triu_sym (bool, optional) – If True, the shape is inferred as a square matrix (N x N) based on the maximum index found. Defaults to True.
row_ids_colname (str, optional) – Name of the column for row indices. Only used for DataFrames. Defaults to DataFrameSpecs.ROW_IDS.
col_ids_colname (str, optional) – Name of the column for column indices. Only used for DataFrames. Defaults to DataFrameSpecs.COL_IDS.

Returns:

The inferred (rows, columns) shape of the matrix.

Return type:

t.Tuple[int, int]

Raises:

pydantic.ValidationError – If any argument’s type is incorrect (e.g., a tuple with != 2 elements).
ValueError – If required columns are missing from a DataFrame.
TypeError – If the input data type is not supported or if index arrays are not of an integer dtype.

gunz_cm.preprocs.intersect_masks(masks: list[numpy.ndarray]) → ndarray[source]#: Compute bitwise-AND across multiple masks.

gunz_cm.preprocs.log_scale_matrix(matrix: numpy.ndarray | scipy.sparse._coo.coo_matrix | scipy.sparse._csr.csr_matrix, exclude_diagonal: bool = False, inplace: bool = False) → numpy.ndarray | scipy.sparse._coo.coo_matrix | scipy.sparse._csr.csr_matrix[source]#

gunz_cm.preprocs.log_scale_matrix(matrix: ndarray, exclude_diagonal: bool, inplace: bool, **kwargs) → ndarray

gunz_cm.preprocs.log_scale_matrix(matrix: scipy.sparse._coo.coo_matrix | scipy.sparse._csr.csr_matrix, exclude_diagonal: bool, inplace: bool, **kwargs) → scipy.sparse._coo.coo_matrix | scipy.sparse._csr.csr_matrix

Optimized log(1+v) scaling with in-place operation support.

Notes

This function applies a log(1+v) transformation to the input matrix. It supports both dense and sparse matrices. If exclude_diagonal is True, the diagonal elements are set to zero for dense matrices or removed for sparse matrices. The inplace parameter allows modifying the matrix in-place to save memory.

Parameters:

matrix (Union[np.ndarray, coo_matrix, csr_matrix]) – Input matrix for log scaling.
exclude_diagonal (bool, optional) – Zero diagonal (dense) or remove entries (sparse), default False.
inplace (bool, optional) – Modify matrix in-place instead of creating new, default False.

Returns:

Log-scaled matrix (original if inplace=True).

Return type:

Union[np.ndarray, coo_matrix, csr_matrix]

gunz_cm.preprocs.mask_points(points: ndarray, cm_df: DataFrame, ds_ratio: int = 1) → ndarray[source]#

Masks points based on the provided DataFrame and downsampling ratio.

Notes

This function processes the input points and masks them based on the unique row and column IDs from the DataFrame. If the downsampling ratio is greater than 1, it further processes the points to downsample them.

Parameters:

points (np.ndarray) – The array of points to be masked.
cm_df (pd.DataFrame) – The DataFrame containing row and column IDs.
ds_ratio (int, optional) – The downsampling ratio, by default 1. Must be an integer greater than or equal to 1.

Returns:

The masked points array.

Return type:

np.ndarray

gunz_cm.preprocs.mirror_upper_to_lower_triangle(mat: Any, remove_diag: bool = False, double_diag: bool = False) → Any[source]#

gunz_cm.preprocs.mirror_upper_to_lower_triangle(cm_coo: coo_matrix, remove_diag: bool = False, double_diag: bool = False) → coo_matrix

gunz_cm.preprocs.mirror_upper_to_lower_triangle(cm_df: DataFrame, remove_diag: bool = False, double_diag: bool = False, row_ids_colname: str = 'row_ids', col_ids_colname: str = 'col_ids', vals_colname: str = 'counts') → DataFrame

Mirror the upper triangle part to the lower triangle part of a matrix.

Parameters:

mat (pandas.DataFrame or scipy.sparse.spmatrix) – Input matrix. Supported types are pandas DataFrame and any scipy sparse matrix.
remove_diag (bool, optional) – Whether to remove the main diagonal. Defaults to False.
double_diag (bool, optional) – Whether to double the diagonal entries. Defaults to False. This is useful for preserving behavior of certain legacy implementations. Ignored if remove_diag is True.

Returns:

Resulting matrix with the upper triangle mirrored to the lower triangle, in the same format as input.

Return type:

any

Raises:

PreprocError – If the input type is not supported.

gunz_cm.preprocs.mirror_upper_to_lower_triangle_coo(cm_coo: coo_matrix, remove_diag: bool = False, double_diag: bool = False) → coo_matrix[source]#

Implement mirror_upper_to_lower_triangle for COO matrices.

Parameters:

cm_coo (scipy.sparse.coo_matrix) – The input sparse matrix.
remove_diag (bool, optional) – Whether to remove the main diagonal. Defaults to False.
double_diag (bool, optional) – Whether to double the diagonal entries. Defaults to False.

Returns:

The resulting symmetric sparse matrix.

Return type:

scipy.sparse.coo_matrix

gunz_cm.preprocs.mirror_upper_to_lower_triangle_df(cm_df: DataFrame, remove_diag: bool = False, double_diag: bool = False, row_ids_colname: str = 'row_ids', col_ids_colname: str = 'col_ids', vals_colname: str = 'counts') → DataFrame[source]#

Implement mirror_upper_to_lower_triangle for pandas DataFrames.

Parameters:

cm_df (pandas.DataFrame) – The input DataFrame representing the matrix.
remove_diag (bool, optional) – Whether to remove the main diagonal. Defaults to False.
double_diag (bool, optional) – Whether to double the diagonal entries. Defaults to False.
row_ids_colname (str, optional) – Column name for row IDs. Defaults to ‘bin1_id’.
col_ids_colname (str, optional) – Column name for column IDs. Defaults to ‘bin2_id’.
vals_colname (str, optional) – Column name for contact counts. Defaults to ‘count’.

Returns:

The resulting symmetric DataFrame.

Return type:

pandas.DataFrame

gunz_cm.preprocs.rand_downsample(data: numpy.ndarray | scipy.sparse._coo.coo_matrix | pandas.core.frame.DataFrame, ratio: float, val_colname: str = 'counts') → numpy.ndarray | scipy.sparse._coo.coo_matrix | pandas.core.frame.DataFrame[source]#

gunz_cm.preprocs.rand_downsample(cm_mat: ndarray, ratio: int, **kwargs) → ndarray

gunz_cm.preprocs.rand_downsample(cm_coo: coo_matrix, ratio: int, **kwargs) → coo_matrix

gunz_cm.preprocs.rand_downsample(cm_df: DataFrame, ratio: int, val_colname: str = 'counts', **kwargs) → DataFrame

Randomly downsample a matrix or dataframe by a specified ratio.

Notes

This function dispatches to different downsampling functions based on the input data type.

Parameters:

data (Union[np.ndarray, sp.coo_matrix, pd.DataFrame]) – Input data to downsample.
ratio (int) – Downsample ratio.
val_colname (str, optional) – Column name for values in dataframe (default is cm_consts.COUNTS_COLNAME).

Returns:

Downsampled data.

Return type:

Union[np.ndarray, sp.coo_matrix, pd.DataFrame]

gunz_cm.preprocs.scale_matrix(matrix: numpy.ndarray | scipy.sparse._coo.coo_matrix | scipy.sparse._csr.csr_matrix, scaling_method: str = 'minmax', min_val: float = 0, max_val: float = 1, exclude_diagonal: bool = False, inplace: bool = False) → numpy.ndarray | scipy.sparse._coo.coo_matrix | scipy.sparse._csr.csr_matrix[source]#

gunz_cm.preprocs.scale_matrix(matrix: ndarray, scaling_method: str, min_val: float, max_val: float, exclude_diagonal: bool = False, inplace: bool = False, **kwargs) → ndarray

gunz_cm.preprocs.scale_matrix(matrix: scipy.sparse._coo.coo_matrix | scipy.sparse._csr.csr_matrix, scaling_method: str, min_val: float, max_val: float, exclude_diagonal: bool = False, inplace: bool = False, **kwargs) → scipy.sparse._coo.coo_matrix | scipy.sparse._csr.csr_matrix

Scale a matrix using the specified method.

Notes

This function supports both dense and sparse matrices and can scale using either min-max scaling or normalization. It can also exclude diagonal elements from scaling and perform operations in-place if specified.

Parameters:

matrix (Union[np.ndarray, coo_matrix, csr_matrix]) – The matrix to scale.
scaling_method (str, optional) – The scaling method to use (‘minmax’ or ‘normal’, default is ‘minmax’).
min_val (float, optional) – The minimum value for min-max scaling (default is 0).
max_val (float, optional) – The maximum value for min-max scaling (default is 1).
exclude_diagonal (bool, optional) – Whether to exclude diagonal elements from scaling (default is False).
inplace (bool, optional) – Whether to perform the scaling in-place (default is False).

Returns:

The scaled matrix.

Return type:

Union[np.ndarray, coo_matrix, csr_matrix]

gunz_cm.preprocs.to_coo_matrix(matrix: pandas.core.frame.DataFrame | tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray], is_triu_sym: bool = True, row_ids_colname: str = 'row_ids', col_ids_colname: str = 'col_ids', vals_colname: str = 'counts', shape: tuple[int, int] | None = None) → coo_matrix[source]#

gunz_cm.preprocs.to_coo_matrix(matrix: DataFrame, is_triu_sym: bool, row_ids_colname: str, col_ids_colname: str, vals_colname: str, shape: tuple[int, int] | None = None) → coo_matrix

gunz_cm.preprocs.to_coo_matrix(matrix: tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray], is_triu_sym: bool, shape: tuple[int, int] | None = None, **kwargs) → coo_matrix

Convert various data types to a SciPy COO sparse matrix.

Parameters:

matrix (pd.DataFrame or tuple) – Input data, which can be: - A pandas DataFrame with coordinate and value columns. - A tuple of (rows, columns, values) NumPy arrays.
is_triu_sym (bool, optional) – If True, assumes the matrix is symmetric and stored in upper-triangular format, used for inferring the full matrix shape. Defaults to True.
row_ids_colname (str, optional) – Column name for row IDs (for DataFrame input).
col_ids_colname (str, optional) – Column name for column IDs (for DataFrame input).
vals_colname (str, optional) – Column name for values (for DataFrame input).
shape (tuple of ints, optional) – The shape of the matrix. If None, it is inferred from the data.

Returns:

The COO format sparse matrix representation of the data.

Return type:

sp.coo_matrix

gunz_cm.preprocs.to_dataframe(matrix: coo_matrix, row_ids_colname: str = 'row_ids', col_ids_colname: str = 'col_ids', vals_colname: str = 'counts') → DataFrame[source]#

gunz_cm.preprocs.to_dataframe(matrix: coo_matrix, row_ids_colname: str, col_ids_colname: str, vals_colname: str) → DataFrame

Convert a sparse matrix to a pandas DataFrame.

Parameters:

matrix (sp.coo_matrix) – Input COO format sparse matrix.
row_ids_colname (str, optional) – The desired column name for row IDs in the output DataFrame.
col_ids_colname (str, optional) – The desired column name for column IDs in the output DataFrame.
vals_colname (str, optional) – The desired column name for values in the output DataFrame.

Returns:

A DataFrame with columns for row IDs, column IDs, and values.

Return type:

pd.DataFrame

gunz_cm.preprocs.transform_to_gaussian(matrix: numpy.ndarray | scipy.sparse._coo.coo_matrix | scipy.sparse._csr.csr_matrix, mu: float = 0, sigma: float = 1, inplace: bool = False) → numpy.ndarray | scipy.sparse._coo.coo_matrix | scipy.sparse._csr.csr_matrix[source]#

gunz_cm.preprocs.transform_to_gaussian(matrix: ndarray, mu: float, sigma: float, inplace: bool = False, **kwargs) → ndarray

gunz_cm.preprocs.transform_to_gaussian(matrix: scipy.sparse._coo.coo_matrix | scipy.sparse._csr.csr_matrix, mu: float, sigma: float, inplace: bool = False, **kwargs) → scipy.sparse._coo.coo_matrix | scipy.sparse._csr.csr_matrix

Transform a matrix to a Gaussian distribution.

Notes

This function transforms the matrix to have a Gaussian distribution with the specified mean and standard deviation. It supports both dense and sparse matrices and can perform operations in-place if specified.

Parameters:

matrix (Union[np.ndarray, coo_matrix, csr_matrix]) – The matrix to transform.
mu (float, optional) – The mean of the Gaussian distribution (default is 0).
sigma (float, optional) – The standard deviation of the Gaussian distribution (default is 1).
inplace (bool, optional) – Whether to perform the transformation in-place (default is False).

Returns:

The transformed matrix.

Return type:

Union[np.ndarray, coo_matrix, csr_matrix]

gunz_cm.preprocs.uniform_resample_mat(cm_mat: ndarray, target_rate: float) → ndarray[source]#

Uniformly resample a matrix by a specified target rate.

Notes

This function simply multiplies the input matrix by the target rate.

Parameters:

cm_mat (np.ndarray) – Input contact matrix to resample.
target_rate (float) – Target rate for resampling.

Returns:

Resampled matrix.

Return type:

np.ndarray