gunz_cm.loaders package

Submodules

gunz_cm.loaders.cool_loader module

gunz_cm.loaders.cool_loader.get_assembly(fpath: str | Path, resolution: int) → str

Retrieve the genome assembly information from a cooler file.

Parameters:

fpath (t.Union[str, pathlib.Path]) – File path of the cooler file.
resolution (int) – Resolution level to query from the file.

Returns:

The genome assembly name (e.g., “hg38”).

Return type:

str

gunz_cm.loaders.cool_loader.get_balancing_weights(fpath: str | Path, resolution: int, region: str, weight_name: str = 'weight') → ndarray

Extracts a vector of balancing weights for a given region.

Parameters:

fpath (t.Union[str, pathlib.Path]) – File path of the cooler file (.cool or .mcool).
resolution (int) – The resolution to use.
region (str) – The chromosome or region to fetch balancing weights for (e.g., “chr1”).
weight_name (str, optional) – The name of the weight column in the bins table. Defaults to ‘weight’.

Returns:

A NumPy array of balancing weights for the specified region. Returns an empty array if the weight column does not exist.

Return type:

np.ndarray

gunz_cm.loaders.cool_loader.get_bins(fpath: str | Path, resolution: int) → DataFrame

Retrieve the binnified index from a cooler file.

Parameters:

fpath (t.Union[str, pathlib.Path]) – Path to the cooler file.
resolution (int) – The resolution to use.

Returns:

DataFrame with columns: ‘chrom’, ‘start’, ‘end’.

Return type:

pd.DataFrame

gunz_cm.loaders.cool_loader.get_chrom_infos(fpath: str | Path) → Dict[str, Dict[str, str | int]]

Extract chromosome information from a cooler file.

For .mcool files, it reads information from the lowest resolution (largest number).

Parameters:: fpath (t.Union[str, pathlib.Path]) – The file path to the cooler file (.cool or .mcool).
Returns:: A dictionary mapping chromosome names to their metadata (name, size).
Return type:: t.Dict[str, t.Dict[str, t.Union[str, int]]]

Examples

>>> from gunz_cm.loaders.cool_loader import get_chrom_infos
>>> info = get_chrom_infos("sample.cool")
>>> print(info['chr1']['size'])
249250621

gunz_cm.loaders.cool_loader.get_resolutions(fpath: str | Path) → List[int]

Retrieve available resolutions from a .mcool file.

Parameters:

fpath (t.Union[str, pathlib.Path]) – The file path to the multi-resolution cooler file (.mcool).

Returns:

A list of integer resolutions available in the file.

Return type:

t.List[int]

Raises:

FileNotFoundError – If the specified fpath does not exist.
LoaderError – If no resolutions can be found in the file.

Examples

>>> from gunz_cm.loaders.cool_loader import get_resolutions
>>> res = get_resolutions("sample.mcool")
>>> print(res)
[1000, 5000, 10000]

gunz_cm.loaders.cool_loader.get_sparsity(fpath: str | Path, resolution: int, region1: str, region2: str | None = None) → float: Calculates the sparsity of a cooler matrix for given regions.

gunz_cm.loaders.cool_loader.load_cooler(fpath: str | Path, resolution: int, region1: str, region2: str | None = None, balancing: Balancing | List[Balancing] | None = Balancing.NONE, output_format: DataStructure = DataStructure.DF, return_raw_counts: bool = False, backend: Backend = Backend.COOLER, chunksize: int | None = None) → ContactMatrix

Load contact matrix data from a cooler file lazily.

Parameters:

fpath (t.Union[str, pathlib.Path]) – Path to the cooler file (.cool or .mcool).
resolution (int) – The resolution level to load from the file.
region1 (str) – First genomic region (e.g., “chr1” or “chr1:1,000,000-2,000,000”).
balancing (Balancing, optional) – Balancing method to apply. Defaults to Balancing.NONE.
region2 (str, optional) – Second genomic region. Defaults to region1 for intra-chromosomal.
output_format (DS, optional) – The desired output format (DS.DF, DS.RCV, or DS.COO). Defaults to DS.DF.
return_raw_counts (bool, optional) – Whether to return raw counts alongside balanced counts. Only supported for output_format=DS.DF. Defaults to False.
backend (Backend, optional) – Select the underlying backend library for loading. Options: ‘cooler’ (default), ‘hictk’.
chunksize (t.Optional[int], optional) – If provided, the data will be loaded in chunks.

Returns:

A ContactMatrix object that can be used to load the data on demand.

Return type:

ContactMatrix

gunz_cm.loaders.csv_loader module

gunz_cm.loaders.csv_loader.load_csv(fpath: str | Path | BytesIO, region1: str, resolution: int, region2: str | None = None, balancing: Balancing | None = None, delimiter: str = '\\s+', encoding: str = 'utf-8', output_format: DataStructure = DataStructure.DF, column_names: List[str] | None = None) → ContactMatrix

Loads contact data from a CSV-like file path or buffer lazily.

fpatht.Union[str, pathlib.Path] | io.BytesIO
The file path or an in-memory byte stream to read from.

region1str
The chromosome to load (e.g., “chr1”).

resolutionint
The resolution (bin size) to apply to the coordinate columns.

region2t.Optional[str], optional
The second region for inter-chromosomal data. Currently not supported.

balancingt.Optional[Balancing], optional
The balancing method reflected in the data.

delimiterstr, optional
The delimiter to use for parsing.

encodingstr, optional
The character encoding of the file.

output_formatDataStructure, optional
The desired output format.

column_namesList[str], optional
Explicit column names.

Examples

gunz_cm.loaders.ginteractions_loader module

gunz_cm.loaders.ginteractions_loader.load_ginteractions(fpath: str | Path | BytesIO, resolution: int, region1: str, region2: str | None = None, encoding: str = 'utf-8', output_format: DataStructure = DataStructure.DF, **kwargs) → ContactMatrix[source]

Loads and processes data from a GInteractions-like tabular file lazily.

fpatht.Union[str, pathlib.Path] | io.BytesIO
The file path or an in-memory byte stream to read from.

resolutionint
The resolution for binning the genomic coordinates.

region1str
The first chromosome to include in the output (e.g., ‘chr1’).

region2t.Optional[str], optional
The second chromosome. If None, it defaults to region1 for intra-chromosomal interactions.

encodingstr, optional
The file encoding to use when reading the file.

output_formatDataStructure, optional
The desired output format.

**kwargs :
Catches extra keyword arguments.

Examples

gunz_cm.loaders.hic_loader module

class gunz_cm.loaders.hic_loader.HiCFooter(cpair_info: Dict[str, int], expected_values: Dict[Tuple[str, str, int], ndarray], norm_factors: Dict[Tuple[str, str, int], ndarray], norm_info: Dict[Tuple[str, str, int, int], Dict[str, int]], available_norms: List[str])[source]

Bases: object

Immutable container for Hi-C file footer information.

Examples

available_norms: List[str]

cpair_info: Dict[str, int]

expected_values: Dict[Tuple[str, str, int], ndarray]

norm_factors: Dict[Tuple[str, str, int], ndarray]

norm_info: Dict[Tuple[str, str, int, int], Dict[str, int]]

class gunz_cm.loaders.hic_loader.HiCHeader(version: int, master_index: int, genome: str, metadata: Dict[str, str], chromosomes: Dict[str, Dict[str, int]], resolutions: List[int])[source]

Bases: object

Immutable container for Hi-C file header information.

Examples

chromosomes: Dict[str, Dict[str, int]]

genome: str

master_index: int

metadata: Dict[str, str]

resolutions: List[int]

version: int

class gunz_cm.loaders.hic_loader.HiCMetadata(header: HiCHeader, footer: HiCFooter)[source]

Bases: object

A structured container for all Hi-C file metadata.

Examples

footer: HiCFooter

header: HiCHeader

gunz_cm.loaders.hic_loader.get_balancing_methods(fpath: str | Path) → list[str]

Retrieves the available balancing (normalization) methods.

Examples

gunz_cm.loaders.hic_loader.get_bins(fpath: str | Path, resolution: int) → DataFrame

Retrieve the binnified index from a Hi-C file.

Examples

gunz_cm.loaders.hic_loader.get_chrom_infos(fpath: str | Path, use_hicstraw: bool = False) → Dict[str, Dict[str, int]]

Reads chromosome information from a Hi-C file.

Examples

gunz_cm.loaders.hic_loader.get_resolutions(fpath: str | Path, use_hicstraw: bool = False) → list[int]

Retrieves the available resolutions from a Hi-C file.

Examples

gunz_cm.loaders.hic_loader.get_sparsity(fpath: str | Path, region1: str, resolution: int, region2: str | None = None) → float

Calculates the sparsity of a Hi-C contact matrix for a given region.

Examples

gunz_cm.loaders.hic_loader.load_hic(fpath: str | Path, region1: str, resolution: int, region2: str | None = None, balancing: Balancing | List[Balancing] | None = Balancing.NONE, output_format: DataStructure = DataStructure.DF, return_raw_counts: bool = False, backend: Backend = Backend.STRAW, chunksize: int | None = None) → ContactMatrix

Loads contact matrix data from a .hic file lazily.

Examples

gunz_cm.loaders.hic_loader.read_hic_metadata(fpath: str | Path, order: Literal['big', 'little'] = 'little') → HiCMetadata

Reads the complete header and footer metadata from a Hi-C file.

Examples

gunz_cm.loaders.memmap_loader module

gunz_cm.loaders.memmap_loader.gen_memmap_fpaths(base_fpath: t.Union[str, pathlib.Path]) → tuple[pathlib.Path, pathlib.Path]

Generates paths for the binary data and JSON metadata files.

base_fpatht.Union[str, pathlib.Path]
The base path for the memmap, without an extension.

tuple[pathlib.Path, pathlib.Path]
A tuple containing the path to the binary (.npdat) file and the metadata (.json) file.

Examples

gunz_cm.loaders.memmap_loader.is_memmap_exists(base_fpath: t.Union[str, pathlib.Path]) → bool

Checks if both the binary and metadata files for a memmap exist.

base_fpatht.Union[str, pathlib.Path]
The base path for the memmap to check.

bool
True if both the .npdat and .json files exist, False otherwise.

Examples

gunz_cm.loaders.memmap_loader.load_memmap(base_fpath: t.Union[str, pathlib.Path], mode: str = 'r') → ContactMatrix: Loads a NumPy array from a memory-mapped file lazily.

Examples

gunz_cm.loaders.narrowpeaks module

gunz_cm.loaders.pickle_loader module

Loads a pickle file containing a contact matrix object lazily.

Examples

gunz_cm.loaders.utils module

class gunz_cm.loaders.utils.ClosedInterval(start, end)

Bases: tuple

end: Alias for field number 1

start: Alias for field number 0

class gunz_cm.loaders.utils.Constant[source]

Bases: object

Bind instance to name to get a unique object

Examples

class gunz_cm.loaders.utils.Region(chromosome: int | str, region: Tuple[int, int] | Constant)[source]

Bases: object

Represent a range of loci and interface with the textual UCSC style in the form ‘chr22:1,000,000-1,500,000’

Use static method Region.from_string to parse USCS string. Converting back to string will canonicalize

chromosomeint or str: 1-22 or string X/Y/M (use chromname() to get chrN string)
region: ClosedInterval or the constant Region.ALL_LOCI: For ‘chr1:100-500’, region.start == 100 and region.end == 500

Parsing tries to be more lenient than the canonical form requires.

>>> str(Region.from_string('1:1,000-1,500')) == 'chr1:1000-1500'
True

>>> str(Region.from_string('chry')) == 'chrY'
True

Examples

ALL_LOCI = <gunz_cm.loaders.utils.Constant object>

chromname() → str[source]

Function chromname.

Examples

Notes

static from_string(region) → Region[source]

Function from_string.

Examples

Notes

is_full_chrom()[source]

This region describes the full chromosome, so region is 0:N

Examples

Module contents

This module provides a unified interface to parse various contact matrix file formats and load them into memory.

It acts as a facade, dispatching calls to the appropriate format-specific loader (e.g., for .hic, .cool, .csv) while providing a consistent API to the user.

Functions:: load_cm_data: Load a contact matrix from a file into memory. get_chrom_infos: Query chromosome names and lengths from a file. get_resolutions: List the available resolutions in a file. get_balancing: List available balancing methods for a specific region.

class gunz_cm.loaders.Balancing(value)[source]

Bases: BaseStrEnum

Enumeration for matrix balancing (normalization) methods.

Examples

KR = 'KR'

NONE = 'NONE'

VC = 'VC'

VC_SQRT = 'VC_SQRT'

class gunz_cm.loaders.BpFrag(value)[source]

Bases: BaseStrEnum

Enumeration for binning units (Base Pairs vs. Fragments).

Examples

BP = 'BP'

FRAG = 'FRAG'

class gunz_cm.loaders.Counts(value)[source]

Bases: BaseStrEnum

Enumeration for different types of interaction counts.

Examples

EXPECTED = 'expected'

OBSERVED = 'observed'

OE = 'oe'

class gunz_cm.loaders.DataStructure(value)[source]

Bases: BaseStrEnum

Enumeration for in-memory data representations.

Examples

COO = 'coo'

DF = 'df'

RC = 'rc'

RCV = 'rcv'

class gunz_cm.loaders.Format(value)[source]

Bases: BaseStrEnum

Enumeration for supported file formats.

Uses BaseStrEnum for case-insensitivity and aliases.

Examples

COO = 'coo'

COOLER = 'cooler'

CSV = 'csv'

GINTERACTIONS = 'ginteractions'

HIC = 'hic'

MCOO = 'mcoo'

MCSV = 'mcsv'

MEMMAP = 'npdat'

NPY = 'npy'

PICKLE = 'pickle'

TSV = 'tsv'

class gunz_cm.loaders.GenomeBuild(value)[source]

Bases: BaseStrEnum

Enumeration for standard genome builds.

Examples

HG19 = 'hg19'

HG38 = 'hg38'

MM10 = 'mm10'

MM9 = 'mm9'

class gunz_cm.loaders.Region(chromosome: int | str, region: Tuple[int, int] | Constant)[source]

Bases: object

Represent a range of loci and interface with the textual UCSC style in the form ‘chr22:1,000,000-1,500,000’

Use static method Region.from_string to parse USCS string. Converting back to string will canonicalize

chromosomeint or str: 1-22 or string X/Y/M (use chromname() to get chrN string)
region: ClosedInterval or the constant Region.ALL_LOCI: For ‘chr1:100-500’, region.start == 100 and region.end == 500

Parsing tries to be more lenient than the canonical form requires.

>>> str(Region.from_string('1:1,000-1,500')) == 'chr1:1000-1500'
True

>>> str(Region.from_string('chry')) == 'chrY'
True

Examples

ALL_LOCI = <gunz_cm.loaders.utils.Constant object>

chromname() → str[source]

Function chromname.

Examples

Notes

static from_string(region) → Region[source]

Function from_string.

Examples

Notes

is_full_chrom()[source]

This region describes the full chromosome, so region is 0:N

Examples

gunz_cm.loaders.gen_memmap_fpaths(base_fpath: t.Union[str, pathlib.Path]) → tuple[pathlib.Path, pathlib.Path]

Generates paths for the binary data and JSON metadata files.

base_fpatht.Union[str, pathlib.Path]
The base path for the memmap, without an extension.

tuple[pathlib.Path, pathlib.Path]
A tuple containing the path to the binary (.npdat) file and the metadata (.json) file.

Examples

gunz_cm.loaders.get_balancing(fpath: str, resolution: int, chrom: str) → list[str]

Gets available balancing methods for a region in a .hic or .cool file.

Parameters:

fpath (str) – The path to the contact matrix file.
resolution (int) – The resolution of the contact matrix.
chrom (str) – The chromosome of interest (e.g., ‘chr1’).

Returns:

A list of available balancing methods (e.g., [‘KR’, ‘VC_SQRT’]).

Return type:

list[str]

gunz_cm.loaders.get_bins(fpath: str | Path, resolution: int) → DataFrame

Gets the binnified index from a .hic or .cool file.

Parameters:

fpath (t.Union[str, pathlib.Path]) – The path to the contact matrix file.
resolution (int) – The resolution to use for binnification.

Returns:

A DataFrame with columns: ‘chrom’, ‘start’, ‘end’.

Return type:

pd.DataFrame

gunz_cm.loaders.get_chrom_infos(fpath: str) → dict[str, int]

Queries chromosome names and lengths from a .hic or .cool file.

Parameters:: fpath (str) – The path to the contact matrix file.
Returns:: A mapping of chromosome names to their lengths.
Return type:: dict[str, int]

gunz_cm.loaders.get_resolutions(fpath: str) → list[int]

Gets the resolutions available in a contact matrix file.

Parameters:: fpath (str) – The path to the contact matrix file.
Returns:: A list of available resolutions.
Return type:: list[int]

gunz_cm.loaders.is_file_standard_cm(fpath: str) → bool[source]: Checks if the file is a standard contact matrix file format.

gunz_cm.loaders.is_memmap_exists(base_fpath: t.Union[str, pathlib.Path]) → bool

Checks if both the binary and metadata files for a memmap exist.

base_fpatht.Union[str, pathlib.Path]
The base path for the memmap to check.

bool
True if both the .npdat and .json files exist, False otherwise.

Examples

Loads contact matrix data from various file formats.

This function acts as a dispatcher, routing the call to the appropriate format-specific loader based on the file’s extension or the fformat argument.

Parameters:

fpath (pathlib.Path) – Path to the contact matrix file.
resolution (int) – Resolution of the contact matrix to load.
region1 (str, optional) – First genomic region (e.g., ‘chr1’). Defaults to None.
region2 (str, optional) – Second genomic region. If None, loads intra-chromosomal data for region1. Defaults to None.
balancing (Balancing | list[Balancing], optional) – Balancing (normalization) method(s) to apply. Defaults to None.
out_datastructure (DataStructure, optional) – Desired output format (‘df’ or ‘coo’). Defaults to DataStructure.DF.
fformat (Format, optional) – Explicitly specify file format, otherwise inferred from extension. Defaults to None.
backend (Backend, optional) – Select the underlying backend library for loading. For COOLER: ‘cooler’, ‘hictk’. For HIC: ‘hicstraw’, ‘hictk’, ‘straw’. Defaults to None (uses standard backend for format).
use_fast_hic (bool, optional) – If True and file is .hic, use the faster fast_hic_loader. Equivalent to setting backend=’straw’. Defaults to False.
return_raw_counts (bool, optional) – If True, return raw counts alongside the primary (balanced) counts. Defaults to False.
**kwargs – Additional keyword arguments passed to the specific loader, (e.g., encoding for CSV files).

Returns:

The loaded contact matrix data in the specified output format.

Return type:

pd.DataFrame | tuple[np.ndarray, …] | np.ndarray | tuple[t.Any, …]

Raises:

FormatError – If the file format is not recognized or supported, or if an invalid backend is selected for the format.
NotImplementedError – If return_raw_counts is True for unsupported formats.

gunz_cm.loaders.load_memmap(base_fpath: t.Union[str, pathlib.Path], mode: str = 'r') → ContactMatrix: Loads a NumPy array from a memory-mapped file lazily.

Examples