gunz_cm.loaders package

Submodules

gunz_cm.loaders.cool_loader module

gunz_cm.loaders.cool_loader.get_assembly(fpath: str | Path, resolution: int) str[source]

Retrieve the genome assembly information from a cooler file.

fpathstr | pathlib.Path

File path of the cooler file.

resolutionint

Resolution level to query from the file.

str

The genome assembly name (e.g., “hg38”).

Examples

gunz_cm.loaders.cool_loader.get_balancing_weights(fpath: str | Path, resolution: int, region: str, weight_name: str = 'weight') ndarray[source]

Extracts a vector of balancing weights for a given region.

fpathstr | pathlib.Path

File path of the cooler file (.cool or .mcool).

resolutionint

The resolution to use.

regionstr

The chromosome or region to fetch balancing weights for (e.g., “chr1”).

weight_namestr, optional

The name of the weight column in the bins table. Defaults to ‘weight’.

np.ndarray

A NumPy array of balancing weights for the specified region. Returns an empty array if the weight column does not exist.

Examples

gunz_cm.loaders.cool_loader.get_bins(fpath: str | Path, resolution: int) DataFrame[source]

Retrieve the binnified index from a cooler file.

fpathstr | pathlib.Path

Path to the cooler file.

resolutionint

The resolution to use.

pd.DataFrame

DataFrame with columns: ‘chrom’, ‘start’, ‘end’.

Examples

gunz_cm.loaders.cool_loader.get_chrom_infos(fpath: str | Path) Dict[str, Dict[str, str | int]][source]

Extract chromosome information from a cooler file.

For .mcool files, it reads information from the lowest resolution (largest number).

fpathstr | pathlib.Path

The file path to the cooler file (.cool or .mcool).

t.Dict[str, t.Dict[str, t.Union[str, int]]]

A dictionary mapping chromosome names to their details.

Examples

gunz_cm.loaders.cool_loader.get_resolutions(fpath: str | Path) List[int][source]

Retrieve available resolutions from a .mcool file.

fpathstr | pathlib.Path

The file path to the multi-resolution cooler file (.mcool).

t.List[int]

A list of integer resolutions available in the file.

FileNotFoundError

If the specified fpath does not exist.

LoaderError

If no resolutions can be found in the file.

Examples

gunz_cm.loaders.cool_loader.get_sparsity(fpath: str | Path, resolution: int, region1: str, region2: str | None = None) float[source]

Calculates the sparsity of a cooler matrix for given regions.

Examples

gunz_cm.loaders.cool_loader.load_cooler(fpath: str | Path, resolution: int, region1: str, region2: str | None = None, balancing: Balancing | list[Balancing] | None = Balancing.NONE, output_format: DataStructure = DataStructure.DF, return_raw_counts: bool = False, backend: Backend = Backend.COOLER, chunksize: int | None = None) ContactMatrix[source]

Load contact matrix data from a cooler file lazily.

fpathstr | pathlib.Path

Path to the cooler file (.cool or .mcool).

resolutionint

The resolution level to load from the file.

region1str

First genomic region (e.g., “chr1” or “chr1:1,000,000-2,000,000”).

balancingBalancing, optional

Balancing method to apply. Defaults to Balancing.NONE.

region2str, optional

Second genomic region. Defaults to region1 for intra-chromosomal.

output_formatDS, optional

The desired output format (DS.DF, DS.RCV, or DS.COO). Defaults to DS.DF.

return_raw_countsbool, optional

Whether to return raw counts alongside balanced counts. Only supported for output_format=DS.DF. Defaults to False.

backendBackend, optional

Select the underlying backend library for loading. Options: ‘cooler’ (default), ‘hictk’.

chunksizeint | None, optional

If provided, the data will be loaded in chunks.

ContactMatrix

A ContactMatrix object that can be used to load the data on demand.

Examples

gunz_cm.loaders.csv_loader module

gunz_cm.loaders.csv_loader.load_csv(fpath: str | Path | BytesIO, region1: str, resolution: int, region2: str | None = None, balancing: Balancing | None = None, delimiter: str = '\\s+', encoding: str = 'utf-8', output_format: DataStructure = DataStructure.DF, column_names: List[str] | None = None) ContactMatrix[source]

Loads contact data from a CSV-like file path or buffer lazily.

fpathstr | pathlib.Path | io.BytesIO

The file path or an in-memory byte stream to read from.

region1str

The chromosome to load (e.g., “chr1”).

resolutionint

The resolution (bin size) to apply to the coordinate columns.

region2str | None, optional

The second region for inter-chromosomal data. Currently not supported.

balancingBalancing | None, optional

The balancing method reflected in the data.

delimiterstr, optional

The delimiter to use for parsing.

encodingstr, optional

The character encoding of the file.

output_formatDataStructure, optional

The desired output format.

column_namesList[str], optional

Explicit column names.

Examples

gunz_cm.loaders.ginteractions_loader module

gunz_cm.loaders.ginteractions_loader.load_ginteractions(fpath: str | Path | BytesIO, resolution: int, region1: str, region2: str | None = None, encoding: str = 'utf-8', output_format: DataStructure = DataStructure.DF, **kwargs) ContactMatrix[source]

Loads and processes data from a GInteractions-like tabular file lazily.

fpathstr | pathlib.Path | io.BytesIO

The file path or an in-memory byte stream to read from.

resolutionint

The resolution for binning the genomic coordinates.

region1str

The first chromosome to include in the output (e.g., ‘chr1’).

region2str | None, optional

The second chromosome. If None, it defaults to region1 for intra-chromosomal interactions.

encodingstr, optional

The file encoding to use when reading the file.

output_formatDataStructure, optional

The desired output format.

**kwargs :

Catches extra keyword arguments.

Examples

gunz_cm.loaders.hic_loader module

class gunz_cm.loaders.hic_loader.HiCFooter(cpair_info: Dict[str, int], expected_values: Dict[Tuple[str, str, int], ndarray], norm_factors: Dict[Tuple[str, str, int], ndarray], norm_info: Dict[Tuple[str, str, int, int], Dict[str, int]], available_norms: List[str])[source]

Bases: object

Immutable container for Hi-C file footer information.

Examples

available_norms: List[str]
cpair_info: Dict[str, int]
expected_values: Dict[Tuple[str, str, int], ndarray]
norm_factors: Dict[Tuple[str, str, int], ndarray]
norm_info: Dict[Tuple[str, str, int, int], Dict[str, int]]
class gunz_cm.loaders.hic_loader.HiCHeader(version: int, master_index: int, genome: str, metadata: Dict[str, str], chromosomes: Dict[str, Dict[str, int]], resolutions: List[int])[source]

Bases: object

Immutable container for Hi-C file header information.

Examples

chromosomes: Dict[str, Dict[str, int]]
genome: str
master_index: int
metadata: Dict[str, str]
resolutions: List[int]
version: int
class gunz_cm.loaders.hic_loader.HiCMetadata(header: HiCHeader, footer: HiCFooter)[source]

Bases: object

A structured container for all Hi-C file metadata.

Examples

footer: HiCFooter
header: HiCHeader
gunz_cm.loaders.hic_loader.get_balancing_methods(fpath: str | Path) list[str][source]

Retrieves the available balancing (normalization) methods.

Examples

gunz_cm.loaders.hic_loader.get_bins(fpath: str | Path, resolution: int) DataFrame[source]

Retrieve the binnified index from a Hi-C file.

Examples

gunz_cm.loaders.hic_loader.get_chrom_infos(fpath: str | Path, use_hicstraw: bool = False) Dict[str, Dict[str, int]][source]

Reads chromosome information from a Hi-C file.

Examples

gunz_cm.loaders.hic_loader.get_resolutions(fpath: str | Path, use_hicstraw: bool = False) list[int][source]

Retrieves the available resolutions from a Hi-C file.

Examples

gunz_cm.loaders.hic_loader.get_sparsity(fpath: str | Path, region1: str, resolution: int, region2: str | None = None) float[source]

Calculates the sparsity of a Hi-C contact matrix for a given region.

Examples

gunz_cm.loaders.hic_loader.load_hic(fpath: str | Path, region1: str, resolution: int, region2: str | None = None, balancing: Balancing | list[Balancing] | None = Balancing.NONE, output_format: DataStructure = DataStructure.DF, return_raw_counts: bool = False, backend: Backend = Backend.STRAW, chunksize: int | None = None) ContactMatrix[source]

Loads contact matrix data from a .hic file lazily.

Examples

gunz_cm.loaders.hic_loader.read_hic_metadata(fpath: str | Path, order: Literal['big', 'little'] = 'little') HiCMetadata[source]

Reads the complete header and footer metadata from a Hi-C file.

Examples

gunz_cm.loaders.memmap_loader module

gunz_cm.loaders.memmap_loader.gen_memmap_fpaths(base_fpath: str | Path) tuple[Path, Path][source]

Generates paths for the binary data and JSON metadata files.

base_fpathstr | pathlib.Path

The base path for the memmap, without an extension.

tuple[pathlib.Path, pathlib.Path]

A tuple containing the path to the binary (.npdat) file and the metadata (.json) file.

Examples

gunz_cm.loaders.memmap_loader.is_memmap_exists(base_fpath: str | Path) bool[source]

Checks if both the binary and metadata files for a memmap exist.

base_fpathstr | pathlib.Path

The base path for the memmap to check.

bool

True if both the .npdat and .json files exist, False otherwise.

Examples

gunz_cm.loaders.memmap_loader.load_memmap(base_fpath: str | Path, mode: str = 'r') ContactMatrix[source]

Loads a NumPy array from a memory-mapped file lazily.

Examples

gunz_cm.loaders.narrowpeaks module

gunz_cm.loaders.narrowpeaks.get_chrom_infos(fpath: str | Path) dict[str, dict[str, str | None]][source]

Retrieves chromosome information from a narrowPeak file.

This function reads a narrowPeak file, extracts unique chromosome names, and initializes a dictionary with chromosome names as keys and their lengths set to None.

fpathstr | pathlib.Path

The file path to the narrowPeak file.

dict[str, dict[str, str | None]]

A dictionary with chromosome names as keys and their lengths set to None.

Examples

gunz_cm.loaders.narrowpeaks.load_narrowpeak(fpath: str | Path, chromosome: str | None = None, resolution: int | None = None) DataFrame[source]

Reads and processes a narrowPeak file.

This function reads a narrowPeak file, assigns column names, converts data types, validates the data, and filters by a specified region if provided.

fpathstr | pathlib.Path

The file path to the narrowPeak file.

chromosomestr | None, optional

The chromosome region to filter by (default is None).

resolutionint | None, optional

The resolution parameter. If provided, the start and end coordinates will be binned to this resolution.

pd.DataFrame

A DataFrame containing the processed narrowPeak data.

Examples

gunz_cm.loaders.pickle_loader module

gunz_cm.loaders.pickle_loader.load_pickle(fpath: str | Path, region1: str | None = None, resolution: int | None = None, region2: str | None = None, balancing: str | None = None, output_format: DataStructure = DataStructure.DF) ContactMatrix[source]

Loads a pickle file containing a contact matrix object lazily.

Examples

gunz_cm.loaders.utils module

class gunz_cm.loaders.utils.ClosedInterval(start, end)

Bases: tuple

end

Alias for field number 1

start

Alias for field number 0

class gunz_cm.loaders.utils.Constant[source]

Bases: object

Bind instance to name to get a unique object

Examples

class gunz_cm.loaders.utils.Region(chromosome: int | str, region: Tuple[int, int] | Constant)[source]

Bases: object

Represent a range of loci and interface with the textual UCSC style in the form ‘chr22:1,000,000-1,500,000’

Use static method Region.from_string to parse USCS string. Converting back to string will canonicalize

chromosomeint or str

1-22 or string X/Y/M (use chromname() to get chrN string)

region: ClosedInterval or the constant Region.ALL_LOCI

For ‘chr1:100-500’, region.start == 100 and region.end == 500

Parsing tries to be more lenient than the canonical form requires.

>>> str(Region.from_string('1:1,000-1,500')) == 'chr1:1000-1500'
True
>>> str(Region.from_string('chry')) == 'chrY'
True

Examples

ALL_LOCI = <gunz_cm.loaders.utils.Constant object>
chromname() str[source]

Function chromname.

Examples

Notes

static from_string(region) Region[source]

Function from_string.

Examples

Notes

is_full_chrom()[source]

This region describes the full chromosome, so region is 0:N

Examples

Module contents

This module provides a unified interface to parse various contact matrix file formats and load them into memory.

It acts as a facade, dispatching calls to the appropriate format-specific loader (e.g., for .hic, .cool, .csv) while providing a consistent API to the user.

Functions:

load_cm_data: Load a contact matrix from a file into memory. get_chrom_infos: Query chromosome names and lengths from a file. get_resolutions: List the available resolutions in a file. get_balancing: List available balancing methods for a specific region.

class gunz_cm.loaders.Balancing(value)[source]

Bases: BaseStrEnum

Enumeration for matrix balancing (normalization) methods.

Examples

KR = 'KR'
NONE = 'NONE'
VC = 'VC'
VC_SQRT = 'VC_SQRT'
class gunz_cm.loaders.BpFrag(value)[source]

Bases: BaseStrEnum

Enumeration for binning units (Base Pairs vs. Fragments).

Examples

BP = 'BP'
FRAG = 'FRAG'
class gunz_cm.loaders.Counts(value)[source]

Bases: BaseStrEnum

Enumeration for different types of interaction counts.

Examples

EXPECTED = 'expected'
OBSERVED = 'observed'
OE = 'oe'
class gunz_cm.loaders.DataStructure(value)[source]

Bases: BaseStrEnum

Enumeration for in-memory data representations.

Examples

COO = 'coo'
DF = 'df'
RC = 'rc'
RCV = 'rcv'
class gunz_cm.loaders.Format(value)[source]

Bases: BaseStrEnum

Enumeration for supported file formats.

Uses BaseStrEnum for case-insensitivity and aliases.

Examples

COO = 'coo'
COOLER = 'cooler'
CSV = 'csv'
GINTERACTIONS = 'ginteractions'
HIC = 'hic'
MCOO = 'mcoo'
MCSV = 'mcsv'
MEMMAP = 'npdat'
NPY = 'npy'
PICKLE = 'pickle'
TSV = 'tsv'
class gunz_cm.loaders.GenomeBuild(value)[source]

Bases: BaseStrEnum

Enumeration for standard genome builds.

Examples

HG19 = 'hg19'
HG38 = 'hg38'
MM10 = 'mm10'
MM9 = 'mm9'
class gunz_cm.loaders.Region(chromosome: int | str, region: Tuple[int, int] | Constant)[source]

Bases: object

Represent a range of loci and interface with the textual UCSC style in the form ‘chr22:1,000,000-1,500,000’

Use static method Region.from_string to parse USCS string. Converting back to string will canonicalize

chromosomeint or str

1-22 or string X/Y/M (use chromname() to get chrN string)

region: ClosedInterval or the constant Region.ALL_LOCI

For ‘chr1:100-500’, region.start == 100 and region.end == 500

Parsing tries to be more lenient than the canonical form requires.

>>> str(Region.from_string('1:1,000-1,500')) == 'chr1:1000-1500'
True
>>> str(Region.from_string('chry')) == 'chrY'
True

Examples

ALL_LOCI = <gunz_cm.loaders.utils.Constant object>
chromname() str[source]

Function chromname.

Examples

Notes

static from_string(region) Region[source]

Function from_string.

Examples

Notes

is_full_chrom()[source]

This region describes the full chromosome, so region is 0:N

Examples

gunz_cm.loaders.gen_memmap_fpaths(base_fpath: str | Path) tuple[Path, Path][source]

Generates paths for the binary data and JSON metadata files.

base_fpathstr | pathlib.Path

The base path for the memmap, without an extension.

tuple[pathlib.Path, pathlib.Path]

A tuple containing the path to the binary (.npdat) file and the metadata (.json) file.

Examples

gunz_cm.loaders.get_balancing(fpath: str, resolution: int, chrom: str) list[str][source]

Gets available balancing methods for a region in a .hic or .cool file.

Parameters:
  • fpath (str) – The path to the contact matrix file.

  • resolution (int) – The resolution of the contact matrix.

  • chrom (str) – The chromosome of interest (e.g., ‘chr1’).

Returns:

A list of available balancing methods (e.g., [‘KR’, ‘VC_SQRT’]).

Return type:

list[str]

gunz_cm.loaders.get_bins(fpath: str | Path, resolution: int) DataFrame[source]

Gets the binnified index from a .hic or .cool file.

Parameters:
  • fpath (str | pathlib.Path) – The path to the contact matrix file.

  • resolution (int) – The resolution to use for binnification.

Returns:

A DataFrame with columns: ‘chrom’, ‘start’, ‘end’.

Return type:

pd.DataFrame

gunz_cm.loaders.get_chrom_infos(fpath: str) dict[str, int][source]

Queries chromosome names and lengths from a .hic or .cool file.

Parameters:

fpath (str) – The path to the contact matrix file.

Returns:

A mapping of chromosome names to their lengths.

Return type:

dict[str, int]

gunz_cm.loaders.get_resolutions(fpath: str) list[int][source]

Gets the resolutions available in a contact matrix file.

Parameters:

fpath (str) – The path to the contact matrix file.

Returns:

A list of available resolutions.

Return type:

list[int]

gunz_cm.loaders.is_file_standard_cm(fpath: str) bool[source]

Checks if the file is a standard contact matrix file format.

gunz_cm.loaders.is_memmap_exists(base_fpath: str | Path) bool[source]

Checks if both the binary and metadata files for a memmap exist.

base_fpathstr | pathlib.Path

The base path for the memmap to check.

bool

True if both the .npdat and .json files exist, False otherwise.

Examples

gunz_cm.loaders.load_cm_data(fpath: Path, resolution: int, region1: str | None = None, region2: str | None = None, balancing: Balancing | list[Balancing] | None = None, output_format: DataStructure = DataStructure.DF, fformat: Format | None = None, backend: Backend | None = None, use_fast_hic: bool = False, return_raw_counts: bool = False, **kwargs) DataFrame | tuple[ndarray, ...] | ndarray | tuple[Any, ...][source]

Loads contact matrix data from various file formats.

This function acts as a dispatcher, routing the call to the appropriate format-specific loader based on the file’s extension or the fformat argument.

Parameters:
  • fpath (pathlib.Path) – Path to the contact matrix file.

  • resolution (int) – Resolution of the contact matrix to load.

  • region1 (str, optional) – First genomic region (e.g., ‘chr1’). Defaults to None.

  • region2 (str, optional) – Second genomic region. If None, loads intra-chromosomal data for region1. Defaults to None.

  • balancing (Balancing | list[Balancing], optional) – Balancing (normalization) method(s) to apply. Defaults to None.

  • out_datastructure (DataStructure, optional) – Desired output format (‘df’ or ‘coo’). Defaults to DataStructure.DF.

  • fformat (Format, optional) – Explicitly specify file format, otherwise inferred from extension. Defaults to None.

  • backend (Backend, optional) – Select the underlying backend library for loading. For COOLER: ‘cooler’, ‘hictk’. For HIC: ‘hicstraw’, ‘hictk’, ‘straw’. Defaults to None (uses standard backend for format).

  • use_fast_hic (bool, optional) – If True and file is .hic, use the faster fast_hic_loader. Equivalent to setting backend=’straw’. Defaults to False.

  • return_raw_counts (bool, optional) – If True, return raw counts alongside the primary (balanced) counts. Defaults to False.

  • **kwargs – Additional keyword arguments passed to the specific loader, (e.g., encoding for CSV files).

Returns:

The loaded contact matrix data in the specified output format.

Return type:

pd.DataFrame | tuple[np.ndarray, …] | np.ndarray | tuple[t.Any, …]

Raises:
  • FormatError – If the file format is not recognized or supported, or if an invalid backend is selected for the format.

  • NotImplementedError – If return_raw_counts is True for unsupported formats.

gunz_cm.loaders.load_memmap(base_fpath: str | Path, mode: str = 'r') ContactMatrix[source]

Loads a NumPy array from a memory-mapped file lazily.

Examples