gunz_cm.utils package

Submodules

gunz_cm.utils.intervals module

Genomic interval utilities for binnification and set operations. Implemented to minimize dependencies on bioframe for core dataloading tasks.

gunz_cm.utils.intervals.binnify(chromsizes: Dict[str, int], binsize: int) DataFrame

Divide a genome into evenly sized bins. Matches bioframe.binnify logic.

Parameters:
  • chromsizes (dict) – Dictionary mapping chromosome names to lengths in bp.

  • binsize (int) – Size of bins in bp.

Returns:

DataFrame with columns: ‘chrom’, ‘start’, ‘end’.

Return type:

pd.DataFrame

gunz_cm.utils.intervals.subtract(df1: DataFrame, df2: DataFrame) DataFrame

Remove intervals from df1 that overlap with any interval in df2. Simplified implementation of interval subtraction.

Parameters:
  • df1 (pd.DataFrame) – Target intervals (e.g., training windows).

  • df2 (pd.DataFrame) – Excluded intervals (e.g., centromeres, blacklisted regions).

Returns:

Filtered df1 containing only intervals that do NOT overlap with df2.

Return type:

pd.DataFrame

gunz_cm.utils.logger module

Centralized logging configuration for the gunz_cm package.

gunz_cm.utils.logger.setup_logging(verbose: bool) None[source]

Configures logging for the CLI and application.

Parameters:

verbose (bool) – If True, sets the console log level to DEBUG. Otherwise, sets it to INFO.

gunz_cm.utils.matrix module

Module.

Examples

gunz_cm.utils.path module

Utilities for path manipulation and repository root discovery.

gunz_cm.utils.path.append_root_dir() None

Append the root directory to sys.path.

This function first retrieves the root directory of the Git repository using get_root_dir(). If the root directory is not already in sys.path, it appends it.

gunz_cm.utils.path.get_root_dir() str

Get the root directory of the Git repository.

This function attempts to find the root directory of the Git repository by searching from the current directory upwards. If no Git repository is found, it raises a RuntimeError.

Returns:

The root directory of the Git repository.

Return type:

str

Raises:

RuntimeError – If a Git repository could not be found in the current or parent directories.

gunz_cm.utils.resources module

Utilities for fetching genomic resources (centromeres, blacklists).

gunz_cm.utils.resources.fetch_centromeres(genome: str, cache: bool = True, cache_dir: Path = PosixPath('/home/adhisant/.gunz_cm/resources')) DataFrame

Fetch centromere coordinates for a given genome assembly from UCSC.

Parameters:
  • genome (str) – Genome assembly name (e.g., ‘hg19’, ‘hg38’, ‘mm10’).

  • cache (bool, optional) – Whether to cache the downloaded data. Defaults to True.

  • cache_dir (pathlib.Path, optional) – Directory to store cached files. Defaults to ~/.gunz_cm/resources.

Returns:

DataFrame with columns: [‘chrom’, ‘start’, ‘end’, ‘name’, ‘gieStain’].

Return type:

pd.DataFrame

gunz_cm.utils.stream module

Utilities for low-level binary stream and byte manipulation.

gunz_cm.utils.stream.bstr2int(payload: bytes, order: str | None = 'big') int[source]

Converts a byte string to an integer.

Parameters:
  • payload (bytes) – The byte string to convert.

  • order (str, optional) – The byte order to use (e.g., ‘big’ or ‘little’). Defaults to ‘big’.

Returns:

The integer representation of the byte string.

Return type:

int

Raises:

ValueError – If the order is not recognized or the payload is empty.

gunz_cm.utils.stream.int2bstr(val: int, len_in_byte: int, order: Literal['big', 'little'] = 'big') bytes[source]

Converts an integer to a byte string.

Converts an integer to a byte string of a specified length and byte order.

Parameters:
  • val (int) – The integer to convert.

  • len_in_byte (int) – The length of the resulting byte string in bytes.

  • order ({'big', 'little'}, optional) – The byte order to use. Defaults to ‘big’.

Returns:

The byte string representation of the integer.

Return type:

bytes

Raises:
gunz_cm.utils.stream.read_str(reader: BinaryIO, encoding: str = 'utf-8') str[source]

Reads a null-terminated string from a binary reader.

This function reads bytes from the reader until a null byte (b’x00’) is encountered. The read bytes are then decoded using the specified encoding.

Parameters:
  • reader (t.BinaryIO) – The binary reader to read from.

  • encoding (str, optional) – The encoding to use for decoding the read bytes. Defaults to ‘utf-8’.

Returns:

The decoded string.

Return type:

str

Module contents