pyprobound.utils.count_kmers

count_kmers(sequences, kmer_length=3, vocabulary=None)

Returns a sparse count matrix of k-mers in a list of sequences.

Parameters:
  • sequences (Iterable[Sequence[Hashable]]) – The sequences to count the k-mers in.

  • kmer_length (int) – The k-mer length to be counted.

  • vocabulary (dict[Sequence[Hashable], int] | None) – Mapping of k-mers to indices.

Return type:

tuple[csc_array, dict[Sequence[Hashable], int]]

Returns:

A tuple (matrix, vocabulary), where matrix is a sparse CSC matrix of the count of each k-mer in each sequence, and vocabulary is the mapping of k-mers to their respective indices in the matrix.