pyprobound.alphabets.Alphabet

class Alphabet(alphabet, complement=False, monomer_length=1, encoding=None, color_scheme=None)

Bases: object

Stores the alphabet encoding of sequences into tensors.

Assumes that the reverse complement mapping is dict(zip(a, reversed(a))). Three sequence characters are reserved: ‘ ‘ is -infinity (not scored), ‘*’ is an uninformative prior over channels, and ‘-’ is zero.

alphabet

The monomers of the alphabet.

Type:

tuple[str]

get_index

A mapping of monomers in the alphabet to indices in the embedding matrix.

Type:

dict[str, int]

get_encoding

A mapping of monomers to tuples of indices in the embedding matrix; for example, ‘*’ maps to all indices available in the embedding.

Type:

dict[str, tuple[int,…]]

get_inv_encoding

Inverse of get_encoding.

Type:

dict[tuple[int,…], str]

__init__(alphabet, complement=False, monomer_length=1, encoding=None, color_scheme=None)

Initializes the alphabet.

Parameters:
  • alphabet (Iterable[str]) – The monomers of the alphabet.

  • complement (bool) – Whether to take the reverse order of the alphabet as the complement encoding - for example, the complement of [‘A’,’C’,’G’,’T’] would be assumed to be [‘T’,’G’,’C’,’A’].

  • monomer_length (int) – The length of elements in the alphabet.

  • encoding (Mapping[str, Iterable[str]] | None) – A mapping of monomers to a degenerate list of monomers - for example, ‘N’ maps to [‘A’,’C’,’G’,’T’].

  • color_scheme (str | dict[str, str | list[float]] | None) – Passed to Logomaker.Logo.

Methods

embed(seqs)

Embeds sequences from a dense to a one-hot representation.

pairwise_embed(seqs, dist)

Embeds sequences into a one-hot pairwise representation.

translate(sequence)

Translates a sequence into a tensor.

Non-Inherited Members

translate(sequence)

Translates a sequence into a tensor.

Parameters:

sequence (str) – A string sequence of length \(\text{length}\).

Return type:

Tensor

Returns:

A dense representation of the sequence as an integer tensor of shape \((\text{length},)\).

embed(seqs)

Embeds sequences from a dense to a one-hot representation.

Parameters:

seqs (Tensor) – A dense representation of sequences as an integer tensor of shape \((\text{minibatch},\text{length})\).

Return type:

Tensor

Returns:

A one-hot embedding of the sequences as a float tensor of shape \((\text{minibatch},\text{channels},\text{length})\).

pairwise_embed(seqs, dist)

Embeds sequences into a one-hot pairwise representation.

Parameters:
  • seqs (Tensor) – A dense representation of sequences as an integer tensor of shape \((\text{minibatch},\text{length})\).

  • dist (int) – The pairwise distance between two monomers.

Return type:

Tensor

Returns:

A one-hot embedding of the sequences as a float tensor of shape \((\text{minibatch}, \text{channels}^2, \text{length} - \text{dist})\). Each position i in the last dimension contains the product of the embedding of i and i+dist.