pyprobound.alphabets.Alphabet
- class Alphabet(alphabet, complement=False, monomer_length=1, encoding=None, color_scheme=None)
Bases:
objectStores the alphabet encoding of sequences into tensors.
Assumes that the reverse complement mapping is dict(zip(a, reversed(a))). Three sequence characters are reserved: ‘ ‘ is -infinity (not scored), ‘*’ is an uninformative prior over channels, and ‘-’ is zero.
- alphabet
The monomers of the alphabet.
- Type:
tuple[str]
- get_index
A mapping of monomers in the alphabet to indices in the embedding matrix.
- Type:
dict[str, int]
- get_encoding
A mapping of monomers to tuples of indices in the embedding matrix; for example, ‘*’ maps to all indices available in the embedding.
- Type:
dict[str, tuple[int,…]]
- get_inv_encoding
Inverse of get_encoding.
- Type:
dict[tuple[int,…], str]
- __init__(alphabet, complement=False, monomer_length=1, encoding=None, color_scheme=None)
Initializes the alphabet.
- Parameters:
alphabet (
Iterable[str]) – The monomers of the alphabet.complement (
bool) – Whether to take the reverse order of the alphabet as the complement encoding - for example, the complement of [‘A’,’C’,’G’,’T’] would be assumed to be [‘T’,’G’,’C’,’A’].monomer_length (
int) – The length of elements in the alphabet.encoding (
Mapping[str,Iterable[str]] |None) – A mapping of monomers to a degenerate list of monomers - for example, ‘N’ maps to [‘A’,’C’,’G’,’T’].color_scheme (
str|dict[str,str|list[float]] |None) – Passed to Logomaker.Logo.
Methods
embed(seqs)Embeds sequences from a dense to a one-hot representation.
pairwise_embed(seqs, dist)Embeds sequences into a one-hot pairwise representation.
translate(sequence)Translates a sequence into a tensor.
Non-Inherited Members
- translate(sequence)
Translates a sequence into a tensor.
- Parameters:
sequence (
str) – A string sequence of length \(\text{length}\).- Return type:
Tensor- Returns:
A dense representation of the sequence as an integer tensor of shape \((\text{length},)\).
- embed(seqs)
Embeds sequences from a dense to a one-hot representation.
- Parameters:
seqs (
Tensor) – A dense representation of sequences as an integer tensor of shape \((\text{minibatch},\text{length})\).- Return type:
Tensor- Returns:
A one-hot embedding of the sequences as a float tensor of shape \((\text{minibatch},\text{channels},\text{length})\).
- pairwise_embed(seqs, dist)
Embeds sequences into a one-hot pairwise representation.
- Parameters:
seqs (
Tensor) – A dense representation of sequences as an integer tensor of shape \((\text{minibatch},\text{length})\).dist (
int) – The pairwise distance between two monomers.
- Return type:
Tensor- Returns:
A one-hot embedding of the sequences as a float tensor of shape \((\text{minibatch}, \text{channels}^2, \text{length} - \text{dist})\). Each position i in the last dimension contains the product of the embedding of i and i+dist.