User Guide

ProBound [1] learns free energy parameters by maximizing the Poisson likelihood given the count of each observed sequence across one or more sequential enrichment rounds. The likelihood is computed by predicting the binding probability of a sequence, and from this probability predicting the sequence count across the different enrichment rounds.

This package [2] implements ProBound in Python using PyTorch, which allows for the analysis of sequences of varying lengths, among other features. To learn more about PyTorch, check out their [Quick Start tutorial](https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html).

This User Guide covers all of the core features designed into the original implementation of ProBound. A simple example can be found in CTCF: Single Experiment.

Count Table

Alphabet

The first step is to define the alphabet of the sequences being modeled

alphabet = pyprobound.alphabets.DNA()

Other than DNA, options include RNA, Codon, and Protein. Custom alphabets can be used by creating a new instance of the class Alphabet. An example is CEBPγ: EpiSELEX-seq.

Count Table

Each experiment must consist of one or more sequential enrichment rounds that are sampled and sequenced in high-throughput. The sequencing data must then be encoded into a count table that enumerates the count of each sequence in each round. A count table is a matrix \(k\) such that \(k_{i,r}\) is the number of times that a probe \(i\) appears in round \(r\). For example,

Index

Round 1

Round 2

ACGTC

2

1

GAGTT

0

1

GTCGC

1

2

TCCAT

1

0

Rounds 1 and 2 may correspond to successive SELEX rounds. For in-vivo TF binding assays (such as ChIP-seq, ChIP-exo, CUT&Tag, CUT&RUN, ChIP-exo, etc.) Round 1 would be the mock library, and Round 2 would be the assay library.

This count table should be loaded into a Pandas DataFrame, where the index is the sequence label for that row. If each round is stored as a single-column TSV, this can be done with a PyProBound helper function

dataframe = pyprobound.get_dataframe(["round1.tsv.gz", "round2.tsv.gz"])

Once loaded into a dataframe, it can be converted into a CountTable object

count_table = pyprobound.CountTable(dataframe, alphabet=alphabet)

Binding Layer

In its simplest configuration, a binding model will consist of two distinct binding modes: a non-specific binding parameter, and a position-specific affinity matrix (PSAM).

Non-Specific Binding

NonSpecific binding is a distinct binding mode, equivalent to a PSAM without any sequence-specific parameters

nonspecific = pyprobound.layers.NonSpecific(alphabet=alphabet, name="NS")

PSAM

A PSAM represents a binding motif as a matrix in which each element corresponds to the free-energy penalty of binding for a given feature relative to a reference sequence that lacks that feature

\[\Delta\Delta G(\text{sequence}) = \Delta G(\text{sequence}) - \Delta G(\text{reference})\]

For example, the following creates a PSAM of length 16, with a total of (4 bases)*(16 positions) = 64 parameters

psam = pyprobound.layers.PSAM(kernel_size=16, alphabet=alphabet)

PSAMs can be seeded with IUPAC code motifs, and can additionally model pairwise features (such as dinucleotides, as well as non-adjacent letter pairs) and palindromic binding. One example that uses all of these features is CEBPγ: EpiSELEX-seq. For further information, refer to the PSAM API.

A PSAM can also be imported from external resources, such as MotifCentral [1], as well as JASPAR and HOCOMOCO 11, by using the functions import_motif_central, import_jaspar, and import_hocomoco, respectively.

Binding Mode

Once the non-specific binding and PSAM objects are created, they must be wrapped into a binding Mode. For convenience, they will be wrapped into a single iterable

modes = [
    pyprobound.Mode.from_nonspecific(nonspecific, count_table),
    pyprobound.Mode.from_psam(psam, count_table),
]

Here, additional features can also be enabled, such as a position bias modeling, which trains a multiplicative bias for each sliding window of the PSAM along the sequence. One example of this is GR: ChIP-seq. For further information, refer to the from_psam API.

The output of each mode is \(-\log K^{rel}_{\text{D}}\) of that mode, where

\[\frac{1}{K^{rel}_{\text{D}}(\text{sequence})} = \sum_{\text{window}} \frac{K_{\text{D}}(\text{reference})}{K_{\text{D}}(\text{window})} = \sum_{\text{window}} \exp \left( - \frac{\Delta\Delta G(\text{window})}{RT} \right)\]

where window is the sliding window of the PSAM along the sequence. In computational terms, this is equivalent to the LogSumExp of the output of a 1D convolution of the PSAM along the sequence.

Multiple Binding Modes

When multiple binding modes are used to model an experiment, ProBound learns an activity parameter \(\alpha_r\) for each mode, which estimates the ratio \([P_{free}] / K_{\text{D}}(\text{reference})\) for a sequencing round \(r\). Multiple binding modes can then be summed together analogously to the way different sliding windows are summed together to predict \(K^{rel}_{\text{D}}\). For a sequence \(i\), the sum over all modes becomes

\[Z_{i,r} = \sum_{\text{mode}} \frac{\alpha_{\text{mode}, r}}{K^{rel}_{\text{D, mode}}(\text{sequence})}\]

The value \(\log Z_{i,r}\) can be calculated from log_aggregate, a function of a Round object described in the following section.

Assay Layer

To predict the count of a probe across different enrichment rounds, the relationship between each round must first be encoded.

Specifically, if \(f_{i,r}\) is the relative concentration of probe \(i\) in round \(r\), then the enrichment ratio \(f_{i,r} / f_{i,r-1}\) must be defined in terms of \(Z_{i,r}\). A multiplicative sequencing depth parameter \(\eta_{r}\) is also trained, so the final output of a round is the expected log count of the probe in that round, \(\log \left( \eta_r f_{i,r} \right)\), where

\[\eta_r f_{i,r} = \eta_r f_{i,r-1} \text{Enrichment}(Z_{i,r})\]

The assay layer is very flexible, so it must be carefully specified to properly correspond to the experiment being modeled.

Initial Round

Each experiment begins with an InitialRound

initial_round = pyprobound.rounds.InitialRound()

Subsequent Rounds

Each successive round can be described by an enrichment function relative to the preceding round. For example, if a sample from the initial library is enriched for binding to a TF to form the second round of a SELEX experiment, one can define

second_round = pyprobound.rounds.BoundRound.from_binding(modes, reference_round=initial_round)

If a third SELEX round was performed, then it could be created with the flag reference_round=second_round, and so on.

BoundRound encodes the sigmoidal binding function as

\[f_{i,r} = f_{i,r-1} \frac{Z_{i,r}}{1 + Z_{i,r}}\]

Alternative enrichment functions can be specified, such as unsaturated binding, catalytic enrichment, or modeling of the unbound fraction. This can be used for Kinase-seq (see Src: Kinase-seq) or Kd-seq (see Dll: Kd-seq). For further information, refer to the rounds API.

Experiment

Once all of the rounds are created, they can be combined into an Experiment in the order that they appear in the count table

experiment = pyprobound.Experiment([initial_round, second_round])

The output of an experiment is the output of the rounds, normalized over the different rounds

\[\log \frac{\eta_{r} f_{i,r}}{ \sum_{r^\prime} \eta_{r^\prime} f_{i, r^\prime} }\]

Sequencing Layer and Optimization

Loss

Multiple experiments can be trained through joint optimization. First, a MultiExperimentLoss must be created

model = pyprobound.MultiExperimentLoss([experiment])

The output of the model is the sum of the Poisson negative log-likelihoods of each experiment (excluding constant terms), normalized by the total number of observed sequences in their corresponding count tables. Given a count table \(k\), this is

\[\frac{1}{\sum_{i,r} k_{i,r}} \sum_{i,r} k_{i,r} \log \frac{ \eta_{r} f_{i,r} }{ \sum_{r^\prime} \eta_{r^\prime} f_{i, r^\prime} }\]

The loss function may include different regularization penalties. For further information, refer to the MultiExperimentLoss API.

Examples of jointly modeling multiple experiments with shared parameters can be found in CTCF: Multiple Experiments, UbxExdHth: Binding Cooperativity, and CEBPγ: EpiSELEX-seq.

Optimization

To train the model, the model must then be wrapped into an Optimizer

optimizer = pyprobound.Optimizer(
    model, [count_table], device="cpu", checkpoint="checkpoint.pt",
)

The model can then be optimized using the optimization protocol from the original ProBound publication with

optimizer.train_sequential()

The model will be saved to the file specified with the checkpoint keyword. The output of the optimization can also be captured by specifying the output keyword. Additional sampling, optimization, and early stopping parameters can also be provided. One example that utilizes these features is Src: Kinase-seq with Early Stopping. For further information, refer to the Optimizer API.

Additional Features

Cooperativity

ProBound can also model the cooperativity between two transcription factors. This is calculated as the product of the affinities of each factor at their respective binding sites, multiplied by a bias trained for each relative distance between the two binding sites. The relative affinity of the cooperative complex formed by factors A and B is

\[\frac{1}{K^{rel}_{\text{D, complex}}(\text{sequence})} = \sum_{\text{window A}} \sum_{\text{window B}} \frac{\omega_{A:B}(\text{window A}, \text{window B})}{K^{rel}_{\text{D, A}}(\text{window A}) K^{rel}_{\text{D, B}}(\text{window B})}\]

To train a cooperativity model, the bias parameter \(\omega_{A:B}\) must first be created from the two factors through a Spacing object, which can then be wrapped into a Cooperativity object.

spacing =  pyprobound.Spacing([psam_A], [psam_B])
cooperativity = pyprobound.Cooperativity(spacing, mode_A, modes_B)

The Cooperativity object can then be used just like a Mode object. An example of cooperativity modeling can be found in UbxExdHth: Binding Cooperativity.

Kd-seq

The ProBound publication [1] describes the Kd-seq method, in which the input, bound, and unbound libraries of an experiment are all sequenced and modeled jointly to infer absolute binding constants. An example is provided in Dll: Kd-seq.

To implement Kd-seq in PyProBound, the input and bound libraries must be encoded using the InitialRound and BoundRound classes described in Assay Layer. For the bound round, the parameters library_concentration and target_concentration must be specified in from_binding. These are the total concentrations of the DNA library and the TF, respectively.

initial_round = pyprobound.rounds.InitialRound()
bound_round = pyprobound.rounds.BoundRound.from_binding(
    modes, initial_round, target_concentration=100, library_concentration=20
)

Next, the unbound library, which encodes the complement of the sigmoidal binding function

\[f_{i,r} = f_{i,r-1} \frac{1}{1 + Z_{i,r}}\]

must be specified. It can be created directly from the bound round with

unbound_round = pyprobound.rounds.UnboundRound.from_round(bound_round)

If the count table columns correspond to the input, bound, and unbound rounds, in that order, the experiment can then be created with

experiment = pyprobound.Experiment([initial_round, bound_round, unbound_round])

Finally, after training, the free_protein. function of the Experiment object can be used to calculate the free protein concentration. The indices of the input, bound, and unbound rounds must be provided. For example, if the experiment is defined as above, the function call would look like

free_protein = experiment.free_protein(0, 1, 2)

To calculate the free protein concentration in a different condition, such as with a different DNA or TF concentration, the parameters target_concentration and library_concentration can be passed separately to free_protein.

Note that the units of target_concentration and library_concentration must always be consistent, in both from_binding and free_protein.

Now What?

Since parameters and outputs of these components are in terms of biophysical constants, they can be used directly for interpreting experiments and validating against alternative assays.

Re-loading

Any model component can be checkpointed with save, and reloaded from a checkpoint with reload.

If an Optimizer has been previously trained, it can similarly be directly reloaded with reload.

Scoring

Sequences that are already encoded in a CountTable can be directly accessed with the seqs attribute. Sequences represented as a string can be encoded using translate, although scoring requires creating a batch-first dimension with unsqueeze.

Sequences can be directly be scored using the objects created to make the model. For example, the output of a Mode, \(-\log K^{rel}_{\text{D}}\), can be estimated for a sequence as

negative_log_kd = mode(alphabet.translate("ACGTC").unsqueeze(0))

Plotting

PyProBound includes a plotting library, pyprobound.plotting, which must be imported separately. Several plotting functions are provided in the plotting API, which are used throughout the Examples in the sidebar.

In CTCF: Single Experiment, examples include

Validation

To validate a model against a different dataset (for example, to evaluate the performance of a SELEX-derived model on explaining MITOMI measurements), one could directly use the output of a Mode or Round directly, according to their biophysical definitions as described above.

There are, however, situations in which additional parameters need to be defined. For example, if a protein binding microarray (PBM) experiment is used for validation, the TF-DNA recognition model might be accurate at each individual binding site, but there are often positional dependencies along the length of the probe due to the design of the microarray [3].

For this purpose, ProBound also contains a library, pyprobound.fitting, which must be imported separately. It allows for the retraining of experiment-dependent parameters, such as positional dependencies in binding, while keeping experiment-independent parameters, such as PSAM parameters, fixed.

Examples are provided at the bottom of CEBPγ: EpiSELEX-seq, Dll: Kd-seq, and both Src: Kinase-seq and Src: Kinase-seq with Early Stopping.

There are two classes available in pyprobound.fitting, Fit, which fits the function

\[\text{observation} (y) \sim m \times \text{prediction} (\log Z) + b\]

and LogFit, which fits the function

\[\log \left( \text{observation} (y) \right) \sim \log \left( \exp(m) \times \exp \left( \text{prediction} (\log Z) \right) + \exp(b) \right)\]

Here, \(y\) is the observed value for each sequence encoded as a single-column CountTable, while log Z is the output of a log_aggregate as described in Multiple Binding Modes.

\(\text{prediction}\) and \(\text{observation}\) are callables passed by the user. If not specified, \(\text{observation}\) is the identity function by default. \(\text{prediction}\) must always be specified; for example, if the observed value is proportional to binding, \(\text{prediction}\) should be F.sigmoid and F.logsigmoid for Fit and LogFit, respectively (F is a common alias for the torch.nn.functional library).

The constructors for Fit and LogFit contain many parameters. The linear scaling factors \(m\) and \(b\) are trained only if train_offset=True. Additionally, update_construct=True, which updates all experiment-specific parameters, must be passed if the validation sequence length is different than the training sequence length. If positional dependencies must be retrained, train_posbias=True must also be provided. In some cases, avidity may be captured with train_hill=True.

Once the fitting object is created, the fit function can be used to train the linear scaling factors and experiment-specific parameters and the plot function can be used to plot how well the observed and expected values agree.

For further information, refer to the fitting API.

References