pyprobound.table.sample_dataframe

sample_dataframe(dataframe, frac=0.1, random_state=None, n_bin=128)

Randomly samples from a dataframe evenly by enrichment.

To make validation or test splits representative of the training data, bin sequences by their overall enrichment and sample evenly within each bin.

Parameters:
  • dataframe (DataFrame) – The input dataframe to be sampled from.

  • frac (float) – The proportion of reads to be sampled from the dataframe.

  • random_state (int | None) – A seed used to make the output reproducible.

  • n_bin (int) – The bin size used to sample sequences from.

Return type:

tuple[DataFrame, DataFrame]

Returns:

A tuple of two dataframes, the first containing frac of the original dataframe, the second containing 1 - frac of the original dataframe.