twinify.napsu_mq package#

class twinify.napsu_mq.napsu_mq.NapsuMQModel(required_marginals: Iterable[FrozenSet[str]] = (), use_laplace_approximation: bool = True)[source]#

Implementation for NAPSU-MQ algorithm, differentially private synthetic data generation method for discrete sensitive data.

reference: arXiv:2205.14485 “Noise-Aware Statistical Inference with Differentially Private Synthetic Data”, Ossi Räisä, Joonas Jälkö, Samuel Kaski & Antti Honkela

fit(data: pandas.core.frame.DataFrame, rng: chacha.defs.ChaChaState, epsilon: float, delta: float, query_sets: Optional[Iterable] = None, **kwargs) twinify.napsu_mq.napsu_mq.NapsuMQResult[source]#

Fit differentially private NAPSU-MQ model from data.

Parameters
  • data (pd.DataFrame) – Pandas Dataframe containing discrete categorical data

  • rng (d3p.random.PRNGState) – d3p PRNG key

  • epsilon (float) – Epsilon for differential privacy mechanism

  • delta (float) – Delta for differential privacy mechanism

Returns

Class containing learned probabilistic model with posterior values

Return type

NapsuMQResult

class twinify.napsu_mq.napsu_mq.NapsuMQResult(dataframe_domain: Dict[str, List[int]], queries: twinify.napsu_mq.marginal_query.FullMarginalQuerySet, posterior_values: jax.Array, data_description: twinify.dataframe_data.DataDescription)[source]#

NAPSU-MQ result class containing learned differentially private probabilistic model from data. Contains functions to generate differentially private synthetic datasets from the original dataset.

generate(rng: chacha.defs.ChaChaState, num_parameter_samples: int, num_data_per_parameter_sample: int = 1, single_dataframe: bool = True) Union[Iterable[pandas.core.frame.DataFrame], pandas.core.frame.DataFrame][source]#

Samples a number of samples from the parameter posterior (approximation) and generates the given number of data points per parameter samples.

By default returns a single data frame samples from the posterior predictive distribution, i.e., for each data records first a parameter value is drawn from the parameter posterior distribution, then the data record is sampled from the model conditioned on that parameter value. num_parameter_samples in this case determines the number of data records included in the returned data frame.

This behavior can be customized to sample more than one data record per parameter sample by setting argument num_data_per_parameter_sample to a value larger than 1, in which case the total number of records returned is num_parameter_samples * num_data_per_parameter_sample.

Setting single_dataframe = False causes the method to return an iterable collection of data frames, each of which contains all data records sampled for a single parameter samples, i.e., in this case this method returns num_parameter_samples data frames each of containing num_data_per_parameter_sample records.

Each of the data frames “looks” like the original data this InferenceResult was obtained from, i.e., it has identical column names and categorical labels (if any).

Parameters
  • rng (-) – A seeded state for the d3p.random secure random number generator.

  • num_parameter_samples (-) – How often to sample from the parameter posterior approximation.

  • num_data_per_parameter_sample (-) – How many data points to generate for each parameter sample.

  • single_dataframe (-) – Whether to combine data samples into a single data frame or return separate data frames.