pycanon.metrics package

Module contents

Module with different functions for calculating the utility.

pycanon.metrics.average_ecsize(data_raw: DataFrame, data_anon: DataFrame, quasi_ident: List | ndarray, sup=True) float

Calculate the metric average equivalence class size.

Parameters:
  • data_raw (pandas dataframe) – dataframe with the data raw under study.

  • data_anon (pandas dataframe) – dataframe with the data anonymized.

  • quasi_ident (list of strings) – list with the name of the columns of the dataframe that are quasi-identifiers.

  • sup (boolean) – boolean, default to True. If true, suppression has been applied to the original dataset (some records may have been deleted).

Returns:

average equivalence class size.

Return type:

float

pycanon.metrics.average_rir(data_anon: DataFrame, quasi_ident: List | ndarray) float

Calculate the average re-identification risk metric.

Parameters:
  • data_anon (pandas dataframe) – dataframe with the data anonymized.

  • quasi_ident (list of strings) – list with the name of the columns of the dataframe that are quasi-identifiers.

Returns:

average re-identification risk.

Return type:

float

pycanon.metrics.classification_metric(data_raw: DataFrame, data_anon: DataFrame, quasi_ident: List | ndarray, sens_att: List | ndarray) float

Calculate the classification metric.

Parameters:
  • data_raw (pandas dataframe) – dataframe with the data raw under study.

  • data_anon (pandas dataframe) – dataframe with the data anonymized.

  • quasi_ident (list of strings) – list with the name of the columns of the dataframe that are quasi-identifiers.

  • sens_att (list of strings) – list with the name of the columns of the dataframe that are the sensitive attributes.

Returns:

classification metric.

Return type:

float

pycanon.metrics.discernability_metric(data_raw: DataFrame, data_anon: DataFrame, quasi_ident: List | ndarray) float

Calculate the discernability metric.

Parameters:
  • data_raw (pandas dataframe) – dataframe with the data raw under study.

  • data_anon (pandas dataframe) – dataframe with the data anonymized. Assuming that all the equivalence classes have more than k records, and given each suppressed record a penalty of the size of the input dataset.

  • quasi_ident (list of strings) – list with the name of the columns of the dataframe that are quasi-identifiers.

Returns:

discernability metric.

Return type:

float

pycanon.metrics.max_rir(data_anon: DataFrame, quasi_ident: List | ndarray) float

Calculate the maximum re-identification risk (worst case).

Parameters:
  • data_anon (pandas dataframe) – dataframe with the data anonymized.

  • quasi_ident (list of strings) – list with the name of the columns of the dataframe that are quasi-identifiers.

Returns:

maximum re-identification risk.

Return type:

float

pycanon.metrics.sa_entropy(data_anon: DataFrame, sens_attr: str) float

Calculate Shannon Entropy for a sensitive attribute.

Parameters:
  • data_anon (pandas dataframe) – dataframe with the data anonymized.

  • sens_attr (string) – string with the senstive attribute for calculating the entropy.

Returns:

Shannon entropy for the sensitive attribute.

Return type:

float

pycanon.metrics.sizes_ec(data: DataFrame, quasi_ident: List | ndarray) dict

Calculate statistics associated to the equivalence classes.

Parameters:
  • data (pandas dataframe) – dataframe with the data anonymized.

  • quasi_ident (list of strings) – list with the name of the columns of the dataframe that are quasi-identifiers.

pycanon.metrics.stats_quasi_ident(data: DataFrame, quasi_ident: str) dict

Calculate statistics associated to a given quasi-identifier.

Parameters:
  • data (pandas dataframe) – dataframe with the data anonymized.

  • quasi_ident (string) – name of the QI to be analyzed.