pycanon.metrics package¶
Module contents¶
Module with different functions for calculating the utility.
- pycanon.metrics.average_ecsize(data_raw: DataFrame, data_anon: DataFrame, quasi_ident: List | ndarray, sup=True) float¶
Calculate the metric average equivalence class size.
- Parameters:
data_raw (pandas dataframe) – dataframe with the data raw under study.
data_anon (pandas dataframe) – dataframe with the data anonymized.
quasi_ident (list of strings) – list with the name of the columns of the dataframe that are quasi-identifiers.
sup (boolean) – boolean, default to True. If true, suppression has been applied to the original dataset (some records may have been deleted).
- Returns:
average equivalence class size.
- Return type:
float
- pycanon.metrics.average_rir(data_anon: DataFrame, quasi_ident: List | ndarray) float¶
Calculate the average re-identification risk metric.
- Parameters:
data_anon (pandas dataframe) – dataframe with the data anonymized.
quasi_ident (list of strings) – list with the name of the columns of the dataframe that are quasi-identifiers.
- Returns:
average re-identification risk.
- Return type:
float
- pycanon.metrics.classification_metric(data_raw: DataFrame, data_anon: DataFrame, quasi_ident: List | ndarray, sens_att: List | ndarray) float¶
Calculate the classification metric.
- Parameters:
data_raw (pandas dataframe) – dataframe with the data raw under study.
data_anon (pandas dataframe) – dataframe with the data anonymized.
quasi_ident (list of strings) – list with the name of the columns of the dataframe that are quasi-identifiers.
sens_att (list of strings) – list with the name of the columns of the dataframe that are the sensitive attributes.
- Returns:
classification metric.
- Return type:
float
- pycanon.metrics.discernability_metric(data_raw: DataFrame, data_anon: DataFrame, quasi_ident: List | ndarray) float¶
Calculate the discernability metric.
- Parameters:
data_raw (pandas dataframe) – dataframe with the data raw under study.
data_anon (pandas dataframe) – dataframe with the data anonymized. Assuming that all the equivalence classes have more than k records, and given each suppressed record a penalty of the size of the input dataset.
quasi_ident (list of strings) – list with the name of the columns of the dataframe that are quasi-identifiers.
- Returns:
discernability metric.
- Return type:
float
- pycanon.metrics.max_rir(data_anon: DataFrame, quasi_ident: List | ndarray) float¶
Calculate the maximum re-identification risk (worst case).
- Parameters:
data_anon (pandas dataframe) – dataframe with the data anonymized.
quasi_ident (list of strings) – list with the name of the columns of the dataframe that are quasi-identifiers.
- Returns:
maximum re-identification risk.
- Return type:
float
- pycanon.metrics.sa_entropy(data_anon: DataFrame, sens_attr: str) float¶
Calculate Shannon Entropy for a sensitive attribute.
- Parameters:
data_anon (pandas dataframe) – dataframe with the data anonymized.
sens_attr (string) – string with the senstive attribute for calculating the entropy.
- Returns:
Shannon entropy for the sensitive attribute.
- Return type:
float
- pycanon.metrics.sizes_ec(data: DataFrame, quasi_ident: List | ndarray) dict¶
Calculate statistics associated to the equivalence classes.
- Parameters:
data (pandas dataframe) – dataframe with the data anonymized.
quasi_ident (list of strings) – list with the name of the columns of the dataframe that are quasi-identifiers.
- pycanon.metrics.stats_quasi_ident(data: DataFrame, quasi_ident: str) dict¶
Calculate statistics associated to a given quasi-identifier.
- Parameters:
data (pandas dataframe) – dataframe with the data anonymized.
quasi_ident (string) – name of the QI to be analyzed.