Page History

...

Week	Task	Status	Comments
20-May	Study Work: State of art on the models, optimization and Evaluation	Done	Look for optimization techniques, how they evaluate anonymization models.
27-May	Finalizing Dataset and Libraries to use -- suppression/rename/ .. etc.	Done	Kubernetes logs/Metrics, Openstack logs/metrics .. any data that has PII information
3-June	Anonymization Impact on the Model's utility
10-June	Anonymization Impact on the Model's utility
17-June	Containeration and the APIs
24-June	Automation using Python
1-July	Testing of the containerized Architecture
8-July	NLP Model for anonymizing Telco Data
15-July
22-July
29-July
5-Aug	Evaluation of the Model
12-Aug	Integration of the developed model with the architecture
19-Aug	Documentation and release of the code.
26-Aug	[BUFFER]

...

https://aclanthology.org/2021.acl-long.323.pdf (Showcases the problems and the evaluation methodology for anonymization models)
https://www.researchgate.net/publication/347730431_Anonymization_Techniques_for_Privacy_Preserving_Data_Publishing_A_Comprehensive_Survey (A survey for different types of techniques)

Datasets:

Key-points:

Although, the logs data in themselves do not contain too much PIIs, but when combined with datas of equal size can yield a well suited data for anonymization problem.
I found a supermarket dataset consisting of nearly all possible PIIs that exists and another factor for choosing it was the feasibility of evaluating the depreciation in the model's prediction and performance. https://data.world/2918diy/global-superstore Evaluation can be done via predicitive models such as:
- Segment, target high-value customers.
- Predict future sales, optimize pricing.
- Recommend products, personalize experience.
I also found many tele communications and relevant dataset that can be taken into consideration for anonymization, but with introduction of certain PIIs:
- https://github.com/logpai/loghub/blob/master/Linux/Linux_2k.log_structured.csv, https://www.kaggle.com/datasets/omduggineni/loghub-ssh-log-data, https://www.kaggle.com/datasets/eliasdabbas/web-server-access-logs/data: log data, cant be used solely for evaluations
- https://data.world/city-of-ny/tbgj-tdd6 and https://www.kaggle.com/datasets/yeanzc/telco-customer-churn-ibm-dataset: Location specific data
- https://www.kaggle.com/datasets/stackoverflow/stackoverflow?select=users: Name, about_me data and location
- https://www.kaggle.com/datasets/uciml/adult-census-income: name, age, relation, race, education, occupation, income ( ideal for evaluation)

Libraries and Methods:

Methods:

Suppression: This removes sensitive information entirely.
- Advantages: Simple, strong anonymization.
- Disadvantages: Data loss, may affect analysis depending on what's removed.
- Impact on Models: Significant degradation, especially if removing features crucial for prediction.
Pseudonymization: Replaces sensitive data with fictitious identifiers.
- Advantages: Preserves data structure, allows some analysis.
- Disadvantages: Not truly anonymous, re-identification risk with complex data.
- Impact on Models: Varies depending on replaced data. May require model retraining.
Generalization: Replaces specific details with broader categories. ("John" -> "Male").
- Advantages: Balances privacy and usability, less data loss than suppression.
- Disadvantages: May introduce bias or reduce information value for models.
- Impact on Models: Moderate degradation depending on the level of generalization. Retraining might be needed.
Tokenization with Masking: Replaces sensitive tokens (words/phrases) with symbols (****).
- Advantages: Easy to implement, protects specific data points.
- Disadvantages: Limited protection for contextual information, may affect readability.
- Impact on Models: Varies depending on masked tokens. May require feature engineering for models.
Differential Privacy: Adds controlled noise to data to achieve statistical protection.
- Advantages: Strong privacy guaranteed, allows some analysis with provable privacy bounds.
- Disadvantages: Complex implementation, can significantly impact data utility for models.
- Impact on Models: High potential for degradation due to added noise. Models might require significant adjustments.

Libraries:

Here are some popular libraries that implement these methods:

Presidio (Python): Open-source library for identifying and anonymizing entities like names, locations, and dates. (https://github.com/microsoft/presidio)
spaCy (Python): Powerful NLP library with built-in named entity recognition capabilities for anonymization tasks. (https://spacy.io/)
Text Anonymizer (Python): Framework offering various anonymization techniques like suppression and generalization. (https://medium.com/@openredact/anonymizer-a-framework-for-text-anonymization-499855f639d4)
ARX (Java): Open-source suite for anonymizing various data types, including text, with features like k-anonymity. (https://github.com/topics/data-anonymization?o=desc&s=updated) Haven't explored this much yet.

Evaluation:

Evaluating the effectiveness of anonymization will involve a trade-off between:

Level of Anonymization: How well identities are protected.
Data Utility: How well the anonymized data retains its usefulness for predicitive analysis or models.

Metrics like precision, recall, and F1-score can be used to assess how well the method identifies sensitive information.
- https://github.com/anonymous-NLP/anonymisation/blob/main/aggregated_annotations.pdf I also thought of to somehow compare the anonymization with the one given so as to have a valid approval for the model's performance.
However, the impact on models requires domain-specific evaluation. Some approaches that I will follow are:
1. Compare model performance: Train and test models on original and anonymized data to see the accuracy drop.
2. Evaluate information loss: Measure how much relevant information is lost due to anonymization.

Space shortcuts

Page tree

Versions Compared

Old Version 3

New Version Current

Key

Datasets:

Libraries and Methods:

Methods:

Libraries: