...
Week | Task | Status | Comments |
20-May | Study Work: State of art on the models, optimization and Evaluation | Done | Look for optimization techniques, how they evaluate anonymization models. |
27-May | Finalizing Dataset and Libraries to use -- suppression/rename/ .. etc. | Done | Kubernetes logs/Metrics, Openstack logs/metrics .. any data that has PII information |
3-June | Anonymization Impact on the Model's utility | ||
10-June | |||
17-June | Containeration and the APIs | ||
24-June | Automation using Python | ||
1-July | Testing of the containerized Architecture | ||
8-July | NLP Model for anonymizing Telco Data | ||
15-July | |||
22-July | |||
29-July | |||
5-Aug | Evaluation of the Model | ||
12-Aug | Integration of the developed model with the architecture | ||
19-Aug | Documentation and release of the code. | ||
26-Aug | [BUFFER] |
...
- https://aclanthology.org/2021.acl-long.323.pdf (Showcases the problems and the evaluation methodology for anonymization models)
- https://www.researchgate.net/publication/347730431_Anonymization_Techniques_for_Privacy_Preserving_Data_Publishing_A_Comprehensive_Survey (A survey for different types of techniques)
Datasets:
Key-points:
- Although, the logs data in themselves do not contain too much PIIs, but when combined with datas of equal size can yield a well suited data for anonymization problem.
- I found a supermarket dataset consisting of nearly all possible PIIs that exists and another factor for choosing it was the feasibility of evaluating the depreciation in the model's prediction and performance. https://data.world/2918diy/global-superstore Evaluation can be done via predicitive models such as:
- Segment, target high-value customers.
- Predict future sales, optimize pricing.
- Recommend products, personalize experience.
- I also found many tele communications and relevant dataset that can be taken into consideration for anonymization, but with introduction of certain PIIs:
- https://github.com/logpai/loghub/blob/master/Linux/Linux_2k.log_structured.csv, https://www.kaggle.com/datasets/omduggineni/loghub-ssh-log-data, https://www.kaggle.com/datasets/eliasdabbas/web-server-access-logs/data: log data, cant be used solely for evaluations
- https://data.world/city-of-ny/tbgj-tdd6 and https://www.kaggle.com/datasets/yeanzc/telco-customer-churn-ibm-dataset: Location specific data
- https://www.kaggle.com/datasets/stackoverflow/stackoverflow?select=users: Name, about_me data and location
- https://www.kaggle.com/datasets/uciml/adult-census-income: name, age, relation, race, education, occupation, income ( ideal for evaluation)
Libraries and Methods:
Methods:
Suppression: This removes sensitive information entirely.
- Advantages: Simple, strong anonymization.
- Disadvantages: Data loss, may affect analysis depending on what's removed.
- Impact on Models: Significant degradation, especially if removing features crucial for prediction.
Pseudonymization: Replaces sensitive data with fictitious identifiers.
- Advantages: Preserves data structure, allows some analysis.
- Disadvantages: Not truly anonymous, re-identification risk with complex data.
- Impact on Models: Varies depending on replaced data. May require model retraining.
Generalization: Replaces specific details with broader categories. ("John" -> "Male").
- Advantages: Balances privacy and usability, less data loss than suppression.
- Disadvantages: May introduce bias or reduce information value for models.
- Impact on Models: Moderate degradation depending on the level of generalization. Retraining might be needed.
Tokenization with Masking: Replaces sensitive tokens (words/phrases) with symbols (****).
- Advantages: Easy to implement, protects specific data points.
- Disadvantages: Limited protection for contextual information, may affect readability.
- Impact on Models: Varies depending on masked tokens. May require feature engineering for models.
Differential Privacy: Adds controlled noise to data to achieve statistical protection.
- Advantages: Strong privacy guaranteed, allows some analysis with provable privacy bounds.
- Disadvantages: Complex implementation, can significantly impact data utility for models.
- Impact on Models: High potential for degradation due to added noise. Models might require significant adjustments.
Libraries:
Here are some popular libraries that implement these methods:
- Presidio (Python): Open-source library for identifying and anonymizing entities like names, locations, and dates. (https://github.com/microsoft/presidio)
- spaCy (Python): Powerful NLP library with built-in named entity recognition capabilities for anonymization tasks. (https://spacy.io/)
- Text Anonymizer (Python): Framework offering various anonymization techniques like suppression and generalization. (https://medium.com/@openredact/anonymizer-a-framework-for-text-anonymization-499855f639d4)
- ARX (Java): Open-source suite for anonymizing various data types, including text, with features like k-anonymity. (https://github.com/topics/data-anonymization?o=desc&s=updated) Haven't explored this much yet.
Evaluation:
Evaluating the effectiveness of anonymization will involve a trade-off between:
- Level of Anonymization: How well identities are protected.
- Data Utility: How well the anonymized data retains its usefulness for predicitive analysis or models.
- Metrics like precision, recall, and F1-score can be used to assess how well the method identifies sensitive information.
- https://github.com/anonymous-NLP/anonymisation/blob/main/aggregated_annotations.pdf I also thought of to somehow compare the anonymization with the one given so as to have a valid approval for the model's performance.
- However, the impact on models requires domain-specific evaluation. Some approaches that I will follow are:
- Compare model performance: Train and test models on original and anonymized data to see the accuracy drop.
- Evaluate information loss: Measure how much relevant information is lost due to anonymization.