Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

WeekTaskStatusComments
20-MayStudy Work: State of art on the models, optimization and EvaluationDoneLook for optimization techniques, how they evaluate anonymization models.
27-MayFinalizing Dataset and Libraries to use -- suppression/rename/ .. etc.DoneKubernetes logs/Metrics, Openstack logs/metrics .. any data that has PII information
3-June

Anonymization Impact on the Model's utility



10-June

17-JuneContaineration and the APIs

24-JuneAutomation using Python

1-JulyTesting of the containerized Architecture

8-July

NLP Model for anonymizing Telco Data



15-July

22-July

29-July

5-AugEvaluation of the Model

12-AugIntegration of the developed model with the architecture

19-AugDocumentation and release of the code.

26-Aug[BUFFER]

...

  1. https://aclanthology.org/2021.acl-long.323.pdf (Showcases the problems and the evaluation methodology for anonymization models)
  2. https://www.researchgate.net/publication/347730431_Anonymization_Techniques_for_Privacy_Preserving_Data_Publishing_A_Comprehensive_Survey (A survey for different types of techniques)

Datasets:

Key-points:

Libraries and Methods:

Methods:

  • Suppression: This removes sensitive information entirely.

    • Advantages: Simple, strong anonymization.
    • Disadvantages: Data loss, may affect analysis depending on what's removed.
    • Impact on Models: Significant degradation, especially if removing features crucial for prediction.
  • Pseudonymization: Replaces sensitive data with fictitious identifiers.

    • Advantages: Preserves data structure, allows some analysis.
    • Disadvantages: Not truly anonymous, re-identification risk with complex data.
    • Impact on Models: Varies depending on replaced data. May require model retraining.
  • Generalization: Replaces specific details with broader categories. ("John" -> "Male").

    • Advantages: Balances privacy and usability, less data loss than suppression.
    • Disadvantages: May introduce bias or reduce information value for models.
    • Impact on Models: Moderate degradation depending on the level of generalization. Retraining might be needed.
  • Tokenization with Masking: Replaces sensitive tokens (words/phrases) with symbols (****).

    • Advantages: Easy to implement, protects specific data points.
    • Disadvantages: Limited protection for contextual information, may affect readability.
    • Impact on Models: Varies depending on masked tokens. May require feature engineering for models.
  • Differential Privacy: Adds controlled noise to data to achieve statistical protection.

    • Advantages: Strong privacy guaranteed, allows some analysis with provable privacy bounds.
    • Disadvantages: Complex implementation, can significantly impact data utility for models.
    • Impact on Models: High potential for degradation due to added noise. Models might require significant adjustments.

Libraries:

Here are some popular libraries that implement these methods:

Evaluation:

Evaluating the effectiveness of anonymization will involve a trade-off between:

  • Level of Anonymization: How well identities are protected.
  • Data Utility: How well the anonymized data retains its usefulness for predicitive analysis or models.
  1. Metrics like precision, recall, and F1-score can be used to assess how well the method identifies sensitive information.
  2. However, the impact on models requires domain-specific evaluation. Some approaches that I will follow are:
    1. Compare model performance: Train and test models on original and anonymized data to see the accuracy drop.
    2. Evaluate information loss: Measure how much relevant information is lost due to anonymization.