Using NLP for Anonymizing Telco Data

Detailed workflow

Week	Task	Status	Comments
20-May	Study Work: State of art on the models, optimization and Evaluation	Done	Look for optimization techniques, how they evaluate anonymization models.
27-May	Finalizing Dataset and Libraries to use -- suppression/rename/ .. etc.		Kubernetes logs/Metrics, Openstack logs/metrics .. any data that has PII information
3-June	Anonymization Impact on the Model's utility
10-June	Anonymization Impact on the Model's utility
17-June	Containeration and the APIs
24-June	Automation using Python
1-July	Testing of the containerized Architecture
8-July	NLP Model for anonymizing Telco Data
15-July
22-July
29-July
5-Aug	Evaluation of the Model
12-Aug	Integration of the developed model with the architecture
19-Aug	Documentation and release of the code.
26-Aug	[BUFFER]

Proposed architecture:

API end-points:

1. Tell where the Raw-Data exists (file-path, url, etc.)

2. Start the anonymization process.

3. Tell where to put the anonymized data (file-path, url, etc.)

4. Receive notification once anonymization is complete (SUCCESS or ERROR) 1 and 4 can be just configuration.

Commonly existing data anonymization techniques:

State of the art models for anonymizing textual data:

Named Entity Recognition (NER) based models:
- These models are trained to identify and classify named entities within text, such as people's names, locations, and organizations. Popular frameworks include spaCy and NLTK.
- Once identified, PII entities can be replaced with anonymized tokens (like "[NAME]") or masked with techniques like character-level redaction (e.g., "Jo** ***th").
Rule-based systems:
- While simpler, rule-based systems can be effective for specific use cases.
- These systems rely on predefined rules and regular expressions to identify PII based on patterns (e.g., phone number formats, email address structures).
Presidio:
- Provides a user-friendly interface for defining custom PII analyzers
- It can then be anonymized using the pre-built anonymization pipeline.

There are dozens of softwares and APIs in the market for anonymization working on these three techniques under the hood.

Ways of evaluating anonymization models:

There are 2 basic methods for evaluation of anonymization models namely, the degree of anonymization and the decrease in the utility of the text.

Precision and Recall: These metrics are commonly used to assess the performance of NLP models in text anonymization. Precision measures the proportion of correctly anonymized information among all the information that the model labeled as sensitive, while recall measures the proportion of correctly anonymized information among all the sensitive information present in the text.
F1 Score: The F1 score provides a balanced evaluation of the model's performance in anonymizing text data. It considers both false positives and false negatives, offering an assessment of the model's effectiveness.
But we need to have the ground truth for testing the validity of the models using the above methods.
To test the decrease in the utility of the text, one way is to train a model before anonymization and to train again after anonymization to check the difference in the performance. Lesser the difference, better the anonymization process.
Human Evaluations: Human evaluations involve experts assessing the anonymized documents for re-identification risks and data utility preservation.

Reference Research papers:

https://aclanthology.org/2021.acl-long.323.pdf (Showcases the problems and the evaluation methodology for anonymization models)
https://www.researchgate.net/publication/347730431_Anonymization_Techniques_for_Privacy_Preserving_Data_Publishing_A_Comprehensive_Survey (A survey for different types of techniques)

Space shortcuts

Page tree