|
2 | 2 |
|
3 | 3 | We provide code to detect Names, Emails, IP addresses, Passwords API/SSH keys in text datasets (in particular datasets of source code). |
4 | 4 | ## NER approach |
5 | | -For the **NER** model based approach go to the `ner_model` folder. |
| 5 | +For the **NER** model based approach (e.g [StarPII](https://huggingface.co/bigcode/starpii)), please go to the `ner` folder. |
| 6 | + |
| 7 | +We provide the code used for training a PII NER model to detect : Names, Emails, Keys, Passwords & IP addresses (more details in our paper: [StarCoder: May The Source Be With You](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view)). You will also find the code (and `slurm` scripts) used for running PII Inference on [StarCoderData](https://huggingface.co/datasets/bigcode/starcoderdata), we were able to detect PII in ~800GB of text in 800 GPU-hours on A100 80GB. To replace secrets we used teh following tokens: |
| 8 | +<NAME>, <EMAIL>, <KEY>, <PASSWORD> |
| 9 | +To mask IP addresses, we randomly selected an IP address from 5~synthetic, private, non-internet-facing IP addresses of the same type. |
6 | 10 |
|
7 | 11 | ## Regex approach |
8 | 12 | Below we explain the regex based approach to dectect Emails, IP addresses adn keys only: |
9 | 13 | We use regexes for emails and IP addresses (they are adapted from [BigScience PII pipeline](https://github.com/bigscience-workshop/data-preparation/tree/main/preprocessing/training/02_pii)). And we use [detect-secrets](https://github.com/Yelp/detect-secrets) for finding secrets keys. We additionally implement some filters on top to reduce the number of false positives. There is also some evaluation code to test the pipeline on a PII benchmark we annotated. |
10 | 14 |
|
11 | | - |
12 | | -We also provide the code used for training and running [StarPII](https://huggingface.co/bigcode/starpii) in `ner_model` and NER model for PII detection on: Names, Emails, Keys, Passwords & IP addresses (more details in our paper: [StarCoder: May The Source Be With You](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view)). We provide the code (and `slurm` scripts) used for running Inference on [StarCoderData](https://huggingface.co/datasets/bigcode/starcoderdata), we were able to detect PII in ~800GB of text in 800 GPU-hours on A100 80GB. To replace secrets we used teh following tokens: |
13 | | -<NAME>, <EMAIL>, <KEY>, <PASSWORD> |
14 | | -To mask IP addresses, we randomly selected an IP address from 5~synthetic, private, non-internet-facing IP addresses of the same type. |
15 | | - |
16 | 15 | ## Usage of the regex approach |
17 | 16 | ``` |
18 | 17 | pip install -r requirements.txt |
|
0 commit comments