Skip to content

Commit a97ff5a

Browse files
committed
rearrange code
1 parent 427f44c commit a97ff5a

25 files changed

+27
-543
lines changed

pii/README.md

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,17 +2,16 @@
22

33
We provide code to detect Names, Emails, IP addresses, Passwords API/SSH keys in text datasets (in particular datasets of source code).
44
## NER approach
5-
For the **NER** model based approach go to the `ner_model` folder.
5+
For the **NER** model based approach (e.g [StarPII](https://huggingface.co/bigcode/starpii)), please go to the `ner` folder.
6+
7+
We provide the code used for training a PII NER model to detect : Names, Emails, Keys, Passwords & IP addresses (more details in our paper: [StarCoder: May The Source Be With You](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view)). You will also find the code (and `slurm` scripts) used for running PII Inference on [StarCoderData](https://huggingface.co/datasets/bigcode/starcoderdata), we were able to detect PII in ~800GB of text in 800 GPU-hours on A100 80GB. To replace secrets we used teh following tokens:
8+
<NAME>, <EMAIL>, <KEY>, <PASSWORD>
9+
To mask IP addresses, we randomly selected an IP address from 5~synthetic, private, non-internet-facing IP addresses of the same type.
610

711
## Regex approach
812
Below we explain the regex based approach to dectect Emails, IP addresses adn keys only:
913
We use regexes for emails and IP addresses (they are adapted from [BigScience PII pipeline](https://github.com/bigscience-workshop/data-preparation/tree/main/preprocessing/training/02_pii)). And we use [detect-secrets](https://github.com/Yelp/detect-secrets) for finding secrets keys. We additionally implement some filters on top to reduce the number of false positives. There is also some evaluation code to test the pipeline on a PII benchmark we annotated.
1014

11-
12-
We also provide the code used for training and running [StarPII](https://huggingface.co/bigcode/starpii) in `ner_model` and NER model for PII detection on: Names, Emails, Keys, Passwords & IP addresses (more details in our paper: [StarCoder: May The Source Be With You](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view)). We provide the code (and `slurm` scripts) used for running Inference on [StarCoderData](https://huggingface.co/datasets/bigcode/starcoderdata), we were able to detect PII in ~800GB of text in 800 GPU-hours on A100 80GB. To replace secrets we used teh following tokens:
13-
<NAME>, <EMAIL>, <KEY>, <PASSWORD>
14-
To mask IP addresses, we randomly selected an IP address from 5~synthetic, private, non-internet-facing IP addresses of the same type.
15-
1615
## Usage of the regex approach
1716
```
1817
pip install -r requirements.txt

pii/ner/README.md

Lines changed: 6 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,7 @@
1-
# Fine-tuning Bigcode-Encoder on an NER task for PII detection
1+
# PII detection and Redaction using an NER model
2+
Here we provide code to:
3+
- fine-tune an encoder model (like [StarEncoder](https://huggingface.co/bigcode/starencoder)) for the task of PII detection (NER): see folder `pii_train_ner`
4+
- run inference with our fine-tuned [StarPII](https://huggingface.co/bigcode/starpii) for PII detection on multiple GPUs: see folder `pii_inference`
5+
- redact/mask PII detected with the model: see folder `pii_redaction`
26

3-
To run the training on all the dataset `bigcode/pii-full-ds`, use the following command:
4-
```bash
5-
python -m torch.distributed.launch \
6-
--nproc_per_node number_of_gpus train.py \
7-
--dataset_name bigcode/pii-full-ds \
8-
--debug \
9-
--learning_rate 2e-5 \
10-
--train_batch_size 8 \
11-
--bf16 \
12-
--add_not_curated
13-
```
14-
Note that we use a global batch size of 64 (8*8 GPUs). To use only curated dataset remove the flag `--add_not_curated`.
7+
This is the code we used for PII anonymization in the 800GB dataset [StarCoderData](https://huggingface.co/datasets/bigcode/starcoderdata).

0 commit comments

Comments
 (0)