Size: 778
Comment:
|
Size: 4284
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
## page was renamed from DEPOT | |
Line 3: | Line 4: |
DEPOT is a text style transfer framework for replacing vulgar expressions in Polish utterances with their non-vulgar equivalents while preserving the main characteristics of the text. The framework contains three pre-trained language models (GPT-2, GPT-3 and T-5) trained on a newly created parallel corpus of sentences containing vulgar expressions and their equivalents. The resulting models are evaluated by checking style transfer accuracy, content preservation and language quality. | DEPOTx is a text style transfer framework for replacing vulgar expressions in Polish utterances with their non-vulgar equivalents while preserving the main characteristics of the text. The framework contains three pre-trained language models (GPT-2, GPT-3 and T-5) trained on a newly created parallel corpus of sentences containing vulgar expressions and their equivalents. The resulting models are evaluated by checking style transfer accuracy, content preservation and language quality. |
Line 7: | Line 8: |
* ... | * [[http://mozart.ipipan.waw.pl/~cklamra/DEPOTx/DEPOTx.zip|Data, scripts and examples]] (please see README file for more information) * [[http://mozart.ipipan.waw.pl/~cklamra/DEPOTx/models/|Models]] (GPT-2, T5-base, T5-large) === Requirements === Required packages can be installed by running: {{{ pip3 install -r requirements.txt }}} Additionally, in order to run evaluation scripts, please install [[https://github.com/MarcinCiura/przetak|Przetak]] and update the value of the `GOPATH` variable in `evaluation/__init__.py` file. === Usage === The `notebooks/` directory contains examples of inference and evaluation. Evaluation might be run from the command line: {{{ python3 -m evaluation -o <original texts> -t <transfered texts> }}} == Training details == All models have been trained using AdamW optimizer using NVidia P100 GPU. Following hyperparameter values have been used to fine-tune the models: === GPT-2 === * number of epochs: 10 * batch size: 2 * learning rate 0.0001 * epsilon: 1e-8 * warmup steps: 100 === GPT-3 === * batch size: 2 * learning rate multiplier: 0.2 * number of epochs: 5 * prompt loss weight: 0.1 * weight decay: 0 === T-5 base === 1st step: * num. of epochs: 6 * batch size: 2 * learning rate: 0.0005 * epsilon: 1e-8 * warmup steps: 100 2nd step: * num. of epochs: 6 * batch size: 2 * learning rate: 0.00005 * epsilon: 1e-8 * warmup steps: 100 === T-5 large === 1st step: * num. of epochs: 10 * batch size: 2 * learning rate: 0.0001 * epsilon: 1e-8 * warmup steps: 100 2nd step: * num. of epochs: 10 * batch size: 1 * learning rate: 0.00002 * epsilon: 1e-8 * warmup steps: 100 == Evaluation == The performance of the models was assessed in three categories using automatic metrics. * '''Style transfer accuracy''' (STA) was assessed using [[https://github.com/MarcinCiura/przetak|Przetak]] * '''Preservation of the content''': cosine similarity (CS, sentence embeddings were obtained using [[https://www.sbert.net|SBERT]]), word overlap (WO) and BLEU * '''Language quality''': perplexity (PPL) using [[https://huggingface.co/flax-community/papuGaPT2|papuGaPT-2]] Models were evaluated against two baselines: * '''Duplicate''': a direct copy of the original text * '''Delete''': letters of words recognized by Przetak as vulgar were replaced with asterisks (except for their the first letter) Overall performance of the models was assessed using geometric mean of CS, STA and PPL scores. == Results == ||'''Method''' ||'''STA'''||'''CS'''||'''WO'''||'''BLEU'''||'''PPL'''||'''GM'''|| ||'''Duplicate'''|| 0.38 || 1.00 || 1.00 || 1.00 || 146.86 || 1.78 || ||'''Delete''' || 1.00 || 0.93 || 0.84 || 0.92 || 246.80 || 4.14 || ||'''GPT-2''' || 0.90 || 0.86 || 0.71 || 0.86 || 258.44 || 3.71 || ||'''GPT-3''' || 0.88 || 0.92 || 0.79 || 0.92 || 359.12 || 3.58 || ||'''T-5 base''' || 0.90 || 0.97 || 0.85 || 0.95 || 187.03 || 4.10 || ||'''T-5 large'''|| 0.93 || 0.97 || 0.86 || 0.95 || 170.02 || 4.31 || |
Line 11: | Line 109: |
CC-BY 4.0 | CC-BY-NC 4.0 |
Line 13: | Line 111: |
== Publications == | == Citation == |
Line 15: | Line 113: |
Klamra C., Wojdyga G. Żurowski S., Rosalska P., Kozłowska M., Ogrodniczuk M. ''Devulgarization of Polish Texts Using Pre-trained Language Models'' (in preparation). | Klamra C., Wojdyga G., Żurowski S., Rosalska P., Kozłowska M., Ogrodniczuk M. (2022). [[https://doi.org/10.1007/978-3-031-08754-7_7|Devulgarization of Polish Texts Using Pre-trained Language Models]]. In: Groen, D., de Mulatier, C., Paszynski, M., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M.A. (eds) Computational Science – [[https://www.iccs-meeting.org/iccs2022/|ICCS 2022]]. Lecture Notes in Computer Science, vol. 13351, pp. 49--55. Springer, Cham. |
Devulgarization of Polish Texts
DEPOTx is a text style transfer framework for replacing vulgar expressions in Polish utterances with their non-vulgar equivalents while preserving the main characteristics of the text. The framework contains three pre-trained language models (GPT-2, GPT-3 and T-5) trained on a newly created parallel corpus of sentences containing vulgar expressions and their equivalents. The resulting models are evaluated by checking style transfer accuracy, content preservation and language quality.
Download
Data, scripts and examples (please see README file for more information)
Models (GPT-2, T5-base, T5-large)
Requirements
Required packages can be installed by running:
pip3 install -r requirements.txt
Additionally, in order to run evaluation scripts, please install Przetak and update the value of the GOPATH variable in evaluation/__init__.py file.
Usage
The notebooks/ directory contains examples of inference and evaluation.
Evaluation might be run from the command line:
python3 -m evaluation -o <original texts> -t <transfered texts>
Training details
All models have been trained using AdamW optimizer using NVidia P100 GPU.
Following hyperparameter values have been used to fine-tune the models:
GPT-2
- number of epochs: 10
- batch size: 2
- learning rate 0.0001
- epsilon: 1e-8
- warmup steps: 100
GPT-3
- batch size: 2
- learning rate multiplier: 0.2
- number of epochs: 5
- prompt loss weight: 0.1
- weight decay: 0
T-5 base
1st step:
- num. of epochs: 6
- batch size: 2
- learning rate: 0.0005
- epsilon: 1e-8
- warmup steps: 100
2nd step:
- num. of epochs: 6
- batch size: 2
- learning rate: 0.00005
- epsilon: 1e-8
- warmup steps: 100
T-5 large
1st step:
- num. of epochs: 10
- batch size: 2
- learning rate: 0.0001
- epsilon: 1e-8
- warmup steps: 100
2nd step:
- num. of epochs: 10
- batch size: 1
- learning rate: 0.00002
- epsilon: 1e-8
- warmup steps: 100
Evaluation
The performance of the models was assessed in three categories using automatic metrics.
Style transfer accuracy (STA) was assessed using Przetak
Preservation of the content: cosine similarity (CS, sentence embeddings were obtained using SBERT), word overlap (WO) and BLEU
Language quality: perplexity (PPL) using papuGaPT-2
Models were evaluated against two baselines:
Duplicate: a direct copy of the original text
Delete: letters of words recognized by Przetak as vulgar were replaced with asterisks (except for their the first letter)
Overall performance of the models was assessed using geometric mean of CS, STA and PPL scores.
Results
Method |
STA |
CS |
WO |
BLEU |
PPL |
GM |
Duplicate |
0.38 |
1.00 |
1.00 |
1.00 |
146.86 |
1.78 |
Delete |
1.00 |
0.93 |
0.84 |
0.92 |
246.80 |
4.14 |
GPT-2 |
0.90 |
0.86 |
0.71 |
0.86 |
258.44 |
3.71 |
GPT-3 |
0.88 |
0.92 |
0.79 |
0.92 |
359.12 |
3.58 |
T-5 base |
0.90 |
0.97 |
0.85 |
0.95 |
187.03 |
4.10 |
T-5 large |
0.93 |
0.97 |
0.86 |
0.95 |
170.02 |
4.31 |
Licence
CC-BY-NC 4.0
Citation
Klamra C., Wojdyga G., Żurowski S., Rosalska P., Kozłowska M., Ogrodniczuk M. (2022). Devulgarization of Polish Texts Using Pre-trained Language Models. In: Groen, D., de Mulatier, C., Paszynski, M., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M.A. (eds) Computational Science – ICCS 2022. Lecture Notes in Computer Science, vol. 13351, pp. 49--55. Springer, Cham.