Locked History Actions

Diff for "DEPOTx"

Differences between revisions 4 and 11 (spanning 7 versions)
Revision 4 as of 2022-01-21 11:14:19
Size: 3574
Comment:
Revision 11 as of 2022-01-21 14:23:46
Size: 3629
Comment:
Deletions are marked like this. Additions are marked like this.
Line 8: Line 8:
=== Download link === 
 * [[https://drive.google.com/file/d/1aG_zhzJCXikv8_dkppLu_rE48CqJC2pQ/view?usp=sharing]] (4.1 GB)
=== Download link ===

 * [[https://drive.google.com/file/d/1aG_zhzJCXikv8_dkppLu_rE48CqJC2pQ/view?usp=sharing|Archive]] (4.1 GB)
Line 15: Line 16:
 * `data/evaluation_corpus.txt` – the corpus used to evaluate the models, originally published by [[Grochowski (2008)|https://ksiegarnia.pwn.pl/Slownik-polskich-przeklenstw-i-wulgaryzmow,917947408,p.html]].  * `data/evaluation_corpus.txt` – the corpus used to evaluate the models, originally published by [[https://ksiegarnia.pwn.pl/Slownik-polskich-przeklenstw-i-wulgaryzmow,917947408,p.html|Grochowski (2008)]].
Line 21: Line 22:
------------
Line 24: Line 24:
``` {{{
Line 26: Line 26:
``` }}}
Line 28: Line 28:
Additionally, in order to run evaluation scripts, please install [[Przetak|https://github.com/MarcinCiura/przetak]] and update the value of the `GOPATH` variable in `evaluation/__init__.py` file. Additionally, in order to run evaluation scripts, please install [[https://github.com/MarcinCiura/przetak|Przetak]] and update the value of the `GOPATH` variable in `evaluation/__init__.py` file.
Line 32: Line 32:
-----
Line 37: Line 36:
``` {{{
Line 39: Line 38:
``` }}}
Line 45: Line 44:
* **Style transfer accuracy** (STA) was assessed using
[[P
rzetak|https://github.com/MarcinCiura/przetak]]
* **Preservation of the content**: cosine similarity (CS, sentence embeddings were obtained using [SBERT](https://www.sbert.net)), word overlap (WO) and BLEU
* **Language quality**: perplexity (PPL)
[[
papuGaPT-2|https://huggingface.co/flax-community/papuGaPT2]]
 * '''Style transfer accuracy''' (STA) was assessed using [[https://github.com/MarcinCiura/przetak|Przetak]]
 * '''Preservation of the content''': cosine similarity (CS, sentence embeddings were obtained using [[https://www.sbert.net|SBERT]]), word overlap (WO) and BLEU
 * '''Language quality''': perplexity (PPL) using [[https://huggingface.co/flax-community/papuGaPT2|papuGaPT-2]]
Line 52: Line 49:
* **Duplicate**: a direct copy of the original text
* **Delete**: letters of words recognized by Przetak as vulgar were replaced with asterisks (except for their the first letter)
 * '''Duplicate''': a direct copy of the original text
 * '''Delete''': letters of words recognized by Przetak as vulgar were replaced with asterisks (except for their the first letter)
Line 61: Line 58:
||Method ||STA ||CS ||WO ||BLEU ||PPL ||GM ||
||--- ||:---: ||:---: ||:---: ||:---: ||:---: ||:---: ||
||**Duplicate**||0.38 ||1 ||1 ||1 ||146.86||1.78 ||
||**Delete** ||1 ||0.93 ||0.84 ||0.92 ||246.80||4.14 ||
||**GPT-2** ||0.90 ||0.86 ||0.71 ||0.86 ||258.44||3.71 ||
||**GPT-3** ||0.88 ||0.92 ||0.79 ||0.92 ||359.12||3.58 ||
||**T-5 base** ||0.90 ||0.97 ||0.85 ||0.95 ||187.03||4.10 ||
||**T-5 large**||0.93 ||0.97 ||0.86 ||0.95 ||170.02||4.31 ||
||'''Method''' ||'''STA'''||'''CS'''||'''WO'''||'''BLEU'''||'''PPL'''||'''GM'''||
||'''Duplicate'''|| 0.38 || 1.00 || 1.00 || 1.00 || 146.86 || 1.78 ||
||'''Delete''' || 1.00 || 0.93 || 0.84 || 0.92 || 246.80 || 4.14 ||
||'''GPT-2''' || 0.90 || 0.86 || 0.71 || 0.86 || 258.44 || 3.71 ||
||'''GPT-3''' || 0.88 || 0.92 || 0.79 || 0.92 || 359.12 || 3.58 ||
||'''T-5 base''' || 0.90 || 0.97 || 0.85 || 0.95 || 187.03 || 4.10 ||
||'''T-5 large'''|| 0.93 || 0.97 || 0.86 || 0.95 || 170.02 || 4.31 ||
Line 73: Line 69:
CC-BY 4.0 CC-BY-NC 4.0
Line 77: Line 73:
Klamra C., Wojdyga G. Żurowski S., Rosalska P., Kozłowska M., Ogrodniczuk M. (2022). ''Devulgarization of Polish Texts Using Pre-trained Language Models'' (in preparation). Klamra C., Wojdyga G., Żurowski S., Rosalska P., Kozłowska M., Ogrodniczuk M. (2022). ''Devulgarization of Polish Texts Using Pre-trained Language Models'' (in preparation).

Devulgarization of Polish Texts

DEPOT is a text style transfer framework for replacing vulgar expressions in Polish utterances with their non-vulgar equivalents while preserving the main characteristics of the text. The framework contains three pre-trained language models (GPT-2, GPT-3 and T-5) trained on a newly created parallel corpus of sentences containing vulgar expressions and their equivalents. The resulting models are evaluated by checking style transfer accuracy, content preservation and language quality.

Download

Contents of the archive

  • data/train_corpus_raw.xlsx – the corpus of texts, which contain vulgar expressions, their euphemistic substitutes and contexts.

  • data/train_corpus_preprocessed.tsv – the preprocessed parallel corpus of vulgar and non-vulgar texts used to train the models.

  • data/evaluation_corpus.txt – the corpus used to evaluate the models, originally published by Grochowski (2008).

  • data/evaluation_results_sentences.tsv – original texts and texts processed by the models.

  • data/evaluation_results_metrics.tsv – automatic evaluatoin results.

Requirements

Required packages can be installed by running:

pip3 install -r requirements.txt

Additionally, in order to run evaluation scripts, please install Przetak and update the value of the GOPATH variable in evaluation/__init__.py file.

Usage

The notebooks/ directory contains examples of inference and evaluation.

Evaluation might be run from the command line:

python3 -m evaluation -o <original texts> -t <transfered texts>

Evaluation

The performance of the models was assessed in three categories using automatic metrics.

  • Style transfer accuracy (STA) was assessed using Przetak

  • Preservation of the content: cosine similarity (CS, sentence embeddings were obtained using SBERT), word overlap (WO) and BLEU

  • Language quality: perplexity (PPL) using papuGaPT-2

Models were evaluated against two baselines:

  • Duplicate: a direct copy of the original text

  • Delete: letters of words recognized by Przetak as vulgar were replaced with asterisks (except for their the first letter)

Overall performance of the models was assessed using geometric mean of CS, STA and PPL scores.

Results

Method

STA

CS

WO

BLEU

PPL

GM

Duplicate

0.38

1.00

1.00

1.00

146.86

1.78

Delete

1.00

0.93

0.84

0.92

246.80

4.14

GPT-2

0.90

0.86

0.71

0.86

258.44

3.71

GPT-3

0.88

0.92

0.79

0.92

359.12

3.58

T-5 base

0.90

0.97

0.85

0.95

187.03

4.10

T-5 large

0.93

0.97

0.86

0.95

170.02

4.31

Licence

CC-BY-NC 4.0

Citation

Klamra C., Wojdyga G., Żurowski S., Rosalska P., Kozłowska M., Ogrodniczuk M. (2022). Devulgarization of Polish Texts Using Pre-trained Language Models (in preparation).