Implementation of "AtTGen: Attribute Tree Generation for Real-World Attribute Joint Extraction", ACL 2023.
A lightweight attribute extraction model that achieves above 96% F1-score on MEPAVE dataset ;)
Please install the dependencies first:
pip install -r requirements.txtusage: main.py [-h] [--name NAME] [--do_train] [--do_eval]
               [--data_dir DATA_DIR] [--word_vocab WORD_VOCAB]
               [--ontology_vocab ONTOLOGY_VOCAB] [--tokenizer TOKENIZER]
               [--seed SEED] [--gpu_ids GPU_IDS] [--batch_size BATCH_SIZE]
               [--lr LR] [--epoch EPOCH] [--emb_dim EMB_DIM]
               [--encode_dim ENCODE_DIM] [--skip_subject SKIP_SUBJECT]
configuration
optional arguments:
  -h, --help            show this help message and exit
  --name NAME           Experiment name, for logging and saving models
  --do_train            Whether to run training.
  --do_eval             Whether to run eval on the test set.
  --data_dir DATA_DIR   The input data dir.
  --word_vocab WORD_VOCAB
                        The vocabulary file.
  --ontology_vocab ONTOLOGY_VOCAB
                        The ontology class file.
  --tokenizer TOKENIZER
                        The tokenizer type.
  --seed SEED           The random seed for initialization
  --gpu_ids GPU_IDS     The GPU ids
  --batch_size BATCH_SIZE
                        Total batch size for training.
  --lr LR               The initial learning rate for Adam.
  --epoch EPOCH         Total number of training epochs to perform.
  --emb_dim EMB_DIM     The dimension of the embedding
  --encode_dim ENCODE_DIM
                        The dimension of the encoding
  --skip_subject SKIP_SUBJECT
                        Whether to skip the subject
Download the dataset to the raw_data folder, and run python3 preprocess.py --dataset=xxxx to preprocess the data.
Using argument
--subject_guild Trueto enable the subject guild function.
Pre-processed NYT dataset is attached in the data folder, which can be used directly.
Benefiting from the parameter-efficiency of this model, we can easily train and inference the model, and evaluate the trained model weights conveniently.
The trained model weights are in runs/jave_best file, which is trained by default hyper-parameters.
We use the sample data in MEPAVE to demonstrate the usage of AtTGen.
- You can check the samples in data/jave_samplefolder.
- You can try this demonstration by directly running python3 playground.py.
- Preparing the MEPAVE dataset
Due to licensing restrictions, we cannot provide this dataset directly, please apply a license to use here,
download the whole dataset and then put *.txt files in raw_data/jave folder.
- Preprocess the data
python3 preprocess.py --dataset=jave- Train the model
python3 main.py --do_train --gpu_ids=0 --data_dir=./data/jave/ --ontology_vocab=attribute_vocab.json --tokenizer=char --name=jave- Evaluate the model
python3 main.py --do_eval --gpu_ids=0 --data_dir=./data/jave/ --ontology_vocab=attribute_vocab.json --tokenizer=char --name=javepython3 main.py --gpu_ids=0 --data_dir=./data/CNShipNet/ --word_vocab=word_vocab.json --ontology_vocab=attribute_vocab.json --tokenizer=chn --do_trainpython3 main.py --gpu_ids=0 --data_dir=./data/nyt/ --ontology_vocab=relation_vocab.json --tokenizer=base --do_trainIf you found this work useful, please cite it as follows:
@inproceedings{li-etal-2023-attgen,
    title = "AtTGen: Attribute Tree Generation for Real-World Attribute Joint Extraction",
    author = "Li, Yanzeng  and
      Xue, Bingcong  and
      Zhang, Ruoyu   and
      Zou, Lei",
    booktitle = "Proceedings of The 61st Annual Meeting of the Association for Computational Linguistics",
    month = july,
    year = "2023",
    address = "Toronto, Canada"
}
