Scikit-learn style model finetuning for NLP
Finetune is a library that allows users to leverage state-of-the-art pretrained NLP models for a wide variety of downstream tasks.
Finetune currently supports TensorFlow implementations of the following models:
- BERT, from "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding"
- RoBERTa, from "RoBERTa: A Robustly Optimized BERT Pretraining Approach"
- GPT, from "Improving Language Understanding by Generative Pre-Training"
- GPT2, from "Language Models are Unsupervised Multitask Learners"
- TextCNN, from "Convolutional Neural Networks for Sentence Classification"
- Temporal Convolution Network, from "An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling"
- DistilBERT from "Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT"
| Section | Description | 
|---|---|
| API Tour | Base models, configurables, and more | 
| Installation | How to install using pip or directly from source | 
| Finetune with Docker | Finetune and inference within a Docker Container | 
| Documentation | Full API documentation | 
Finetuning the base language model is as easy as calling Classifier.fit:
model = Classifier()               # Load base model
model.fit(trainX, trainY)          # Finetune base model on custom data
model.save(path)                   # Serialize the model to disk
...
model = Classifier.load(path)      # Reload models from disk at any time
predictions = model.predict(testX) # [{'class_1': 0.23, 'class_2': 0.54, ..}, ..]Choose your desired base model from finetune.base_models:
from finetune.base_models import BERT, RoBERTa, GPT, GPT2, TextCNN, TCN
model = Classifier(base_model=BERT)Optimize your model with a variety of configurables. A detailed list of all config items can be found in the finetune docs.
model = Classifier(low_memory_mode=True, lr_schedule="warmup_linear", max_length=512, l2_reg=0.01, oversample=True, ...)The library supports finetuning for a number of tasks. A detailed description of all target models can be found in the finetune API reference.
from finetune import *
models = (Classifier, MultiLabelClassifier, MultiFieldClassifier, MultipleChoice, # Classify one or more inputs into one or more classes
          Regressor, OrdinalRegressor, MultifieldRegressor,                       # Regress on one or more inputs
          SequenceLabeler, Association,                                           # Extract tokens from a given class, or infer relationships between them
          Comparison, ComparisonRegressor, ComparisonOrdinalRegressor,            # Compare two documents for a given task
          LanguageModel, MultiTask,                                               # Further pretrain your base models
          DeploymentModel                                                         # Wrapper to optimize your serialized models for a production environment
          )For example usage of each of these target types, see the finetune/datasets directory. For purposes of simplicity and runtime these examples use smaller versions of the published datasets.
If you have large amounts of unlabeled training data and only a small amount of labeled training data, you can finetune in two steps for best performance.
model = Classifier()               # Load base model
model.fit(unlabeledX)              # Finetune base model on unlabeled training data
model.fit(trainX, trainY)          # Continue finetuning with a smaller amount of labeled data
predictions = model.predict(testX) # [{'class_1': 0.23, 'class_2': 0.54, ..}, ..]
model.save(path)                   # Serialize the model to diskFinetune can be installed directly from PyPI by using pip
pip3 install finetune
or installed directly from source:
git clone -b master https://github.com/IndicoDataSolutions/finetune && cd finetune
python3 setup.py develop              # symlinks the git directory to your python path
pip3 install tensorflow-gpu --upgrade # or tensorflow-cpu
python3 -m spacy download en          # download spacy tokenizerIn order to run finetune on your host, you'll need a working copy of tensorflow-gpu >= 1.14.0 and up to date nvidia-driver versions.
You can optionally run the provided test suite to ensure installation completed successfully.
pip3 install pytest
pytestIf you'd prefer you can also run finetune in a docker container. The bash scripts provided assume you have a functional install of docker and nvidia-docker.
git clone https://github.com/IndicoDataSolutions/finetune && cd finetune
# For usage with NVIDIA GPUs
./docker/build_gpu_docker.sh      # builds a docker image
./docker/start_gpu_docker.sh      # starts a docker container in the background, forwards $PWD to /finetune
docker exec -it finetune bash # starts a bash session in the docker container
For CPU-only usage:
./docker/build_cpu_docker.sh
./docker/start_cpu_docker.sh
Full documentation and an API Reference for finetune is available at finetune.indico.io.
