Skip to content

NERDA Models

This section covers the interface for NERDA models, that is implemented as its own Python class NERDA.models.NERDA.

The interface enables you to easily

  • specify your own NERDA.models.NERDA model
  • train it
  • evaluate it
  • use it to predict entities in new texts.

NERDA

NERDA model

A NERDA model object containing a complete model configuration. The model can be trained with the train method. Afterwards new observations can be predicted with the predict and predict_text methods. The performance of the model can be evaluated on a set of new observations with the evaluate_performance method.

Examples:

Model for a VERY small subset (5 observations) of English NER data

>>> from NERDA.datasets import get_conll_data
>>> trn = get_conll_data('train', 5)
>>> valid = get_conll_data('valid', 5)
>>> tag_scheme = ['B-PER', 'I-PER', 'B-LOC', 'I-LOC',
                  'B-ORG', 'I-ORG', 'B-MISC', 'I-MISC']
>>> tag_outside = 'O'
>>> transformer = 'bert-base-multilingual-uncased'
>>> model = NERDA(transformer = transformer,
                  tag_scheme = tag_scheme,
                  tag_outside = tag_outside,
                  dataset_training = trn,
                  dataset_validation = valid)

Model for complete English NER data set CoNLL-2003 with modified hyperparameters

>>> trn = get_conll_data('train')
>>> valid = get_conll_data('valid')
>>> transformer = 'bert-base-multilingual-uncased'
>>> hyperparameters = {'epochs' : 3,
                       'warmup_steps' : 400,
                       'train_batch_size': 16,
                       'learning_rate': 0.0001},
>>> model = NERDA(transformer = transformer,
                  dataset_training = trn,
                  dataset_validation = valid,
                  tag_scheme = tag_scheme,
                  tag_outside = tag_outside,
                  dropout = 0.1,
                  hyperparameters = hyperparameters)

Attributes:

Name Type Description
network torch.nn.Module

network for Named Entity Recognition task.

tag_encoder sklearn.preprocessing.LabelEncoder

encoder for the NER labels/tags.

transformer_model transformers.PreTrainedModel

(Auto)Model derived from the transformer.

transformer_tokenizer transformers.PretrainedTokenizer

(Auto)Tokenizer derived from the transformer.

transformer_config transformers.PretrainedConfig

(Auto)Config derived from the transformer.

train_losses list

holds training losses, once the model has been trained.

valid_loss float

holds validation loss, once the model has been trained.

__init__(self, transformer='bert-base-multilingual-uncased', device=None, tag_scheme=['B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], tag_outside='O', dataset_training=None, dataset_validation=None, max_len=128, network=<class 'NERDA.networks.NERDANetwork'>, dropout=0.1, hyperparameters={'epochs': 4, 'warmup_steps': 500, 'train_batch_size': 13, 'learning_rate': 0.0001}, tokenizer_parameters={'do_lower_case': True}, validation_batch_size=8, num_workers=1) special

Initialize NERDA model

Parameters:

Name Type Description Default
transformer str

which pretrained 'huggingface' transformer to use.

'bert-base-multilingual-uncased'
device str

the desired device to use for computation. If not provided by the user, we take a guess.

None
tag_scheme List[str]

All available NER tags for the given data set EXCLUDING the special outside tag, that is handled separately.

['B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']
tag_outside str

the value of the special outside tag. Defaults to 'O'.

'O'
dataset_training dict

the training data. Must consist of 'sentences': word-tokenized sentences and 'tags': corresponding NER tags. You can look at examples of, how the dataset should look like by invoking functions get_dane_data() or get_conll_data(). Defaults to None, in which case the English CoNLL-2003 data set is used.

None
dataset_validation dict

the validation data. Must consist of 'sentences': word-tokenized sentences and 'tags': corresponding NER tags. You can look at examples of, how the dataset should look like by invoking functions get_dane_data() or get_conll_data(). Defaults to None, in which case the English CoNLL-2003 data set is used.

None
max_len int

the maximum sentence length (number of tokens after applying the transformer tokenizer) for the transformer. Sentences are truncated accordingly. Look at your data to get an impression of, what could be a meaningful setting. Also be aware that many transformers have a maximum accepted length. Defaults to 128.

128
network Module

network to be trained. Defaults to a default generic NERDANetwork. Can be replaced with your own customized network architecture. It must however take the same arguments as NERDANetwork.

<class 'NERDA.networks.NERDANetwork'>
dropout float

dropout probability. Defaults to 0.1.

0.1
hyperparameters dict

Hyperparameters for the model. Defaults to {'epochs' : 3, 'warmup_steps' : 500, 'train_batch_size': 16, 'learning_rate': 0.0001}.

{'epochs': 4, 'warmup_steps': 500, 'train_batch_size': 13, 'learning_rate': 0.0001}
tokenizer_parameters dict

parameters for the transformer tokenizer. Defaults to {'do_lower_case' : True}.

{'do_lower_case': True}
validation_batch_size int

batch size for validation. Defaults to 8.

8
num_workers int

number of workers for data loader.

1
Source code in NERDA/models.py
def __init__(self, 
             transformer: str = 'bert-base-multilingual-uncased',
             device: str = None, 
             tag_scheme: List[str] = [
                        'B-PER',
                        'I-PER', 
                        'B-ORG', 
                        'I-ORG', 
                        'B-LOC', 
                        'I-LOC', 
                        'B-MISC', 
                        'I-MISC'
                        ],
             tag_outside: str = 'O',
             dataset_training: dict = None,
             dataset_validation: dict = None,
             max_len: int = 128,
             network: torch.nn.Module = NERDANetwork,
             dropout: float = 0.1,
             hyperparameters: dict = {'epochs' : 4,
                                      'warmup_steps' : 500,
                                      'train_batch_size': 13,
                                      'learning_rate': 0.0001},
             tokenizer_parameters: dict = {'do_lower_case' : True},
             validation_batch_size: int = 8,
             num_workers: int = 1) -> None:
    """Initialize NERDA model

    Args:
        transformer (str, optional): which pretrained 'huggingface' 
            transformer to use. 
        device (str, optional): the desired device to use for computation. 
            If not provided by the user, we take a guess.
        tag_scheme (List[str], optional): All available NER 
            tags for the given data set EXCLUDING the special outside tag, 
            that is handled separately.
        tag_outside (str, optional): the value of the special outside tag. 
            Defaults to 'O'.
        dataset_training (dict, optional): the training data. Must consist 
            of 'sentences': word-tokenized sentences and 'tags': corresponding 
            NER tags. You can look at examples of, how the dataset should 
            look like by invoking functions get_dane_data() or get_conll_data().
            Defaults to None, in which case the English CoNLL-2003 data set is used. 
        dataset_validation (dict, optional): the validation data. Must consist
            of 'sentences': word-tokenized sentences and 'tags': corresponding 
            NER tags. You can look at examples of, how the dataset should 
            look like by invoking functions get_dane_data() or get_conll_data().
            Defaults to None, in which case the English CoNLL-2003 data set 
            is used.
        max_len (int, optional): the maximum sentence length (number of 
            tokens after applying the transformer tokenizer) for the transformer. 
            Sentences are truncated accordingly. Look at your data to get an 
            impression of, what could be a meaningful setting. Also be aware 
            that many transformers have a maximum accepted length. Defaults 
            to 128. 
        network (torch.nn.module, optional): network to be trained. Defaults
            to a default generic `NERDANetwork`. Can be replaced with your own 
            customized network architecture. It must however take the same 
            arguments as `NERDANetwork`.
        dropout (float, optional): dropout probability. Defaults to 0.1.
        hyperparameters (dict, optional): Hyperparameters for the model. Defaults
            to {'epochs' : 3, 'warmup_steps' : 500, 'train_batch_size': 16, 
            'learning_rate': 0.0001}.
        tokenizer_parameters (dict, optional): parameters for the transformer 
            tokenizer. Defaults to {'do_lower_case' : True}.
        validation_batch_size (int, optional): batch size for validation. Defaults
            to 8.
        num_workers (int, optional): number of workers for data loader.
    """

    # set device automatically if not provided by user.
    if device is None:
        self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
        print("Device automatically set to:", self.device)
    else:
        self.device = device
        print("Device set to:", self.device)
    self.tag_scheme = tag_scheme
    self.tag_outside = tag_outside
    self.transformer = transformer  
    self.dataset_training = dataset_training
    self.dataset_validation = dataset_validation
    self.hyperparameters = hyperparameters
    self.tag_outside = tag_outside
    self.tag_scheme = tag_scheme
    tag_complete = [tag_outside] + tag_scheme
    # fit encoder to _all_ possible tags.
    self.max_len = max_len
    self.tag_encoder = sklearn.preprocessing.LabelEncoder()
    self.tag_encoder.fit(tag_complete)
    self.transformer_model = AutoModel.from_pretrained(transformer)
    self.transformer_tokenizer = AutoTokenizer.from_pretrained(transformer, **tokenizer_parameters)
    self.transformer_config = AutoConfig.from_pretrained(transformer)  
    self.network = NERDANetwork(self.transformer_model, self.device, len(tag_complete), dropout = dropout)
    self.network.to(self.device)
    self.validation_batch_size = validation_batch_size
    self.num_workers = num_workers
    self.train_losses = []
    self.valid_loss = np.nan
    self.quantized = False
    self.halved = False

evaluate_performance(self, dataset, return_accuracy=False, **kwargs)

Evaluate Performance

Evaluates the performance of the model on an arbitrary data set.

Parameters:

Name Type Description Default
dataset dict

Data set that must consist of 'sentences' and NER'tags'. You can look at examples of, how the dataset should look like by invoking functions get_dane_data() or get_conll_data().

required
kwargs

arbitrary keyword arguments for predict. For instance 'batch_size' and 'num_workers'.

{}
return_accuracy bool

Return accuracy as well? Defaults to False.

False

Returns:

Type Description
DataFrame

DataFrame with performance numbers, F1-scores, Precision and Recall. Returns dictionary with this AND accuracy, if return_accuracy is set to True.

Source code in NERDA/models.py
def evaluate_performance(self, dataset: dict, 
                         return_accuracy: bool=False,
                         **kwargs) -> pd.DataFrame:
    """Evaluate Performance

    Evaluates the performance of the model on an arbitrary
    data set.

    Args:
        dataset (dict): Data set that must consist of
            'sentences' and NER'tags'. You can look at examples
             of, how the dataset should look like by invoking functions 
             get_dane_data() or get_conll_data().
        kwargs: arbitrary keyword arguments for predict. For
            instance 'batch_size' and 'num_workers'.
        return_accuracy (bool): Return accuracy
            as well? Defaults to False.


    Returns:
        DataFrame with performance numbers, F1-scores,
        Precision and Recall. Returns dictionary with
        this AND accuracy, if return_accuracy is set to
        True.
    """

    tags_predicted = self.predict(dataset.get('sentences'), 
                                  **kwargs)

    # compute F1 scores by entity type
    f1 = compute_f1_scores(y_pred = tags_predicted, 
                           y_true = dataset.get('tags'),
                           labels = self.tag_scheme,
                           average = None)

    # create DataFrame with performance scores (=F1)
    df = list(zip(self.tag_scheme, f1[2], f1[0], f1[1]))
    df = pd.DataFrame(df, columns = ['Level', 'F1-Score', 'Precision', 'Recall'])    

    # compute MICRO-averaged F1-scores and add to table.
    f1_micro = compute_f1_scores(y_pred = tags_predicted, 
                                 y_true = dataset.get('tags'),
                                 labels = self.tag_scheme,
                                 average = 'micro')
    f1_micro = pd.DataFrame({'Level' : ['AVG_MICRO'], 
                             'F1-Score': [f1_micro[2]],
                             'Precision': [np.nan],
                             'Recall': [np.nan]})
    df = df.append(f1_micro)

    # compute MACRO-averaged F1-scores and add to table.
    f1_macro = compute_f1_scores(y_pred = tags_predicted, 
                                 y_true = dataset.get('tags'),
                                 labels = self.tag_scheme,
                                 average = 'macro')
    f1_macro = pd.DataFrame({'Level' : ['AVG_MICRO'], 
                             'F1-Score': [f1_macro[2]],
                             'Precision': [np.nan],
                             'Recall': [np.nan]})
    df = df.append(f1_macro)

    # compute and return accuracy if desired
    if return_accuracy:
        accuracy = accuracy_score(y_pred = flatten(tags_predicted), 
                                  y_true = flatten(dataset.get('tags')))
        return {'f1':df, 'accuracy': accuracy}

    return df

half(self)

Convert weights from Float32 to Float16 to increase performance

Quantization and half precision inference are mutually exclusive.

Read more: https://pytorch.org/docs/master/generated/torch.nn.Module.html?highlight=half#torch.nn.Module.half

Returns: Nothing. Model is "halved" as a side-effect.

Source code in NERDA/models.py
def half(self):
    """Convert weights from Float32 to Float16 to increase performance

    Quantization and half precision inference are mutually exclusive.

    Read more: https://pytorch.org/docs/master/generated/torch.nn.Module.html?highlight=half#torch.nn.Module.half

    Returns: 
        Nothing. Model is "halved" as a side-effect.
    """
    assert not (self.halved), "Half precision already applied"
    assert not (self.quantized), "Can't run both quantization and half precision"

    self.network.half()
    self.halved = True

load_network_from_file(self, model_path='model.bin')

Load Pretrained NERDA Network from file

Loads weights for a pretrained NERDA Network from file.

Parameters:

Name Type Description Default
model_path str

Path for model file. Defaults to "model.bin".

'model.bin'

Returns:

Type Description
str

str: message telling if weights for network were loaded succesfully.

Source code in NERDA/models.py
def load_network_from_file(self, model_path = "model.bin") -> str:
    """Load Pretrained NERDA Network from file

    Loads weights for a pretrained NERDA Network from file.

    Args:
        model_path (str, optional): Path for model file. 
            Defaults to "model.bin".

    Returns:
        str: message telling if weights for network were
        loaded succesfully.
    """
    # TODO: change assert to Raise.
    assert os.path.exists(model_path), "File does not exist. You can download network with download_network()"
    self.network.load_state_dict(torch.load(model_path, map_location = torch.device(self.device)))
    self.network.device = self.device
    return f'Weights for network loaded from {model_path}'

predict(self, sentences, return_confidence=False, **kwargs)

Predict Named Entities in Word-Tokenized Sentences

Predicts word-tokenized sentences with trained model.

Parameters:

Name Type Description Default
sentences List[List[str]]

word-tokenized sentences.

required
kwargs

arbitrary keyword arguments. For instance 'batch_size' and 'num_workers'.

{}
return_confidence bool

if True, return confidence scores for all predicted tokens. Defaults to False.

False

Returns:

Type Description
List[List[str]]

List[List[str]]: Predicted tags for sentences - one predicted tag/entity per word token.

Source code in NERDA/models.py
def predict(self, sentences: List[List[str]],
            return_confidence: bool = False,
            **kwargs) -> List[List[str]]:
    """Predict Named Entities in Word-Tokenized Sentences

    Predicts word-tokenized sentences with trained model.

    Args:
        sentences (List[List[str]]): word-tokenized sentences.
        kwargs: arbitrary keyword arguments. For instance
            'batch_size' and 'num_workers'.
        return_confidence (bool, optional): if True, return
            confidence scores for all predicted tokens. Defaults
            to False.

    Returns:
        List[List[str]]: Predicted tags for sentences - one
        predicted tag/entity per word token.
    """
    return predict(network = self.network, 
                   sentences = sentences,
                   transformer_tokenizer = self.transformer_tokenizer,
                   transformer_config = self.transformer_config,
                   max_len = self.max_len,
                   device = self.device,
                   tag_encoder = self.tag_encoder,
                   tag_outside = self.tag_outside,
                   return_confidence = return_confidence,
                   **kwargs)

predict_text(self, text, return_confidence=False, **kwargs)

Predict Named Entities in a Text

Parameters:

Name Type Description Default
text str

text to predict entities in.

required
kwargs

arbitrary keyword arguments. For instance 'batch_size' and 'num_workers'.

{}
return_confidence bool

if True, return confidence scores for all predicted tokens. Defaults to False.

False

Returns:

Type Description
list

tuple: word-tokenized sentences and predicted tags/entities.

Source code in NERDA/models.py
def predict_text(self, text: str, 
                 return_confidence:bool = False, **kwargs) -> list:
    """Predict Named Entities in a Text

    Args:
        text (str): text to predict entities in.
        kwargs: arbitrary keyword arguments. For instance
            'batch_size' and 'num_workers'.
        return_confidence (bool, optional): if True, return
            confidence scores for all predicted tokens. Defaults
            to False.

    Returns:
        tuple: word-tokenized sentences and predicted 
        tags/entities.
    """
    return predict_text(network = self.network, 
                        text = text,
                        transformer_tokenizer = self.transformer_tokenizer,
                        transformer_config = self.transformer_config,
                        max_len = self.max_len,
                        device = self.device,
                        tag_encoder = self.tag_encoder,
                        tag_outside = self.tag_outside,
                        return_confidence=return_confidence,
                        **kwargs)

quantize(self)

Apply dynamic quantization to increase performance.

Quantization and half precision inference are mutually exclusive.

Read more: https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html

Returns:

Type Description

Nothing. Applies dynamic quantization to Network as a side-effect.

Source code in NERDA/models.py
def quantize(self):
    """Apply dynamic quantization to increase performance.

    Quantization and half precision inference are mutually exclusive.

    Read more: https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html

    Returns:
        Nothing. Applies dynamic quantization to Network as a side-effect.
    """
    assert not (self.quantized), "Dynamic quantization already applied"
    assert not (self.halved), "Can't run both quantization and half precision"

    self.network = torch.quantization.quantize_dynamic(
        self.network, {torch.nn.Linear}, dtype=torch.qint8
    )
    self.quantized = True

save_network(self, model_path='model.bin')

Save Weights of NERDA Network

Saves weights for a fine-tuned NERDA Network to file.

Parameters:

Name Type Description Default
model_path str

Path for model file. Defaults to "model.bin".

'model.bin'

Returns:

Type Description
None

Nothing. Saves model to file as a side-effect.

Source code in NERDA/models.py
def save_network(self, model_path:str = "model.bin") -> None:
    """Save Weights of NERDA Network

    Saves weights for a fine-tuned NERDA Network to file.

    Args:
        model_path (str, optional): Path for model file. 
            Defaults to "model.bin".

    Returns:
        Nothing. Saves model to file as a side-effect.
    """
    torch.save(self.network.state_dict(), model_path)
    print(f"Network written to file {model_path}")

train(self)

Train Network

Trains the network from the NERDA model specification.

Returns:

Type Description
str

str: a message saying if the model was trained succesfully. The network in the 'network' attribute is trained as a side-effect. Training losses and validation loss are saved in 'training_losses' and 'valid_loss' attributes respectively as side-effects.

Source code in NERDA/models.py
def train(self) -> str:
    """Train Network

    Trains the network from the NERDA model specification.

    Returns:
        str: a message saying if the model was trained succesfully.
        The network in the 'network' attribute is trained as a 
        side-effect. Training losses and validation loss are saved 
        in 'training_losses' and 'valid_loss' 
        attributes respectively as side-effects.
    """
    network, train_losses, valid_loss = train_model(network = self.network,
                                                    tag_encoder = self.tag_encoder,
                                                    tag_outside = self.tag_outside,
                                                    transformer_tokenizer = self.transformer_tokenizer,
                                                    transformer_config = self.transformer_config,
                                                    dataset_training = self.dataset_training,
                                                    dataset_validation = self.dataset_validation,
                                                    validation_batch_size = self.validation_batch_size,
                                                    max_len = self.max_len,
                                                    device = self.device,
                                                    num_workers = self.num_workers,
                                                    **self.hyperparameters)

    # attach as attributes to class
    setattr(self, "network", network)
    setattr(self, "train_losses", train_losses)
    setattr(self, "valid_loss", valid_loss)

    return "Model trained successfully"