NERDA Models

This section covers the interface for NERDA models, that is implemented as its own Python class NERDA.models.NERDA.

The interface enables you to easily

specify your own NERDA.models.NERDA model
train it
evaluate it
use it to predict entities in new texts.

`NERDA`

NERDA model

A NERDA model object containing a complete model configuration. The model can be trained with the train method. Afterwards new observations can be predicted with the predict and predict_text methods. The performance of the model can be evaluated on a set of new observations with the evaluate_performance method.

Examples:

Model for a VERY small subset (5 observations) of English NER data

>>> from NERDA.datasets import get_conll_data
>>> trn = get_conll_data('train', 5)
>>> valid = get_conll_data('valid', 5)
>>> tag_scheme = ['B-PER', 'I-PER', 'B-LOC', 'I-LOC',
                  'B-ORG', 'I-ORG', 'B-MISC', 'I-MISC']
>>> tag_outside = 'O'
>>> transformer = 'bert-base-multilingual-uncased'
>>> model = NERDA(transformer = transformer,
                  tag_scheme = tag_scheme,
                  tag_outside = tag_outside,
                  dataset_training = trn,
                  dataset_validation = valid)

Model for complete English NER data set CoNLL-2003 with modified hyperparameters

>>> trn = get_conll_data('train')
>>> valid = get_conll_data('valid')
>>> transformer = 'bert-base-multilingual-uncased'
>>> hyperparameters = {'epochs' : 3,
                       'warmup_steps' : 400,
                       'train_batch_size': 16,
                       'learning_rate': 0.0001},
>>> model = NERDA(transformer = transformer,
                  dataset_training = trn,
                  dataset_validation = valid,
                  tag_scheme = tag_scheme,
                  tag_outside = tag_outside,
                  dropout = 0.1,
                  hyperparameters = hyperparameters)

Attributes:

Name	Type	Description
`network`	`torch.nn.Module`	network for Named Entity Recognition task.
`tag_encoder`	`sklearn.preprocessing.LabelEncoder`	encoder for the NER labels/tags.
`transformer_model`	`transformers.PreTrainedModel`	(Auto)Model derived from the transformer.
`transformer_tokenizer`	`transformers.PretrainedTokenizer`	(Auto)Tokenizer derived from the transformer.
`transformer_config`	`transformers.PretrainedConfig`	(Auto)Config derived from the transformer.
`train_losses`	`list`	holds training losses, once the model has been trained.
`valid_loss`	`float`	holds validation loss, once the model has been trained.

`init(self, transformer='bert-base-multilingual-uncased', device=None, tag_scheme=['B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], tag_outside='O', dataset_training=None, dataset_validation=None, max_len=128, network=<class 'NERDA.networks.NERDANetwork'>, dropout=0.1, hyperparameters={'epochs': 4, 'warmup_steps': 500, 'train_batch_size': 13, 'learning_rate': 0.0001}, tokenizer_parameters={'do_lower_case': True}, validation_batch_size=8, num_workers=1)` `special`

Initialize NERDA model

Parameters:

Name	Type	Description	Default
`transformer`	`str`	which pretrained 'huggingface' transformer to use.	`'bert-base-multilingual-uncased'`
`device`	`str`	the desired device to use for computation. If not provided by the user, we take a guess.	`None`
`tag_scheme`	`List[str]`	All available NER tags for the given data set EXCLUDING the special outside tag, that is handled separately.	`['B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']`
`tag_outside`	`str`	the value of the special outside tag. Defaults to 'O'.	`'O'`
`dataset_training`	`dict`	the training data. Must consist of 'sentences': word-tokenized sentences and 'tags': corresponding NER tags. You can look at examples of, how the dataset should look like by invoking functions get_dane_data() or get_conll_data(). Defaults to None, in which case the English CoNLL-2003 data set is used.	`None`
`dataset_validation`	`dict`	the validation data. Must consist of 'sentences': word-tokenized sentences and 'tags': corresponding NER tags. You can look at examples of, how the dataset should look like by invoking functions get_dane_data() or get_conll_data(). Defaults to None, in which case the English CoNLL-2003 data set is used.	`None`
`max_len`	`int`	the maximum sentence length (number of tokens after applying the transformer tokenizer) for the transformer. Sentences are truncated accordingly. Look at your data to get an impression of, what could be a meaningful setting. Also be aware that many transformers have a maximum accepted length. Defaults to 128.	`128`
`network`	`Module`	network to be trained. Defaults to a default generic `NERDANetwork`. Can be replaced with your own customized network architecture. It must however take the same arguments as `NERDANetwork`.	`<class 'NERDA.networks.NERDANetwork'>`
`dropout`	`float`	dropout probability. Defaults to 0.1.	`0.1`
`hyperparameters`	`dict`	Hyperparameters for the model. Defaults to {'epochs' : 3, 'warmup_steps' : 500, 'train_batch_size': 16, 'learning_rate': 0.0001}.	`{'epochs': 4, 'warmup_steps': 500, 'train_batch_size': 13, 'learning_rate': 0.0001}`
`tokenizer_parameters`	`dict`	parameters for the transformer tokenizer. Defaults to {'do_lower_case' : True}.	`{'do_lower_case': True}`
`validation_batch_size`	`int`	batch size for validation. Defaults to 8.	`8`
`num_workers`	`int`	number of workers for data loader.	`1`

Source code in NERDA/models.py

def __init__(self, 
             transformer: str = 'bert-base-multilingual-uncased',
             device: str = None, 
             tag_scheme: List[str] = [
                        'B-PER',
                        'I-PER', 
                        'B-ORG', 
                        'I-ORG', 
                        'B-LOC', 
                        'I-LOC', 
                        'B-MISC', 
                        'I-MISC'
                        ],
             tag_outside: str = 'O',
             dataset_training: dict = None,
             dataset_validation: dict = None,
             max_len: int = 128,
             network: torch.nn.Module = NERDANetwork,
             dropout: float = 0.1,
             hyperparameters: dict = {'epochs' : 4,
                                      'warmup_steps' : 500,
                                      'train_batch_size': 13,
                                      'learning_rate': 0.0001},
             tokenizer_parameters: dict = {'do_lower_case' : True},
             validation_batch_size: int = 8,
             num_workers: int = 1) -> None:
    """Initialize NERDA model

    Args:
        transformer (str, optional): which pretrained 'huggingface' 
            transformer to use. 
        device (str, optional): the desired device to use for computation. 
            If not provided by the user, we take a guess.
        tag_scheme (List[str], optional): All available NER 
            tags for the given data set EXCLUDING the special outside tag, 
            that is handled separately.
        tag_outside (str, optional): the value of the special outside tag. 
            Defaults to 'O'.
        dataset_training (dict, optional): the training data. Must consist 
            of 'sentences': word-tokenized sentences and 'tags': corresponding 
            NER tags. You can look at examples of, how the dataset should 
            look like by invoking functions get_dane_data() or get_conll_data().
            Defaults to None, in which case the English CoNLL-2003 data set is used. 
        dataset_validation (dict, optional): the validation data. Must consist
            of 'sentences': word-tokenized sentences and 'tags': corresponding 
            NER tags. You can look at examples of, how the dataset should 
            look like by invoking functions get_dane_data() or get_conll_data().
            Defaults to None, in which case the English CoNLL-2003 data set 
            is used.
        max_len (int, optional): the maximum sentence length (number of 
            tokens after applying the transformer tokenizer) for the transformer. 
            Sentences are truncated accordingly. Look at your data to get an 
            impression of, what could be a meaningful setting. Also be aware 
            that many transformers have a maximum accepted length. Defaults 
            to 128. 
        network (torch.nn.module, optional): network to be trained. Defaults
            to a default generic `NERDANetwork`. Can be replaced with your own 
            customized network architecture. It must however take the same 
            arguments as `NERDANetwork`.
        dropout (float, optional): dropout probability. Defaults to 0.1.
        hyperparameters (dict, optional): Hyperparameters for the model. Defaults
            to {'epochs' : 3, 'warmup_steps' : 500, 'train_batch_size': 16, 
            'learning_rate': 0.0001}.
        tokenizer_parameters (dict, optional): parameters for the transformer 
            tokenizer. Defaults to {'do_lower_case' : True}.
        validation_batch_size (int, optional): batch size for validation. Defaults
            to 8.
        num_workers (int, optional): number of workers for data loader.
    """

    # set device automatically if not provided by user.
    if device is None:
        self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
        print("Device automatically set to:", self.device)
    else:
        self.device = device
        print("Device set to:", self.device)
    self.tag_scheme = tag_scheme
    self.tag_outside = tag_outside
    self.transformer = transformer  
    self.dataset_training = dataset_training
    self.dataset_validation = dataset_validation
    self.hyperparameters = hyperparameters
    self.tag_outside = tag_outside
    self.tag_scheme = tag_scheme
    tag_complete = [tag_outside] + tag_scheme
    # fit encoder to _all_ possible tags.
    self.max_len = max_len
    self.tag_encoder = sklearn.preprocessing.LabelEncoder()
    self.tag_encoder.fit(tag_complete)
    self.transformer_model = AutoModel.from_pretrained(transformer)
    self.transformer_tokenizer = AutoTokenizer.from_pretrained(transformer, **tokenizer_parameters)
    self.transformer_config = AutoConfig.from_pretrained(transformer)  
    self.network = NERDANetwork(self.transformer_model, self.device, len(tag_complete), dropout = dropout)
    self.network.to(self.device)
    self.validation_batch_size = validation_batch_size
    self.num_workers = num_workers
    self.train_losses = []
    self.valid_loss = np.nan
    self.quantized = False
    self.halved = False

`evaluate_performance(self, dataset, return_accuracy=False, **kwargs)`

Evaluate Performance

Evaluates the performance of the model on an arbitrary data set.

Parameters:

Name	Type	Description	Default
`dataset`	`dict`	Data set that must consist of 'sentences' and NER'tags'. You can look at examples of, how the dataset should look like by invoking functions get_dane_data() or get_conll_data().	required
`kwargs`		arbitrary keyword arguments for predict. For instance 'batch_size' and 'num_workers'.	`{}`
`return_accuracy`	`bool`	Return accuracy as well? Defaults to False.	`False`

Returns:

Type	Description
`DataFrame`	DataFrame with performance numbers, F1-scores, Precision and Recall. Returns dictionary with this AND accuracy, if return_accuracy is set to True.

Source code in NERDA/models.py

def evaluate_performance(self, dataset: dict, 
                         return_accuracy: bool=False,
                         **kwargs) -> pd.DataFrame:
    """Evaluate Performance

    Evaluates the performance of the model on an arbitrary
    data set.

    Args:
        dataset (dict): Data set that must consist of
            'sentences' and NER'tags'. You can look at examples
             of, how the dataset should look like by invoking functions 
             get_dane_data() or get_conll_data().
        kwargs: arbitrary keyword arguments for predict. For
            instance 'batch_size' and 'num_workers'.
        return_accuracy (bool): Return accuracy
            as well? Defaults to False.


    Returns:
        DataFrame with performance numbers, F1-scores,
        Precision and Recall. Returns dictionary with
        this AND accuracy, if return_accuracy is set to
        True.
    """

    tags_predicted = self.predict(dataset.get('sentences'), 
                                  **kwargs)

    # compute F1 scores by entity type
    f1 = compute_f1_scores(y_pred = tags_predicted, 
                           y_true = dataset.get('tags'),
                           labels = self.tag_scheme,
                           average = None)

    # create DataFrame with performance scores (=F1)
    df = list(zip(self.tag_scheme, f1[2], f1[0], f1[1]))
    df = pd.DataFrame(df, columns = ['Level', 'F1-Score', 'Precision', 'Recall'])    

    # compute MICRO-averaged F1-scores and add to table.
    f1_micro = compute_f1_scores(y_pred = tags_predicted, 
                                 y_true = dataset.get('tags'),
                                 labels = self.tag_scheme,
                                 average = 'micro')
    f1_micro = pd.DataFrame({'Level' : ['AVG_MICRO'], 
                             'F1-Score': [f1_micro[2]],
                             'Precision': [np.nan],
                             'Recall': [np.nan]})
    df = df.append(f1_micro)

    # compute MACRO-averaged F1-scores and add to table.
    f1_macro = compute_f1_scores(y_pred = tags_predicted, 
                                 y_true = dataset.get('tags'),
                                 labels = self.tag_scheme,
                                 average = 'macro')
    f1_macro = pd.DataFrame({'Level' : ['AVG_MICRO'], 
                             'F1-Score': [f1_macro[2]],
                             'Precision': [np.nan],
                             'Recall': [np.nan]})
    df = df.append(f1_macro)

    # compute and return accuracy if desired
    if return_accuracy:
        accuracy = accuracy_score(y_pred = flatten(tags_predicted), 
                                  y_true = flatten(dataset.get('tags')))
        return {'f1':df, 'accuracy': accuracy}

    return df

`half(self)`

Convert weights from Float32 to Float16 to increase performance

Quantization and half precision inference are mutually exclusive.

Read more: https://pytorch.org/docs/master/generated/torch.nn.Module.html?highlight=half#torch.nn.Module.half

Returns: Nothing. Model is "halved" as a side-effect.

Source code in NERDA/models.py

def half(self):
    """Convert weights from Float32 to Float16 to increase performance

    Quantization and half precision inference are mutually exclusive.

    Read more: https://pytorch.org/docs/master/generated/torch.nn.Module.html?highlight=half#torch.nn.Module.half

    Returns: 
        Nothing. Model is "halved" as a side-effect.
    """
    assert not (self.halved), "Half precision already applied"
    assert not (self.quantized), "Can't run both quantization and half precision"

    self.network.half()
    self.halved = True

`load_network_from_file(self, model_path='model.bin')`

Load Pretrained NERDA Network from file

Loads weights for a pretrained NERDA Network from file.

Parameters:

Name	Type	Description	Default
`model_path`	`str`	Path for model file. Defaults to "model.bin".	`'model.bin'`

Returns:

Type	Description
`str`	str: message telling if weights for network were loaded succesfully.

Source code in NERDA/models.py

def load_network_from_file(self, model_path = "model.bin") -> str:
    """Load Pretrained NERDA Network from file

    Loads weights for a pretrained NERDA Network from file.

    Args:
        model_path (str, optional): Path for model file. 
            Defaults to "model.bin".

    Returns:
        str: message telling if weights for network were
        loaded succesfully.
    """
    # TODO: change assert to Raise.
    assert os.path.exists(model_path), "File does not exist. You can download network with download_network()"
    self.network.load_state_dict(torch.load(model_path, map_location = torch.device(self.device)))
    self.network.device = self.device
    return f'Weights for network loaded from {model_path}'

`predict(self, sentences, return_confidence=False, **kwargs)`

Predict Named Entities in Word-Tokenized Sentences

Predicts word-tokenized sentences with trained model.

Parameters:

Name	Type	Description	Default
`sentences`	`List[List[str]]`	word-tokenized sentences.	required
`kwargs`		arbitrary keyword arguments. For instance 'batch_size' and 'num_workers'.	`{}`
`return_confidence`	`bool`	if True, return confidence scores for all predicted tokens. Defaults to False.	`False`

Returns:

Type	Description
`List[List[str]]`	List[List[str]]: Predicted tags for sentences - one predicted tag/entity per word token.

Source code in NERDA/models.py

def predict(self, sentences: List[List[str]],
            return_confidence: bool = False,
            **kwargs) -> List[List[str]]:
    """Predict Named Entities in Word-Tokenized Sentences

    Predicts word-tokenized sentences with trained model.

    Args:
        sentences (List[List[str]]): word-tokenized sentences.
        kwargs: arbitrary keyword arguments. For instance
            'batch_size' and 'num_workers'.
        return_confidence (bool, optional): if True, return
            confidence scores for all predicted tokens. Defaults
            to False.

    Returns:
        List[List[str]]: Predicted tags for sentences - one
        predicted tag/entity per word token.
    """
    return predict(network = self.network, 
                   sentences = sentences,
                   transformer_tokenizer = self.transformer_tokenizer,
                   transformer_config = self.transformer_config,
                   max_len = self.max_len,
                   device = self.device,
                   tag_encoder = self.tag_encoder,
                   tag_outside = self.tag_outside,
                   return_confidence = return_confidence,
                   **kwargs)

`predict_text(self, text, return_confidence=False, **kwargs)`

Predict Named Entities in a Text

Parameters:

Name	Type	Description	Default
`text`	`str`	text to predict entities in.	required
`kwargs`		arbitrary keyword arguments. For instance 'batch_size' and 'num_workers'.	`{}`
`return_confidence`	`bool`	if True, return confidence scores for all predicted tokens. Defaults to False.	`False`

Returns:

Type	Description
`list`	tuple: word-tokenized sentences and predicted tags/entities.

Source code in NERDA/models.py

def predict_text(self, text: str, 
                 return_confidence:bool = False, **kwargs) -> list:
    """Predict Named Entities in a Text

    Args:
        text (str): text to predict entities in.
        kwargs: arbitrary keyword arguments. For instance
            'batch_size' and 'num_workers'.
        return_confidence (bool, optional): if True, return
            confidence scores for all predicted tokens. Defaults
            to False.

    Returns:
        tuple: word-tokenized sentences and predicted 
        tags/entities.
    """
    return predict_text(network = self.network, 
                        text = text,
                        transformer_tokenizer = self.transformer_tokenizer,
                        transformer_config = self.transformer_config,
                        max_len = self.max_len,
                        device = self.device,
                        tag_encoder = self.tag_encoder,
                        tag_outside = self.tag_outside,
                        return_confidence=return_confidence,
                        **kwargs)

`quantize(self)`

Apply dynamic quantization to increase performance.

Quantization and half precision inference are mutually exclusive.

Read more: https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html

Returns:

Type	Description
	Nothing. Applies dynamic quantization to Network as a side-effect.

Source code in NERDA/models.py

def quantize(self):
    """Apply dynamic quantization to increase performance.

    Quantization and half precision inference are mutually exclusive.

    Read more: https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html

    Returns:
        Nothing. Applies dynamic quantization to Network as a side-effect.
    """
    assert not (self.quantized), "Dynamic quantization already applied"
    assert not (self.halved), "Can't run both quantization and half precision"

    self.network = torch.quantization.quantize_dynamic(
        self.network, {torch.nn.Linear}, dtype=torch.qint8
    )
    self.quantized = True

`save_network(self, model_path='model.bin')`

Save Weights of NERDA Network

Saves weights for a fine-tuned NERDA Network to file.

Parameters:

Name	Type	Description	Default
`model_path`	`str`	Path for model file. Defaults to "model.bin".	`'model.bin'`

Returns:

Type	Description
`None`	Nothing. Saves model to file as a side-effect.

Source code in NERDA/models.py

def save_network(self, model_path:str = "model.bin") -> None:
    """Save Weights of NERDA Network

    Saves weights for a fine-tuned NERDA Network to file.

    Args:
        model_path (str, optional): Path for model file. 
            Defaults to "model.bin".

    Returns:
        Nothing. Saves model to file as a side-effect.
    """
    torch.save(self.network.state_dict(), model_path)
    print(f"Network written to file {model_path}")

`train(self)`

Train Network

Trains the network from the NERDA model specification.

Returns:

Type	Description
`str`	str: a message saying if the model was trained succesfully. The network in the 'network' attribute is trained as a side-effect. Training losses and validation loss are saved in 'training_losses' and 'valid_loss' attributes respectively as side-effects.

Source code in NERDA/models.py

def train(self) -> str:
    """Train Network

    Trains the network from the NERDA model specification.

    Returns:
        str: a message saying if the model was trained succesfully.
        The network in the 'network' attribute is trained as a 
        side-effect. Training losses and validation loss are saved 
        in 'training_losses' and 'valid_loss' 
        attributes respectively as side-effects.
    """
    network, train_losses, valid_loss = train_model(network = self.network,
                                                    tag_encoder = self.tag_encoder,
                                                    tag_outside = self.tag_outside,
                                                    transformer_tokenizer = self.transformer_tokenizer,
                                                    transformer_config = self.transformer_config,
                                                    dataset_training = self.dataset_training,
                                                    dataset_validation = self.dataset_validation,
                                                    validation_batch_size = self.validation_batch_size,
                                                    max_len = self.max_len,
                                                    device = self.device,
                                                    num_workers = self.num_workers,
                                                    **self.hyperparameters)

    # attach as attributes to class
    setattr(self, "network", network)
    setattr(self, "train_losses", train_losses)
    setattr(self, "valid_loss", valid_loss)

    return "Model trained successfully"