Skip to content

Predictions

This section covers functionality for computing predictions with a NERDA.models.NERDA model.

predict(network, sentences, transformer_tokenizer, transformer_config, max_len, device, tag_encoder, tag_outside, batch_size=8, num_workers=1, return_tensors=False, return_confidence=False, pad_sequences=True)

Compute predictions.

Computes predictions for a list with word-tokenized sentences with a NERDA model.

Parameters:

Name Type Description Default
network Module

Network.

required
sentences List[List[str]]

List of lists with word-tokenized sentences.

required
transformer_tokenizer PreTrainedTokenizer

tokenizer for transformer model.

required
transformer_config PretrainedConfig

config for transformer model.

required
max_len int

Maximum length of sentence after applying transformer tokenizer.

required
device str

Computational device.

required
tag_encoder LabelEncoder

Encoder for Named-Entity tags.

required
tag_outside str

Special 'outside' NER tag.

required
batch_size int

Batch Size for DataLoader. Defaults to 8.

8
num_workers int

Number of workers. Defaults to 1.

1
return_tensors bool

if True, return tensors.

False
return_confidence bool

if True, return confidence scores for all predicted tokens. Defaults to False.

False
pad_sequences bool

if True, pad sequences. Defaults to True.

True

Returns:

Type Description
List[List[str]]

List[List[str]]: List of lists with predicted Entity tags.

Source code in NERDA/predictions.py
def predict(network: torch.nn.Module, 
            sentences: List[List[str]],
            transformer_tokenizer: transformers.PreTrainedTokenizer,
            transformer_config: transformers.PretrainedConfig,
            max_len: int,
            device: str,
            tag_encoder: sklearn.preprocessing.LabelEncoder,
            tag_outside: str,
            batch_size: int = 8,
            num_workers: int = 1,
            return_tensors: bool = False,
            return_confidence: bool = False,
            pad_sequences: bool = True) -> List[List[str]]:
    """Compute predictions.

    Computes predictions for a list with word-tokenized sentences 
    with a `NERDA` model.

    Args:
        network (torch.nn.Module): Network.
        sentences (List[List[str]]): List of lists with word-tokenized
            sentences.
        transformer_tokenizer (transformers.PreTrainedTokenizer): 
            tokenizer for transformer model.
        transformer_config (transformers.PretrainedConfig): config
            for transformer model.
        max_len (int): Maximum length of sentence after applying 
            transformer tokenizer.
        device (str): Computational device.
        tag_encoder (sklearn.preprocessing.LabelEncoder): Encoder
            for Named-Entity tags.
        tag_outside (str): Special 'outside' NER tag.
        batch_size (int, optional): Batch Size for DataLoader. 
            Defaults to 8.
        num_workers (int, optional): Number of workers. Defaults
            to 1.
        return_tensors (bool, optional): if True, return tensors.
        return_confidence (bool, optional): if True, return
            confidence scores for all predicted tokens. Defaults
            to False.
        pad_sequences (bool, optional): if True, pad sequences. 
            Defaults to True.

    Returns:
        List[List[str]]: List of lists with predicted Entity
        tags.
    """
    # make sure, that input has the correct format. 
    assert isinstance(sentences, list), "'sentences' must be a list of list of word-tokens"
    assert isinstance(sentences[0], list), "'sentences' must be a list of list of word-tokens"
    assert isinstance(sentences[0][0], str), "'sentences' must be a list of list of word-tokens"

    # set network to appropriate mode.
    network.eval()

    # fill 'dummy' tags (expected input for dataloader).
    tag_fill = [tag_encoder.classes_[0]]
    tags_dummy = [tag_fill * len(sent) for sent in sentences]

    dl = create_dataloader(sentences = sentences,
                           tags = tags_dummy, 
                           transformer_tokenizer = transformer_tokenizer,
                           transformer_config = transformer_config,
                           max_len = max_len, 
                           batch_size = batch_size, 
                           tag_encoder = tag_encoder,
                           tag_outside = tag_outside,
                           num_workers = num_workers,
                           pad_sequences = pad_sequences)

    predictions = []
    probabilities = []
    tensors = []

    with torch.no_grad():
        for _, dl in enumerate(dl): 

            outputs = network(**dl)   

            # conduct operations on sentence level.
            for i in range(outputs.shape[0]):

                # extract prediction and transform.

                # find max by row.
                values, indices = outputs[i].max(dim=1)

                preds = tag_encoder.inverse_transform(indices.cpu().numpy())
                probs = values.cpu().numpy()

                if return_tensors:
                    tensors.append(outputs)    

                # subset predictions for original word tokens.
                preds = [prediction for prediction, offset in zip(preds.tolist(), dl.get('offsets')[i]) if offset]
                if return_confidence:
                    probs = [prob for prob, offset in zip(probs.tolist(), dl.get('offsets')[i]) if offset]

                # Remove special tokens ('CLS' + 'SEP').
                preds = preds[1:-1]
                if return_confidence:
                    probs = probs[1:-1]

                # make sure resulting predictions have same length as
                # original sentence.

                # TODO: Move assert statement to unit tests. Does not work 
                # in boundary.
                # assert len(preds) == len(sentences[i])            
                predictions.append(preds)
                if return_confidence:
                    probabilities.append(probs)

            if return_confidence:
                return predictions, probabilities

            if return_tensors:
                return tensors

    return predictions

predict_text(network, text, transformer_tokenizer, transformer_config, max_len, device, tag_encoder, tag_outside, batch_size=8, num_workers=1, pad_sequences=True, return_confidence=False, sent_tokenize=<function sent_tokenize at 0x7f69593ace60>, word_tokenize=<function word_tokenize at 0x7f695916ddd0>)

Compute Predictions for Text.

Computes predictions for a text with NERDA model. Text is tokenized into sentences before computing predictions.

Parameters:

Name Type Description Default
network Module

Network.

required
text str

text to predict entities in.

required
transformer_tokenizer PreTrainedTokenizer

tokenizer for transformer model.

required
transformer_config PretrainedConfig

config for transformer model.

required
max_len int

Maximum length of sentence after applying transformer tokenizer.

required
device str

Computational device.

required
tag_encoder LabelEncoder

Encoder for Named-Entity tags.

required
tag_outside str

Special 'outside' NER tag.

required
batch_size int

Batch Size for DataLoader. Defaults to 8.

8
num_workers int

Number of workers. Defaults to 1.

1
pad_sequences bool

if True, pad sequences. Defaults to True.

True
return_confidence bool

if True, return confidence scores for predicted tokens. Defaults to False.

False

Returns:

Type Description
tuple

tuple: sentence- and word-tokenized text with corresponding predicted named-entity tags.

Source code in NERDA/predictions.py
def predict_text(network: torch.nn.Module, 
                 text: str,
                 transformer_tokenizer: transformers.PreTrainedTokenizer,
                 transformer_config: transformers.PretrainedConfig,
                 max_len: int,
                 device: str,
                 tag_encoder: sklearn.preprocessing.LabelEncoder,
                 tag_outside: str,
                 batch_size: int = 8,
                 num_workers: int = 1,
                 pad_sequences: bool = True,
                 return_confidence: bool = False,
                 sent_tokenize: Callable = sent_tokenize,
                 word_tokenize: Callable = word_tokenize) -> tuple:
    """Compute Predictions for Text.

    Computes predictions for a text with `NERDA` model. 
    Text is tokenized into sentences before computing predictions.

    Args:
        network (torch.nn.Module): Network.
        text (str): text to predict entities in.
        transformer_tokenizer (transformers.PreTrainedTokenizer): 
            tokenizer for transformer model.
        transformer_config (transformers.PretrainedConfig): config
            for transformer model.
        max_len (int): Maximum length of sentence after applying 
            transformer tokenizer.
        device (str): Computational device.
        tag_encoder (sklearn.preprocessing.LabelEncoder): Encoder
            for Named-Entity tags.
        tag_outside (str): Special 'outside' NER tag.
        batch_size (int, optional): Batch Size for DataLoader. 
            Defaults to 8.
        num_workers (int, optional): Number of workers. Defaults
            to 1.
        pad_sequences (bool, optional): if True, pad sequences. 
            Defaults to True.
        return_confidence (bool, optional): if True, return 
            confidence scores for predicted tokens. Defaults
            to False.

    Returns:
        tuple: sentence- and word-tokenized text with corresponding
        predicted named-entity tags.
    """
    assert isinstance(text, str), "'text' must be a string."
    sentences = sent_tokenize(text)

    sentences = [word_tokenize(sentence) for sentence in sentences]

    predictions = predict(network = network, 
                          sentences = sentences,
                          transformer_tokenizer = transformer_tokenizer,
                          transformer_config = transformer_config,
                          max_len = max_len,
                          device = device,
                          return_confidence = return_confidence,
                          batch_size = batch_size,
                          num_workers = num_workers,
                          pad_sequences = pad_sequences,
                          tag_encoder = tag_encoder,
                          tag_outside = tag_outside)

    return sentences, predictions