Skip to content

fuzzup Showcase

fuzzup offers (1) a simple approach for clustering string entitities based on Levenshtein Distance using Fuzzy Matching in conjunction with a simple rule-based clustering method.

fuzzup also provides (2) functions for computing the prominence of
entity clusters resulting from (1).

In this section we will go through the nuts and bolts of fuzzup by applying it to a realistic setting.

Designed for Handling Output from NER

An important use-case for fuzzup is organizing, structuring and analyzing output from Named-Entity Recognition(=NER).

For this reason fuzzup has been handtailored to fit the output from NER predictions from the Hugging Face transformers NER pipeline specifically.

Use-case

First of, import dependencies needed later.

from rapidfuzz.fuzz import partial_token_set_ratio
import pandas as pd
import numpy as np

from fuzzup.datasets import simulate_ner_data
from fuzzup.fuzz import (
    fuzzy_cluster, 
    compute_prominence, 
    compute_fuzzy_matrix,
)
from fuzzup.whitelists import match_whitelist

Say, we have used a transformers Hugging Face NER pipeline to identify names of persons in a news article. The output from the algorithm is a list of string entities and looks like this (simulated data).

PERSONS_NER = simulate_ner_data()
pd.DataFrame.from_records(PERSONS_NER)
word entity_group score start end placement
0 Donald Trump PER 0.293968 96 13 lead
1 Donald Trump PER 0.981178 45 22 body
2 J. biden PER 0.084998 10 54 body
3 joe biden PER 0.607045 83 66 title
4 Biden PER 0.535421 45 58 lead
5 Bide PER 0.304502 67 77 lead
6 mark esper PER 0.949895 43 60 body
7 Christopher c . miller PER 0.388053 76 97 title
8 jim mattis PER 0.365383 78 72 title
9 Nancy Pelosi PER 0.313847 34 49 lead
10 trumps PER 0.557752 36 24 title
11 Trump PER 0.048489 24 6 body
12 Donald PER 0.285257 80 46 title
13 miller PER 0.078196 46 56 title

As you can see, the output is rather messy (partly due to the stochastic nature of the algorithm). Another reason for the output looking messy is, that for instance 'Joe Biden' has been mentioned a lot of times but in different ways, e.g. 'Joe Biden', 'J. Biden' and 'Biden'.

We want to organize these strings entities by forming meaningful clusters from them, in which the entities are closely related based on their pairwise edit distances.

Workflow

fuzzup offers functionality for:

  1. Computing all of the mutual string distances (Levensteihn Distances/fuzzy ratios) between the string entities
  2. Forming clusters of string entities based on the distances from (1)
  3. Computing prominence of the clusters from (2) based on the number of entity occurrences, their positions in the text etc.
  4. Matching entities (clusters) with entity whitelists

Together these steps constitute an end-to-end approach for organizing and structuring the output from NER. Below we go through a simple example of the fuzzup workflow.

Step 1: Compute Pairwise Edit Distances

First, fuzzup computes pairwise fuzzy ratios for all pairs of string entities.

Fuzzy ratios are numbers between 0 and 100 are measures of similarity between strings. They are derived from the Levenshtein distance - a string metric, that measures the distance between two strings.

In short the Levenshtein distance (also known as 'edit distance') between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.

fuzzup has a separate function compute_fuzzy_matrix for this, that presents the output - the mutual fuzzy ratios - as a cross-tabular matrix with all ratios.

from fuzzup.fuzz import fuzzy_cluster
persons = [x.get('word') for x in PERSONS_NER]
compute_fuzzy_matrix(persons, scorer=partial_token_set_ratio)
Nancy Pelosi Bide mark esper Christopher c . miller Donald Trump Donald J. biden jim mattis trumps joe biden Trump miller Biden
Nancy Pelosi 100.000000 40.000000 26.666666 30.000000 23.529411 25.000000 26.666666 23.529411 25.000000 35.294117 0.000000 28.571428 33.333332
Bide 40.000000 100.000000 40.000000 50.000000 25.000000 40.000000 75.000000 33.333332 0.000000 75.000000 0.000000 50.000000 100.000000
mark esper 26.666666 40.000000 100.000000 50.000000 28.571428 22.222221 22.222221 37.500000 33.333332 30.769230 40.000000 40.000000 33.333332
Christopher c . miller 30.000000 50.000000 50.000000 100.000000 27.272728 22.222221 42.857143 35.294117 33.333332 33.333332 40.000000 100.000000 40.000000
Donald Trump 22.222221 25.000000 28.571428 27.272728 100.000000 100.000000 18.181818 25.000000 80.000000 25.000000 100.000000 33.333332 25.000000
Donald 25.000000 40.000000 22.222221 22.222221 100.000000 100.000000 28.571428 18.181818 0.000000 25.000000 0.000000 22.222221 33.333332
J. biden 26.666666 75.000000 22.222221 42.857143 18.181818 28.571428 100.000000 26.666666 0.000000 100.000000 0.000000 40.000000 88.888885
jim mattis 23.529411 33.333332 40.000000 35.294117 25.000000 18.181818 26.666666 100.000000 44.444443 30.769230 25.000000 33.333332 28.571428
trumps 25.000000 0.000000 33.333332 33.333332 80.000000 0.000000 0.000000 44.444443 100.000000 0.000000 80.000000 28.571428 0.000000
joe biden 35.294117 75.000000 30.769230 33.333332 25.000000 25.000000 100.000000 30.769230 0.000000 100.000000 0.000000 40.000000 80.000000
Trump 0.000000 0.000000 40.000000 40.000000 100.000000 0.000000 0.000000 25.000000 80.000000 0.000000 100.000000 33.333332 0.000000
miller 28.571428 50.000000 40.000000 100.000000 33.333332 25.000000 40.000000 33.333332 25.000000 40.000000 33.333332 100.000000 40.000000
Biden 33.333332 100.000000 33.333332 40.000000 25.000000 33.333332 88.888885 28.571428 0.000000 80.000000 0.000000 40.000000 100.000000

The different string representations of e.g. Donald Trump and Joe Biden have high mutual fuzzy ratios. In comparision representations of different persons have relatively small fuzzy ratios.

You can think of this matrix as a correlation matrix, that shows the correlation between strings.

Step 2: Forming Clusters

Clusters of entities can be formed using the output from (1) using a naive approach clustering two string entities together, if their mutual fuzzy ratio exceeds a certain threshold.

Computing the pairwise fuzzy ratios and forming the clusters can be done in one take by simply invoking the fuzzy_cluster function.

clusters = fuzzy_cluster(PERSONS_NER, 
                         scorer=partial_token_set_ratio, 
                         cutoff=70,
                         merge_output=True)
pd.DataFrame.from_records(clusters)
word entity_group score start end placement cluster_id
0 Donald Trump PER 0.293968 96 13 lead Donald Trump
1 Donald Trump PER 0.981178 45 22 body Donald Trump
2 J. biden PER 0.084998 10 54 body joe biden
3 joe biden PER 0.607045 83 66 title joe biden
4 Biden PER 0.535421 45 58 lead joe biden
5 Bide PER 0.304502 67 77 lead joe biden
6 mark esper PER 0.949895 43 60 body mark esper
7 Christopher c . miller PER 0.388053 76 97 title Christopher c . miller
8 jim mattis PER 0.365383 78 72 title jim mattis
9 Nancy Pelosi PER 0.313847 34 49 lead Nancy Pelosi
10 trumps PER 0.557752 36 24 title Donald Trump
11 Trump PER 0.048489 24 6 body Donald Trump
12 Donald PER 0.285257 80 46 title Donald Trump
13 miller PER 0.078196 46 56 title Christopher c . miller

Note, that the original entities are now equipped with a 'cluster_id', assigning each of the entities to an entity cluster.

We see from the results, that different string representations of e.g. 'Donald Trump' have been clustered together. As you see, the 'cluster_id' of each cluster is the longest string within the entity cluster.

In this case we applied a partial_token_set_ratio and a cutoff threshold value of 75 on the pairwise fuzzy ratios. Depending on your use case, you should choose an appropriate scorer from rapidfuzz.fuzz and 'fine-tune' the cutoff threshold value on your own data.

Step 3: Compute Prominence of Entity Clusters

A naïve approach for computing the 'prominence' of the different string clusters is to just count the number of nodes/strings in each cluster. This is the default behaviour of compute_prominence().

clusters = compute_prominence(clusters,
                              merge_output=True)
pd.DataFrame.from_records(clusters).sort_values('prominence_rank', ascending=True)
word entity_group score start end placement cluster_id prominence_score prominence_rank
0 Donald Trump PER 0.293968 96 13 lead Donald Trump 5.0 1
1 Donald Trump PER 0.981178 45 22 body Donald Trump 5.0 1
10 trumps PER 0.557752 36 24 title Donald Trump 5.0 1
11 Trump PER 0.048489 24 6 body Donald Trump 5.0 1
12 Donald PER 0.285257 80 46 title Donald Trump 5.0 1
2 J. biden PER 0.084998 10 54 body joe biden 4.0 2
3 joe biden PER 0.607045 83 66 title joe biden 4.0 2
4 Biden PER 0.535421 45 58 lead joe biden 4.0 2
5 Bide PER 0.304502 67 77 lead joe biden 4.0 2
7 Christopher c . miller PER 0.388053 76 97 title Christopher c . miller 2.0 3
13 miller PER 0.078196 46 56 title Christopher c . miller 2.0 3
6 mark esper PER 0.949895 43 60 body mark esper 1.0 4
8 jim mattis PER 0.365383 78 72 title jim mattis 1.0 4
9 Nancy Pelosi PER 0.313847 34 49 lead Nancy Pelosi 1.0 4

In this case, the 'prominence score' of the 'Donald Trump' entity cluster is 5, because Donald Trump is mentioned 5 times in different variations. This is the highest frequency among the clusters and therefore the 'Donald Trump' cluster is scored as the most prominent cluster.

The clusters are ranked by their prominence scores in the 'prominence rank' column.

Step 4: Matching with Whitelists

It can be useful to have one or more whitelists with specific entities of interest, when analyzing the output from NER. Assume, that we are only interested in Donald Trump and Joe Biden.

We construct a minimal whitelist.

whitelist = ['Donald Trump', 'Joe Biden']

Now, we can match it with our predicted entities using function match_whitelist.

match_whitelist(clusters,
                whitelist,
                scorer=partial_token_set_ratio,
                score_cutoff=80,
                aggregate_cluster=True,
                to_dataframe=True).sort_values('prominence_rank', ascending=True)
word entity_group score start end placement cluster_id prominence_score prominence_rank matches
0 Donald Trump PER 0.293968 96 13 lead Donald Trump 5.0 1 [Donald Trump]
1 Donald Trump PER 0.981178 45 22 body Donald Trump 5.0 1 [Donald Trump]
10 trumps PER 0.557752 36 24 title Donald Trump 5.0 1 [Donald Trump]
11 Trump PER 0.048489 24 6 body Donald Trump 5.0 1 [Donald Trump]
12 Donald PER 0.285257 80 46 title Donald Trump 5.0 1 [Donald Trump]
2 J. biden PER 0.084998 10 54 body joe biden 4.0 2 [Joe Biden]
3 joe biden PER 0.607045 83 66 title joe biden 4.0 2 [Joe Biden]
4 Biden PER 0.535421 45 58 lead joe biden 4.0 2 [Joe Biden]
5 Bide PER 0.304502 67 77 lead joe biden 4.0 2 [Joe Biden]

Whitelist matching can also be conducted using Whitelist subclasses. In the example below, NER output is compared to a Whitelist consisting of Cities.

from fuzzup.whitelists import Cities

LOCATIONS = [{'word': 'Viborg', 'entity_group': 'LOC', 'cluster_id' : 'Viborg'}, 
             {'word': 'Uldum', 'entity_group': 'ORG', 'cluster_id' : 'Uldum' }]

# initialize whitelist
cities = Cities()

# clustering and whitelist matching
clusters = fuzzy_cluster(LOCATIONS)
matches = cities(clusters,
                 score_cutoff=90)

matches
INFO:fuzzup.whitelists:Loading whitelist: city
INFO:fuzzup.whitelists:Done loading.

[{'word': 'Viborg',
  'entity_group': 'LOC',
  'cluster_id': 'Viborg',
  'matches': ['Visborg', 'Viborg'],
  'versions': {None},
  'mappings': [{'municipality': 'Mariagerfjord Kommune',
    'eblocal_id': 10887,
    'dawa_id': '12337669-ca46-6b98-e053-d480220a5a3f',
    'lon_lat': (10.15028458, 56.73531236)},
   {'municipality': 'Silkeborg Kommune',
    'eblocal_id': 10790,
    'dawa_id': nan,
    'lon_lat': (9.36897659, 56.41681148)}]}]