`fuzzup` Showcase

fuzzup offers (1) a simple approach for clustering string entitities based on Levenshtein Distance using Fuzzy Matching in conjunction with a simple rule-based clustering method.

fuzzup also provides (2) functions for computing the prominence of
entity clusters resulting from (1).

In this section we will go through the nuts and bolts of fuzzup by applying it to a realistic setting.

Designed for Handling Output from NER

An important use-case for fuzzup is organizing, structuring and analyzing output from Named-Entity Recognition(=NER).

For this reason fuzzup has been handtailored to fit the output from NER predictions from the Hugging Face transformers NER pipeline specifically.

Use-case

First of, import dependencies needed later.

from rapidfuzz.fuzz import partial_token_set_ratio
import pandas as pd
import numpy as np

from fuzzup.datasets import simulate_ner_data
from fuzzup.fuzz import (
    fuzzy_cluster, 
    compute_prominence, 
    compute_fuzzy_matrix,
)
from fuzzup.whitelists import match_whitelist

Say, we have used a transformers Hugging Face NER pipeline to identify names of persons in a news article. The output from the algorithm is a list of string entities and looks like this (simulated data).

PERSONS_NER = simulate_ner_data()
pd.DataFrame.from_records(PERSONS_NER)

	word	entity_group	score	start	end	placement
0	Donald Trump	PER	0.293968	96	13	lead
1	Donald Trump	PER	0.981178	45	22	body
2	J. biden	PER	0.084998	10	54	body
3	joe biden	PER	0.607045	83	66	title
4	Biden	PER	0.535421	45	58	lead
5	Bide	PER	0.304502	67	77	lead
6	mark esper	PER	0.949895	43	60	body
7	Christopher c . miller	PER	0.388053	76	97	title
8	jim mattis	PER	0.365383	78	72	title
9	Nancy Pelosi	PER	0.313847	34	49	lead
10	trumps	PER	0.557752	36	24	title
11	Trump	PER	0.048489	24	6	body
12	Donald	PER	0.285257	80	46	title
13	miller	PER	0.078196	46	56	title

As you can see, the output is rather messy (partly due to the stochastic nature of the algorithm). Another reason for the output looking messy is, that for instance 'Joe Biden' has been mentioned a lot of times but in different ways, e.g. 'Joe Biden', 'J. Biden' and 'Biden'.

We want to organize these strings entities by forming meaningful clusters from them, in which the entities are closely related based on their pairwise edit distances.

Workflow

fuzzup offers functionality for:

Computing all of the mutual string distances (Levensteihn Distances/fuzzy ratios) between the string entities
Forming clusters of string entities based on the distances from (1)
Computing prominence of the clusters from (2) based on the number of entity occurrences, their positions in the text etc.
Matching entities (clusters) with entity whitelists

Together these steps constitute an end-to-end approach for organizing and structuring the output from NER. Below we go through a simple example of the fuzzup workflow.

Step 1: Compute Pairwise Edit Distances

First, fuzzup computes pairwise fuzzy ratios for all pairs of string entities.

Fuzzy ratios are numbers between 0 and 100 are measures of similarity between strings. They are derived from the Levenshtein distance - a string metric, that measures the distance between two strings.

In short the Levenshtein distance (also known as 'edit distance') between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.

fuzzup has a separate function compute_fuzzy_matrix for this, that presents the output - the mutual fuzzy ratios - as a cross-tabular matrix with all ratios.

from fuzzup.fuzz import fuzzy_cluster
persons = [x.get('word') for x in PERSONS_NER]
compute_fuzzy_matrix(persons, scorer=partial_token_set_ratio)

	Nancy Pelosi	Bide	mark esper	Christopher c . miller	Donald Trump	Donald	J. biden	jim mattis	trumps	joe biden	Trump	miller	Biden
Nancy Pelosi	100.000000	40.000000	26.666666	30.000000	23.529411	25.000000	26.666666	23.529411	25.000000	35.294117	0.000000	28.571428	33.333332
Bide	40.000000	100.000000	40.000000	50.000000	25.000000	40.000000	75.000000	33.333332	0.000000	75.000000	0.000000	50.000000	100.000000
mark esper	26.666666	40.000000	100.000000	50.000000	28.571428	22.222221	22.222221	37.500000	33.333332	30.769230	40.000000	40.000000	33.333332
Christopher c . miller	30.000000	50.000000	50.000000	100.000000	27.272728	22.222221	42.857143	35.294117	33.333332	33.333332	40.000000	100.000000	40.000000
Donald Trump	22.222221	25.000000	28.571428	27.272728	100.000000	100.000000	18.181818	25.000000	80.000000	25.000000	100.000000	33.333332	25.000000
Donald	25.000000	40.000000	22.222221	22.222221	100.000000	100.000000	28.571428	18.181818	0.000000	25.000000	0.000000	22.222221	33.333332
J. biden	26.666666	75.000000	22.222221	42.857143	18.181818	28.571428	100.000000	26.666666	0.000000	100.000000	0.000000	40.000000	88.888885
jim mattis	23.529411	33.333332	40.000000	35.294117	25.000000	18.181818	26.666666	100.000000	44.444443	30.769230	25.000000	33.333332	28.571428
trumps	25.000000	0.000000	33.333332	33.333332	80.000000	0.000000	0.000000	44.444443	100.000000	0.000000	80.000000	28.571428	0.000000
joe biden	35.294117	75.000000	30.769230	33.333332	25.000000	25.000000	100.000000	30.769230	0.000000	100.000000	0.000000	40.000000	80.000000
Trump	0.000000	0.000000	40.000000	40.000000	100.000000	0.000000	0.000000	25.000000	80.000000	0.000000	100.000000	33.333332	0.000000
miller	28.571428	50.000000	40.000000	100.000000	33.333332	25.000000	40.000000	33.333332	25.000000	40.000000	33.333332	100.000000	40.000000
Biden	33.333332	100.000000	33.333332	40.000000	25.000000	33.333332	88.888885	28.571428	0.000000	80.000000	0.000000	40.000000	100.000000

The different string representations of e.g. Donald Trump and Joe Biden have high mutual fuzzy ratios. In comparision representations of different persons have relatively small fuzzy ratios.

You can think of this matrix as a correlation matrix, that shows the correlation between strings.

Step 2: Forming Clusters

Clusters of entities can be formed using the output from (1) using a naive approach clustering two string entities together, if their mutual fuzzy ratio exceeds a certain threshold.

Computing the pairwise fuzzy ratios and forming the clusters can be done in one take by simply invoking the fuzzy_cluster function.

clusters = fuzzy_cluster(PERSONS_NER, 
                         scorer=partial_token_set_ratio, 
                         cutoff=70,
                         merge_output=True)
pd.DataFrame.from_records(clusters)

	word	entity_group	score	start	end	placement	cluster_id
0	Donald Trump	PER	0.293968	96	13	lead	Donald Trump
1	Donald Trump	PER	0.981178	45	22	body	Donald Trump
2	J. biden	PER	0.084998	10	54	body	joe biden
3	joe biden	PER	0.607045	83	66	title	joe biden
4	Biden	PER	0.535421	45	58	lead	joe biden
5	Bide	PER	0.304502	67	77	lead	joe biden
6	mark esper	PER	0.949895	43	60	body	mark esper
7	Christopher c . miller	PER	0.388053	76	97	title	Christopher c . miller
8	jim mattis	PER	0.365383	78	72	title	jim mattis
9	Nancy Pelosi	PER	0.313847	34	49	lead	Nancy Pelosi
10	trumps	PER	0.557752	36	24	title	Donald Trump
11	Trump	PER	0.048489	24	6	body	Donald Trump
12	Donald	PER	0.285257	80	46	title	Donald Trump
13	miller	PER	0.078196	46	56	title	Christopher c . miller

Note, that the original entities are now equipped with a 'cluster_id', assigning each of the entities to an entity cluster.

We see from the results, that different string representations of e.g. 'Donald Trump' have been clustered together. As you see, the 'cluster_id' of each cluster is the longest string within the entity cluster.

In this case we applied a partial_token_set_ratio and a cutoff threshold value of 75 on the pairwise fuzzy ratios. Depending on your use case, you should choose an appropriate scorer from rapidfuzz.fuzz and 'fine-tune' the cutoff threshold value on your own data.

Step 3: Compute Prominence of Entity Clusters

A naïve approach for computing the 'prominence' of the different string clusters is to just count the number of nodes/strings in each cluster. This is the default behaviour of compute_prominence().

clusters = compute_prominence(clusters,
                              merge_output=True)
pd.DataFrame.from_records(clusters).sort_values('prominence_rank', ascending=True)

	word	entity_group	score	start	end	placement	cluster_id	prominence_score	prominence_rank
0	Donald Trump	PER	0.293968	96	13	lead	Donald Trump	5.0	1
1	Donald Trump	PER	0.981178	45	22	body	Donald Trump	5.0	1
10	trumps	PER	0.557752	36	24	title	Donald Trump	5.0	1
11	Trump	PER	0.048489	24	6	body	Donald Trump	5.0	1
12	Donald	PER	0.285257	80	46	title	Donald Trump	5.0	1
2	J. biden	PER	0.084998	10	54	body	joe biden	4.0	2
3	joe biden	PER	0.607045	83	66	title	joe biden	4.0	2
4	Biden	PER	0.535421	45	58	lead	joe biden	4.0	2
5	Bide	PER	0.304502	67	77	lead	joe biden	4.0	2
7	Christopher c . miller	PER	0.388053	76	97	title	Christopher c . miller	2.0	3
13	miller	PER	0.078196	46	56	title	Christopher c . miller	2.0	3
6	mark esper	PER	0.949895	43	60	body	mark esper	1.0	4
8	jim mattis	PER	0.365383	78	72	title	jim mattis	1.0	4
9	Nancy Pelosi	PER	0.313847	34	49	lead	Nancy Pelosi	1.0	4

In this case, the 'prominence score' of the 'Donald Trump' entity cluster is 5, because Donald Trump is mentioned 5 times in different variations. This is the highest frequency among the clusters and therefore the 'Donald Trump' cluster is scored as the most prominent cluster.

The clusters are ranked by their prominence scores in the 'prominence rank' column.

Step 4: Matching with Whitelists

It can be useful to have one or more whitelists with specific entities of interest, when analyzing the output from NER. Assume, that we are only interested in Donald Trump and Joe Biden.

We construct a minimal whitelist.

whitelist = ['Donald Trump', 'Joe Biden']

Now, we can match it with our predicted entities using function match_whitelist.

match_whitelist(clusters,
                whitelist,
                scorer=partial_token_set_ratio,
                score_cutoff=80,
                aggregate_cluster=True,
                to_dataframe=True).sort_values('prominence_rank', ascending=True)

	word	entity_group	score	start	end	placement	cluster_id	prominence_score	prominence_rank	matches
0	Donald Trump	PER	0.293968	96	13	lead	Donald Trump	5.0	1	[Donald Trump]
1	Donald Trump	PER	0.981178	45	22	body	Donald Trump	5.0	1	[Donald Trump]
10	trumps	PER	0.557752	36	24	title	Donald Trump	5.0	1	[Donald Trump]
11	Trump	PER	0.048489	24	6	body	Donald Trump	5.0	1	[Donald Trump]
12	Donald	PER	0.285257	80	46	title	Donald Trump	5.0	1	[Donald Trump]
2	J. biden	PER	0.084998	10	54	body	joe biden	4.0	2	[Joe Biden]
3	joe biden	PER	0.607045	83	66	title	joe biden	4.0	2	[Joe Biden]
4	Biden	PER	0.535421	45	58	lead	joe biden	4.0	2	[Joe Biden]
5	Bide	PER	0.304502	67	77	lead	joe biden	4.0	2	[Joe Biden]

Whitelist matching can also be conducted using Whitelist subclasses. In the example below, NER output is compared to a Whitelist consisting of Cities.

from fuzzup.whitelists import Cities

LOCATIONS = [{'word': 'Viborg', 'entity_group': 'LOC', 'cluster_id' : 'Viborg'}, 
             {'word': 'Uldum', 'entity_group': 'ORG', 'cluster_id' : 'Uldum' }]

# initialize whitelist
cities = Cities()

# clustering and whitelist matching
clusters = fuzzy_cluster(LOCATIONS)
matches = cities(clusters,
                 score_cutoff=90)

matches

INFO:fuzzup.whitelists:Loading whitelist: city
INFO:fuzzup.whitelists:Done loading.

[{'word': 'Viborg',
  'entity_group': 'LOC',
  'cluster_id': 'Viborg',
  'matches': ['Visborg', 'Viborg'],
  'versions': {None},
  'mappings': [{'municipality': 'Mariagerfjord Kommune',
    'eblocal_id': 10887,
    'dawa_id': '12337669-ca46-6b98-e053-d480220a5a3f',
    'lon_lat': (10.15028458, 56.73531236)},
   {'municipality': 'Silkeborg Kommune',
    'eblocal_id': 10790,
    'dawa_id': nan,
    'lon_lat': (9.36897659, 56.41681148)}]}]

fuzzup Showcase