fuzzup
Showcase
fuzzup
offers (1) a simple approach for clustering string entitities based on
Levenshtein Distance using
Fuzzy Matching
in conjunction with a simple rule-based clustering method.
fuzzup
also provides (2) functions for computing the prominence of
entity clusters resulting from (1).
In this section we will go through the nuts and bolts of fuzzup
by applying it to a realistic setting.
Designed for Handling Output from NER
An important use-case for fuzzup
is organizing, structuring and analyzing output from Named-Entity Recognition(=NER).
For this reason fuzzup
has been handtailored to fit the output from NER predictions from the Hugging Face transformers NER pipeline specifically.
Use-case
First of, import dependencies needed later.
from rapidfuzz.fuzz import partial_token_set_ratio
import pandas as pd
import numpy as np
from fuzzup.datasets import simulate_ner_data
from fuzzup.fuzz import (
fuzzy_cluster,
compute_prominence,
compute_fuzzy_matrix,
)
from fuzzup.whitelists import match_whitelist
Say, we have used a transformers
Hugging Face NER pipeline to identify names of persons in a news article. The output from the algorithm is a list of string entities and looks like this (simulated data).
PERSONS_NER = simulate_ner_data()
pd.DataFrame.from_records(PERSONS_NER)
As you can see, the output is rather messy (partly due to the stochastic nature of the algorithm). Another reason for the output looking messy is, that for instance 'Joe Biden' has been mentioned a lot of times but in different ways, e.g. 'Joe Biden', 'J. Biden' and 'Biden'.
We want to organize these strings entities by forming meaningful clusters from them, in which the entities are closely related based on their pairwise edit distances.
Workflow
fuzzup
offers functionality for:
- Computing all of the mutual string distances (Levensteihn Distances/fuzzy ratios) between the string entities
- Forming clusters of string entities based on the distances from (1)
- Computing prominence of the clusters from (2) based on the number of entity occurrences, their positions in the text etc.
- Matching entities (clusters) with entity whitelists
Together these steps constitute an end-to-end approach for organizing and structuring the output from NER. Below we go through a simple example of the fuzzup
workflow.
Step 1: Compute Pairwise Edit Distances
First, fuzzup
computes pairwise fuzzy ratios for all pairs of string entities.
Fuzzy ratios are numbers between 0 and 100 are measures of similarity between strings. They are derived from the Levenshtein distance - a string metric, that measures the distance between two strings.
In short the Levenshtein distance (also known as 'edit distance') between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.
fuzzup
has a separate function compute_fuzzy_matrix
for this, that presents the output - the mutual fuzzy ratios - as a cross-tabular matrix with all ratios.
from fuzzup.fuzz import fuzzy_cluster
persons = [x.get('word') for x in PERSONS_NER]
compute_fuzzy_matrix(persons, scorer=partial_token_set_ratio)
The different string representations of e.g. Donald Trump and Joe Biden have high mutual fuzzy ratios. In comparision representations of different persons have relatively small fuzzy ratios.
You can think of this matrix as a correlation matrix, that shows the correlation between strings.
Step 2: Forming Clusters
Clusters of entities can be formed using the output from (1) using a naive approach clustering two string entities together, if their mutual fuzzy ratio exceeds a certain threshold.
Computing the pairwise fuzzy ratios and forming the clusters can be done in one take by simply invoking the fuzzy_cluster
function.
clusters = fuzzy_cluster(PERSONS_NER,
scorer=partial_token_set_ratio,
cutoff=70,
merge_output=True)
pd.DataFrame.from_records(clusters)
Note, that the original entities are now equipped with a 'cluster_id', assigning each of the entities to an entity cluster.
We see from the results, that different string representations of e.g. 'Donald Trump' have been clustered together. As you see, the 'cluster_id' of each cluster is the longest string within the entity cluster.
In this case we applied a partial_token_set_ratio
and a cutoff threshold value of 75 on the pairwise fuzzy ratios. Depending on your use case, you should choose an appropriate scorer from rapidfuzz.fuzz
and 'fine-tune' the cutoff threshold value on your own data.
Step 3: Compute Prominence of Entity Clusters
A naïve approach for computing the 'prominence' of the different string clusters is to just count the number of nodes/strings in each cluster. This is the default behaviour of compute_prominence()
.
clusters = compute_prominence(clusters,
merge_output=True)
pd.DataFrame.from_records(clusters).sort_values('prominence_rank', ascending=True)
In this case, the 'prominence score' of the 'Donald Trump' entity cluster is 5, because Donald Trump is mentioned 5 times in different variations. This is the highest frequency among the clusters and therefore the 'Donald Trump' cluster is scored as the most prominent cluster.
The clusters are ranked by their prominence scores in the 'prominence rank' column.
Step 4: Matching with Whitelists
It can be useful to have one or more whitelists with specific entities of interest, when analyzing the output from NER. Assume, that we are only interested in Donald Trump and Joe Biden.
We construct a minimal whitelist.
whitelist = ['Donald Trump', 'Joe Biden']
Now, we can match it with our predicted entities using function match_whitelist
.
match_whitelist(clusters,
whitelist,
scorer=partial_token_set_ratio,
score_cutoff=80,
aggregate_cluster=True,
to_dataframe=True).sort_values('prominence_rank', ascending=True)
Whitelist matching can also be conducted using Whitelist
subclasses. In the example below, NER output is compared to a Whitelist
consisting of Cities
.
from fuzzup.whitelists import Cities
LOCATIONS = [{'word': 'Viborg', 'entity_group': 'LOC', 'cluster_id' : 'Viborg'},
{'word': 'Uldum', 'entity_group': 'ORG', 'cluster_id' : 'Uldum' }]
# initialize whitelist
cities = Cities()
# clustering and whitelist matching
clusters = fuzzy_cluster(LOCATIONS)
matches = cities(clusters,
score_cutoff=90)
matches