Datasets
This section covers functionality for (down)loading Named Entity Recognition data sets.
download_conll_data(dir=None)
Download CoNLL-2003 English data set.
Downloads the CoNLL-2003 English data set annotated for Named Entity Recognition.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dir |
str |
Directory where CoNLL-2003 datasets will be saved. If no directory is provided, data will be saved to a hidden folder '.dane' in your home directory. |
None |
Returns:
Type | Description |
---|---|
str |
str: a message telling, if the archive was in fact succesfully extracted. Obviously the CoNLL datasets are extracted to the desired directory as a side-effect. |
Examples:
>>> download_conll_data()
>>> download_conll_data(dir = 'conll')
Source code in NERDA/datasets.py
def download_conll_data(dir: str = None) -> str:
"""Download CoNLL-2003 English data set.
Downloads the [CoNLL-2003](https://www.clips.uantwerpen.be/conll2003/ner/)
English data set annotated for Named Entity Recognition.
Args:
dir (str, optional): Directory where CoNLL-2003 datasets will be saved. If no directory is provided, data will be saved to a hidden folder '.dane' in your home directory.
Returns:
str: a message telling, if the archive was in fact
succesfully extracted. Obviously the CoNLL datasets are
extracted to the desired directory as a side-effect.
Examples:
>>> download_conll_data()
>>> download_conll_data(dir = 'conll')
"""
# set to default directory if nothing else has been provided by user.
if dir is None:
dir = os.path.join(str(Path.home()), '.conll')
return download_unzip(url_zip = 'https://data.deepai.org/conll2003.zip',
dir_extract = dir)
download_dane_data(dir=None)
Download DaNE data set.
Downloads the 'DaNE' data set annotated for Named Entity Recognition developed and hosted by Alexandra Institute.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dir |
str |
Directory where DaNE datasets will be saved. If no directory is provided, data will be saved to a hidden folder '.dane' in your home directory. |
None |
Returns:
Type | Description |
---|---|
str |
str: a message telling, if the archive was in fact succesfully extracted. Obviously the DaNE datasets are extracted to the desired directory as a side-effect. |
Examples:
>>> download_dane_data()
>>> download_dane_data(dir = 'DaNE')
Source code in NERDA/datasets.py
def download_dane_data(dir: str = None) -> str:
"""Download DaNE data set.
Downloads the 'DaNE' data set annotated for Named Entity
Recognition developed and hosted by
[Alexandra Institute](https://github.com/alexandrainst/danlp/blob/master/docs/docs/datasets.md#dane).
Args:
dir (str, optional): Directory where DaNE datasets will be saved. If no directory is provided, data will be saved to a hidden folder '.dane' in your home directory.
Returns:
str: a message telling, if the archive was in fact
succesfully extracted. Obviously the DaNE datasets are
extracted to the desired directory as a side-effect.
Examples:
>>> download_dane_data()
>>> download_dane_data(dir = 'DaNE')
"""
# set to default directory if nothing else has been provided by user.
if dir is None:
dir = os.path.join(str(Path.home()), '.dane')
return download_unzip(url_zip = 'http://danlp-downloads.alexandra.dk/datasets/ddt.zip',
dir_extract = dir)
download_unzip(url_zip, dir_extract)
Download and unzip a ZIP archive to folder.
Loads a ZIP file from URL and extracts all of the files to a given folder. Does not save the ZIP file itself.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url_zip |
str |
URL to ZIP file. |
required |
dir_extract |
str |
Directory where files are extracted. |
required |
Returns:
Type | Description |
---|---|
str |
str: a message telling, if the archive was succesfully extracted. Obviously the files in the ZIP archive are extracted to the desired directory as a side-effect. |
Source code in NERDA/datasets.py
def download_unzip(url_zip: str,
dir_extract: str) -> str:
"""Download and unzip a ZIP archive to folder.
Loads a ZIP file from URL and extracts all of the files to a
given folder. Does not save the ZIP file itself.
Args:
url_zip (str): URL to ZIP file.
dir_extract (str): Directory where files are extracted.
Returns:
str: a message telling, if the archive was succesfully
extracted. Obviously the files in the ZIP archive are
extracted to the desired directory as a side-effect.
"""
# suppress ssl certification
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
print(f'Reading {url_zip}')
with urlopen(url_zip, context=ctx) as zipresp:
with ZipFile(BytesIO(zipresp.read())) as zfile:
zfile.extractall(dir_extract)
return f'archive extracted to {dir_extract}'
get_conll_data(split='train', limit=None, dir=None)
Load CoNLL-2003 (English) data split.
Loads a single data split from the CoNLL-2003 (English) data set.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
split |
str |
Choose which split to load. Choose from 'train', 'valid' and 'test'. Defaults to 'train'. |
'train' |
limit |
int |
Limit the number of observations to be returned from a given split. Defaults to None, which implies that the entire data split is returned. |
None |
dir |
str |
Directory where data is cached. If set to None, the function will try to look for files in '.conll' folder in home directory. |
None |
Returns:
Type | Description |
---|---|
dict |
dict: Dictionary with word-tokenized 'sentences' and named entity 'tags' in IOB format. |
Examples:
Get test split
>>> get_conll_data('test')
Get first 5 observations from training split
>>> get_conll_data('train', limit = 5)
Source code in NERDA/datasets.py
def get_conll_data(split: str = 'train',
limit: int = None,
dir: str = None) -> dict:
"""Load CoNLL-2003 (English) data split.
Loads a single data split from the
[CoNLL-2003](https://www.clips.uantwerpen.be/conll2003/ner/)
(English) data set.
Args:
split (str, optional): Choose which split to load. Choose
from 'train', 'valid' and 'test'. Defaults to 'train'.
limit (int, optional): Limit the number of observations to be
returned from a given split. Defaults to None, which implies
that the entire data split is returned.
dir (str, optional): Directory where data is cached. If set to
None, the function will try to look for files in '.conll' folder in home directory.
Returns:
dict: Dictionary with word-tokenized 'sentences' and named
entity 'tags' in IOB format.
Examples:
Get test split
>>> get_conll_data('test')
Get first 5 observations from training split
>>> get_conll_data('train', limit = 5)
"""
assert isinstance(split, str)
splits = ['train', 'valid', 'test']
assert split in splits, f'Choose between the following splits: {splits}'
# set to default directory if nothing else has been provided by user.
if dir is None:
dir = os.path.join(str(Path.home()), '.conll')
assert os.path.isdir(dir), f'Directory {dir} does not exist. Try downloading CoNLL-2003 data with download_conll_data()'
file_path = os.path.join(dir, f'{split}.txt')
assert os.path.isfile(file_path), f'File {file_path} does not exist. Try downloading CoNLL-2003 data with download_conll_data()'
# read data from file.
data = []
with open(file_path, 'r') as file:
reader = csv.reader(file, delimiter = ' ')
for row in reader:
data.append([row])
sentences = []
sentence = []
entities = []
tags = []
for row in data:
# extract first element of list.
row = row[0]
# TO DO: move to data reader.
if len(row) > 0 and row[0] != '-DOCSTART-':
sentence.append(row[0])
tags.append(row[-1])
if len(row) == 0 and len(sentence) > 0:
# clean up sentence/tags.
# remove white spaces.
selector = [word != ' ' for word in sentence]
sentence = list(compress(sentence, selector))
tags = list(compress(tags, selector))
# append if sentence length is still greater than zero..
if len(sentence) > 0:
sentences.append(sentence)
entities.append(tags)
sentence = []
tags = []
if limit is not None:
sentences = sentences[:limit]
entities = entities[:limit]
return {'sentences': sentences, 'tags': entities}
get_dane_data(split='train', limit=None, dir=None)
Load DaNE data split.
Loads a single data split from the DaNE data set kindly hosted by Alexandra Institute.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
split |
str |
Choose which split to load. Choose from 'train', 'dev' and 'test'. Defaults to 'train'. |
'train' |
limit |
int |
Limit the number of observations to be returned from a given split. Defaults to None, which implies that the entire data split is returned. |
None |
dir |
str |
Directory where data is cached. If set to None, the function will try to look for files in '.dane' folder in home directory. |
None |
Returns:
Type | Description |
---|---|
dict |
dict: Dictionary with word-tokenized 'sentences' and named entity 'tags' in IOB format. |
Examples:
Get test split
>>> get_dane_data('test')
Get first 5 observations from training split
>>> get_dane_data('train', limit = 5)
Source code in NERDA/datasets.py
def get_dane_data(split: str = 'train',
limit: int = None,
dir: str = None) -> dict:
"""Load DaNE data split.
Loads a single data split from the DaNE data set kindly hosted
by [Alexandra Institute](https://github.com/alexandrainst/danlp/blob/master/docs/docs/datasets.md#dane).
Args:
split (str, optional): Choose which split to load. Choose
from 'train', 'dev' and 'test'. Defaults to 'train'.
limit (int, optional): Limit the number of observations to be
returned from a given split. Defaults to None, which implies
that the entire data split is returned.
dir (str, optional): Directory where data is cached. If set to
None, the function will try to look for files in '.dane' folder in home directory.
Returns:
dict: Dictionary with word-tokenized 'sentences' and named
entity 'tags' in IOB format.
Examples:
Get test split
>>> get_dane_data('test')
Get first 5 observations from training split
>>> get_dane_data('train', limit = 5)
"""
assert isinstance(split, str)
splits = ['train', 'dev', 'test']
assert split in splits, f'Choose between the following splits: {splits}'
# set to default directory if nothing else has been provided by user.
if dir is None:
dir = os.path.join(str(Path.home()), '.dane')
assert os.path.isdir(dir), f'Directory {dir} does not exist. Try downloading DaNE data with download_dane_data()'
file_path = os.path.join(dir, f'ddt.{split}.conllu')
assert os.path.isfile(file_path), f'File {file_path} does not exist. Try downloading DaNE data with download_dane_data()'
split = pyconll.load_from_file(file_path)
sentences = []
entities = []
for sent in split:
sentences.append([token.form for token in sent._tokens])
entities.append([token.misc['name'].pop() for token in sent._tokens])
if limit is not None:
sentences = sentences[:limit]
entities = entities[:limit]
return {'sentences': sentences, 'tags': entities}