NAMES: Dutch corpus of person name variants
Spelling variation, variants and digitization errors in person names are serious obstacles for search operations in historical documents. The NAMES project aimed to standardize 564,000 different surnames and 190,113 different given names with the help of the Clariah tool TICCL.
Spelling variation, variants and digitization errors in person names are serious obstacles for search operations in historical documents. A solution could be the spelling standardization of surnames and given names. But ambiguities and alternative interpretations make this a non-trivial task which requires expert evaluation assisted by automatic analyses.
The NAMES project aimed to standardize 564,000 different surnames and 190,113 different given names from 19th century sources with 52.5 million tokens with the help of the Clariah tool TICCL. A subset of these names was already automatically related to a standard as they could be identified as having been used for the same individual. This subset has been reviewed by experts which resulted in 127,154 surnames associated to 11,278 standards and 49,804 given names associated to 782 gender independent standards. Unfortunately, TICCL did not succeed to support the extension of this set. Instead, brute force comparison of the remaining names to names with a standard, and extending the number of standards, increased the coverage of standardized tokens to 99,43% for given names and 98,51% for surnames.
Data will be made available in RDF format for linked open data and as Lexicon service. In addition, digital versions of name dictionaries will be made accessible.
Project info
Onderzoekers
Universitair docent & Onderzoeker, Universiteit Utrecht
Meer projecten
DIGIFIL: Digital Film Listings
DIGIFIL aims to digitise the Dutch Filmladders and contextual information about the wider movie landscape as reported in historical newspape...
NEWSGAC: News Genres Transparant Automatic Genre Classification
How genres in newspapers and television news can be detected automatically using machine learning in a transparent manner, to capture the sh...
HUMIGEC: Human capital, immigration and the early modern Dutch economy
What was the contribution of migrant workers to the 18th-century Dutch economy? We reconstructed the careers of native and migrant sailors w...