In the absence of a shared unique key, an ensemble of non-unique personal attributes
such as names and addresses is often used to link data from disparate sources.
Such data matching is widely used when assembling data warehouses and business
mailing lists, and is a foundation of many longitudinal epidemiological and
other health related studies. Unfortunately, names and addresses are often captured
in non-standard and varying formats, usually with some degree of spelling and
typographical errors. It is therefore important that such data is transformed
into a clean and standardised format before it is further processed. Traditional
approaches for cleaning and standardisation of personal information have been
based on domain-specific rules that need considerable configuration by highly
skilled end users. In this paper we describe an alternative approach based on
probabilistic hidden Markov models. Experiments on various health-related administrative
data sets show that, compared to a rules-based approach, the probabilistic system
is less cumbersome and more to use and, for more complex data, produces more
accurate results.