Supervised learning methods on distance based record linkage
Speaker: 
Daniel Abril
Institution: 
IIIA-CSIC
Date: 
26 April 2011 - 12:00pm

Record linkage is the task of identifying records corresponding to the same
entity from one or more data sources. Relying on this idea, it is feasible
to use it in the data privacy context, to evaluate the disclosure risk of
protected data, evaluating the number of linked records between a data set
and its protected version. The introduction of new methods is necessary to
improve the standard record linkage and provide insightful information about
the re-identification risk of variables and their interactions. To do that,
we propose a supervised approach with different parametrized distances for
linking records with numerical attributes. These alternatives are the
Euclidean distance, the Mahalanobis distance and the Choquet integral.
Where, the parametrization in all cases is determined as an optimization
problem. We evaluate and compare our proposals with the standard distance
based record linkage, which do not rely on the parametrization of a
distance function.