

Hector Garcia-Molina
Stanford University
Generic Entity Resolution
Abstract
Entity resolution (ER) is a problem that arises in many information
integration scenarios: We have two or more sources containing records
on the same set of real-world entities (e.g., customers). However,
there are no unique identifiers that tell us what records from one
source correspond to those in the other sources. Furthermore, the
records representing the same entity may have differing information,
e.g., one record may have the address misspelled, another record may
be missing some fields. An ER algorithm attempts to identify the
matching records from multiple sources (i.e., those corresponding to
the same real-world entity), and merges the matching records as best
it can.
In this talk I will describe a "generic" ER approach where the
functions for comparing and merging records are black-boxes, invoked
on pairs of records. I will describe a set of important properties
that should be satisfied by the black-box functions to enable
efficient and deterministic ER algorithms, and I will present an
algorithm, Swoosh, that significantly reduces the calls to these
functions. In addition, I will also discuss how ER can be preformed
when "confidences" are associated with the input records and with the
match and merge functions.
Back to Distinguished
Lecture Series
|