Friday, October 8, 2010

matching names on Low-Key registration: Lingua::EN::MatchNames

One issue with the Low-Key Hillclimbs this year is we've gone to an on-line "RSVP" system where riders fill in a form indicating their plan to be at the climb. In this way, we can make sure the number attending doesn't exceed the capacity of the roads to support our "traffic". We can also better plan for how much food we need to purchase, and there's less opportunity for errors in recording names and numbers in a rushed registration area. It also saves a lot of time entering rider info: the riders do that for us so we simply export the Google spreadsheet where the data are stored. All good.

But on the downside riders register by name, rather than by number, as was our previous model. Sure, we could have riders simply enter their number, but it is too easy to make a mistake with a number. People tend to be fairly well practiced at typing their names, on the other hand.

But names aren't always typed exactly the same. For example, Steve may become Steven, or even Stephen. Middle initials may be present or absent. "O'Niell" might be cut short to "ONeill".

Since I tend to be careless, I prefer automating things as much as possible, so I use a Perl script to match names to my existing rider database. If a rider has already participated this year, I use his existing number, otherwise I assign him an unused number. Additionally, I want to compare the rider's results this year to those from previous years, so I need to maintain a map of numbers associated with each rider for each year. All of this requires name-matching. Obviously a simple string comparison fails here. I want "Steven P. Smith" to match "Steve Smith", assuming nobody else is signed up as "Steven P. Smith".

Fortunatly, CPAN saves the day. CPAN is the source of choice for Perl libraries, sections of code people have written to solve difficult problems which they've posted there so others don't need to resolve the same problems. In this case, there's a wonderful library called "Lingua::EN::MatchNames". The name is hierarchical: Lingua refers to language libraries, EN refers to an English-specific library, and MatchNames is the particular library which matches names.

First I split rider names into first and last components. First names are everything other than the last name. So "Steven S. Smith Jr." has a last name of "Smith Jr." and a first name of "Steven S." Another tricky example: "Yves St. Laurent" has a last name of "St. Laurent" and a first name of "Yves". MatchNames requires separate first and last names, so this step is important. This takes a bit of code. I wrote this code myself, and I won't bore you with it, as it's somewhat brute-force.

Then I build a hash table of known rider names indexed by the first letter of their last names. This helps speed up comparisons: I assume any variant on a rider's name will include at least the same first letter of last name. So for Mr. Smith, I need compare his name only with known names also beginning with S.

The actual call to the MatchNames library looks like this:
  my $score =
    name_eq(
      $first,
      $last,
      $firstname{$l}->{$n0},
      $lastname{$l}->{$n0},
    );

where $first and $last are the names of the rider on the registration list, $l is the first letter of the rider's last name, and $firstname{$l}->{$n0} and $lastname{$l}->{$n0} are the first and last name of rider $n0 already in the database. The procedure name_eq returns a score giving a goodness of match. The clever thing is it doesn't do a simple string comparison, but rather it uses intelligence about nicknames and abbreviations to assess the quality of the match.

A score of 100 is a perfect match. I accept anything of at least 50 as a match, but if there are multiple matches of at least 50, I take the one scoring the highest. So far this seems to work quite well. If I catch any errors there is always manual editing to make sure things end up correct.

No comments: