For the past few days I’ve been working on building a baseball database of all of the players who have played baseball from 1871 to 2008. The tricky part in building such a database is gathering statistics of the current season and merging it with the Lahman baseball database. A book called Baseball Hacks shows you how to gather statistics from the current season by using data from http://mlb.mlb.com and inserting it into a MySQL database.
One of the drawbacks in merging this data is trying to find a way to cross-reference a player’s playerID in the Lahman database with his mlb.com ID. A playerID is generated by using the first five letters of a player’s last name and first two letters of his first name. A number is added to the end of the ID to make it unique in case of duplicates. The playerID for Chipper Jones, for example, is jonesch06. His mlb.com ID is 116706. I was thinking since I know the pattern of how the playerIDs are generated in the Lahman database I could somehow use that to link the Lahman database data to the mlb.com data but this method could end up being too inaccurate.
Luckily, I stumbled upon the forums at http://www.baseball-fever.com. In the Statistics, Analysis, & Sabermetrics area there are some individuals asking how to link these IDs together. The author of THE BOOK — Playing The Percentages In Baseball posted a file that contains the playerIDs mlb.com IDs of all players. I should be able to use this information to merge the Lahman database with this current season! Hopefully, I’ll have a working database of past seasons and the current season soon.