From the category archives:

Database

New baseball database

July 9, 2009

In April, I wrote about building a baseball database of statistics for every player including those playing in the current 2009 season.  After several weeks working on it I have finally completed the database.   I had no idea it would be so complicated to build such a database but I learned a lot.

My first hurdle was learning about .htaccess .  I’ve forgotten exactly what .htaccess is for but I used it to redirect query results of players so each player will have their own unique URL page.

I then had to determine a way to separate the batting and pitching queries so when a pitcher is being searched you will get his pitching stats rather than batting stats.   Later, I’ll should be able to have all of these stats on each player page

I thought it was important to be able to access Babe Ruth’s batting and pitching stats.  So, I added pitching stats to position players who also had pitched in their career.  Check out Jose Canseco and Steve Finley’s pitching stats.

If you are familiar with databases you know that every record must have a unique ID.  When dealing with the current 2009 season, one of the challenges was trying to find a way to generate a unique ID for all rookie players.  First, I needed to find a way to determine if a player is indeed a rookie.  Then, I had to find an automated way to generate a unique ID.  I spent a few days working on a stored procedure.  This stored procedure will scan the entire master database for duplicates and if duplicates are found, the number at the end of the ID will be incremented.

The method I was using for determining rookies had a flaw because I was searching for players who were playing in the 2009 season but hadn’t played in 2008.  The drawback to this was that some players, like Kris Benson, had been out of baseball for a few years so they weren’t actually rookies.  I had to go back and find the IDs for these players and ensure they weren’t considered new rookie players.

Each day I work on the database I always seem to find a mistake somewhere.  The other day, I noticed Scott Rolen’s 2009 AB totals were a little off.  For some reason, I missed the stats from one day in June so I had to go back and fix that.

Another thing I noticed was Miguel Tejada’s batting average in my database didn’t match the batting average from other sites.  After some research, I found that if a game was suspended due to rain I wasn’t updating the database.

Another glitch is I don’t have any 2009 stats for HBP, BK, and WP categories for pitchers.  I still need to find a way to gather this data.

I’m sure I’ve overlooked some other things.  If anyone notices any errors please let me know.


http://baseballsimulator.com/blog/wp-content/plugins/sociofluid/images/digg_32.png http://baseballsimulator.com/blog/wp-content/plugins/sociofluid/images/stumbleupon_32.png http://baseballsimulator.com/blog/wp-content/plugins/sociofluid/images/delicious_32.png http://baseballsimulator.com/blog/wp-content/plugins/sociofluid/images/blinklist_32.png http://baseballsimulator.com/blog/wp-content/plugins/sociofluid/images/blogmarks_32.png http://baseballsimulator.com/blog/wp-content/plugins/sociofluid/images/technorati_32.png http://baseballsimulator.com/blog/wp-content/plugins/sociofluid/images/google_32.png

{ 1 comment }

Baseball hacks database

April 7, 2009

For the past few days I’ve been working on building a baseball database of all of the players who have played baseball from 1871 to 2008.  The tricky part in building such a database is gathering statistics of the current season and merging it with the Lahman baseball database.  A book called Baseball Hacks shows you how to gather statistics from the current season by using  data from http://mlb.mlb.com and inserting it into a MySQL database.

One of the drawbacks in merging this data is trying to find a way to cross-reference a player’s playerID in the Lahman database with his mlb.com ID.  A playerID is generated by using the first five letters of a player’s last name and first two letters of his first name.  A number is added to the end of the ID to make it unique in case of duplicates. The playerID for Chipper Jones, for example, is jonesch06.  His mlb.com ID is 116706.  I was thinking since I know the pattern of how the playerIDs are generated in the Lahman database I could somehow use that to link the Lahman database data to the mlb.com data but this method could end up being too inaccurate.

Luckily, I stumbled upon the forums at http://www.baseball-fever.com.  In the Statistics, Analysis, & Sabermetrics area there are some individuals asking how to link these IDs together.   The author of THE BOOK — Playing The Percentages In Baseball posted a file that contains the playerIDs mlb.com IDs of all players.  I should be able to use this information to merge the Lahman database with this current season!  Hopefully, I’ll have a working database of past seasons and the current season soon.

http://baseballsimulator.com/blog/wp-content/plugins/sociofluid/images/digg_32.png http://baseballsimulator.com/blog/wp-content/plugins/sociofluid/images/stumbleupon_32.png http://baseballsimulator.com/blog/wp-content/plugins/sociofluid/images/delicious_32.png http://baseballsimulator.com/blog/wp-content/plugins/sociofluid/images/blinklist_32.png http://baseballsimulator.com/blog/wp-content/plugins/sociofluid/images/blogmarks_32.png http://baseballsimulator.com/blog/wp-content/plugins/sociofluid/images/technorati_32.png http://baseballsimulator.com/blog/wp-content/plugins/sociofluid/images/google_32.png

{ 2 comments }