Last year about this time I wrote about building a baseball player database using information provided by a book called Baseball Hacks. I’ve finally managed to write up a list of steps required to create such a database of player statistics and keep it up-to-date as the season progresses.
I’m by no means a SQL expert so the steps I took to get this far in building an automated database may not have been the best but it seems to be fully functional and working. Please feel free to comment if improvements can be made. Take a look at my steps below:
1. Each year since 1995, Sean Lahman has provided a complete database of baseball statistics for every team and player in Major League history. You can download the 2009 version here: http://www.baseball1.com.
2. MySQL will be used as the database. Download and install MySQL (http://dev.mysql.com/downloads/).
3. Download and create 5 tables which will be used to store data from the current season. Download: current_season.sql.
4. Download and install Perl: http://www.perl.org/get.html
5. A few things on the MLB.com Gameday servers have changed since the Baseball Hacks book was released in 2006 so I’ve made some changes to the Perl scripts provided by the book. Download modified scripts: (load_db_all_days.pl, update_db.pl, save_to_db.pm).
6. These scripts require some Perl modules in order for them to work. Most of these modules are already installed when you install Perl but you may need to add the DBD-mysql model. You can do this by running the Perl Package Manager from Start>Programs>Perl.
Here’s a list of modules used by these scripts:
DBI
XML::Simple
LWP
Time::Local
Data::Dumper
DBD-mysql
7. Run load_db_all_days.pl to gather data from opening day to games that were played yesterday. When you run load_db_all_days.pl it will begin inserting data from games that were played on April 5th and will complete after data from all games that were played yesterday have been inserted.
8. This year opening day actually started on April 4th, 2010 (http://gdx.mlb.com/components/game/mlb/year_2010/month_04/day_04/) with the Yankees and Red Sox. There was also an exhibition played between the Mariners and Giants. In order to gather data for the game between the Yankees and Red Sox we’ll need to modify load_db_all_days.pl.
The starting dates in the script will need to be changed each season. In load_db_all_days.pl.
Change the following lines below:
my $mintime = timegm(0,0,0,4,3,110); my $mintimestr = gmtime(timegm(0,0,0,4,3,110));
The $mintime variable is set to timegm(second, minute, hour, day, month, year). The number 4 is the day and 3 is the month. In Perl, months range from 0-11 (0=January, 11=December). So, the number 3 = April. The year is from 1900 (e.g., 1900 + 110 = 2010).
The dates for $mintime and $mintimestr should always match.
To force the script to run just for one game uncomment line 63 so it looks like this:
die;
If for some reason the exhibition game was processed you can always go into your database and delete data for the exhibition game from the 5 tables that were created in step 3.
To run a Perl script in Windows open a command prompt and change the directory to the location where load_db_all_days.pl resides and run “Perl load_db_all_days.pl”
9. Create a new table for batters. This table will store a combination of statistics from the Lahman database and current season statistics.
CREATE TABLE batting_combined ( playerID varchar(9) NOT NULL DEFAULT '', stint smallint(2) NOT NULL, teamID char(3) DEFAULT NULL, lgID char(2) DEFAULT NULL, nameFirst varchar(50) DEFAULT NULL, nameLast varchar(50) DEFAULT NULL, yearID smallint(4) NOT NULL DEFAULT '0', g smallint(3) DEFAULT NULL, ab smallint(3) DEFAULT NULL, r smallint(3) DEFAULT NULL, h smallint(3) DEFAULT NULL, 2B smallint(3) DEFAULT NULL, 3B smallint(3) DEFAULT NULL, hr smallint(3) DEFAULT NULL, rbi smallint(3) DEFAULT NULL, sb smallint(3) DEFAULT NULL, cs smallint(3) DEFAULT NULL, bb smallint(3) DEFAULT NULL, so smallint(3) DEFAULT NULL, sf smallint(3) DEFAULT NULL, hbp smallint(3) DEFAULT NULL, title varchar(50) DEFAULT NULL, url varchar(255) DEFAULT NULL, message varchar(255) DEFAULT NULL, sponsor text, PRIMARY KEY (stint,playerID,yearID) );
10. Create a new table for pitchers.
CREATE TABLE pitching_combined ( playerID varchar(9) NOT NULL DEFAULT '', stint char(2) NOT NULL DEFAULT '', teamID char(3) DEFAULT NULL, lgID char(2) DEFAULT NULL, nameFirst varchar(50) DEFAULT NULL, nameLast varchar(50) DEFAULT NULL, yearID smallint(4) NOT NULL DEFAULT '0', g smallint(3) unsigned DEFAULT NULL, gs smallint(2) DEFAULT NULL, cg smallint(2) DEFAULT NULL, sho smallint(2) DEFAULT NULL, bf smallint(2) DEFAULT NULL, outs smallint(2) DEFAULT NULL, hr smallint(2) DEFAULT NULL, bb smallint(2) DEFAULT NULL, so smallint(2) DEFAULT NULL, er smallint(2) DEFAULT NULL, runs smallint(2) DEFAULT NULL, hits smallint(2) DEFAULT NULL, wins smallint(2) DEFAULT NULL, losses smallint(2) DEFAULT NULL, holds smallint(2) DEFAULT NULL, hopps smallint(2) DEFAULT NULL, saves smallint(2) DEFAULT NULL, sopps smallint(2) DEFAULT NULL, HBP smallint(3) DEFAULT NULL, BK smallint(3) DEFAULT NULL, WP smallint(3) DEFAULT NULL, title varchar(50) DEFAULT NULL, url varchar(255) DEFAULT NULL, message varchar(255) DEFAULT NULL, sponsor text, PRIMARY KEY (playerID,yearID,stint) );
11. Run following insert statement for the batting_combined table:
insert into batting_combined (playerID,yearID, teamID,stint,lgID ,nameFirst, nameLast, g, ab, r, h, `2B`, `3B`, hr, rbi, sb, cs, bb, so, sf, hbp) SELECT ma.playerID, ba.yearID, ba.teamID, ba.stint, ba.lgID, ma.nameFirst, ma.nameLast, ba.G, ba.AB, ba.R, ba.H, ba.`2b`, ba.`3b`, ba.HR, ba.RBI, ba.SB, ba.CS, ba.BB, ba.SO, ba.sf, ba.hbp FROM master AS ma , batting AS ba WHERE ma.playerID = ba.playerID;
12. Run following insert statement for the pitching_combined table:
insert into pitching_combined (playerID,stint,teamID,lgID,nameFirst, nameLast, yearID, g,gs,cg,sho, bf,outs, hr, bb, so,hbp,bk,wp, er, runs, hits, wins, losses, saves) SELECT ma.playerID, pi.stint, pi.teamID, pi.lgID, ma.nameFirst, ma.nameLast, pi.yearID, pi.G, pi.gs, pi.cg, pi.sho, pi.BFP, pi.ipouts, pi.HR, pi.BB, pi.SO, pi.hbp, pi.bk, pi.wp, pi.ER, pi.R, pi.H, pi.W, pi.L, pi.SV FROM master AS ma , pitching AS pi WHERE ma.playerID = pi.playerID;
13. Tangotiger of http://www.insidethebook.com has provided a list of cross referenced player ids for each player . This list allows us to map a player’s ‘Lahman’ database id with his ‘mlb.com’ id.
Create the following table:
CREATE TABLE export_id_map ( MLBAM_ID int(11) DEFAULT NULL, retro_id varchar(11) DEFAULT NULL, bdb_id varchar(11) DEFAULT NULL, stats_id int(11) DEFAULT NULL, bis_id int(11) DEFAULT NULL, source_id int(11) DEFAULT NULL );
You can refer to the example below to see how to insert the csv file from Tangotiger’s site into the export_id_map table:
load data local infile 'C:\\Files\\EXPORT_ID_MAP\\EXPORT_ID_MAP.txt' into table export_id_map fields terminated by ',' enclosed by '"' lines terminated by '\r\n' (MLBAM_ID, retro_id, bdb_id, stats_id, bis_id, source_id);
14. Create following table export_id_map2. In this table I added two columns nameFirst and nameLast:
CREATE TABLE export_id_map2 ( MLBAM_ID int(11) NOT NULL, retro_id varchar(11) DEFAULT NULL, bdb_id varchar(11) DEFAULT NULL, stats_id int(11) DEFAULT NULL, bis_id int(11) DEFAULT NULL, source_id int(11) DEFAULT NULL, nameLast varchar(50) DEFAULT NULL, nameFirst varchar(50) DEFAULT NULL, PRIMARY KEY (MLBAM_ID) );
15. Step 14 was necessary because the list provided by Tangotiger last year did have player names. I haven’t adapted to this year’s list which did not have columns for player names. If I decided to leave out the names I probably would have had to build a 3 table join which can get complicated especially with my lack of thorough knowledge of MySQL joins. Run the following insert:
insert into export_id_map2 (MLBAM_ID,retro_id, bdb_id,stats_id,bis_id ,source_id, nameLast, nameFirst) SELECT ex.MLBAM_ID, ex.retro_id, ex.bdb_id, ex.stats_id, ex.bis_id, ex.source_id, ma.nameLast, ma.nameFirst FROM master AS ma , export_id_map AS ex WHERE ma.playerID = ex.bdb_id group by ex.bdb_id
16. Create following table for new players that are not currently found in the Lahman database:
CREATE TABLE rookies ( playerID varchar(10) NOT NULL, gameDate date DEFAULT NULL, teamID char(3) DEFAULT NULL, eliasID int(6) NOT NULL, nameFirst varchar(32) DEFAULT NULL, nameLast varchar(32) DEFAULT NULL, nickName varchar(32) DEFAULT NULL, jersey smallint(2) DEFAULT NULL, throws char(1) DEFAULT NULL, PRIMARY KEY (playerID) );
17. Create a view (vw_lastplayed) which will return a list of players that didn’t play last year:
CREATE VIEW vw_lastplayed AS select ma.nameFirst AS nameFirst, ma.nameLast AS nameLast, max(ba.yearID) AS myMax, ba.G AS G, ba.AB AS AB, ba.R AS R, ba.H AS H, ba.2B AS 2b, ba.3B AS 3b, ba.HR AS HR, ba.RBI AS RBI, ba.SB AS SB, ba.CS AS CS, ba.BB AS BB, ba.SO AS SO, ma.playerID AS playerID from (`master` ma join batting ba) where (ma.playerID = ba.playerID) group by ma.playerID having (max(ba.yearID) = '2008');
18. Create following view to determine which batters have played for the last 20 years. This helps determine whether or not a player should be considered a rookie:
CREATE VIEW vw_20years AS select concat(ma.nameFirst,' ',ma.nameLast) AS `CONCAT(ma.nameFirst, ' ', ma.nameLast)` from (`master` ma join batting ba) where ((ma.playerID = ba.playerID) and (ba.yearID between 1988 and 2008)) group by concat(ma.nameFirst,' ',ma.nameLast);
19. Create following view to determine all pitchers that have played within the last 20 years:
CREATE VIEW vw_pitching_20years AS select concat(ma.nameFirst,' ',ma.nameLast) AS `CONCAT(ma.nameFirst, ' ', ma.nameLast)` from (`master` ma join batting ba) where ((ma.playerID = ba.playerID) and (ba.yearID between 1988 and 2008)) group by concat(ma.nameFirst,' ',ma.nameLast);
20. Create following procedure which will scan ids in the master table and compare ids in rookies table for duplicates. If duplicates are found, the number at the end of the ID will be incremented.
CREATE PROCEDURE `idGen`()
BEGIN
DECLARE counter INT DEFAULT 1;
simple_loop: LOOP
SET counter=counter+1;
select counter;
update rookies set playerid = lower(concat(left(nameLast,5), left(nameFirst,2),concat('0', counter))) where playerid in (Select playerid FROM master);
IF counter=20 THEN
LEAVE simple_loop;
END IF;
END LOOP simple_loop;
update rookies set playerid = lower(playerid);
END
21. Run update_db.pl nightly. I have it scheduled to run at about 5 am Pacific time each day.
The update_db.pl script will gather data for games that were played the day before. This script also does the following:
- Updates the batting_combined & pitching_combined tables to combine current season statistics with all-time stats.
- Inserts data into rookies table so we have a list of all known rookies.
- Inserts data into batting_combined & pitching_combined of all players who exist in the rookies table list but aren’t really rookies (i.e. they missed the 2009 season).
- Deletes non-rookies from the rookies table.
- Runs procedure created in step 19.
That’s it! Here is the baseball database. Please leave a comment if you have any questions.
{ 3 comments }








