How to maintain an up-to-date baseball database

April 15, 2010

Last year about this time I wrote about building a baseball player database using information provided by a book called Baseball Hacks. I’ve finally managed to write up a list of steps required to create such a database of player statistics and keep it up-to-date as the season progresses.

I’m by no means a SQL expert so the steps I took to get this far in building an automated database may not have been the best but it seems to be fully functional and working.  Please feel free to comment if improvements can be made.   Take a look at my steps below:

1.  Each year since 1995, Sean Lahman has provided a complete database of baseball statistics for every team and player in Major League history.  You can download the 2009 version here:  http://www.baseball1.com.

2.  MySQL will be used as the database.  Download and install MySQL (http://dev.mysql.com/downloads/).

3.  Download and create 5 tables which will be used to store data from the current season. Download: current_season.sql.

4. Download and install Perl: http://www.perl.org/get.html

5. A few things on the MLB.com Gameday servers have changed since the Baseball Hacks book was released in 2006 so I’ve made some changes to the Perl scripts provided by the book. Download modified scripts: (load_db_all_days.pl, update_db.pl, save_to_db.pm).

6. These scripts require some Perl modules in order for them to work. Most of these modules are already installed when you install Perl but you may need to add the DBD-mysql model. You can do this by running the Perl Package Manager from Start>Programs>Perl.

Here’s a list of modules used by these scripts:

DBI
XML::Simple
LWP
Time::Local
Data::Dumper
DBD-mysql

7. Run load_db_all_days.pl to gather data from opening day to games that were played yesterday. When you run load_db_all_days.pl it will begin inserting data from games that were played on April 5th and will complete after data from all games that were played yesterday have been inserted.

8. This year opening day actually started on April 4th, 2010 (http://gdx.mlb.com/components/game/mlb/year_2010/month_04/day_04/) with the Yankees and Red Sox. There was also an exhibition played between the Mariners and Giants. In order to gather data for the game between the Yankees and Red Sox we’ll need to modify load_db_all_days.pl.

The starting dates in the script will need to be changed each season. In load_db_all_days.pl.

Change the following lines below:

my $mintime = timegm(0,0,0,4,3,110);
my $mintimestr = gmtime(timegm(0,0,0,4,3,110));

The $mintime variable is set to timegm(second, minute, hour, day, month, year). The number 4 is the day and 3 is the month. In Perl, months range from 0-11 (0=January, 11=December). So, the number 3 = April. The year is from 1900 (e.g., 1900 + 110 = 2010).

The dates for $mintime and $mintimestr should always match.

To force the script to run just for one game uncomment line 63 so it looks like this:

die;

If for some reason the exhibition game was processed you can always go into your database and delete data for the exhibition game from the 5 tables that were created in step 3.

To run a Perl script in Windows open a command prompt and change the directory to the location where load_db_all_days.pl resides and run “Perl load_db_all_days.pl

9. Create a new table for batters. This table will store a combination of statistics from the Lahman database and current season statistics.

CREATE TABLE batting_combined (
  playerID varchar(9) NOT NULL DEFAULT '',
  stint smallint(2) NOT NULL,
  teamID char(3) DEFAULT NULL,
  lgID char(2) DEFAULT NULL,
  nameFirst varchar(50) DEFAULT NULL,
  nameLast varchar(50) DEFAULT NULL,
  yearID smallint(4) NOT NULL DEFAULT '0',
  g smallint(3) DEFAULT NULL,
  ab smallint(3) DEFAULT NULL,
  r smallint(3) DEFAULT NULL,
  h smallint(3) DEFAULT NULL,
  2B smallint(3) DEFAULT NULL,
  3B smallint(3) DEFAULT NULL,
  hr smallint(3) DEFAULT NULL,
  rbi smallint(3) DEFAULT NULL,
  sb smallint(3) DEFAULT NULL,
  cs smallint(3) DEFAULT NULL,
  bb smallint(3) DEFAULT NULL,
  so smallint(3) DEFAULT NULL,
  sf smallint(3) DEFAULT NULL,
  hbp smallint(3) DEFAULT NULL,
  title varchar(50) DEFAULT NULL,
  url varchar(255) DEFAULT NULL,
  message varchar(255) DEFAULT NULL,
  sponsor text,
  PRIMARY KEY (stint,playerID,yearID)
);

10. Create a new table for pitchers.

CREATE TABLE pitching_combined (
  playerID varchar(9) NOT NULL DEFAULT '',
  stint char(2) NOT NULL DEFAULT '',
  teamID char(3) DEFAULT NULL,
  lgID char(2) DEFAULT NULL,
  nameFirst varchar(50) DEFAULT NULL,
  nameLast varchar(50) DEFAULT NULL,
  yearID smallint(4) NOT NULL DEFAULT '0',
  g smallint(3) unsigned DEFAULT NULL,
  gs smallint(2) DEFAULT NULL,
  cg smallint(2) DEFAULT NULL,
  sho smallint(2) DEFAULT NULL,
  bf smallint(2) DEFAULT NULL,
  outs smallint(2) DEFAULT NULL,
  hr smallint(2) DEFAULT NULL,
  bb smallint(2) DEFAULT NULL,
  so smallint(2) DEFAULT NULL,
  er smallint(2) DEFAULT NULL,
  runs smallint(2) DEFAULT NULL,
  hits smallint(2) DEFAULT NULL,
  wins smallint(2) DEFAULT NULL,
  losses smallint(2) DEFAULT NULL,
  holds smallint(2) DEFAULT NULL,
  hopps smallint(2) DEFAULT NULL,
  saves smallint(2) DEFAULT NULL,
  sopps smallint(2) DEFAULT NULL,
  HBP smallint(3) DEFAULT NULL,
  BK smallint(3) DEFAULT NULL,
  WP smallint(3) DEFAULT NULL,
  title varchar(50) DEFAULT NULL,
  url varchar(255) DEFAULT NULL,
  message varchar(255) DEFAULT NULL,
  sponsor text,
  PRIMARY KEY (playerID,yearID,stint)
);

11. Run following insert statement for the batting_combined table:

insert into batting_combined
(playerID,yearID, teamID,stint,lgID ,nameFirst, nameLast, g, ab, r, h, `2B`, `3B`, hr, rbi, sb, cs, bb, so, sf, hbp)
SELECT
ma.playerID,
ba.yearID,
ba.teamID,
ba.stint,
ba.lgID,
ma.nameFirst,
ma.nameLast,
ba.G,
ba.AB,
ba.R,
ba.H,
ba.`2b`,
ba.`3b`,
ba.HR,
ba.RBI,
ba.SB,
ba.CS,
ba.BB,
ba.SO,
ba.sf,
ba.hbp
FROM master AS ma , batting AS ba
WHERE ma.playerID = ba.playerID;

12. Run following insert statement for the pitching_combined table:

insert into pitching_combined
(playerID,stint,teamID,lgID,nameFirst, nameLast, yearID, g,gs,cg,sho, bf,outs, hr, bb, so,hbp,bk,wp, er, runs, hits, wins, losses, saves)
SELECT ma.playerID,
pi.stint,
pi.teamID,
pi.lgID,
ma.nameFirst,
ma.nameLast,
pi.yearID,
pi.G,
pi.gs,
pi.cg,
pi.sho,
pi.BFP,
pi.ipouts,
pi.HR,
pi.BB,
pi.SO,
pi.hbp,
pi.bk,
pi.wp,
pi.ER,
pi.R,
pi.H,
pi.W,
pi.L,
pi.SV
FROM master AS ma , pitching AS pi
WHERE ma.playerID = pi.playerID;

13. Tangotiger of http://www.insidethebook.com has provided a list of cross referenced player ids for each player . This list allows us to map a player’s ‘Lahman’ database id with his ‘mlb.com’ id.

Create the following table:

CREATE TABLE export_id_map (
  MLBAM_ID int(11) DEFAULT NULL,
  retro_id varchar(11) DEFAULT NULL,
  bdb_id varchar(11) DEFAULT NULL,
  stats_id int(11) DEFAULT NULL,
  bis_id int(11) DEFAULT NULL,
  source_id int(11) DEFAULT NULL
);

You can refer to the example below to see how to insert the csv file from Tangotiger’s site into the export_id_map table:

load data local infile 'C:\\Files\\EXPORT_ID_MAP\\EXPORT_ID_MAP.txt' into table export_id_map
fields terminated by ','
enclosed by '"'
lines terminated by '\r\n'
(MLBAM_ID, retro_id, bdb_id, stats_id, bis_id, source_id);

14. Create following table export_id_map2. In this table I added two columns nameFirst and nameLast:

CREATE TABLE export_id_map2 (
  MLBAM_ID int(11) NOT NULL,
  retro_id varchar(11) DEFAULT NULL,
  bdb_id varchar(11) DEFAULT NULL,
  stats_id int(11) DEFAULT NULL,
  bis_id int(11) DEFAULT NULL,
  source_id int(11) DEFAULT NULL,
  nameLast varchar(50) DEFAULT NULL,
  nameFirst varchar(50) DEFAULT NULL,
  PRIMARY KEY (MLBAM_ID)
);

15. Step 14 was necessary because the list provided by Tangotiger last year did have player names.  I haven’t adapted to this year’s list which did not have columns for player names.  If I decided to leave out the names I probably would have had to build a 3 table join which can get complicated especially with my lack of thorough knowledge of MySQL joins. Run the following insert:

insert into export_id_map2
(MLBAM_ID,retro_id, bdb_id,stats_id,bis_id ,source_id, nameLast, nameFirst)
SELECT
ex.MLBAM_ID,
ex.retro_id,
ex.bdb_id,
ex.stats_id,
ex.bis_id,
ex.source_id,
ma.nameLast,
ma.nameFirst
FROM master AS ma , export_id_map AS ex
WHERE ma.playerID = ex.bdb_id
group by ex.bdb_id

16. Create following table for new players that are not currently found in the Lahman database:

CREATE TABLE rookies (
  playerID varchar(10) NOT NULL,
  gameDate date DEFAULT NULL,
  teamID char(3) DEFAULT NULL,
  eliasID int(6) NOT NULL,
  nameFirst varchar(32) DEFAULT NULL,
  nameLast varchar(32) DEFAULT NULL,
  nickName varchar(32) DEFAULT NULL,
  jersey smallint(2) DEFAULT NULL,
  throws char(1) DEFAULT NULL,
  PRIMARY KEY (playerID)
);

17. Create a view (vw_lastplayed) which will return a list of players that didn’t play last year:

CREATE VIEW vw_lastplayed AS select
ma.nameFirst AS nameFirst,
ma.nameLast AS nameLast,
max(ba.yearID) AS myMax,
ba.G AS G,
ba.AB AS AB,
ba.R AS R,
ba.H AS H,
ba.2B AS 2b,
ba.3B AS 3b,
ba.HR AS HR,
ba.RBI AS RBI,
ba.SB AS SB,
ba.CS AS CS,
ba.BB AS BB,
ba.SO AS SO,
ma.playerID AS playerID
from (`master` ma join batting ba) where (ma.playerID = ba.playerID) group by ma.playerID having (max(ba.yearID) = '2008');

18. Create following view to determine which batters have played for the last 20 years.  This helps determine whether or not a player should be considered a rookie:

CREATE VIEW vw_20years AS select
concat(ma.nameFirst,' ',ma.nameLast) AS `CONCAT(ma.nameFirst, ' ', ma.nameLast)`
from (`master` ma join batting ba) where ((ma.playerID = ba.playerID) and (ba.yearID between 1988 and 2008)) group by concat(ma.nameFirst,' ',ma.nameLast);

19. Create following view to determine all pitchers that have played within the last 20 years:

CREATE VIEW vw_pitching_20years AS select
concat(ma.nameFirst,' ',ma.nameLast) AS `CONCAT(ma.nameFirst, ' ', ma.nameLast)`
from (`master` ma join batting ba) where ((ma.playerID = ba.playerID) and (ba.yearID between 1988 and 2008)) group by concat(ma.nameFirst,' ',ma.nameLast);

20. Create following procedure which will scan ids in the master table and compare ids in rookies table for duplicates. If duplicates are found, the number at the end of the ID will be incremented.

CREATE  PROCEDURE `idGen`()
BEGIN
   DECLARE counter INT DEFAULT 1;
            simple_loop: LOOP
              SET counter=counter+1;
              select counter;
              update rookies set playerid = lower(concat(left(nameLast,5), left(nameFirst,2),concat('0', counter))) where playerid in (Select playerid FROM master);
              IF counter=20 THEN
                 LEAVE simple_loop;
              END IF;
            END LOOP simple_loop;
            update rookies set playerid = lower(playerid);
END

21. Run update_db.pl nightly. I have it scheduled to run at about 5 am Pacific time each day.

The update_db.pl script will gather data for games that were played the day before.  This script also does the following:

  • Updates the batting_combined & pitching_combined tables to combine current season statistics with all-time stats.
  • Inserts data into rookies table so we have a list of all known rookies.
  • Inserts data into batting_combined & pitching_combined of all players who exist in the rookies table list but aren’t really rookies (i.e. they missed the 2009 season).
  • Deletes non-rookies from the rookies table.
  • Runs procedure created in step 19.

That’s it!  Here is the baseball database.  Please leave a comment if you have any questions.

{ 6 comments… read them below or add one }

Tim Blaker 08.23.10 at 12:27 am

Thanks for posting this information. I was struggling a bit to deal with the changes in using DBI and DBD-mysql and the information you posted made the difference.

fwiw, I found one error in your scripts: in save_to_db you need to declare

$gameid2 = $gameid;

I put this in right after:

$gamedate2 = $gamedate; in sub_save_game

and then in $roster_query you need to change

$gameid to $gameid2 otherwise the field is blank because you are retrieving the data from players.xml

Again, thanks very much.

Shaun Root 08.26.11 at 3:43 am

Thank you for this article, it was fantastic!

I do have a question though, how are you coming up with team records? Are you just combing data from the pitching current table? It’s doesn’t seem to be accurate because players get traded. For example, when you run the following query, it looks like PHI is credited with both a win and a loss:

SELECT * FROM `pitching_current` WHERE gameID LIKE ‘%phimlb%’ and gameDate = ‘2010-08-25’ ORDER BY gameDate

If you run this next query, the Phillies aren’t credited with a win at all:

SELECT * FROM `pitching_current` WHERE gameID LIKE ‘%phimlb%’ and gameDate = ‘2010-08-22’ ORDER BY gameDate

Also, is there a place where all the field names are explained? For example, what does status_ind mean in the games_current table?

Shaun Root 09.01.11 at 1:33 am

I ran load_db_all_days.pl for 2009-current. Now my batting_combined and pitching_combined stats are crazy for 2009 and 2010. It shows some players as having over 400 games played in 2009! Can you think of any reason why this might happen and how I might fix it without starting over from scratch?

Note: I have been running update_db.pl everyday since the day I first ran load_db_all_days.pl.

Thanks again for the great tutorial.

Miguel 06.13.12 at 1:40 am

Hi,

Do you know if there is a database of MLB players where I can find the number they had hold on their jersey?

Thanks.

Marianne Pelletier 06.21.15 at 3:34 pm

My script doesn’t seem to populate the gameID on Rosters. Have you noticed that or did I put a flaw in the script? Thanks for this additional syntax.

Marianne Pelletier 06.28.15 at 11:35 am

Got it! See Tim’s note, above.

Leave a Comment

You can use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>