Racing Database Redesign - Part 13 - Automated Data Import

At the bottom of this post you'll find a video demonstrating the automated data download and importation process in action. I don't go into great detail in that video so it is worthwhile flagging up a few pointers so if you are trying to do something similar you'll know what to look out for.

Obviously some of the information may not be directly related to your situation if you are not using MySQL and PHP. Although I have no intention of posting lots of PHP script code hopefully you will be able to glean something from it in anycase.

Firstly, let me describe the process that the automation script follows:

1. Assuming no dates have been specified, the script heads off to and downloads the results files for the UK Win and Place markets based on the current system date, i.e. the files that contain the previous day's racing. Although there are files available for racing in other countries and the database can handle it, I just want to concentrate on UK racing.
2. It imports that data into temporary tables - one for the win data and the other for the place data.
3. It then analyses those tables on a race-by-race basis and, using internal MySQL commands, it splits that data and inserts it into the corresponding tables. It also calculates other information from that same data.
4. Once it has gone through all the races, it empties the aforementioned temporary tables, produces a summary of what it has done by writing everything to a log file then quits.

If the script was provided with a date range, steps 1-2 are repeated for each and every day within that range.

The script makes use of the PHP 'cURL' library to download each datafile, stick it into a variable then save the variable contents to a file on my server. It then uses the MySQL 'LOAD DATA INFILE' to quickly insert the contents of that file into temporary holding tables from where the data is then analysed and manipulated using internal MySQL commands.

While it is perfectly feasible to use PHP to manipulate and modify the data on a line be line basis before using the MySQL 'INSERT' command to insert the data into the corresponding tables, the method I'm using is significantly faster. As you'll see from the video below it takes 1-2 seconds for each day's worth of data. If you are not using MySQL, I strongly advise you investigate what bulk importation funtionality your software supports.

Finally, let me flag up a few things that you might want to consider:

1. Although the data files holding the results for the day are generally available by 10am the following morning, this is not always the case. Sometimes there is a delay so you might want to put off the download process until later that evening.
2. Very occasionally, Betfair have issues and the files aren't uploaded for a number of days so you may want to set up a mechanism that records that fact so that you can go back at some point to correct it either manually or automatically. I find a text based log file suits my fine, though you could have errors written to a database table instead.
3. Once you've downloaded the datafiles, there's no need to keep them after the data is imported into your database so you could set up an automated delete mechanism once the database is up-to-date.
4. If you configure the script to run automatically without any user input, you might want to consider having the log file emailed to you if you are away.

Now that everything is up and running, it is time to concentrate on the interface so I'll draw this series to a close. At some later stage I'll come back to introduce the interface but that is likely to be a while.

In the meantime, I hope you have found some of my ramblings of use and if you have any questions, just leave a comment below.


Alistair said...

UPDATE: I had fun yesterday downloading and importing the Betfair data files.

In the video I underestimated the time it would take to do nearly seven year's worth of data so I tackled it in yearly batches. Unfortunately there was a problem with 2010.

The data import into the EVENT table failed. The relationship between the EVENT and VENUE tables forces the need for a BFShortName for the course to exist in the VENUE table before an entry can be made in the EVENT table. Unfortunately, on a couple of files from Betfair this information was missing for some reason. As a result, the whole importation process for the year failed.

To resolve the issue I created a dummy entry in the VENUE table with a BFShortName of 'Unknown' and then modified my script to check for any missing data and replace it with the same value of 'Unknown'. If missing course name data exists, an entry is written to the logfile allowing me to manually update the info.

This is always the danger when relying on data from a third party. One has no control over what it contains. Consequently, it is important that the importation process is frequently monitored incase other, unforeseen issues raise their heads.

Nigel Waters said...

Hi Alistair,

I want to say a big thank you for this series as it is something I have been wanting to do for a very long time but never really knew where to start.

I have jumped in and trying to learn Python and mySQL as I go along but I have hit a brick wall. I have spent hours upon hours trying to input IPMin1...etc but I just can not figure out how to do it. IPMin1 is easy enough but then trying to input the nth Minimum is where I am really struggling. I know you are a busy man but any help or pointers would be greatly appreciated.

Alistair said...

Hi Nigel,

Good to hear you have taken the plunge and are having fun in the process.

To list the nth number of minimum IPMIN figures into the EVENT table you need to transpose that data (effectively held in the one column) from the downloadable data files into a single row in the EVENT table.

All the data in to EVENT table is worked out at the same time and is done as follows:

1. You need to analyse each race individually by looping through the events listed in the datafile and executing an SQL statement like:

"SELECT * FROM horsewinprices WHERE EVENTID = ".$eventID." ORDER BY IPMIN"

This will give you a set of results listed by increasing order of IPMIN. You can now loop through this list to extract the IPMIN figures, but I'm coming to that...

2. For each event extract the corresponding event info such as EventID, EventDT, BFShortName, RaceDist etc and assign them to variables in your code.

3. Now, assign a variable to hold an expanding string of comma delimited IPMIN figures in preparation for using the LOAD DATA INFILE function. Loop through this list for as many times as the number of IPMIN values you want to record, appending each new IPMIN value to the previous. In PHP I use something like: $ipmin = $ipmin.",".$mysqlEventRunnerRow["IPMIN"];

4. Now, build a string that contains a comma delimited list of all the items from step 2 and 3 IN THE SAME ORDER AS LISTED IN EVENT TABLE FIELD LIST. Make sure you include an extra comma for data fields in the table that step 2 and 3 don't cover, e.g. the Going. You'll end up with a string a little like this, though note for space reasons and clarity I've missed some fields out:

$eventresultsStr = $eventID.",".$eventdt.",".$courseName.",".$raceDist.",".$ipmin;

5. As you loop through each event, build up a data string that will hold ALL the data you are calculating:

$dataStr = $dataStr.$eventresultsStr."\n";

6. Once you've looped through all the EventIDs that variable, $dataStr in my case, will hold all the calculated in-play data for each event, one per row. You can now save that to a temporary file on your system then upload it en masse using the LOAD DATA INFILE command.

Note that steps 5-6 can be ignored if you use the SQL INSERT funtions on a line by line basis at the end of step 4. However, as I've already flagged up elsewhere, for loading lots of records en masse, I much prefer the LOAD DATA INFILE funtion which I find to be much quicker.

There's no harm in you trying both methods for yourself though if for no other reason that academic curiousty.

I hope that is enough to get you past your current stumbling block. I may ask you to reciprocate in the future if I decide to go down the Python route. At the moment I'm having to revise my plans vis-a-vis using the API/scraping Timeform.

It seems I was completely wrong in my assumption that the various Betfair ID numbers were the same across the two systems. As a result, using the API looks like it will be forced on me.

Nigel Waters said...

Hi Alistair,

Thanks for taking the time to reply.

They are great pointers in what is a huge project but well worth the effort. I am currently populating my tables via Python using mySQL query's which is working very well other than these IP stats and I have added extra fields which I feel will be great use in the future. I know it is possible and won't give up until I have exhausted every avenue as it lightning quick this way but if I can't then I will as you say have to start stripping the information out 1st.

Again thanks for such a fantastic blog and if I can be any help in the future don't hesitate to get in touch.

Keep up the excellent work.

Tony said...

Hi, but as you find the race card for the following days?

Alistair said...

Sorry Tony, I'm not sure I understand your question.

Matt said...

Hi Alistair,

I've recently found your blog and have been reading some key posts regarding a number of things. Initially I stumbled upon it due to FairBot - which has to be one of the "nicest" auto-betting bots for BetFair that I've found so far, much nicer and easier to use than Gruss' bot and more features than GeeksToy. It's not perfect, but I'm finding it's easy Dutching options and quick Strategy Editor quite good.

Secondly, I like the way that you're going about creating a race databases. I too have started harvesting bet data from BetFair to help me pick viable Dutching bets. One thing that annoyed me about BetFair data (although I'm yet to import it into MySQL yet) is that the runners only had info on whether they won or not. I thought that it would be more useful for analysis and dutching if I knew whether it came 2nd/3rd/4th etc.

After searching for ages, I finally found a free site that offers it: I signed up for a free trial and downloaded all of the race data for UK in csv format - it looks like a useful addition to your database as it covers the actual placement of the runners, as well as extra data like the distances the horses were beaten over and official race duration.

As a grand plan for my strategy, I was trying to get to the point where given an upcoming race, I can use the placement data to figure out the average position a horse comes in at, and use the BetFair info to figure out the starting odds to InPlay odds and see how they shorten/lengthen to give me a possible idea of whether I can dutch the horses going InPlay at a more favourable set of prices.

Looking forward to read more posts from your blog - good luck!

Alistair said...

Thanks for your comments Matt, and welcome to the blog.

My database endeavours are on hold at the moment. I prefer to leave those types of projects for the longer nights of winter. I've got too many things I need to do outside while the weather is good - not that we've had much of that this summer.

Matt said...

Hi Alistair - One question about FairBot (apologies as it's a little off topic from the databases) - Are there any good forums/sites that are dedicated to discussing FairBot Strategies? I can't seem to find many good ones when compared to the various forums that are dedicated to Gruss & GeeksToy etc..

Alistair said...

I am not aware of any dedicated Fairbot forums Matt. Even the software developers, Binteko, do not appear to have one.

Alan Wilson said...

Many thanks, this information is just what I've been looking for and will save me a lot of time.