Racing Database Redesign - Part 14 - Manual Error Checking

I thought I'd take a moment to discuss the importance of periodic checking of your database to ensure that your automated systems are doing what you expect them to do. This is all the more critical when your database pulls in data from a third party to which you have no control over - as is the case for my horse racing database.

I've already mentioned a case in point where a couple of data files from Betfair had empty MENUHINT hint fields. This resulted in importation of data failing because the BFShortName for the venue could not be ascertained.

In that instance, I was able to programmatically check for that missing data and substitute a dummy value to enable the data importation process to continue. I now still have to periodically run the following SQL query to identify any events where the BFShortName is unknown. I can then go in and manually create it.

SELECT * FROM 'event' WHERE 'bfshortname' = 'Unknown';

This weekend, and purely by chance, I discovered another issue which wasn't trapped by the error checking in my script.

The TimeForm site does not seem to use the same values for eventid and selectionid for it's historical results pages as those that are contained within the data files downloaded from Betfair. For some reason, I had always assumed they did. I was in the process of confirming that when I discovered that the race data from the data file released on 22 May 2015, i.e. races held on the 21st, hadn't been placed into the EVENTDETAILS table. All the race data like MinIP odds had all been calulated correctly and entered into the EVENT table, but the data for each individual runner had not been entered into the EVENTDETAILS table.

This had not been trapped by the error checking in my script, yet clearly there was an issue. Take a look at the image below which shows the data file in question as loaded into a spreadsheet:

The highlighted rows illustrate the problem. That data shouldn't be in the file. I've yet to ascertain why the error checking in my script didn't catch it but now I know it is there I can search for it and remove it before the data is sent to the corresponding tables.

This is the issue with third party data. It can be changed without notice or can contain errors the nature of which are impossible to predict. Once identified, code can be put in place to trap those conditions but one can never be certain that some unforeseen condition isn't getting through.

So, like the case of the missing BFShortName above, I will have to periodically check that every race listed in the EVENT table actually has some corresponding runners listed in the EVENTDETAILS table by running the following query:

SELECT * FROM 'event' WHERE 'eventid' NOT IN (SELECT 'eventid' FROM 'eventdetail');

It is manual checking like this that needs to be done periodically to maintain confidence in the integrity not of the database design but of the data itself. The trouble is, if you don't know what to look for, how do you query your database to find it?

That's a difficult question to answer, but hopefully the intimate knowledge that you have with your database design and the data it is meant to contain will guide you in the process.


Harry Haller said...

Assuming you haven't done this already..

May I suggest you should automate this kind of testing so you don't have to do it manually and can even get notified when any of the tests fail?

There are lots of testing frameworks around, don't know any for PHP/Python since I'm not fluent in those languages, but any Unit Testing framework will work, and you could have a background/cron process in your server executing the tests say, everyday.

This series is really interesting, thinking of doing something like this myself, so some tips on writing the code on Python would be awesome.

Good luck!

Alistair said...

Hi Harry and thanks for the comment.

You are quite correct and I should have suggested the very same in my post. I have/will implement such checking through an automated script run from a cron job.

Not everyone is doing things on a LAMP server like me so it it worth flagging up to people the importance of doing the checks irrespective of what technologies one is using. Certainly, if you are able to execute such checks on an automated basis so much the better, but you'll need to investigate how to do that within your own particular set up.

Thanks again for the prompt Harry.