Racing Database Redesign - Part 9 - Tables (Final Design?)

The table design is drawing to a conclusion now, but there's one more important thing to do before reviewing the final design for any omissions or errors.

When setting up a database, it is not enough to establish the tables and the relationships between them. One must also consider the data that is being stored; what 'type' it is and what 'size' it is. This is important as it helps to ensure that the amount of storage space required is minimised and only the correct type of information is stored in the corresponding fields. To illustrate this point, let me take a simple example...

Imagine you create a table field that is intended to hold the names of Jockeys. As you don't know how big peoples' names may be you create that field so that it can hold 200 characters. As your table fills up over time you may find that the maximum length of any Jockey's name is 25 characters. This means that that one field in your database is wasting 175 characters worth of data storage for EVERY record in that table.

If you've done the same type of thing in all fields, over all tables throughout the database I'm sure you'll see that that wasted space can mount up very quickly.

Also, specifying the type of data each field will hold helps to eliminate data input errors. So, if a field should only contain numbers any attempt to enter text will result in an error.

For a full list of data types available reference should be made to the specific database software you are using. In any event, you can use the ERD to document what data types and settings are required for each and every field within your database.

I use the following format:

datatype(size if applicable), NULL [yes|no], DefaultValue

So, using the EventID field as an example, its datatype is set to 'int' (that's an integer) with a size of 20 characters; NULL is set to 'no' - in otherwords it cannot be empty; and finally there is no default value set - in otherwords its value must be specified manually or programmatically.

The final ERD below lists all the settings required, though some will need further explanation.


Generally speaking, it is best to make sure each field cannot be NULL, i.e. empty. This is not possible with this database simply due to the nature of the data that is being recorded and where it is coming from. So you'll see all of the tables except EVENTDETAIL have at least one field specified to allow a NULL value. Unfortunate, but it cannot really be helped in this case - at least not until I can find sources for that historical information.

Also note the use of the 'varchar' data type. I'm sure MS Access has a similar data type but I cannot remember what it is called. When using MySQL as your database you will find that indexing fields that contain text performs much quicker if the field is configured as a 'varchar' rather than 'text'. I'm not privy to the technical reasons this is so though I'm sure there will be some explanation on the Internet for those who are interested. Suffice to say, if there's any chance you may wish to perform any form of querying on a text field, configure it as a 'varchar'.

I should flag up one important point that you need to consider with regard to the data types that you specify:

Virtually all of this data is being provided by a third party, namely Betfair. As such, neither you or I have control over that data. Currently, Betfair provide internal IDs such as EventID as a numerical field. Though unlikely, there is nothing to stop them changing that to an alphanumeric field for example. In that event, any attempt to store such EventIDs into the current design would fail.

I don't think Betfair would implement such changes, but I feel it is important that you are aware of such a possibility. In other words, be aware of where your data is coming from, how much control you have over it and the potential impact it may have on your design.

Table and Design Review:

At this point it is worth doing a final review of the design to ascertain if everything is covered or indeed if anything has been missed not only from the design but the original specification. In no particular order...

1. You should note from the ERD I've changed my mind about having separate date and time fields in the EVENT table. I'm not sure why I suggested it in the first place as, when I checked my old database, I had indeed stored them as one in 'datetime' format.

2. Note that I've included two fields in the VENUE table that ostensibly hold the same information, namely Name and BFShortName. The Name field will hold the full name of the venue whereas BFShortName will hold the abbreviated venue name as contained within the downloadable BF results files as well as the menus you might have seen in trading software. A full list of these pairings can be downloaded from Betfair as an MS Excel file. Using this file, the VENUE table can be pre-populated though you still need to find the corresponding VenueID values.

3. Although the place market downloadable result files indicate whether a runner was placed they do not indicate its finishing position. Consequently there is no way to ascertain a horse's past form figures from those files. The form figures are available from the Betfair API or numerous websites and the best place to place that information would be in a text field stored in the HORSE table. This field would need updating every time the horse ran.

4. A general rule of thumb within database design is this: if something can be calculated from existing data then there is no need to store that result within the database. Depending on the complexity of the calculation required, this can increase processing time for every executed query. If storing that info upfront reduces the overall processing time then it may be worth ignoring this general rule.

Take for example the well known Lay The Field strategy. As an exercise, consider how you would go about analysing the historical record using the ERD design above in order to ascertain which venues, race distances, field sizes, race types, etc were statistically good for implementing the LTF strategy - answers/suggestions in the comment section please.

5. It is worthwhile reviewing other sources of information or websites to assess whether other suitable information exists that might be usefully included. For example, at the bottom of every card on the Timeform site is listed the 'Betfair betting forecast' - a list of projected BSP for each runner. Could that be used to develop any future strategy? If you do not record it you'll never be able to analyse the data so you'll never know.

6. Note that the design is generic. While I am principally focussing on UK horse racing, you'll see that the downloadable datafiles all have the same format. Consequently, races from Australia, Ireland, South Africa and USA could be included. As an added bonus, UK greyhound races (granted jockey info would be irrelevant here) could also be recorded within the same database - or more suitably a database with similar structure. Don't say I don't give you value for money!


As you can see from the above points, not only can a final review flag up any omissions but it can lead to expansion beyond the remit of the original specification. The normalisation process which I discussed in an earlier post helps to facilitate this but the database designer must be able to draw the line somewhere and for me, this is it. I'll make a few tweaks as outlined in the review above but the design is now essentially finished.

I hope you've found this series of some use and feel able to have a go yourself. I'm going to take a break for a couple of weeks as I start to implement the design. I'll return with a few posts to discuss some of the automation side of things. In the meantime, let me wish you good luck with your endeavours.

3 comments:

Alistair said...

UPDATE:

With regard to Betfair data types, I came across this data type document when I was looking for something else.

It would be worth mapping your database datatypes to these, where applicable.

The thing to note is that Betfair uses a 'string' for their internal IDs, which was what I was alluding to in the post. If using MySQL, I'd still be tempted to configure those fields as 'varchar' if there is any possibility of doing regular queries based on those fields.

Ray said...

Many Thanks for the Data Type and Venue links. Because I'm not building a retrieval program at present, I never thought of looking into the API documentation for such info. You've stopped me building Venue/Country Mappings the hard way.

Cheers

Alistair said...

Glad to be of assistance Ray. It is worthwhile browsing through the API documentation even if, like me, you want to avoid using it using the API if possible.

The API does provide a lot of information that may, depending on what you are trying to do, persuade you to use it as it makes things easier.