Racing Database Redesign - Part 11 - Design And Admin Software

Just a quick note about design and admin software you can use on your project.

All the modelling I've talked about so far does not require any use of design software. Indeed, you don't need to go anywhere near your database application until you have to physically create the tables.

If you are using MS Access as your database, you can graphically create your relationship between tables and the software will prompt you accordingly for various settings. However, the tables have to be created beforehand - or they did in the last version of Access I used in anger. So you would have to do the modelling up front before creating the tables. You can of course invest in ERD design software but for small projects like this, it is unnecessary.

If you are using a MySQL server based database system like I am, particularly hosted on a Linux box, you have a number of options available to you. Note that MySQL is also available for Windows servers:

1. Webmin - A web based Linux server administration tool that lives on your server which comes with a MySQL administration module. It's primary role is for administering the server rather than creating databases but it does allow basic tasks such as creating/deleting tables, running SQL queries and the like. There is no ERD tool however.

2. phpMyAdmin - Another web based tool that lives on your server (Linux or Windows) but this is specifically designed for MySQL server admin and database creation. As such, it is much more sophisticated and offers complete control over your database design process, including an ERD module in which relationships can be graphically created and configured.

3. MySQL Workbench - A cross-platform client administration and database design tool giving you complete control over your MySQL server and database design process. The modelling feature is particularly useful in that it allows the design of the database to be done, including ERD, without the need to create the tables (the table fields and settings still have to be defined as part of the process). Once you are happy with the design the tables can be automatically created on the server with just a few mouse clicks. It's particularly useful for designing large databases over multiple servers.

While I like MySQL Workbench, I use phpMyAdmin as my tool of choice. Not only am I more familiar with it, but given the scale of databases I tent to create it suits my purposes perfectly. Also, because it is web based, hosted on my server, I can access it from any device with having to worry about installing software on each client.

The image above was created using phpMyAdmin's designer tool. You'll notice that there's a few additions to the EVENT table from last time. This is to allow me to analyse some of the in-running detail on a race-by-race basis. More of that nearer the time.

Back Or Lay Picks 25th-26th April

An unusually quiet weekend this time around...


These picks should be backed if decimal odds at KO are 2.00 or over.
They should be laid if their odds are at 1.90 or under.

Racing Database Redesign - Part 10 - Tables (Update)

So much for taking some time out from blogging to create the database! As soon as I started I noticed an issue with the Betfair data that would upset things slightly.

The VENUE table is, for all intents and purposes, pretty static in that the data it contains isn't going to change very much if at all. As a result, this table can be easily pre-populated as part of the design process using the information available from TimeForm and the spreadsheet file I flagged up in my last post.

If you look back to the ERD, you'll see that the VenueID is the key field. This identifier is not directly available from the downloadable Betfair results files but I wanted to use that as the key (i) because I assumed it was unique and (ii) I could use it within any future interface to call up the TimeForm race card by programically creating the URL.

The image below illustrates why I intended to use it. You'll see the CourseID of 1.9 in the URL which identifies Catterick Bridge.

Each course will have a corresponding CourseID and it would be a trivial process to retrieve that from the VENUE table and build the corresponding URL.

However, all is not as straight forward as it seems. There are a number of issues I discovered when populating the aforementioned spreadsheet file with these identifiers in preparation for importing to the database:

1. Timeform treats Newmarket as two separate courses, namely Newmarket (Rowley) and Newmarket (July) with CourseIDs of 1.36 and 1.60 respectively yet the downloadable results files, the aforementioned spreadsheet and as far as I can make out, API software all treat them as one using the short form name of 'Newm'. This then makes it difficult to marry the results files to the correct venue in the case of Newmarket.

2. After the shortlived Great Leighs course arose phoenix-like from the ashes with the new name of Chelmsford City, Timeform simply reassigned the same CourseID to the name of Chelmsford City. A reasonable thing to do you might think except, if you change the dates on the Timeform site to 8th Jan 2009 you'll see that a meeting that was held at Great Leighs was apparently held at Chelmsford City - long before such an entity existed. You and I know that they are one and the same location, but software doesn't.

While this may not necessarily be an issue in and of itself, the historical results files for the time that Great Leighs was operational refer to it's orginal name not its reincarnation. If this issue isn't taken care of, the importation of all those races at Great Leighs will fail.

3. It is not an operational issue but you'll also find the venue long name in the spreadsheet file differs from the long name used by Timeform. For example, Epsom v Epsom Downs; Fontwell v Fontwell Park; Stratford v Stratford-on-Avon and so on. You may want to edit the spreadsheet accordingly before importing into your database.

So, what do I do about it? Fortunately there is a relatively simple solution that doesn't involve a major redesign.

If you look at the short name forms in the spreadsheet and the downloadable results file you'll see they are the same. More importantly, they are unique. Consequently, the BFShortName field in the VENUE table can be used as the table key and the VenueID field can be demoted to a normal field containing the code needed to programmatically build the URL to Timeform.

This means that the Great Leighs/Chelmsford City schizophrenia is taken care of by having two entries each with the same VenueID and it turns the Newmarket issue into a minor anomaly where creating its URL for Timeform will present a little, not insurmountable problem.

I won't present an updated ERD at this point just in case I come across other issues with the data that force a change on me. Instead, I'll hold back on that until I've created all the tables and I'm satisfied there aren't any more issues.

Racing Database Redesign - Part 9 - Tables (Final Design?)

The table design is drawing to a conclusion now, but there's one more important thing to do before reviewing the final design for any omissions or errors.

When setting up a database, it is not enough to establish the tables and the relationships between them. One must also consider the data that is being stored; what 'type' it is and what 'size' it is. This is important as it helps to ensure that the amount of storage space required is minimised and only the correct type of information is stored in the corresponding fields. To illustrate this point, let me take a simple example...

Imagine you create a table field that is intended to hold the names of Jockeys. As you don't know how big peoples' names may be you create that field so that it can hold 200 characters. As your table fills up over time you may find that the maximum length of any Jockey's name is 25 characters. This means that that one field in your database is wasting 175 characters worth of data storage for EVERY record in that table.

If you've done the same type of thing in all fields, over all tables throughout the database I'm sure you'll see that that wasted space can mount up very quickly.

Also, specifying the type of data each field will hold helps to eliminate data input errors. So, if a field should only contain numbers any attempt to enter text will result in an error.

For a full list of data types available reference should be made to the specific database software you are using. In any event, you can use the ERD to document what data types and settings are required for each and every field within your database.

I use the following format:

datatype(size if applicable), NULL [yes|no], DefaultValue

So, using the EventID field as an example, its datatype is set to 'int' (that's an integer) with a size of 20 characters; NULL is set to 'no' - in otherwords it cannot be empty; and finally there is no default value set - in otherwords its value must be specified manually or programmatically.

The final ERD below lists all the settings required, though some will need further explanation.

Generally speaking, it is best to make sure each field cannot be NULL, i.e. empty. This is not possible with this database simply due to the nature of the data that is being recorded and where it is coming from. So you'll see all of the tables except EVENTDETAIL have at least one field specified to allow a NULL value. Unfortunate, but it cannot really be helped in this case - at least not until I can find sources for that historical information.

Also note the use of the 'varchar' data type. I'm sure MS Access has a similar data type but I cannot remember what it is called. When using MySQL as your database you will find that indexing fields that contain text performs much quicker if the field is configured as a 'varchar' rather than 'text'. I'm not privy to the technical reasons this is so though I'm sure there will be some explanation on the Internet for those who are interested. Suffice to say, if there's any chance you may wish to perform any form of querying on a text field, configure it as a 'varchar'.

I should flag up one important point that you need to consider with regard to the data types that you specify:

Virtually all of this data is being provided by a third party, namely Betfair. As such, neither you or I have control over that data. Currently, Betfair provide internal IDs such as EventID as a numerical field. Though unlikely, there is nothing to stop them changing that to an alphanumeric field for example. In that event, any attempt to store such EventIDs into the current design would fail.

I don't think Betfair would implement such changes, but I feel it is important that you are aware of such a possibility. In other words, be aware of where your data is coming from, how much control you have over it and the potential impact it may have on your design.

Table and Design Review:

At this point it is worth doing a final review of the design to ascertain if everything is covered or indeed if anything has been missed not only from the design but the original specification. In no particular order...

1. You should note from the ERD I've changed my mind about having separate date and time fields in the EVENT table. I'm not sure why I suggested it in the first place as, when I checked my old database, I had indeed stored them as one in 'datetime' format.

2. Note that I've included two fields in the VENUE table that ostensibly hold the same information, namely Name and BFShortName. The Name field will hold the full name of the venue whereas BFShortName will hold the abbreviated venue name as contained within the downloadable BF results files as well as the menus you might have seen in trading software. A full list of these pairings can be downloaded from Betfair as an MS Excel file. Using this file, the VENUE table can be pre-populated though you still need to find the corresponding VenueID values.

3. Although the place market downloadable result files indicate whether a runner was placed they do not indicate its finishing position. Consequently there is no way to ascertain a horse's past form figures from those files. The form figures are available from the Betfair API or numerous websites and the best place to place that information would be in a text field stored in the HORSE table. This field would need updating every time the horse ran.

4. A general rule of thumb within database design is this: if something can be calculated from existing data then there is no need to store that result within the database. Depending on the complexity of the calculation required, this can increase processing time for every executed query. If storing that info upfront reduces the overall processing time then it may be worth ignoring this general rule.

Take for example the well known Lay The Field strategy. As an exercise, consider how you would go about analysing the historical record using the ERD design above in order to ascertain which venues, race distances, field sizes, race types, etc were statistically good for implementing the LTF strategy - answers/suggestions in the comment section please.

5. It is worthwhile reviewing other sources of information or websites to assess whether other suitable information exists that might be usefully included. For example, at the bottom of every card on the Timeform site is listed the 'Betfair betting forecast' - a list of projected BSP for each runner. Could that be used to develop any future strategy? If you do not record it you'll never be able to analyse the data so you'll never know.

6. Note that the design is generic. While I am principally focussing on UK horse racing, you'll see that the downloadable datafiles all have the same format. Consequently, races from Australia, Ireland, South Africa and USA could be included. As an added bonus, UK greyhound races (granted jockey info would be irrelevant here) could also be recorded within the same database - or more suitably a database with similar structure. Don't say I don't give you value for money!

As you can see from the above points, not only can a final review flag up any omissions but it can lead to expansion beyond the remit of the original specification. The normalisation process which I discussed in an earlier post helps to facilitate this but the database designer must be able to draw the line somewhere and for me, this is it. I'll make a few tweaks as outlined in the review above but the design is now essentially finished.

I hope you've found this series of some use and feel able to have a go yourself. I'm going to take a break for a couple of weeks as I start to implement the design. I'll return with a few posts to discuss some of the automation side of things. In the meantime, let me wish you good luck with your endeavours.