zaterdag, februari 06, 2010

Distributed version control for Databases (technical story)

I'm using Mercurial for some time now and I really enjoy using it. I only code by myself but I want my code to be versioned. One of the biggest advantages for me is the build in pull/push mechanism which can update repositories over the network.

I have a development repository on my computer. And with a simple click I can upload my code changes to the beta environment. Also: if I need to quickly fix a bug, I can clone the live environment to a new folder on my development laptop, make some changes and push it right back to the production server. And be sure that my change is tracked.

There is however one more problem: my web applications consist of both code and data. And the data is located in a database which is described by a data model. For some changes I need a data model change or certain data in the database. And this is not yet manageable by Mercurial. But there has to be a solution. And this is what kept me busy for weeks already.

If we take Mercurial as an example we have three things: The repository, the current working dir and the buffer.

The repository holds all the checked-in versions of the code. It also holds the history. It also holds all the information to recreate a certain snapshot.

The current working dir holds the so called "HEAD" version of the repository and the changes that have not yet been committed to the repository.

The buffer is used to track changes in the current working dir and how to apply them to the repository. For example if two files are modified and then committed. They are written to the buffer first. And from there a transaction is started if for example a change can not be committed because of a merge conflict the corresponding file will be marked and the other will be written to the buffer.

How does this apply to databases?

It took a lot of thinking to come up with the simplest idea. In a database the current working dir is the database itself.

The repository should be an exact copy of the database but then in a different format because we need a different kind of metadata.

There are however some problems:

Tracking changes to data:

Where changes to files are checked by generating and comparing hashed versions of files. This also can be done with database-tables or records. However this could be very costly for large recordsets. I would propose to use two fields: "created" & "modified". When a record is inserted into the database the "created" field is set to the current timestamp.  When a record is updated the "modified" field is set to the current timestamp.

Getting changes since the last "commit" is simple querying the database records that have created or modified fields later than the last commit time.

(This may introduce a problem when times are inconsistent between different database servers, however I guess this will workout as long as time is consistent on the server by itself)

Tracking changes to the data model:

It can be possible for the data model to change, maybe a table field will be changed, added or deleted. During a commit the system must check the data model against the repository and save changes accordingly.

Updating a live data on a database:

In mercurial we execute the "pull" command to get the latest changes from another repository. Updating the working dir is a different command "update". With databases there may be a lot of data to change, taking a long time. Also we don't want to get the database into an inconsistent state. The key here seems to be transactions. However it might be better to create a shadow database, change the data model and data there and when all changes are applied correctly the whole database is copied to the live database.

Unique identifiers:

One mayor problem with databases are primary keys, they need to be unique. The problem is here is that we have separate servers inserting records into the database. We could use GUID's for primary and foreign keys . But I really dislike GUID's for keys. Especially from a performance point (in mySQL at least). My solution would be to use GUIDs in conjunction with integer primary keys. Every record would have a primary key an auto increment int and a GUID. The GUID is created with a server unique key and a unique hash over the record. The GUID for the record will never change.

When a record is saved to the repository all primary keys and foreign keys will be replaced by GUIDs. When merging records also the GUIDs will be used.

Not tracking all data:

Just as with code you may not need some data not to be in your repository like compiled DLL's for example. This can also be true for databases. For cached data, logs, sessiondata etc you may want to track the datamodel but not the data. So there must be a way to not track the data. (or stop saving this data in the database. The data is just filling up your backups!)

Keeping the repository small:

A repository will be at least the size of the original database. But was we track more metadata and delta's the repository can become quite large. Smart use of compression should make it possible to keep the repository quite small.

What's next?

This solution depends heavily on a certain database structure. I would really like to have a solution that works on any database, but that would mean I had to get a HASH on every record in a database. I imagine this will become very costly quite fast

The best way to implement this would probably be a mercurial commit hook. So the database changes get committed at the same time as the code changes.

I'm going to start to implement this as PHP files, as this is the environment I am really needing it for at the moment. I might even learn python so I'll be able to implement the solution as a proper mercurial extension.

I'm not convinced this will be a solution for all possible scenarios, but it seems to be a quite elegant solution for my scenarios at my scale. It would really solve a great annoyance I have with working on and deploying webapplications. It there are better alternatives please let me know!

(Or perhaps should I use a noSQL solution for my data storage needs?)

Geen opmerkingen: