Recycling old and proven IT principles: Semaphores

The definition of a semaphore used in programming is “a technique for coordinating or synchronizing activities in which multiple processes compete for the same operating system resources.” Simply put, it is a mechanism for locking used resources and telling other processes to wait. It is a good engineering practice keeping in mind that a lot of the time we need to be pragmatic and just reload the data during the weekend.

In particular, when operating with database tables, which are frequently updated and loaded at the same time, semaphores can be a useful approach. There are several cases that can cause a deadlock, even when only reading from the table and it is something we want to avoid. Tables can also be updated with data while we are reading; data that should have been included but are omitted because the query has already been sent to the database and data are being fed back to the asking resource. Imagine loading all data for a current date and increasing this date when the load is finished. We are missing out on data that arrives when we are loading. Even worse, rows that we are loading can be updated while we are reading the data.

If we only want to load the same data once, how do we do this? The idea is to use a single binary column in the shared table as a semaphore column. Prior to fetching data, this column is marked with a flag for the designated rows to indicate that these rows are locked by the process in question. Since we have tagged the rows we are going to operate on, it is easy to untag them again using the same flag as a filter. This way we can at least make sure that we are not missing any data between fetching, even though the table is being updated while you are trying to read from it.

However, locking the rows that are being fetched from being updated is trickier. You could refrain from updating the locked rows, but how would the loading routine know which ones was not updated next time? You could store the ones that are locked in a separate table, but you would only want to do that if you were sure you could not work around the timetable for when to fetch and load the table. You would be better off raising an event when something is being fetched and halt the whole loading of the table. This could work if the overall loading plan isn’t too hectic. Nevertheless you are not saved from the fact that this might happen on the exact same time, but mostly, it isn’t worth spending the effort of avoiding such traps that isn’t common at all when other controls could notify of this.

To make sure all of the data is loaded we, the data integrators, use several technics to ensure that. It is nice to be elegant, but sometimes we need take a pragmatic approach by reloading parts of the data in the weekends or lending a hand to reconciliation. Which locking paradigm you choose to use depends on the immediacy, frequency and amount of data that is being updated rather than inserted.

Menu