Strategies For Dividing And Sharing Source Data

When handling raw data from an external or internal source, we are disposed to several choices:

  • Should we leave the data unaltered, a mere and true replicate of the source to make comparisons between the two more readable?

  • There are times it makes sense to compile data into a singular table - like when input files are plenty, small and but the same format or when the source systems has been normalized into many slim tables with few columns which clearly relate to one another.

  • We can choose to divide large inputs into smaller chunks. This is especially viable if parts of the content is clearly distinguishable, like when you have both customer and supplier orders in the same input stream.

  • In some cases, we can also choose to operate on the incoming data to make it more fertile and ready for consumption. This can happen in any number of ways depending on both the quality of the data and how we will use the data in later stages. Here we will not discover these plentitude of opportunities.

So which strategy should we adopt when viewing a set of new input that is destined to be our new stream of input?

I am inclined to use the restaurant analogy and how they treat their raw material, which is delivered by different vendors. Do they store the material of equal kind in the same or separate containers? Do they treat the goods - cut the cucumbers in two or more slices - before they put it into tiny boxes? Maybe they have a unique cartridge system with equal-size boxes that dictate how it will be partitioned, divided and shared. Circumstances, skills, tools and historical reasons will guide your choices whether you like it or not. Can we somehow come up with some rule of thumb of what to do?

In my experience, the approach of dividing the data from the same source is the most overlooked. It is something we do at a later stage. Even though most source tables are built with no redundancy in mind, it does not mean that it cannot contain diversable information, which appear to be the same, like customer and supplier order or invoices. As the source system is built to handle many world problems and the underlying database design has to cater for that, sometimes we end up with big tables containing data that covers at least two business processes. It can be hard to distinguish from the fact that these tables are often the core, the main artery of that system and here we are with our scalpels trying to carve those data apart. Again, the properties of a customer and supplier invoice are pretty much the same and hence the core of the source can be put in one single source table. From a data modeling perspective, an invoice is an invoice. It is actually quite clever, and 3NFish, to put the data in the same table. It is almost ironic – we are trying to build an information model of the business in question and then we go ahead and put two, quite different, processes in the same heap.

Another example is when an ERP system is using generic tables to record different kind of calculations defined by the business user. These calculations can vary a great deal and the outcome can ultimately support different needs and business processes. For instance, a calculation can find the aggregated performance of a portfolio on a daily basis. Another calculation can tell how the different assets in that portfolio attributed to the overall performance. The results of these will be stored in the same tables as they use the same calculation engine. The use of it can be rather reporting the performance and analysis of the attribution. The reason we want to divide these data into separate tables is that we would want to load it at different times as the reporting needs vary from the analysis’ needs.

Therefore, what we should do is to make those two arteries run in two different veins embracing our user’s needs. And, I do believe in information streams. Streams that build information highways. An information stream is a current of data that flows from the source to a designated end user. We want to separate the different business processes to make theme unconnected and able to run on different times independently of the underlying structure. We do not want them to interrelate as this atomic level. Yes, customer consumption directs the supply of goods we need. It does not mean we need to put in the same table!

So, when should we gather data into a single table? We want maintain the control of the input. We want to make sure that we can explain every possibility of what happened during data transfers - most of all to ourselves. Maintaining hundreds and thousands of inputs when we can collate parts of it makes sense, because having too many tables will make us loose overview and that inevitable control. Especially in cases where there are tables correlating because of a strong 3NF structure in the source, having a one-to-one connection to a main table, we should include those data in that table. If these data are stand-alone and parse in a single table it makes sense to just put them in the same box when gathering the input. When we have many sources gathering the same type of data with the same fields, but different content - it also makes sense to gather these and use a column as a source descriptor to distinguish the data vendors.

In conclusion, divide data from a single source table when it splits into separate business processes. Gather data when it belongs together and there is no other time-related loading reasons for not doing so.