Datawarehouse containerization

We were asked to deliver something with regard to the data architecture and modeling to be used in the new datawarehouse of my employer. The organization is recently merged with another one, whereby systems and processes are still separated from the main data warehouse.

In the new environment, the following must be taken into account:

-the different cultures of the blood groups of the fusion partners.

-the organization that is spread over several locations and several departments.

-the existing landscape that must continue to deliver, the show must go on.

-the technology and the environment require a flexible set-up that can last for several years.

With regard to the cultures:

-The largest merger partner consists of a self-managing team of experienced employees with a more ad hoc way of working. Speed of work is central. Demand driven.

-The smallest merger partner consists of 2 teams that are organized according to the principles of a demand supply organization, so they do a lot with external hiring. They also provide the new management.

To address all these factors, we propose to address this as follows:

– Leave the current data warehouse intact, “the show must go on.”

– Add the data of the smallest merger partner as quickly as possible so that an unequivocal truth is created.

-To organize the total data landscape in containers with their own staff and their own possibilities to organize their work processes without other containers being affected by this. It is also easier to work with the OTAP street if there are no or at least clearly defined interfaces between those containers. Experience shows that otap testing processes that affect everything can hardly be done anymore. So we want to reduce the complexity. The linking pin between these containers should then be formed by the Meta data repository & job control center.

The containers that we see before us are:

–Self service BI / Dashboards based on star models, perhaps to a large extent simply copy from the already present models. Use an otap street for this Self Service BI container as standard.

–Data Science section this is easily forgotten but must be named. it seems to me that this should also be arranged with a development and production part. Unless nothing is produced and then this must also be clear. In principle this concerns science and long-term studies without guarantee of results.

–Applications. There is currently a development towards real-time information provision because the production system supplier no longer support any regular reports and there are major problems with the validation of registration. Operational provision of information and ad hoc data provides the organization with money so this is important. For realtime only flat tables are important, star modelling should not be used. Timeliness and lead time are important in this container.

–Meta data repository & job control center

An automated repository that maintains meta data, performance and usage of all components of the data warehouse for auditing and maintenance and prioritizing datawarehouse processes.

In principle, the boundaries between the containers are not fixed. The core of the split is that between an “old” unmanaged part and a “new” to be built managed container landscape. Depending on the success and degree of exploitation of these containers, shrinkage or growth, including the allocated FTE, the boundary between the containers can be moved.

So what is a container? 😉

Inspiration for the setup in containers comes from our practical experience within a large complex data warehouse environment and the developments within the cloud architecture. (see, for example, docker.com) Basically, this means stopping a certain group of data warehouse activities in 1 environment, including all dependencies so that people and resources do not get in each other’s way.

Step 1 of applying “containerization” is to put picket posts within the existing (largest) data warehouse environment.