Silo to Pipeline

In the recent O’Rielly whitepaper, “The Path to Predictive Analytics and Machine Learning,” the authors , Conor Doherty, Steven Camina, Kevin White and Gary Orenstein, point out a key issue facing many businesses: data silos.

What is a data silo?

Traditional data architectures use a siloed, Online Transaction Processing (OLTP) model for Customer Relations Management (sales, returns, queries), and a completely separate data store for analysis (if these are accessible online, they are called Online Analytical Processing (OLAP) warehouses.

OLAP-optimized data warehouses cannot handle one-off inserts and updates. Instead, data must be organized and loaded all at once —as a large batch—which results in an offline operation that runs overnight or during off-hours. The tradeoff with this approach is that streaming data cannot be queried by the analytical database until a batch load runs. With such an architecture, standing up a real-time application or enabling analyst to query your freshest dataset cannot be achieved.

Doherty Camina, White, Orenstein

The O’Rielly team suggest partnering the somewhat alarmingly named Apache Kafka, a high-throughput, distributed messaging system that “acts as a broker between producers (processes that publish their records to a topic) and consumers (processes that subscribe to one or more topics)” with Apache Spark, a distributed, memory-optimized system for data transformation.

envision a shipping network in which the schedules and routes are determined programmatically by using predictive models. The models might take weather and traffic data and combine them with past shipping logs to predict the time and route that will result in the most efficient delivery. In this case, day-to-day operations are contingent on the results of analytic predictive models. This kind of on-the-fly automated optimization is not possible when transactions and analytics happen in separate siloes.

Doherty, Camina, White, Orenstein

While your company might not be ready for real time analytics, its worth thinking about building a system ready for transformation. And good news for your bottom line, Apache Kafka and Spark are both free.