There is a lot of talk these days about predictive analytics, big data, real-time analytics, dashboards, and active data warehousing. These topics are related in a fairly straightforward way. Further, there are new claims about in-memory database processing that blends these issues into a promise of real-time predictive analytics. Lets tease the topic apart…
Predictive analytics is really composed of two parts: modeling and scoring.
Modeling requires big data to discover which information from the enterprise predicts some interesting event. Big data is both broad and deep: broad because no one knows which data elements will be predictive… “which elements” has to be discovered in the modeling exercise… and deep is required because it takes history to detect a trend.
The model that results represents a rule: if the rule’s conditions are true then we can predict the outcome within some statistical boundaries. For example: “if a payment has not been received within 120 days and the customer’s account balance has dropped over 40% in the last year then there is a 83.469271346% chance that the customer will default on the payment. Note that you can create a rule without predictive modeling and without a statistical boundary… for example: “if payment has not been received within 120 days then the customer will likely default”. We have been creating these heuristic rules since time began… and the point of predictive analytics is to discover rules more accurately than by heuristics. You might say then that a model defines, or predicts with some certainty, an event of interest. The definition may be described as a set of rules.
Scoring requires only the elements discovered by the modeling exercise… and may or may not require big deep data. Big data is required if any of the discovered elements represents a trend. For example, if to predict a stock price there are elements that represent the average price over the last 30 days, 90 days, 180 days, etc.; and that calculate the difference between the 90 day number and the 30 day number to show the trend; then either the data has to be aggregated on-the-fly from the detail… or it must be pre-aggregated. This distinction is important… the result of a modeling exercise may require the creation of some new aggregated data. Note that we are suggesting that a score depicts an interesting event.
Real-time analytics, or more fairly, near-real-time analytics, requires these rules to be checked ASAP after new information is available.
A dashboard can provide one of two features: either the dashboard applies the rules and presents alerts when some rule triggers… or the dashboard may present raw data for evaluation by a human. For example, the speedometer on your car dashboard presents raw data and it is up to you to apply rules based on the input. Note that the speedometer on your car provides a real-time display. Sometimes BI dashboards use real-time displays like a speedometer to display static data. For example, I have seen daily metrics displayed using a speedometer widget… but since the speedometer updates just once a day this is clearly metaphorical.
Active data warehousing implies some sort of rule-based activity. The activity may be triggered in near-real-time or as a batch process.
But in any consideration of real-time processing there is an issue if the rules cross data input boundaries. By this I mean… it is simpler to build a speedometer that reads from one input, the rotation of the axle, than to build a meter that incorporates multiple inputs… for example the meter that displays how many miles/meters you have left before you run out of petrol. But to provide this in real-time your car has real-time access to two inputs… and an embedded processor is required to integrate the data and derive the data display. Near-real-time displays of data warehouse data have the same constraints… if the display requires data from more than one source that data has to be acquired and integrated and calculated/scored in near-real-time. This is a daunting problem if the sources cross application system boundaries.
In the book “In-memory Data Management” Plattner and Zeier promise near-real-time analytics from an in-memory DBMS, HANA. But there is no discussion of how this really works for a data warehouse across source systems based on integrated data. Near-real-time predictive modeling requires broad data that will cross these systems boundaries. It may be possible to develop a system with near-real-time data acquisition and integration can occur… and rules may be applied immediately to identify interesting events. But this sort of data acquisition is very advanced… and an in-memory database does not inherently solve the problem… every application in the enterprise will not be in the same memory space.
I do not see it. SAP may live in a single memory space… and maybe every possible application of that data can live there as well. But as long as there is relevant data outside of the space data integration is required and the argument for real-time weakens.