While being able to build predictive models on mountains of data without moving it out of the database is pretty cool in itself, I feel analysis without action is pretty much pointless. Tom Davenport describes this common data mining conundrum in Competing on Analytics.
Many firms are able to segment their customers and determine which ones are most profitable or which are most likely to defect. However, they are reluctant to treat different customers differently—out of tradition or egalitarianism or whatever. With such compunctions, they will have a very difficult time becoming successful analytical competitors—yet it is surprising how often companies initiate analyses without ever acting on them. The “action” stage of any analytical effort is, of course, the only one that ultimately counts.
The OBE tutorial describes a scenario in which a business wants to identify customers who are most likely to purchase insurance. Through a set of simple steps, a (decision tree) classification model is built that can be used to predict whether a particular customer is likely to purchase based on historic data.
In a classical data mining approach, the predictions of this model would be written to some OUTPUT_TABLE where they would be available for subsequent processing. Growing staler every minute—and soon forgotten when its newer sibling OUTPUT_TABLE_NEW_FINAL_2 is inevitably created—our precious business intelligence slowly withers away in a disregarded section of the database until ultimately dropped by a careless DBA.
Output tables are where analytical insight goes to die.
If all we were interested in was building models, we’d be better off glueing choo-choos. It is the new ways in which we can utilise these database resident models that makes this technology really interesting. With a few simple additional steps, this same model can be used in real-time to provide inline predictions based on up-to-date customer data; as well as for new customers.
All we need is a view
and a join.
Update (October 3rd, 2012): as Marcos points out in the comments, I was making things far too complicated. No need for a separate join; simply select the output columns you need and pass everything directly to the view.
The join operations glues the original data and the prediction models together; The view allows us to look at the harmonised results directly. When a customer record is selected from the view the source data for this record is passed to the model to generate the predicted values in real-time. When source data changes so does the prediction. When new source records are added they are automatically processed in the same way.
-- Create a new customer. INSERT INTO INSUR_CUST_LTV_SAMPLE (CUSTOMER_ID, LAST, FIRST) VALUES ('CU123', 'VERMEER', 'LUKAS'); 1 rows inserted. Elapsed: 00:00:00.003 -- Get prediction and probability for the new customer. SELECT CUSTOMER_ID, insur_pred, insur_prob FROM insur_cust_ltv_prediction WHERE CUSTOMER_ID = 'CU123'; CUSTOMER_ID INSUR_PRED INSUR_PROB ----------- ---------- ---------- CU123 No 0.7262813 Elapsed: 00:00:00.004 -- Update customer data. UPDATE INSUR_CUST_LTV_SAMPLE SET bank_funds = 500, checking_amount = 100 WHERE CUSTOMER_ID = 'CU123'; 1 rows updated. Elapsed: 00:00:00.003 -- Get prediction and probability for the updated customer. SELECT CUSTOMER_ID, insur_pred, insur_prob FROM insur_cust_ltv_prediction WHERE CUSTOMER_ID = 'CU123'; CUSTOMER_ID INSUR_PRED INSUR_PROB ----------- ---------- ---------- CU123 Yes 0.6261398 Elapsed: 00:00:00.004
Seamless. Any system that can read data from an Oracle database can now utilise Oracle Data Mining models. No need to move your data. No need to build new applications.
Applications reading data from the view need never know the difference between the original source data and machine generated predictions. Oracle Business Intelligence Publisher can easily display this data in forecasting reports; or use it to power pro-active alerts. In Oracle Real-Time Decisions, rules can be built around the outcomes of these models; or predictions from multiple sources can be fed into combined likelihood models for increased accuracy.
This is huge. Trust me. Stop over-analysing and start taking action. After all, that’s the only step that ultimately counts.
The Wall Street Journal has an interesting article explaining how companies are starting to use (big) data to support their recruiting efforts. It provides a good example of the more general trend in businesses towards evidence-based decisioning and data science, but it also shows how some crucial aspects of these techniques are easily overlooked or oversimplified.
My big-data-science-bogus-alarm started ringing upon reading the last sentence in this short paragraph.
Applicants for the job take a 30-minute test that screens them for personality traits and puts them through scenarios they might encounter on the job. Then the program spits out a score: red for low potential, yellow for medium potential or green for high potential. Xerox accepts some yellows if it thinks it can train them, but mostly hires greens.
Sounds smart, right? Well, maybe.
If Xerox never hires any “reds” and only very few “yellows”, how will they know the program is actually working? How will they know that all that complicated math is doing something more than simply returning random colour values? An evidence-based approach should always include some form of scientific control. If it doesn’t, it might as well be snake oil.
Of course, this is probably just a simple journalistic crime of omission of a trivial implementation detail, but it reminded me of that old chestnut “the tiger repellant”. For your convenience, this blogpost has been equipped with some very strong Tiger Repellant tonic. If you do not see any tigers around you right now, you will know it is working.
See? No tigers?
Proven to work like a charm. Order yours today! Great prices! Limited availability! Now taking applications in the comments.
[ Disclaimer: Tiger Repellant is not certified for use in South-East Asia or zoological parks. Tiger Repellant inc. and its employees and subsidiaries cannot be held liable for any damage caused to your person in the event of being eaten by a tiger. ]
My girlfriend has been struggling with an interesting little problem lately. She was asked to determine the optimal distribution of medicine boxes and bottles over a set of adaptable cabinets; under volume as well as weight constraints. Not an easy task for a computer scientist; much less for a hospital pharmacist in training.
After describing the problem to me last night I (unhelpfully) mumbled that “this sounds like a variable sized bin packing problem to me, you can’t solve the kind of thing in Excel, you probably need an LP solver”.
Apparently I was wrong. It already seemed obvious to me that Excel suffers from a severe case of feature bloat, but this is just absurd.
As the beautiful old car cruised in almost perfect silence under the guidance of its automatic controls, Duncan tried to see something of the terrain through which he was passing. The spaceport was fifty kilometers from the city—no one had yet invented a noiseless rocket—and the four-lane highway bore a surprising amount of traffic. Duncan could count at least twenty vehicles of various types, and even though they were all moving in the same direction, the spectacle was somewhat alarming.
“I hope all those other cars are on automatic,” he said anxiously.
Washington looked a little shocked. “Of course,” he said “It’s been a criminal offence for—oh, at least a hundred years—to drive manually on a public highway. Though we still have occasional psychopaths who kill themselves and other people.”
The future sounds fascinating, but I want my Google Driverless Car now.
Derek Jones posits that “success does not require understanding“.
In my line of work I am constantly trying to understand what is going on (the purpose of this understanding is to control and make things better) and consider anybody who uses machine learning as being clueless, dim witted or just plain lazy; the problem with machine learning is that it gives answers without explanations (ok decision trees do provide some insights).
Problem solving versus solving problems.
As one who specializes in using machine learning, I obviously resent being called “clueless, dim witted or just plain lazy”. However, I feel a larger point should be made here. Success does most definitely require understanding, but not necessarily of how one particular instance of a solution came about.
To be successful in any machine learning effort, one needs to have intricate understanding of what the problem is and how techniques can be applied to find solutions. This is a more general form of understanding which puts more emphasis on the process of finding workable models, rather than on applying these models to individual instances of a problem. Comprehension of problem solving over understanding a particular solution.
Driving a black box.
Consider the following example. To me, the engine of my car is a black box; I have very little idea how it works. My mechanic does know how engines work in general, but he is unable to know the exact internal state of the engine in my car as I am cruising down the highway at 100 miles per hour. None of this “lack of understanding” prevents me from getting from A to B. I turn the wheel, I push the peddel and off we go.
In essence, my mechanic and I have different levels of understanding of my car. But importantly, at different levels of precision, the thing becomes a black box to each of us; in the sense that there is a point where our otherwise perfectly practical models break down and no longer are able to reflect reality. In the end, it’s black boxes all the way down.
Models are merely tools to help you navigate a vastly complex world. Very much like machine learning models, a scientific model might work in many cases, but so does Newton’s law of universal gravitation. We know for a fact that that particular model is definitely wrong; and I sincerely hope many others are just as incorrect.
There will always be limits to our understanding. The fact that we have a model that can help us predict does not necessarily mean we have correctly understood the nature of the universe. All models are wrong, but some are useful.
Reality is simply much too complicated to be captured in a manageable set of rules, but even incomplete (or incorrect) models can provide insight and help us better navigate this world. Machine learning is successful, precisely because it can help us find such models.
[ Peter Norvig has written an excellent piece on this subject in relation to language models. ]
Marketing catchphrases like “recommended by experts” (an appeal to authority), “world-renowned bestseller” (candidly claiming consensus) and “limited supply only” (suggesting scarcity) are widely used to promote many different types of products. To a marketeer, these persuasion tactics are like universally coaxing super supplements that can make just about any offer seem more enticing.
But not all these advertising additives are created equal; and neither are apparently all consumers.
In a fascinating (at least, to scientific advertising geeks like me) study titled “Heterogeneity in the Effects of Online Persuasion“, social scientists Maurits Kaptein and Dean Eckles looked at the differences in susceptibility to varying influence tactics between individuals. What they found may change the way we think about recommendation engines and marketing personalization in general.
It is striking how large the heterogeneity is relative to the average effects of each of the influence strategies. Even though the overall effects of both the authority and consensus strategies were significantly positive, the estimates of the effects of these strategies was negative for many participants. […] Employing the “wrong” strategy for an individual can have negative effects compared with no strategy at all; and the present results suggest there are many people for whom the included strategies have negative effects.
Our advertising additives can have adverse side-effects. Some people don’t respond well to authority; others don’t feel much for the majority rule. If you pick the wrong strategy for a particular individual, you may actually hurt your marketing efforts; independent of what product you are actually trying to sell.
Kaptein also collaborated in another publication “Means Based Adaptive Persuasive Systems“, which looked at the combined effects of multiple persuasion strategies.
Contrary to intuition, having multiple sources of advice agree on the recommendation had not only no positive impact on compliance levels but actually had a slightly negative effect when compared to the preferred strategy. This is a fascinating discovery since one would assume two agreeing opinions would be stronger than one.
As strange as it may seem, in the case of combined cajolery, the whole is not only less than the sum of its parts; it is less than the single best bit.
1 + 3 = 2
Eckles and Kaptein conclude that personalization is key; and I couldn’t agree more.
To use the results presented above influencers will have to create implementations of distinct influence strategies to support product representations or customer calls to action. As in the two studies presented above, multiple implementations of influence strategies can be created and presented separately. Thus, one can support a product presentation on an e-commerce website by an implementation of the scarcity strategy (“This product is almost out of stock”) or by an implementation of the consensus strategy (“Over a million copies sold”). If technically one is able to represent these different strategies together with the product presentations, identify distinct customers, and measure the effect of the influence strategy on the customer, then one can dynamically select an influence strategy for each customer.
The good news is that we can do this today. Using Oracle Real-Time Decisions, choosing the best influence strategy for a particular customer can easily be implemented as a separate decision to be optimized for conversion. Alternatively, these strategies could simply be considered as another facet of your assets in RTD; similar to the way we would utilize product category metadata to share learnings across promotions.
Personalization is about more than just deciding what you want to sell. This research clearly shows that a recommendation engine that can only select the “best” product is simply not good enough.
Because conversion sometimes requires a little persuasion.
It almost seems like everyone has their head in the cloud these days. And it’s not all just hot air and water vapor. Infrastructure as a service (IaaS), platform as a service (PaaS) and software as a service (SaaS) are truly revolutionizing the corporate computing industry.
That is why, for the past few months, my good friend Matt Feigal and I have been collaborating with budding startup Cloudular to bring you the next logical evolutionary step in cloud computing. Inspired by the skyward ascent of hardware, middleware and software, we are proud to bring you vaporized wetware; or “artificial intelligence as a service” (AIaaS).
I’m mostly kidding, of course, but here is something I cooked up over the weekend. A web service that plays connect four (based on an earlier post) and is looking for worthy sparring partners.
If you think you can code a better connect four algorithm (and you probably can, especially since I’ve deliberately lobotomized this particular version of my implementation), head on over to github and build your own to compete against mine. All the code and an interface description are available there. The service itself is available on Google App Engine.
I’ve got (part of) my head in the cloud, what about you?