Archive for the ‘Database’ Category
While being able to build predictive models on mountains of data without moving it out of the database is pretty cool in itself, I feel analysis without action is pretty much pointless. Tom Davenport describes this common data mining conundrum in Competing on Analytics.
Many firms are able to segment their customers and determine which ones are most profitable or which are most likely to defect. However, they are reluctant to treat different customers differently—out of tradition or egalitarianism or whatever. With such compunctions, they will have a very difficult time becoming successful analytical competitors—yet it is surprising how often companies initiate analyses without ever acting on them. The “action” stage of any analytical effort is, of course, the only one that ultimately counts.
The OBE tutorial describes a scenario in which a business wants to identify customers who are most likely to purchase insurance. Through a set of simple steps, a (decision tree) classification model is built that can be used to predict whether a particular customer is likely to purchase based on historic data.
In a classical data mining approach, the predictions of this model would be written to some OUTPUT_TABLE where they would be available for subsequent processing. Growing staler every minute—and soon forgotten when its newer sibling OUTPUT_TABLE_NEW_FINAL_2 is inevitably created—our precious business intelligence slowly withers away in a disregarded section of the database until ultimately dropped by a careless DBA.
Output tables are where analytical insight goes to die.
If all we were interested in was building models, we’d be better off glueing choo-choos. It is the new ways in which we can utilise these database resident models that makes this technology really interesting. With a few simple additional steps, this same model can be used in real-time to provide inline predictions based on up-to-date customer data; as well as for new customers.
All we need is a view
and a join.
Update (October 3rd, 2012): as Marcos points out in the comments, I was making things far too complicated. No need for a separate join; simply select the output columns you need and pass everything directly to the view.
The join operations glues the original data and the prediction models together; The view allows us to look at the harmonised results directly. When a customer record is selected from the view the source data for this record is passed to the model to generate the predicted values in real-time. When source data changes so does the prediction. When new source records are added they are automatically processed in the same way.
-- Create a new customer. INSERT INTO INSUR_CUST_LTV_SAMPLE (CUSTOMER_ID, LAST, FIRST) VALUES ('CU123', 'VERMEER', 'LUKAS'); 1 rows inserted. Elapsed: 00:00:00.003 -- Get prediction and probability for the new customer. SELECT CUSTOMER_ID, insur_pred, insur_prob FROM insur_cust_ltv_prediction WHERE CUSTOMER_ID = 'CU123'; CUSTOMER_ID INSUR_PRED INSUR_PROB ----------- ---------- ---------- CU123 No 0.7262813 Elapsed: 00:00:00.004 -- Update customer data. UPDATE INSUR_CUST_LTV_SAMPLE SET bank_funds = 500, checking_amount = 100 WHERE CUSTOMER_ID = 'CU123'; 1 rows updated. Elapsed: 00:00:00.003 -- Get prediction and probability for the updated customer. SELECT CUSTOMER_ID, insur_pred, insur_prob FROM insur_cust_ltv_prediction WHERE CUSTOMER_ID = 'CU123'; CUSTOMER_ID INSUR_PRED INSUR_PROB ----------- ---------- ---------- CU123 Yes 0.6261398 Elapsed: 00:00:00.004
Seamless. Any system that can read data from an Oracle database can now utilise Oracle Data Mining models. No need to move your data. No need to build new applications.
Applications reading data from the view need never know the difference between the original source data and machine generated predictions. Oracle Business Intelligence Publisher can easily display this data in forecasting reports; or use it to power pro-active alerts. In Oracle Real-Time Decisions, rules can be built around the outcomes of these models; or predictions from multiple sources can be fed into combined likelihood models for increased accuracy.
This is huge. Trust me. Stop over-analysing and start taking action. After all, that’s the only step that ultimately counts.
A representative of a large company (which shall remain unnamed) recently called on me for some advice. They had accidentally loaded polluted data into their data warehouse and wanted to know if there was anything they could do to get rid of it. I told them that restoring their most recent backup and reloading any more recent, unpolluted data was probably the simplest solution. They concurred, but regretted to inform me that they had not made a single backup of their production system since I helped them set it up; many, many months ago.
Let me repeat that for you. Large company. Production system. No backups. Ever.
Now, you’re probably thinking that is pretty silly for a large company to not have backups of a production system. Well, wait until you hear what happened next.
A few days later, a different representative from the same company demanded to know where exactly in the documentation we had explained that they were expected to create backups of their production system. How were they supposed to know that anything could possibly go wrong? It wasn’t their fault that they had not anticipated this disaster, right?
I’m glad to report that, in the end, everything worked out and everyone lived happily ever after. But for future reference, here’s a free (as in beer, not as in speech) tip.
Backup. Your. Production. Database.
[Also, do not attempt to dry your poodle in a microwave oven]
I’ve seen people using and building upon databases in lots of different ways; many of them surprisingly original. However, it seems to me that some people do not fully grasp the concept of a database. Using a database is clearly not the same as understanding how one works.
I think we need to take a step back and consider the basic question ”what is a database”. You might think you know the answer. You might be one of those people who thinks the answer is not important as long as you know how to use it. I disagree, and I’m going to explain why. If you are one of those people, this article is for you.
A database is not a box of data.
A database is not a file drawer you put stuff in for later retrieval. A database is not a spreadsheet. A database is not a collection of arrays. Of course you know this, but many programmers forget this when they start coding. They treat the data in the database as if they need to manage it; as if it were their responsibility.
Surprise! It’s not.
A database is more like a library. Access to data (books) is the most important thing for a library; it is its reason for existence. And although the purpose of a library is to give you easy access to books, you are not expected to know exactly where each book is stored, nor are you expected to put them back on a shelve when you return them. You are not, in other words, the one who decides how the books are organized. That’s the librarians job.
A database also has a librarian. We call her ”the Optimizer”.
Databases, just like libraries, are built to facilitate fast and easy access to data. In later posts I will discuss some of the techniques used to improve performance. For now, let us assume that the database is quite different from a normal array or any other basic data structure. Just like a librarian in a library, the objective of the optimizer is thus to help you find what you are looking for faster. It does so by creating a plan to find the data every time you run a query.
As the Oracle9i Database Performance Tuning Guide and Reference puts it:
The optimizer determines the most efficient way to execute a SQL statement after considering many factors related to the objects referenced and the conditions specified in the query. This determination is an important step in the processing of any SQL statement and can greatly affect execution time.
Every query you send to the database goes through this process. There is no way to go around it; and you should not want to. The trick is to work together with the optimizer to get the best results.
I hope that working together will be easier now you know she exists. Next time, I will discuss some of the decisions the optimizer can take.
[This article is based on a presentation I prepared for Ortec, the company where I completed my Master thesis.]