Forever Learning

Forever learning and helping machines do the same.

Archive for May 2010

Benford’s Law

with 2 comments

In a (not so) recent episode of Radio Lab (one of my favorite podcasts) I was introduced to Benford’s Law. I’d never heard of this phenomenon, which in hindsight is rather strange, because it’s effects are so profound. Let me explain.

Take a some numerical data from a real-life source, bank transaction amounts for instance, and tally the number of transactions whose amount starts with a one (i.e. whose first digit is a ‘1’ and not ‘2’ or ‘9’). What percentage of the transactions would you expect to match this criterion? What about the number of transactions whose amount starts with a three?

I’d never really though about this before. Implicitly I had always assumed that in large sets of data the distribution of first digits was equal amongst the numbers one through nine. This would result in the answer to the two questions above being “eleven percent” in both cases (zero is not considered, so there are nine possible first digits).

Benford’s Law predicts that I was wrong, and it turns out he is right.

The answers according to Benford’s Law are “probably about thirty percent” and “probably about twelve percent”. And using my own bank transaction amounts and some javascript gimmickry we can see that he is closer to the actual numbers than I was.

Benford's Law

Tally of the first digits of bank transaction amounts

Real-life data is obviously not random data, and when you think about it there are perfectly logical explanations for this result. Still, for me, this is something I had to see in order to believe.

I encourage you to try it out yourself on your own data. I’m curious to know what you find.

Update (april 2nd 2011): Code for this project is now available on Github.


Written by Lukas Vermeer

May 24, 2010 at 14:56

a Database is not a Box

with 3 comments

I’ve seen people using and building upon databases in lots of different ways; many of them surprisingly original. However, it seems to me that some people do not fully grasp the concept of a database. Using a database is clearly not the same as understanding how one works.

Not a Database

Not a Database (image by Jared Tarbell)

I think we need to take a step back and consider the basic question ”what is a database”. You might think you know the answer. You might be one of those people who thinks the answer is not important as long as you know how to use it. I disagree, and I’m going to explain why. If you are one of those people, this article is for you.

A database is not a box of data.

A database is not a file drawer you put stuff in for later retrieval. A database is not a spreadsheet. A database is not a collection of arrays. Of course you know this, but many programmers forget this when they start coding. They treat the data in the database as if they need to manage it; as if it were their responsibility.

Surprise! It’s not.

A better metaphor

A better metaphor (image from The U.S. National Archives)

A database is more like a library. Access to data (books) is the most important thing for a library; it is its reason for existence. And although the purpose of a library is to give you easy access to books, you are not expected to know exactly where each book is stored, nor are you expected to put them back on a shelve when you return them. You are not, in other words, the one who decides how the books are organized. That’s the librarians job.

A database also has a librarian. We call her ”the Optimizer”.

Databases, just like libraries, are built to facilitate fast and easy access to data. In later posts I will discuss some of the techniques used to improve performance. For now, let us assume that the database is quite different from a normal array or any other basic data structure. Just like a librarian in a library, the objective of the optimizer is thus to help you find what you are looking for faster. It does so by creating a plan to find the data every time you run a query.

As the Oracle9i Database Performance Tuning Guide and Reference puts it:

The optimizer determines the most efficient way to execute a SQL statement after considering many factors related to the objects referenced and the conditions specified in the query. This determination is an important step in the processing of any SQL statement and can greatly affect execution time.

Every query you send to the database goes through this process. There is no way to go around it; and you should not want to. The trick is to work together with the optimizer to get the best results.

I hope that working together will be easier now you know she exists. Next time, I will discuss some of the decisions the optimizer can take.

[This article is based on a presentation I prepared for Ortec, the company where I completed my Master thesis.]

Written by Lukas Vermeer

May 14, 2010 at 15:16

Posted in Database, Oracle

I am a Corporate Ninja

with 3 comments

At first, the term Business Intelligence (abbr. BI) had me a bit confused. Contrary to my initial interpretation, BI does not concern the mental acuity of corporations or their employees.

Business, as in ‘From that moment on I had a much better idea as to how I should go about my business.

Intelligence, as in ‘Feudal Japan often used ninja to gather intelligence.

Rather than ‘Corporate IQ‘ Business Intelligence should thus be interpreted more like ‘Decision Information‘.

We are not trying to make business people smart. We are simply working to give business people the information they need to make informed decisions.

We are Corporate Ninjas.

[Disclaimer: I have no intention of implying that BI specialists or business people lack mental skills, but simply trying to explain how I think the term BI should be interpreted; and consequently what my focus as a consultant should be.]

Written by Lukas Vermeer

May 7, 2010 at 14:47

Posted in BI, Oracle

Estimating Pi

with 4 comments

[I’ve tweeted about this before, but I feel the need to explain.]

I want to share with you my favorite algorithm. It may not be pretty, efficient or even practical, but I love it for it’s misleading simplicity and hidden complexity. It is a Probabilistic (Randomized) Algorithm that combines statistics and Euclidean geometry by using the Pythagorean theorem and the Law of Large Numbers to estimate Pi. I first learned about this algorithm when I was at Utrecht University studying Computing Science.

Don’t worry, this is not some complicated computer-gizmo-thing that only super-nerds can translate. I’ll take you through it nice and slow. You will need to recall some basic arithmetic you learned in high-school.

To estimate Pi, we rely on the following truths.

  1. The surface area of a circle is equal to πr² (Pi times the radius of the circle squared, definition of Pi).
  2. We can make a square in which the circle fits with sides that are of length 2r (easy, just try it).
  3. The surface of that square (that just encompasses the circle) is (2r)² (the width of the square squared).
  4. The ratio between the surface of the circle and the surface of the square is πr² / (2r)²; which equals π / 4.
  5. This ratio reflects the percentage of the square that is occupied by the circle (something close to 78,5%).
  6. The odds that a random point in the square is also in the circle is equal to the ratio of the two surface areas (statistics).
  7. Any point in the square is also in the circle when it is less than r distant from the center of the circle.
  8. The distance between two points squared is a² + b² (Pythagorean theorem).
  9. If you select enough random points in the square, the ratio between those points that are also in the circle and those that are not in the circle will approximate the ratio between the two surface areas (law of large numbers).

Estimating Pi

Based on all this, all we have to do is the following.

  1. Pick a a bunch of random points in the square (say N number of points where N is a pretty big number).
  2. Determine how many of those points are in the circle using the Pythagorean theorem (say C of them are in the circle).
  3. Calculate the ratio between C and N and multiply by 4.
  4. Presto! You have now estimated Pi.

Written down in Perl and compressed to fit into a single tweet this [1] can be expressed as:

perl -e 'for(;$i<999999;++$i){rand()**2+rand()**2>1?0:++$c}print 4*$c/$i." is (probably) pretty damn close to Pi!\n"'

Which probably looks daunting, but does nothing more than what I’ve just described.

  • for (;$i<999999:++$i) just means “do the following 999999 times”.
  • rand() is a random number between 0 and 1.
  • N**2 is N squared.
  • something ? this : that translates to “if something is true, do this, else, do that”.
  • 0 does nothing.
  • ++$c adds one to the variable C.
  • print does exactly what you expect.
  • \n prints a new line (solid return).

When executed, the Perl script outputs something like the following.

3.14152714152714 is (probably) pretty damn close to Pi!

Which is obviously wrong, but still pretty close to the actual number. Also note, that since we select random points the estimated number will be different every time we execute the code. To increase the accuracy of the number, simply use more random points. As the professor who explained this to an astonished class commented:

For maximum accuracy, you should simply select enough points so that the odds of a technical computational error are higher then the odds of an estimation error.

Isn’t computing science cool!?

[1] As my friend Daan pointed out before my original tweet was not quite optimal; hence the different code in this post.

Written by Lukas Vermeer

May 6, 2010 at 19:29

It’s about time I started a blog.

with one comment

It crossed my mind before, but I never felt I had anything to say. I want that to change.

I’ve been posting pictures to an abandoned Livejournal account and Twittering for some time now, but lately I’m having trouble compressing all I have to say into just one picture or 140 characters. Time to expand my arsenal of forms of public expression. Time to get bloggin’.

Written by Lukas Vermeer

May 4, 2010 at 11:27

Posted in Meta

%d bloggers like this: