Everything, Everything | I\'m A Geek (And I Can\'t Sleep)

I\'m A Geek (And I Can\'t Sleep)

Thursday 27th April, 2006 02:05 Comments: 0

A friend made a post on another friend's forum asking for suggestions for an SQL statement to match a telephone number against a list of possible charging bands for telephone numbers. He said for the example number of 4470115XXXXX there are 2 possible matches: 447011 and 4470115. In these cases he wanted to make sure it matched the most digits possible. One guy replied quickly with a suggestion, but it didn\'t work (his last post just said "bugger"), so I posted my suggestion:

How about a loop around the number as a string? If the number of rows returned == 0 then do the same query, but knock a number off the end

4470115XXXXX
0 rows returned
4470115XXXX
0 rows returned
4470115XXX
0 rows returned
4470115XX
0 rows returned
4470115X
0 rows returned
4470115
1 row returned so stop here and that's your value.

Then the most number of queries you\'re doing is the length of the string, and as there's no recursion it should be quick and simple to execute - and work with any SQL database.

It's not entirely elegant, as it relies on you keeping track of what's returned within the program, rather than doing it all with one single SQL query, but it's simple to code and it definitely works.

The other guy (the one that said bugger) then came up with a pretty clever suggestion that did the trick:

SELECT FIRST 1 * FROM BAND WHERE LOCATE(MSD,\'4470115XXXXX\') = 1 ORDER BY LENGTH(MSD) DESC

The original poster replied saying "My concern here is that the rate table currently has 2,179 entries in it, and the average CDR file I run against it (log of calls made) is about 3000 calls / month. That's a *lot* of querying. And it needs to be done sufficiently quickly that a webpage doesn\'t load at an unacceptably slow rate."

And the other guy then replied, and made my day, as no one else had even thanked me for my suggestion (I added the bold text myself for emphasis):

The last of my suggested lines above has been shown to work properly. I suspect the one above it, or something like it, might actually be slightly faster - the expensive stage is narrowing the list, not sorting it, since it has to check through every record in the database. The approach of shortening the number and comparing it exactly will be quicker than that of locating just where in the whole number the MSD may lie - less work being done then thrown away. Essentially, it's asking "starts with?" rather than "index of == 1?"

If performance turns out to be a real bitch, you could try SilentBob's approach. In which case, be sure to index the first column to make it quick to check if a record exists or not, although you probably won\'t need to do it by hand: just mark is as NOT NULL UNIQUE PRIMARY KEY and any database worth its salt says "okay, okay, i get the hint already. here, have an index".

So I think I\'ll accept that as a compliment. The other guy is definitely much better than me at writing SQL queries, but I generally like to keep things simple - I know from playing with ColdFusion and Access databases, it was sometimes quicker to grab a result with a single simple query and then manipulate the data in ColdFusion than it was to perform multiple and/or complicated queries. Also, a simple query often works on all databases, in this case I beleve the "FIRST 1" part near the start of his query would need to be removed and "LIMIT 1" would need to be appended to the query if the database was a MySQL one. My query should work on any database.

And yes, I can still think of a way to further reduce the number of queries. If you know the maximum length of the string is, for example, never going to be more than 8, you could chop the number down to 8 to start with, so searching for 4470115XXXXX would instantly be cut down to 4470115X before doing the loop, which would match on the second query with 4470115 - a total of 2 queries instead of the 6 shown in the original example. So it\'ll take 1/3rd of the time. That's a pretty good saving. The overhead is that whenever you update the database you\'ll have to make sure the maximum length is also updated to keep everything in sync so you get accurate results. And then you either need to write it to a variable in a file, or (more likely?) store it in the database and retrieve it with an extremely quick query.