Kaggle has me

One of the people I work with heard of a site called Kaggle, and said he immediately thought of me (not sure if that was a compliment), anyway not I am not so sure if I should thank him or get some sort of addiction help. Certainly all of the other things I had in the pipeline are on hold as I try to see if I can rack the formula for the next competition.

So what’s Kaggle? Kaggle is the place to go if you love creating algorithms to solve real world problems, the fun part is, it’s not pretend, they are real problems with real data. If you’re lucky you win a prize (yes – cash $) for the best algorithm, but either way you can have a load of fun whether it’s your own custom code or something in R or your other favourite BI / data mining tool.

So go to http://kaggle.com and have a go.

Disclaimer: you might get addicted too, so do this at your own risk.

MDS Regular expressions – awesome!

Had a report to generate for a client, got a database backup and needed to extract state abbreviations such as NSW, SA, QLD, VIC etc. from within one of 2 different records. Also needed to extract account numbers in a variety of formats, “HL Acc”, “H/L Acc:”, “HL Acc No:” and so on.

Had a fairly reasonable query running, took 57 seconds with like, charindex, substring and so on, when thought I’d try and use regular expressions to help out.

After tossing between fiddling getting CLR regex wrappers loaded up and off-the-shelf Master Data Services (see previous post on MDS and regular expressions in SQL) .. installed MDS, took about 5 minutes, and then found the same queries rewritten to use MDS running in under 1 second.

you can use regular expressions in your query such as:

and dbo.RegexIsMatch(d2.Description,N'(?<=\b.*)NSW(?=[^A-Za-z]|\s|$)’, 0)= 1)> 0 then‘NSW’

or even add named expressions

THEN dbo.RegexExtract(dp.Description,N'(?<=\b.*HO?I?/?L:?.*)(?<account>\d\d\d\d?\d?\d?)(?=.*\b)’,N’account’,1)

etc.

Data cleansing

some links that may be useful:

http://msdn.microsoft.com/en-us/library/bb190163.aspx

http://daim.idi.ntnu.no/masteroppgaver/IME/IDI/2007/3375/masteroppgave.pdf

http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Sequence_Mining/SPADE

String matching, soundex, pattern matching, fuzzy matching

quick bookmark for me – we’re considering the benefits of pattern matching – to match or not to match that’s the question. Found that Sam Chapman has a great library might save some time, details here:
it covers these algorithms:

Hamming distance, Levenshtein distance, Needleman-Wunch distance or Sellers Algorithm, Smith-Waterman distance, Gotoh Distance or Smith-Waterman-Gotoh distance, Block distance or L1 distance or City block distance, Monge Elkan distance, Jaro distance metric, Jaro Winkler, SoundEx distance metric, Matching Coefficient, Dice’s Coefficient, Jaccard Similarity or Jaccard Coefficient or Tanimoto coefficient, Overlap Coefficient, Euclidean distance or L2 distance, Cosine similarity, Variational distance, Hellinger distance or Bhattacharyya distance, Information Radius (Jensen-Shannon divergence), Harmonic Mean, Skew divergence, Confusion Probability, Tau, Fellegi and Sunters (SFS) metric, TFIDF or TF/IDF, FastA, BlastP, Maximal matches, q-gram, Ukkonen Algorithms

“CouchDB for .NET Developers”, Hadi Hariri

Today’s brown bag lucnh box video was from NDC 2010, by Hadi Hariri

It covered the basics of CouchDB and how to consume this kind of database from .NET applications.

links:

Ottoman:  http://github.com/sinesignal/ottoman
Divan: http://github.com/foretagsplatsen/Divan
Hammock: http://code.google.com/p/relax-net/
SharpCouch: http://code.google.com/p/couchbrowse/source/browse/trunk/SharpCouch/SharpCouch.cs

good link of general how-tos, issues, faqs:
http://stackoverflow.com/questions/1050152/use-couchdb-with-net

“Rhino ETL”, Paul Barriere

Watched a useful video on Rhino ETL over lunch today.

notes:
23:40 strongly typed rows
25:01 testing
33:00 SQL writes
34:00 auto generate prep scheme
48:00 branching to two outputs
59.32 aggregating calls to web service
60:01 error handling, try/catch etc.

link to video:
http://ayende.com/Blog/archive/2010/03/18/rhino-etl-video.aspx

Simple Email Address validation control

found this here:

http://vaultofthoughts.net/CategoryView,category,ASP.NET.aspx
lifted and duplicated below – so I do not lose it 🙂
“…. the EmailValidator control which validates user input with a pattern of an email:

public class EmailValidator

    : RegularExpressionValidator

{

    public EmailValidator()

    {

        ValidationExpression =

        @”^[a-zA-Z0-9][\w\.-]*[a-zA-Z0-9]@

        [a-zA-Z0-9][\w\.-]*[a-zA-Z0-9]\.

        [a-zA-Z][a-zA-Z\.]*[a-zA-Z]$”;

        ErrorMessage = “Email is incorrect”;

    }

}

“Note that ValidationExpression is broken so it may be better displayed in the browser. As for how accurate, the expression is. It was taken from the Regular Expression Library page and I have had no problems with it.

Usage on a page after registration:”

<my:EmailValidator runat=”server” ID=”EmailCorrect”

    ControlToValidate=”Email” />

validation of drivers license numbers

// This is some old (and ugly) code from long ago
// some of these validation functions are difficult to do as Regex, so over years used these over and over to do pre-validation on client, saving user’s round trips from mis-keyed numbers.
// though ugly, maybe it might help someone now – but advise you to clean it up / rewrite it first.

function validateDriverLic(field_name, field_description) {
    var i, ch, cnt_chr, cnt_num;
    var all_zero = true;
    var checkOK = “0123456789”;
    cnt_chr = 0; 
    cnt_num = 0;
    var D_Licence = field_name.value;
    if (D_Licence.length>9) { //the length should be 9
        // alert(“The maximum length of a Drivers Licence number is 9 characters”);
        alert(“Invalid ” + field_description);
        field_name.focus(); 
        return false; 
    }

    for (i = 0; i < D_Licence.length; i++) {
        ch = D_Licence.charAt(i);
        if (((i==2) || (i==3)) && (checkOK.indexOf(ch)==-1))
        {
             //must third and fourth char be numbers
            // alert(“3rd and 4th characters of Drivers Licence must be numeric.”);
            alert(“Invalid ” + field_description);
            field_name.focus(); 
            return false; 
        } 
        if (ch!=’0′) all_zero = false;
        if (checkOK.indexOf(ch)!=-1) //(ch.isnumeric)
            cnt_num++;
        else
            cnt_chr++;
        }
        if (cnt_num < 4)
        {
            //must have at least 4 number
            // alert(“Drivers Licence should include at least 4 Numeric characters”);
            alert(“Invalid ” + field_description);
            field_name.focus(); 
            return false; 
        } 
        if (cnt_chr > 2)
        {
            //must not have more that 2 char 
            // alert(“Drivers Licence should not include more than 2 alpha characters”);
            alert(“Invalid ” + field_description);
            field_name.focus(); 
            return false; 
        }
        if (cnt_chr == 9)
        {
            //must not have all char
            // alert(“Drivers Licence can not have all alpha characters”);
            alert(“Invalid ” + field_description);
            field_name.focus(); 
            return false; 
        }
        if (all_zero == true)
        {
            //must not have all zero
            // alert(“Drivers Licence number can not be all zero\’s”);
            alert(“Invalid ” + field_description);
            field_name.focus(); 
            return false; 
        }
}

Client side javascript validation for ABN and ACN

// check for valid ABN
// if fmt_str true, add ABR formatting info.

function check_ABN(field_name, field_description, fmt_str) {

var abn_fmt_str = (fmt_str) ? ‘\nThe ABN is a unique 11 digit number formed from a 9 digit unique identifier and two prefix check digits.\nThe Australian Business Register structure an ABN as XX XXX XXX XXX to assist readability.’ : ”;

var abn = ”, chk = 0, k=11, checkOK = “0123456789”, ch=”;

var weighting = new Array(19,17,15,13,11, 9, 7, 5, 3, 1,10)

var checkStr = field_name.value, allValid = true;

for (i = 0; i < checkStr.length; i++) {

ch = checkStr.charAt(i);

if (checkOK.indexOf(ch) != -1) {

if (k == 11) {

k–;

chk += weighting[k]*(parseInt(ch)-1);

}

else

if (k– > 0) chk += weighting[k]*parseInt(ch);

abn += ch;

}

}

chk = chk/89 – Math.floor(chk/89)

if (abn.length != 11 || k < 0 || chk != 0) {

field_name.focus();

alert(‘Please check field \”‘ + field_description + ‘\” is a valid ABN.\n’+abn_fmt_str);

return (false);

}

return (true);

}

// check for valid ACN
// if fmt_str true, add ASIC formatting info.

function check_ACN(field_name, field_description, fmt_str) {

var acn_fmt_str = (fmt_str) ? ‘\nASIC has adopted a convention of always printing and displaying the ACN in the format nnn nnn nnn ie three blocks of three characters separated by a blank. This is to assist readability and the inserted blanks do not form part of the ACN.’ : ”;

var acn = ”, chk = 0, k=9, checkOK = “0123456789”, ch=”;

var checkStr = field_name.value, allValid = true;

for (i = 0; i < checkStr.length; i++) {

ch = checkStr.charAt(i);

if (checkOK.indexOf(ch) != -1) {

if (k– > 0) chk += k*parseInt(ch);

acn += ch;

}

}

ch = chk.toString(10);

chk = 10-(parseInt(ch.charAt(ch.length-1)));

if (chk == 10) chk = 0;

if (acn.length != 9 || k < 0 || chk != acn.charAt(8)) {

field_name.focus();

alert(‘Please check field \”‘ + field_description + ‘\” is a valid ACN.\n’+acn_fmt_str);

return (false);

}

return (true);

}

Client side javascript validation for ABN and ACN

// check for valid ABN
// if fmt_str true, add ABR formatting info.

function check_ABN(field_name, field_description, fmt_str) {

var abn_fmt_str = (fmt_str) ? ‘\nThe ABN is a unique 11 digit number formed from a 9 digit unique identifier and two prefix check digits.\nThe Australian Business Register structure an ABN as XX XXX XXX XXX to assist readability.’ : ”;

var abn = ”, chk = 0, k=11, checkOK = “0123456789”, ch=”;

var weighting = new Array(19,17,15,13,11, 9, 7, 5, 3, 1,10)

var checkStr = field_name.value, allValid = true;

for (i = 0; i < checkStr.length; i++) {

ch = checkStr.charAt(i);

if (checkOK.indexOf(ch) != -1) {

if (k == 11) {

k–;

chk += weighting[k]*(parseInt(ch)-1);

}

else

if (k– > 0) chk += weighting[k]*parseInt(ch);

abn += ch;

}

}

chk = chk/89 – Math.floor(chk/89)

if (abn.length != 11 || k < 0 || chk != 0) {

field_name.focus();

alert(‘Please check field \”‘ + field_description + ‘\” is a valid ABN.\n’+abn_fmt_str);

return (false);

}

return (true);

}

// check for valid ACN
// if fmt_str true, add ASIC formatting info.

function check_ACN(field_name, field_description, fmt_str) {

var acn_fmt_str = (fmt_str) ? ‘\nASIC has adopted a convention of always printing and displaying the ACN in the format nnn nnn nnn ie three blocks of three characters separated by a blank. This is to assist readability and the inserted blanks do not form part of the ACN.’ : ”;

var acn = ”, chk = 0, k=9, checkOK = “0123456789”, ch=”;

var checkStr = field_name.value, allValid = true;

for (i = 0; i < checkStr.length; i++) {

ch = checkStr.charAt(i);

if (checkOK.indexOf(ch) != -1) {

if (k– > 0) chk += k*parseInt(ch);

acn += ch;

}

}

ch = chk.toString(10);

chk = 10-(parseInt(ch.charAt(ch.length-1)));

if (chk == 10) chk = 0;

if (acn.length != 9 || k < 0 || chk != acn.charAt(8)) {

field_name.focus();

alert(‘Please check field \”‘ + field_description + ‘\” is a valid ACN.\n’+acn_fmt_str);

return (false);

}

return (true);

}