StrCmpLogicalA

Article
04/28/2008

Due to a policy around supporting and encouraging internationalizable software, there are certain system APIs which only have a UNICODE version (functions typically with the W postfix). One example which is missing the plain old ANSI version is the StrCmpLogicalW function. Unfortunately for me, I was in need of an A version. "Fixing" my code to use the W version was impractical given the data I had to work with, but I really liked the idea of having my strings sorted with numbers being placed in an order which made sense to a human being. So I did the only logical thing left to do...

Just to get my code up and running, I did the easy thing which was to write an A version which takes the ANSI strings as inputs, allocates two new buffers, converts the inputs into UNICODE, and then calls the W version. Needless to say this incurs a lot of overhead, as all this has to happen for each string compare, which isn't pretty.

My next cut (shown below) actually walks the ANSI strings and does some nitty-gritty comparisons.

int StrCmpLogicalA(const char *psz1, const char *psz2)

{

// handle NULL inputs

if(!psz1 && !psz2) return 0;

if(!psz1) return -1;

if(!psz2) return 1;

while(*psz1 && *psz2)

{

if(*psz1 >= '0' && *psz1 <= '9' &&

*psz2 >= '0' && *psz2 <= '9') // numerical

{

// keep track of where we are starting

const char* digit1 = psz1;

const char* digit2 = psz2;

// strip off any leading zeros

size_t leading1 = 0;

size_t leading2 = 0;

while(*digit1 == '0')

{

++leading1;

++digit1;

}

while(*digit2 == '0')

{

++leading2;

++digit2;

}

// scan to the end of the digits

while(*psz1 >= '0' && *psz1 <= '9')

++psz1;

while(*psz2 >= '0' && *psz2 <= '9')

++psz2;

// calc the number of digits

size_t len1 = psz1 - digit1;

size_t len2 = psz2 - digit2;

if(len1 < len2) return -1;

if(len1 > len2) return 1;

// now start walking over the digits

while(digit1 < psz1 && digit2 < psz2)

{

// test the number

if(*digit1 < *digit2) return -1;

if(*digit1 > *digit2) return 1;

++digit1;

++digit2;

}

// if we reach here, the numbers are the same, and

// psz1 and psz2 already point at the next character

// to test

// since we kept track of leading digits, we can add

// precedence based off that.

if(leading1 < leading2) return -1;

if(leading1 > leading2) return 1;

}

else // mixed and non numerical

{

unsigned char c1 = *psz1;

unsigned char c2 = *psz2;

// strip off the lower case bits

if(c1 >= 'a' && c1 <= 'z') c1 &= 0xDF;

if(c2 >= 'a' && c2 <= 'z') c2 &= 0xDF;

// test the characters

if(c1 < c2) return -1;

if(c1 > c2) return 1;

// else they are the same, keep walking

++psz1;

++psz2;

}

// check for unprocessed characters

if(!*psz1 && *psz2) return -1;

if(*psz1 && !*psz2) return 1;

// strings are equivalent

return 0;

}

As you can see this version does a case insensitive compare, if you happen to need one that does a case sensitive compare, then you just need to comment out the two lines of code which strip out the 6^th bit. If you want "abc03def" to be treated the same as "abc003def", then comment out the two lines where we test the leading digit count.

Disclaimers

1. This has only gone through minimal testing, use at your own risk, etc. (Though if you do find bugs let me know!)

2. Even though I named it StrCmpLogicalA in this blog, I make no implied statement about behavioral conformance with StrCmpLogicalW. I've never seen the algorithm or code behind the official API, and the above sample undoubtedly does things differently. In fact, I know it does some things differently.

3. Developers are encouraged to use UNICODE and my apologies go out to those trying to facilitate its wider-spread usage.

StrCmpLogicalA

Additional resources