What's the correct way to calculate buffer/characters length when UTF-8 is enabled?

zerowalker 21 Reputation points
2022-01-23T13:17:23.033+00:00

When enabling UTF-8. https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page
You can use the "A" (ANSI) functions but they will handle the text as UTF-8,
this is really great and seems to work very well from the cases i have used it in.

However, i don't know how to give the correct lengths at times, for example: here: https://learn.microsoft.com/en-us/windows/win32/api/shobjidl_core/nf-shobjidl_core-ishelllinka-getpath
It's requesting

The size, in characters, of the buffer pointed to by the pszFile parameter, including the terminating null character

, which is not that uncommon.

Issue is, we don't know the size beforehand as UTF-8 has a variable length, so 1 character could be 1 byte and the other 3 bytes.
I have played it safe by going with 4 bytes as that's (as far as i know) the "max" size that's generally available.

Windows API - Win32
Windows API - Win32
A core set of Windows application programming interfaces (APIs) for desktop and server applications. Previously known as Win32 API.
2,422 questions
{count} votes

Accepted answer
  1. Matthew Schmidt 76 Reputation points
    2022-01-27T07:41:41.283+00:00

    In UTF-8 the first byte of a specific character will decide the number of total bytes contained by that character, a byte with a bit sequence of 10XX_XXXX is a continuation byte and is therefore invalid as a first byte character.

    For example, Unicode code points are written in hexadecimal notation with a Unicode prefixing, U+XXXX, informing a compiler that the proceeding hexadecimal string is to be taken as a Unicode code point and not just a standard ASCII character. In the proceeding passage I will place the binary representations of each hexadecimal code space in parentheses with an X being any binary value (either a 1 or 0,) but these are just for demonstrating what the actual character looks like in binary.

    U+00 - U+7F (0XXX_XXXX): This is the standard 128-bit ASCII characters set.
    U+80 - U+07FF (110X_XXXX 10XX_XXXX): This is the second code space in UTF-8.
    U+0800 - U+FFFF (1110_XXXX 10XX_XXXX 10XX_XXXX): This is the third code space in UTF-8.
    U+010000 - U+10FFFF (1111_0XXX 10XX_XXXX 10XX_XXXX 10XX_XXXX): This is the fourth code space in UTF-8.

    Therefore, to count the number of actual characters in a UTF-8 character string, it could, theoretically, be as simple as initialize an accumulator to zero, and a pointer to the address of the string, and then proceed to check whether the MSB of each byte are zero. If they are increment both the accumulator and the pointer by one. If they are not, figure out where the first zero occurs, counting from the MSB, within that byte. After which, verify that the number of continuation bytes proceeding the current byte plus the current byte itself actually match the number stated by that initial byte. If the count matches, then increment the pointer by that number but only increment the accumulator by one. If it doesn't then the entire string is invalid and throw some kind of error, do not ever try and programmatically try and fix some invalid string. It'll never work... At the conclusion of receiving the null byte, the number of actual UTF-8 characters present within the given string would then be inside the accumulator.

    Edit 1: Keep in mind that UTF-8 has sequences of single byte chars, hence the 8 in UTF-8, in contrast UTF-16, used by variables of type wchar, have a sequence of word size characters, or shorts, hence the UTF-16.

    Edit 2: Also, a buffer is, well, a buffer. They don't contain exact 100% accurate sizes. They contain some arbitrary size greater than the actual size. So technically, your question is actually asking, "What size did I set my buffer to?"

    Edit 3: If we state that UTF-8 is suffixed with an 8 due to the reason being that the maximum size of the initial byte making up an actual Unicode code point character is 8-bit, or for example, the binary bit string 0XXX_XXXX, or the binary bit string 110X_XXXX, are all 8-bit. Given, for the second byte stated, we now know that it must be followed by a single continuation byte, 10XX_XXXX, in order to actually point to a valid Unicode code point; we only know this though because we first scanned those initial 8 bits of the entire UTF-8 character. It's actually not possible to reliably scan an entire UTF-8 document using a pointer of any size larger than 8 bits, considering we have characters of some arbitrary number of bytes. The first character might be a length of a single byte, the next having a length of two bytes, and then one with a length of three. The key idea here though, is that each of these characters all have an initial 8-bit byte that informs us on which Unicode Code Space our current character is actually pointing to, and these initial bytes are all 8-bit in size.

    That previous statement implies that the UTF-16 encoding must then be suffixed with a 16 due to the reason that its initial byte that informs us on which Unicode Code Space our current character is pointing to is of length 16 bits. In fact, with UTF-16, it's actually possible to jump straight into a string at some arbitrarily chosen location and know exactly what character you're looking at, this isn't exactly possible with UTF-8, considering 8 bits only have a cardinality of 256 characters, while 16 bits have a cardinality of 65,535 characters.

    Hope this all helped, I absolutely love all things encoding. I would definitely check out the Unicode website to find out exactly what a Unicode Code Space is, and which characters belong to which code space, and which set of characters are actually invalid, as Unicode doesn't exactly run continuously by assigning every code point in order starting with 0; it instead jumps all over the place based on some given category. UTF-8 and UTF-16 both point to the same Unicode code point but are just encoded differently. One by sequences of 8 bits and the other by sequences of 16 bits.


2 additional answers

Sort by: Most helpful
  1. Xiaopo Yang - MSFT 11,496 Reputation points Microsoft Vendor
    2022-01-24T01:46:19.91+00:00

    Hello,

    Welcome to Microsoft Q&A!

    Perhaps what you want is WideCharToMultiByte function.

    Thank you.


    If the answer is the right solution, please click "Accept Answer" and kindly upvote it. If you have extra questions about this answer, please click "Comment".
    Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

    0 comments No comments

  2. zerowalker 21 Reputation points
    2022-01-25T06:46:29.677+00:00

    Some functions want the characters other the size.
    So for example in some functions the W and A variants will require the same length (even though W will have twice the size cause of wchar).

    And it specifically says "number of characters" in these cases, hence the problematic case with UTF-8.
    Just going by trial and error doesn't seem like a good solution, i would rather understand how Windows actually handles it:)

    WideCharToMultiByte, that's just for converting back and forth isn't it?