Giter VIP home page Giter VIP logo

utf8.h's Introduction

πŸ“š utf8.h

Actions Status Build status Sponsor

A simple one header solution to supporting utf8 strings in C and C++.

Functions provided from the C header string.h but with a utf8* prefix instead of the str* prefix:

API function docs

string.h utf8.h complete C++14 constexpr
strcat utf8cat βœ”
strchr utf8chr βœ” βœ”
strcmp utf8cmp βœ” βœ”
strcoll utf8coll
strcpy utf8cpy βœ”
strcspn utf8cspn βœ” βœ”
strdup utf8dup βœ”
strfry utf8fry
strlen utf8len βœ” βœ”
strnlen utf8nlen βœ” βœ”
strncat utf8ncat βœ”
strncmp utf8ncmp βœ” βœ”
strncpy utf8ncpy βœ”
strndup utf8ndup βœ”
strpbrk utf8pbrk βœ” βœ”
strrchr utf8rchr βœ” βœ”
strsep utf8sep
strspn utf8spn βœ” βœ”
strstr utf8str βœ” βœ”
strtok utf8tok
strxfrm utf8xfrm

Functions provided from the C header strings.h but with a utf8* prefix instead of the str* prefix:

strings.h utf8.h complete C++14 constexpr
strcasecmp utf8casecmp βœ” βœ”
strncasecmp utf8ncasecmp βœ” βœ”
strcasestr utf8casestr βœ” βœ”

Functions provided that are unique to utf8.h:

utf8.h complete C++14 constexpr
utf8codepoint βœ” βœ”
utf8rcodepoint βœ” βœ”
utf8size βœ” βœ”
utf8size_lazy βœ” βœ”
utf8nsize_lazy βœ” βœ”
utf8valid βœ” βœ”
utf8nvalid βœ” βœ”
utf8makevalid βœ”
utf8codepointsize βœ” βœ”
utf8catcodepoint βœ”
utf8isupper βœ” βœ”
utf8islower βœ” βœ”
utf8lwr βœ”
utf8upr βœ”
utf8lwrcodepoint βœ” βœ”
utf8uprcodepoint βœ” βœ”

Usage

Just #include "utf8.h" in your code!

The current supported platforms are Linux, macOS and Windows.

The current supported compilers are gcc, clang, MSVC's cl.exe, and clang-cl.exe.

Design

The utf8.h API matches the string.h API as much as possible by design. There are a few major differences though.

utf8.h uses char8_t* in C++ 20 instead of char*

Anywhere in the string.h or strings.h documentation where it refers to 'bytes' I have changed that to utf8 codepoints. For instance, utf8len will return the number of utf8 codepoints in a utf8 string - which does not necessarily equate to the number of bytes.

API function docs

int utf8casecmp(const void *src1, const void *src2);

Return less than 0, 0, greater than 0 if src1 < src2, src1 == src2, src1 > src2 respectively, case insensitive.

void *utf8cat(void *dst, const void *src);

Append the utf8 string src onto the utf8 string dst.

void *utf8chr(const void *src, utf8_int32_t chr);

Find the first match of the utf8 codepoint chr in the utf8 string src.

int utf8cmp(const void *src1, const void *src2);

Return less than 0, 0, greater than 0 if src1 < src2,
src1 == src2, src1 > src2 respectively.

void *utf8cpy(void *dst, const void *src);

Copy the utf8 string src onto the memory allocated in dst.

size_t utf8cspn(const void *src, const void *reject);

Number of utf8 codepoints in the utf8 string src that consists entirely
of utf8 codepoints not from the utf8 string reject.

void *utf8dup(const void *src);

Duplicate the utf8 string src by getting its size, mallocing a new buffer
copying over the data, and returning that. Or 0 if malloc failed.

size_t utf8len(const void *str);

Number of utf8 codepoints in the utf8 string str,
excluding the null terminating byte.

size_t utf8nlen(const void *str, size_t n);

Similar to utf8len, except that only at most n bytes of src are looked.

int utf8ncasecmp(const void *src1, const void *src2, size_t n);

Return less than 0, 0, greater than 0 if src1 < src2, src1 == src2,
src1 > src2 respectively, case insensitive. Checking at most n
bytes of each utf8 string.

void *utf8ncat(void *dst, const void *src, size_t n);

Append the utf8 string src onto the utf8 string dst,
writing at most n+1 bytes. Can produce an invalid utf8
string if n falls partway through a utf8 codepoint.

int utf8ncmp(const void *src1, const void *src2, size_t n);

Return less than 0, 0, greater than 0 if src1 < src2,
src1 == src2, src1 > src2 respectively. Checking at most n
bytes of each utf8 string.

void *utf8ncpy(void *dst, const void *src, size_t n);

Copy the utf8 string src onto the memory allocated in dst.
Copies at most n bytes. If n falls partway through a utf8 codepoint, or if dst doesn't have enough room for a null terminator, the final string will be cut short to preserve utf8 validity.

void *utf8pbrk(const void *str, const void *accept);

Locates the first occurrence in the utf8 string str of any byte in the
utf8 string accept, or 0 if no match was found.

void *utf8rchr(const void *src, utf8_int32_t chr);

Find the last match of the utf8 codepoint chr in the utf8 string src.

size_t utf8size(const void *str);

Number of bytes in the utf8 string str,
including the null terminating byte.

size_t utf8size_lazy(const void *str);

Similar to utf8size, except that the null terminating byte is excluded.

size_t utf8nsize_lazy(const void *str, size_t n);

Similar to utf8size, except that only at most n bytes of src are looked and the null terminating byte is excluded.

size_t utf8spn(const void *src, const void *accept);

Number of utf8 codepoints in the utf8 string src that consists entirely
of utf8 codepoints from the utf8 string accept.

void *utf8str(const void *haystack, const void *needle);

The position of the utf8 string needle in the utf8 string haystack.

void *utf8casestr(const void *haystack, const void *needle);

The position of the utf8 string needle in the utf8 string haystack, case insensitive.

void *utf8valid(const void *str);

Return 0 on success, or the position of the invalid utf8 codepoint on failure.

void *utf8nvalid(const void *str, size_t n);

Similar to utf8valid, except that only at most n bytes of src are looked.

int utf8makevalid(void *str, utf8_int32_t replacement);

Return 0 on success. Makes the str valid by replacing invalid sequences with the 1-byte replacement codepoint.

void *utf8codepoint(const void *str, utf8_int32_t *out_codepoint);

Sets out_codepoint to the current utf8 codepoint in str, and returns the address of the next utf8 codepoint after the current one in str.

void *utf8rcodepoint(const void *str, utf8_int32_t *out_codepoint);

Sets out_codepoint to the current utf8 codepoint in str, and returns the address of the previous utf8 codepoint before the current one in str.

size_t utf8codepointsize(utf8_int32_t chr);

Returns the size of the given codepoint in bytes.

void *utf8catcodepoint(void *utf8_restrict str, utf8_int32_t chr, size_t n);

Write a codepoint to the given string, and return the address to the next place after the written codepoint. Pass how many bytes left in the buffer to n. If there is not enough space for the codepoint, this function returns null.

int utf8islower(utf8_int32_t chr);

Returns 1 if the given character is lowercase, or 0 if it is not.

int utf8isupper(utf8_int32_t chr);

Returns 1 if the given character is uppercase, or 0 if it is not.

void utf8lwr(void *utf8_restrict str);

Transform the given string into all lowercase codepoints.

void utf8upr(void *utf8_restrict str);

Transform the given string into all uppercase codepoints.

utf8_int32_t utf8lwrcodepoint(utf8_int32_t cp);

Make a codepoint lower case if possible.

utf8_int32_t utf8uprcodepoint(utf8_int32_t cp);

Make a codepoint upper case if possible.

Codepoint Case

Various functions provided will do case insensitive compares, or transform utf8 strings from one case to another. Given the vastness of unicode, and the authors lack of understanding beyond latin codepoints on whether case means anything, the following categories are the only ones that will be checked in case insensitive code:

Todo

License

This is free and unencumbered software released into the public domain.

Anyone is free to copy, modify, publish, use, compile, sell, or distribute this software, either in source code form or as a compiled binary, for any purpose, commercial or non-commercial, and by any means.

In jurisdictions that recognize copyright laws, the author or authors of this software dedicate any and all copyright interest in the software to the public domain. We make this dedication for the benefit of the public at large and to the detriment of our heirs and successors. We intend this dedication to be an overt act of relinquishment in perpetuity of all present and future rights to this software under copyright law.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

For more information, please refer to http://unlicense.org/

utf8.h's People

Contributors

alunegov avatar bitonic avatar boretrk avatar chloridite avatar codecat avatar curoles avatar etodd avatar f2404 avatar falsycat avatar fluks avatar guekka avatar gumichan01 avatar lrpereira avatar manish364824 avatar nairou avatar rouault avatar roxas232 avatar ruby0x1 avatar scriptexec avatar sheredom avatar timgates42 avatar usbac avatar warmwaffles avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

utf8.h's Issues

Possibility of dual-licensing?

utf8.h is currently published under the Unlicense, putting its work in the public domain. This is great, but there are open questions as to whether this is valid in all jurisdictions (Germany being the most famous example).

As such, would you be at all willing to consider dual-licensing this software under the Unlicense and another "fallback" license? The CC0 license is another public domain license with a clause for what should happen when the terms of the license are deemed invalid under local law. Alternatively, there exist other minimal OSI approved licenses (such as the MIT license, the ISC license and the BSD licenses) which are permissive. These typically require attribution from the user, but if the software were dual-licensed, it would be entirely their choice which license they want to use.

Absolutely no worries if this is too big an ask, just really want to be able to use this software in a more legally-watertight way.

How to get codepoint of first character in iteration?

This code is following the example of the PR #21 to iterate. However, utf8codepoint is for getting the pointer and the codepoint of the next character. How can I get the codepoint of the first character?

utf8_char = utf8codepoint(utf8_string, &codepoint);
while (codepoint != '\0') {
	this_char = malloc(utf8codepointsize(codepoint) + 1);
	memset(this_char, 0, utf8codepointsize(codepoint) + 1);
	memcpy(this_char, utf8_char, utf8codepointsize(codepoint));
	printf("This char: %s\n", this_char);
	utf8_char = utf8codepoint(utf8_char, &codepoint);
}

Character iterating?

What do you suggest is the best way of iterating over codepoints using this library?

int might be too small for a code point

utf8chr uses an int type, which can be 16 bits wide, for chr code point argument. Maximum code point is 0x10FFFF which is more than 16 bits. To be more portable, chr argument's type should be something that is always large enough, maybe long if you don't want to include stdint.h.

utf8makevalid : test to identify sequence length and possible values not sufficient

Hello,

In utf8makevalid, you use the following test to identify a 4 sequence bytes

"if (0xf0 == (0xf8 & *read))"

This is not correct if you suppose that you can have any invalid string as an input parameter, since only a few values in f0-ff ranges are valid.

Moreover, for valid values in f0-ff ranges, possible values for second byte are not the same one. For example, with f0, valid range for second byte is 90..bf, instead of 80..bf

Regards

Copy string to limited buffer, without risking invalid result?

It looks like utf8cpy will copy the entire string, but makes an assumption about the destination being big enough, whereas utf8ncpy allows you to specify a destination buffer size limit, but risks creating an invalid result if the source string is longer.

I'm curious when this second result is ever desirable? If I'm working with utf8 strings, and I want to limit a string to a certain buffer size, shouldn't it crop the string at a valid code point?

utf8upr/lwr size issues?

Hi, I was looking at the docs for utf8upr/lwr, and they don't seem to indicate what happens if the string passed to them doesn't have enough space for the new codepoints. I understand that letters may have different byte sizes in their upper/lowercase variants, so I was wondering whether utf8upr/lwr will allocate extra memory as required.

Looking at the code, though, it seems like they just call utf8catcodepoint, which AFAIK doesn't allocate additional memory. In fact, the size argument in that call is set to the size of the new codepoint, rather than the size of the buffer as it should be. Is this correct?

utf8makevalid read out of bounds (+ other functions)

Hello,
It seems to me that utf8makevalid can read string to modify out of bounds :

while ('\0' != *read) {
if (0xf0 == (0xf8 & read)) {
/
ensure each of the 3 following bytes in this 4-byte
* utf8 codepoint began with 0b10xxxxxx */
if ((0x80 != (0xc0 & read[1])) || (0x80 != (0xc0 & read[2])) ||
(0x80 != (0xc0 & read[3]))) {

=> it seems to me that we cannot be sure that read[1], [2] and [3] are not of bounds.

Regards,

PS : same problem in utf8codepoint and maybe other functions, but this is particularly important for utf8makevalid , because I can have any invaldi string as an input

Not an issue

How to loop through a utf8 str like :

for (int =0;i<strlen(*char);i++) do_something_with *char[i]

Request for utf8makevalid() function in addition to utf8valid()

A common use case is that an application has to somehow work with the string provided even if it may have invalid sequences. This function would replace invalid utf8 sequences in a string with the specified valid utf8 character byte, ensuring that the output is valid utf8 and has the same total byte length as the input.

tolower and toupper?

How does this work with utf8?

I can use tolower() and toupper() to get lowercase or uppercase characters of ascii chars, but there's nothing in this library for converting codepoints, from what I see.

utf8rchr issue

Hello, I think I have found a bug in the utf8rchr code. On some occasions it skips past the null terminator of the string and continues reading until it finds the specified character. Running the following code under gcc produced the problem:

#include <stdio.h>
#include "utf8.h"

int main()
{
    char *s1 = "Hello";
    char *s2 = "Hello  ";
    char *result = utf8rchr(s1, 'o');

    printf("String pointer: %llx\n", s1);
    printf("Char pointer  : %llx\n", result);
    printf("Index         : %d\n", result-s1);

    return 0;
}

This code produced the following output, indicating that the last occurrence of 'o' was at index 10 (it should be at 4), well past the end of the string.

Output from gcc:

String pointer: 7ff659d368d4
Char pointer  : 7ff659d368de
Index         : 10

Looking at the code for utf8rchr, I believe the problem code is where offset is being incremented by 2 (for a single byte ascii character) instead of 1, and skipping the Null terminator:

    while (src[offset] == c[offset]) {
      offset++;
    }

This doesn't occur on all occasions. I suspect that if the code encounters multiple Null characters before it finds another occurrence of the search character, it works okay.

strn*/utf8n* functions

The utf8len function returns codepoints instead of bytes, as expected, but it seems things like utf8ncmp continue to use bytes, which wasn't what I expected. Perhaps utf8ncmp could use n in codepoints too, and another utf8bcmp could use b bytes?

Not at all critical, since a work-around is easy, and I have no idea if others would want codepoint counting instead of bytes for the n functions. I needed an n codepoint compare, so I noticed this.

utf8valid with size

Is there a way to call utf8valid with string which is not null-terminated?

grapheme support

Is there any functions like utf8codepoint and utf8rcodepoint but for grapheme ?

utf8ncpy incorrectly loops when it does not hit null-terminator

Following up on my issue from #50, utf8ncpy doesn't (now) correctly stop at n bytes unless it hits the null-terminator in src. The following code demonstrates the problem:

#include "utf8.h"

int main(int argc, char* argv[]) {
  char buffer[10];
  utf8ncpy(buffer, "foo", 2);
}

Running this program results in a segmentation fault for me, due to n looping round past 0.

Changing the 2 to 3 works, because the null-terminator is hit at the end of the string "foo".

Bug with utf8casecmp?

It seems utf8casecmp is not working correctly. I was trying to use it with std::set as a custom comparator. I compared it to strcasecmp and found it is not giving the same results for basic ASCII strings. Note I wouldn't expect the same values, but I would expect it to match negative or positive.

    printf("%d\n", strcasecmp(".gdoc", ".GSHeeT")); // -15
    printf("%d\n", utf8casecmp(".gdoc", ".GSHeeT")); // 1
    printf("%d\n", strcasecmp(".gsheet", ".gSLiDe")); // -4
    printf("%d\n", utf8casecmp(".gsheet", ".gSLiDe")); // 1

Support constexpr?

Hey, thanks for this library. I'd like to be able to use it in constexpr context.
Would be willing to add a utf8_constexpr macro, that would be used in contexts that would allow it?
I might be able to work on a PR, but I don't have a time estimate

Invalid pointer returned when calling utf8codepoint function for a empty string

Sample code to reproduce the issue

const char * emptystr = u8"";
void * ret = utf8codepoint( (void*)emptystr,   &c); 

It is expected to return (void *) emptystr, but returns (void *) (emptystr+1).
The ret is now a bad index. It points to the address after the null terminator!

Suggest to add a null check at the beginning of the function, see below.


void *utf8codepoint(const void *utf8_restrict str,  utf8_int32_t *utf8_restrict out_codepoint) {
  const char *s = (const char *)str;

  // make sure a null string will alwaus return a fixed result, the pointer to str itself
  // without the check it could return an invalid posintion(s+x) which can result memory issue
  if ('\0' == *s) {
    return (void *)s;
  }

...

  return (void *)s;
}

utf8ncpy writes n+1 bytes (buffer overflow)

Here is an example test case where I tell utf8ncpy to write at most 10 bytes, but results in all 11 bytes of the buffer being written. I first noticed this in a larger program when it triggered a stack check exception due to buffer overflow.

#include <string.h>
#include <stdio.h>
#include "utf8.h"

int main(int argc, char* argv[]) {
  char buffer[11];
  memset(buffer, 0xdd, 11);
  printf("%02x\n", buffer[10] & 0xff);

  utf8ncpy(buffer, "foo", 10);

  printf("%02x\n", buffer[10] & 0xff);
}

which I have compiled simply with clang main.c with clang version: Apple LLVM version 9.1.0 (clang-902.0.39.2)

I get the result of

dd
00

when I would expect

dd
dd

utf8ncat - size wraparound bug

Hello πŸ‘‹! I think I found a small bug in utf8ncat when the function is executed with size_t n being 0.
The function will still write all remaining bytes to the dst buffer.

for example:

utf8_int8_t dst[12] = { 'h', 'e', 'l', 'l', 'o', '\0' };
const utf8_int8_t* src = "world";
utf8ncat(dst, src, 0);

// dst will be { 'h', 'e', 'l', 'l', 'o', 'w', 'o', 'r', 'l', 'd', '\0', '\0' };

If I am not mistaken it is because size_t being unsigned which causes the following --n to wraparound:

utf8.h/utf8.h

Line 631 in 89f6a43

} while (('\0' != *src) && (0 != --n));

I presume this is not defined behavior and that this is a bug.

A question on casting

Well, this is not an issue, it is an elementary question/request.

You have said, "..... Having it as a void* forces a user to explicitly cast the utf8 string to char* such that the onus is on them not to break the code anymore!...."

I am not very strong on this.
Can you please give an example on how to do this casting in the code........or point me to any page where such example code is given.

Thanks

Way of removing malloc completely

At the moment, we can pass alloc_func_ptr into functions that need to allocate memory but the malloc path is still live and for bare metal platforms that don't have an actual malloc in the c lib is won't compile, would be nice if a define could remove the malloc call completely... Happy to fail if no malloc, I'll always be passing alloc_func_ptr.

For now I've just defined it out like so
#if UTF8_NO_STD_MALLOC
//No malloc, you must pass in alloc_func_ptr
assert(false);
#else
n = (utf8_int8_t *)malloc(bytes);
#endif

conflicting int32_t definition

uint8.h detects _MSC_VER and #defines int32_t. The problem is that I've got another header that uses int32_t in a typedef, which ends up creating this:

typedef __int32 __int32;

which gives me a compiler error. Since stdint.h is included in Visual Studio 2010 and up, I propose changing the check for _MSC_VER to something like this:

#if (_MSC_VER < 1600)
#define int32_t __int32
#define uint32_t __uint32
#else
#include <stdint.h>
#endif

That way if we have stdint.h with visual studio we don't have to resort to #defines.

Allow programmer specified allocator

Would be nice to maybe provide a way for the person including the utf8 code to be able to do something like

#define utf8malloc my_alloc
#include "utf8.h"

I would make a PR to do this, but I don't know if it would be anything anyone is interested in.

More readible/maintainable tests

Hi

I was just looking through tests/main.c and I noticed that all the error codes are hard coded into the test functions themselves.

I'm not sure what the reason is for this, but my initial thought was that it would be better to have a header file define an enum that could contain descriptive error codes for all the functions.

Tedious and boring work, I know, but would make it easier to add new error codes and to reason about each test - assuming this wouldn't break something that relies on knowing the codes without being able to read an enum...

[feature] Add utf8ndup

This is my current implementation. I am using it to replace all of my strndups

#include <utf8.h>

void*
utf8ndup(const void* src, size_t n)
{
    const char* s = (const char*)src;
    char* c       = 0;

    // figure out how many bytes (including the terminator) we need to copy first
    size_t bytes = utf8size(src);

    c = (char*)malloc(n);

    if (0 == c) {
        // out of memory so we bail
        return 0;
    }

    bytes = 0;
    size_t i = 0;

    // copy src byte-by-byte into our new utf8 string
    while ('\0' != s[bytes] && i < n) {
        c[bytes] = s[bytes];
        bytes++;
        i++;
    }

    // append null terminating byte
    c[bytes] = '\0';
    return c;
}

I don't know if this is desirable. I am almost just half tempted to calloc an memcpy the results.

utf8tok and utf8tok_r

I've been playing with adding utf8tok but the problem with the original implementation is that it is not re-entrant.

I've been looking at musl at how they implemented utf8tok_r and it's relatively simple. here

void *utf8tok_r(void *utf8_restrict str, const void *utf8_restrict sep, void **utf8_restrict ptr) {
  char* s = (char*) str;
  char** p = (char**) ptr;

  if (!s && !(s = *p)) {
    return NULL;
  }

  s += utf8spn(s, sep);
  if (!*s) {
    return *p = 0;
  }

  *p = s + utf8cspn(s, sep);
  if (**p) {
    *(*p)++ = 0;
  } else {
    *p = 0;
  }

  return s;
}

The following is the implemented test (it fails at the assert for fΓΆΕ‘f.

UTEST(utf8tok_r, token_walking) {
    char* string = utf8dup("this|aÀÑé|fΓΆΕ‘f|that|");
    char* ptr = NULL;

    ASSERT_EQ(0, utf8ncmp(utf8tok_r(string, "|", &ptr), "this", 4));
    ASSERT_EQ(0, utf8ncmp(utf8tok_r(string, "|", &ptr), "aÀÑé", 4));
    ASSERT_EQ(0, utf8ncmp(utf8tok_r(string, "|", &ptr), "fΓΆΕ‘f", 4));
    ASSERT_EQ(0, utf8ncmp(utf8tok_r(string, "|", &ptr), "that", 4));

    free(string);
}

After playing with this for a bit, I am kind of at a loss for what to do.

Anyways, leaving this here in case someone else wants to pick it up and go on.

provide a function to get the previous codepoint

As far as I can tell the library currently only provides a way to iterate over a byte sequence in one direction using utf8codepoint. It would be super useful to have a function to go the other way, too and possibly rename them to utf8next and utf8prev or something similar? I'd be down to add this if this is something that you'd merge!

Thanks for the library, it's very clean, lightweight and useful!

clang-format?

While working on #92, I noticed there was no clang format file provided with utf8.h
I tried to respect the formatting, but it's hard to be always consistent
Providing a clang format file would allow contributors to follow the repo style guide easily

Bug in utf8len on malformed UTF-8 string

A malformed UTF-8 string will cause utf8len to read memory past str's null character. A buffer like this for example:

char str[] = { -16, '\0' };

The function needs additional checks if there is a null character somewhere not expected and then inform about an error in the str. Maybe it needs an error argument, set errno or return something else.

utf8lwr() - handle accented upper case vowels

Currently utf8.h lowercase conversion only handles ASCII chars. It is not possible to handle accented uppercase and lowercase vowels like Á for example. For my specific purpose I would like to cover at least Γ€ΓˆΓŒΓ’Γ™ since they cover most Latin languages. For the time being I would also be OK monkey patching utf8.h myself. I suppose the change has to take place in here:

utf8.h/utf8.h

Line 1016 in 1ca34ec

if (('A' <= cp) && ('Z' >= cp)) {

Any advice on how to do it? (I'm not a great C coder unfortunately :-)

Why not use memcpy()?

    // copy src byte-by-byte into our new utf8 string
    while ('\0' != s[bytes]) {
      n[bytes] = s[bytes];
      bytes++;
    }

Some minor overflow bugs

Hi maintainers,

I did some minor analysis on your library using the KLEE symbolic execuiton engine, which at the core tries to explore all different execution paths in the software under analysis to find bugs. It's an academic tool and you can see it here: https://klee.github.io/
During this process I found several overflows in the code and I wanted to report them collectively, so that's the purpose of this issue.

The small example program I used is the following:

#include "utf8.h"


int
main(int argc, char **argv)
{

        char arr[10];
        klee_make_symbolic(arr, 10, "arr");
        klee_assume(arr[9] == '\0');

        char arr1[10];
        klee_make_symbolic(arr1, 10, "arr1");
        klee_assume(arr1[9] =='\0');

        void *arr_check = utf8valid(arr);
        void *arr1_check = utf8valid(arr1);
        if (arr_check != 0 && arr1_check != 0)
        {
                if (utf8ncasecmp(arr_check, arr1_check, 9) == 0)
                        return 1;
                return 0;
        }
        return 1;
}

The calls to klee_make_symbolic triggers KLEE to consider the values in the arr and arr1 buffers to be unknowns in an equation system. It's not really necessary to understand the details of this to understand the bugs that I am reporting, the main point is that the bugs I found essentially set arr and arr1 to be of certain characters, and then the execution of utf8ncasecmp will result in some memory-out-of-bounds access. Some of the bugs that I show here I have also analysed with address sanitiser to confirm the bugs.

If my code snippet above uses your library in an erroneous manner then please disregard the bugs.

Bugs:

Bug 1

arr value: "\xf0\x00\x00\x01\xf0\x00\x01\x01\xf0\x00"
arr1 value: "\xf0\x00\x00\x01\xf0\x00\x01\x01\xff\x00"
Type: memory-out-of-bound:
Stack trace:
#0 utf8codepoint at ./utf8.h:987
#1 utf8ncasecmp at ./utf8.h:507

Bug 2

arr value: "\xf0\x00\x00\x01\xf0\x00\x01\x01\xe0\x00"
arr1 value: "\xf0\x00\x00\x01\xf0\x00\x01\x01\xff\x00"
Type: memory-out-of-bound:
Stack trace:
#0 in utf8codepoint at ./utf8.h:992
#1 in utf8ncasecmp at ./utf8.h:507

Bug 3

arr value: "\xf0\x00\x00\x01\xe0\x00\x01\xf0\xff\x00"
arr1 value: "\xf0\x00\x00\x01\xf0\x00\x00\x01\xff\x00"
Type: memory-out-of-bound:
Stack trace:
#0 in utf8codepoint at ./utf8.h:987
#1 in utf8ncasecmp at ./utf8.h:507

Bug 4

arr value: "\xf0\x00\x00\x01\xf0\x00\x00\x01\xc1\x00"
arr1 value: "\xf0\x00\x00\x01\xf0\x00\x00\x01\xc1\x00"
Type: memory-out-of-bound:
Stack trace:
#0 in utf8codepoint at ./utf8.h:984
#1 in utf8ncasecmp at ./utf8.h:507

Bug 5

arr value: "\xf1\x00\x00\x00Y\xf0\x00\x01\x00\x00"
arr1 value: "\xf1\x00\x00\x00\xc19\xf0\x00\x01\x00"
Type: memory-out-of-bound:
Stack trace:
#0 in utf8ncasecmp at ./utf8.h:494

Bug 5

arr value: "\xf1\x00\x00\x00Y\xf0\x00\x01\x00\x00"
arr1 value: "\xf1\x00\x00\x00\xc19\xf0\x00\x01\x00"
Type: memory-out-of-bound:
Stack trace:
#0 in utf8ncasecmp at ./utf8.h:494

Bug 6

arr value: "\xe1\x80\x00\xe0\x01\x1b\xf0\x00\x02\x00"
arr1 value: "\xf0\x01\x00\x00\xf0\x00\x01\x1b\xc2\x00"
Type: memory-out-of-bound:
Stack trace:
#0 in utf8ncasecmp at ./utf8.h:494

Bug 7

arr value: "\xf1\x00\x00\x00\xc3 \xc2\x00\x00\x00"
arr1 value: "\xf1\x00\x00\x00\xe0\x03\x00\xe0\x02\x00"
Type: memory-out-of-bound:
Stack trace:
#0 in utf8ncasecmp at ./utf8.h:468

Bug 8

arr value: "\xf1\x00\x00\x00\xe0\x03\x17\xe0\x01\x00"
arr1 value: "\xf1\x80\x00\x00\xc3\x17\xc1\x00\xff\x00"
Type: memory-out-of-bound:
Stack trace:
#000002175 in utf8ncasecmp at ./utf8.h:481

Bug 9

arr value: "\xf1\x00\x00\x00\xc3 \xc2\x00\xc0\x00"
arr1 value: "\xf1\x00\x00\x00\xe0\x03\x00\xe0\x02\x00"
Type: memory-out-of-bound:
Stack trace:
#0 in utf8ncasecmp at ./utf8.h:470

Bug 10

arr value: "\xf1\x00\x00\x00\xc3\x00\xe0\x01\x00\x00"
arr1 value: "\xf1\x00\x00\x00\xe0\x03 \xe0\x01\x00"
Type: memory-out-of-bound:
Stack trace:
#0 in utf8ncasecmp at ./utf8.h:481

Bug 11

arr value: "\xf1\x00\x00\x00\xc1\x10\xc1\x00\xf0\x00"
arr1 value: "\xf1\x80\x00\x00\xe0\x010\xe0\x01\x00"
Type: memory-out-of-bound:
Stack trace:
#0 in utf8ncasecmp at ./utf8.h:496

utf8ncpy incorrectly removes last valid codepoint

In the code:

utf8_int8_t *utf8ncpy(utf8_int8_t *utf8_restrict dst,
                      const utf8_int8_t *utf8_restrict src, size_t n) {
 ...
  for (check_index = index - 1;
       check_index > 0 && 0x80 == (0xc0 & d[check_index]); check_index--) {
    /* just moving the index */
  }

For code points >7F 0xc0 is valid mask for 1st byte, for rest it is 0x80 (https://en.wikipedia.org/wiki/UTF-8).

Consider string °¯\_(ツ)_/Β―Β°

let's add a printout:

     /* just moving the index */
      printf("index:%lu byte:%x cond:%u\n", check_index, (unsigned)(unsigned char)d[check_index],
              0x80 == (0xc0 & d[check_index]));
copy index:0 byte:c2
copy index:1 byte:b0
copy index:2 byte:c2
copy index:3 byte:af
copy index:4 byte:5c
copy index:5 byte:5f
copy index:6 byte:28
copy index:7 byte:e3
copy index:8 byte:83
copy index:9 byte:84
copy index:10 byte:29
copy index:11 byte:5f
copy index:12 byte:2f
copy index:13 byte:c2
copy index:14 byte:af
copy index:15 byte:c2
copy index:16 byte:b0
copy index:17 byte:0
found null
index:16 byte:b0 cond:1
index:15 byte:c2 cond:0

Following after that code will chop last valid code point:

  if (check_index < index &&
      (index - check_index) < utf8codepointsize(d[check_index])) { //(17-15)=2 < utf8codepointsize=4
    index = check_index;
  }

FIX

Fix that worked for me: (index - check_index) < utf8codepointcalcsize(&d[check_index]))

The problem with using utf8codepointsize:

utf8_constexpr14_impl size_t utf8codepointsize(utf8_int32_t chr) {
  if (0 == ((utf8_int32_t)0xffffff80 & chr)) {
    return 1;
  } else if (0 == ((utf8_int32_t)0xfffff800 & chr)) {
    return 2;
  } else if (0 == ((utf8_int32_t)0xffff0000 & chr)) {
    return 3;
  } else { /* if (0 == ((int)0xffe00000 & chr)) { */
    return 4;
  }
}

is that c2 becomes ffffffc2 and none of the 0xffffxxxx & chr == 0

`utf8nvalid` reads out bounds

The utf8nvalid procedure fails to respect the n parameter when the string ends in a multibyte codepoint. In those cases, it will read past it when ensuring the codepoint is terminated; the bounds check does not include the later str[2]:

      /* ensure that there's 2 bytes or more remained */
      if (remained < 2) {
        return (utf8_int8_t *)str;
      }

      /* ensure the 1 following byte in this 2-byte
       * utf8 codepoint began with 0b10xxxxxx */
      if (0x80 != (0xc0 & str[1])) {
        return (utf8_int8_t *)str;
      }

      /* ensure that our utf8 codepoint ended after 2 bytes */
      if (0x80 == (0xc0 & str[2])) {
        return (utf8_int8_t *)str;
      }

This fails in cases such as the following, where a string is unterminated:

#include <assert.h>
#include <string.h>
#include "utf8.h"

int main(int argc, char** argv) {
    const char terminated[] = "\xc2\xa3"; // UTF-8 encoding of U+00A3 (pound sign)
    size_t terminated_length = strlen(terminated);

    const char memory[] = "\xff\xff\xff\xff"
                          "\xc2\xa3"
                          "\x80\xff\xff\xff";

    const char* unterminated_begin = &memory[4];
    const char* unterminated_end = &memory[strlen(memory) - 4];
    size_t unterminated_length = unterminated_end - unterminated_begin;

    assert(terminated_length == unterminated_length);
    assert(strncmp(terminated, unterminated_begin, unterminated_length) == 0);
    // The two strings are identical within the bounds that are passed to
    // utf8nvalid, so we would expect these two tests to pass.
    assert(utf8nvalid(terminated, terminated_length) == NULL);
    assert(utf8nvalid(unterminated_begin, unterminated_length) == NULL); // fails!
}

Couple of thoughts

Hey, nice library! I am looking for utf-8 C string parsing and this fits the bill. I had a couple of thoughts after reading the code.

  • While you have dutifully re implemented string.h functions, some of them are considered harmful. Ex: utf8ncpy may not append a null terminating character. If you don't consider it too sanctimonious, supplying non-harmful functions and making people opt-in to the riskier ones (with an ifdef?) might be helpful. I suspect so, since safety is a concern of yours since you return void*.

For instance, a safer utf8ncpy function that guarantees a null terminator (possibly truncating last char) and returns boolean whether the string was truncated or not can be helpful, if certainly, not conformant with anything in string.h. Also, only filling one NULL character in at the end, because zeroing after termination is a waste of cycles. I have been using such a workhorse function for years.

  • Consider using restrict where available. This will shrink code size and reduce waits for memory accesses. Again, a non-conforming change in some applications since applying it will require parameters to not overlap, but that's going to be the case in real world situations anyway.
  • I would be happy to trap strange input parameters by defining an assert macro before including your header. For instance (and sorry to pick on utf8cpy), if the max number of elements is zero, you could check that in an assert which is a NO-OP unless defined by the caller prior to header inclusion.

Here is a snippet that lets you portably apply the restrict keyword if you're interested:

#if defined(__GNUC__) || defined(__clang__)
    #ifdef __cplusplus    
        #define utf8_restrict __restrict
    #else
        #if __STDC_VERSION__ >= 199901L
            #define utf8_restrict restrict
        #endif
    #endif
#elif defined(_MSC_VER) && (_MSC_VER >= 1400) /* vs2005 */
    #define utf8_restrict __restrict
#else
   #define utf8_restrict
#define

Bug in utf8valid

utf8valid will fail on utf8 file with 2 or more linebreaks in a row.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.