jdev - 2025-11-12


  1. alexkurisu

    Weird questions time: does anybody know of any software that internally uses packed UTF-8 bytes for Unicode tables instead of UTF-32? I'm currently experimenting with my custom Unicode library, so i wonder if that's even reasonable. This will let me get rid of decoder/encoder…

  2. moparisthebest

    alexkurisu: what even uses utf-32 ? Java did originally but even it hasn't for years when it can avoid it... https://doc.rust-lang.org/std/string/struct.String.html is utf8 only

  3. Link Mauve

    moparisthebest, no, Java originally used UCS-2, and moved to UTF-16 when that started existing.

  4. Link Mauve

    alexkurisu, I believe ICU internally uses UTF-16, not UTF-32, so maybe have a look at that suite of libraries.

  5. Link Mauve

    I have no experience hacking on ICU, but its API requires UTF-16 input.

  6. Link Mauve

    I have no experience hacking on ICU, but its API requires UTF-16 input and output.

  7. alexkurisu

    > alexkurisu: what even uses utf-32 ? Java did originally but even it hasn't for years when it can avoid it... > > https://doc.rust-lang.org/std/string/struct.String.html is utf8 only Well, UTF-32 is equivalent to a sequence of Unicode codepoint numbers, so with my approach it will be impossible to get the Unicode codepoint number of an arbitrary symbol. But i wonder if that's even something useful

  8. Link Mauve

    alexkurisu, what do you want to use the Unicode table for here?

  9. alexkurisu

    > alexkurisu, what do you want to use the Unicode table for here? I'm making a custom Unicode library, so properties and stuff. But those can be pre-encoded to UTF-8

  10. moparisthebest

    > moparisthebest, no, Java originally used UCS-2, and moved to UTF-16 when that started existing. yep utf-16 you are right, it's finally happening I'm starting to forget Java!!!! 🎉🥳🎊

  11. Link Mauve

    alexkurisu, I think that works, just you won’t be able to use SIMD since the alignment will be off.

  12. singpolyma

    UTF16 for internal is dheort popular afaict but there are some that use utf8 for internal as well

  13. singpolyma

    UTF16 for internal is the most popular afaict but there are some that use utf8 for internal as well

  14. alexkurisu

    > alexkurisu, I think that works, just you won’t be able to use SIMD since the alignment will be off. Would it be though? AFAIK, `char`'s alignment is 1 and i don't plan on using packed UTF-8 for strings, just use them instead of codepoints to avoid decoding

  15. Link Mauve

    Not packed UTF-8, so you mean you’ll be padding all characters to four bytes?

  16. Link Mauve

    That’s an encoding I have never seen.

  17. moparisthebest

    ah here's the other thing I was talking about https://openjdk.org/jeps/254 looks like as of Java 9 it'll store ASCII or utf-16 based on the contents of the string

  18. alexkurisu

    > Not packed UTF-8, so you mean you’ll be padding all characters to four bytes? Just pack the bytes as is into something like `uint32_t`

  19. Link Mauve

    moparisthebest, ah, Python does that too.

  20. alexkurisu

    > Just pack the bytes as is into something like `uint32_t` Strings themselves are normal UTF-8

  21. Link Mauve

    alexkurisu, so padded UTF-8, interesting, I have never seen anyone do that.

  22. Link Mauve

    Note that this isn’t UTF-8 then, as null bytes are encoded as null bytes in UTF-8.

  23. alexkurisu

    > Note that this isn’t UTF-8 then, as null bytes are encoded as null bytes in UTF-8. Well, i don't know of any decoder that is actually able to correctly encode NULL bytes anyways

  24. Link Mauve

    All of them, it’s standard UTF-8.

  25. alexkurisu

    Technically, one can use `0xFF` as a sequence end

  26. alexkurisu

    So, `0xFF`-terminated padded UTF-8 :)

  27. alexkurisu

    Not like it would be used this way though

  28. Link Mauve

    Uh, you’re inventing a non-UTF-8 encoding at this point.

  29. singpolyma

    why not just use char and regular utf8 if that's what you want?

  30. alexkurisu

    That's what i intend to do, the padded UTF-8 thing is needed for property lookup tables, so i was interested if someone also did that

  31. alexkurisu

    That's what i intend to do, the padded UTF-8 thing is needed only for property lookup tables, so i was interested if someone also did that

  32. alexkurisu

    The idea is to avoid dealing with UTF-8 encoding/decoding completely

  33. singpolyma

    you'll still need to "deal with" UTF8 for some operations, if you do them. any transformation or split or length etc. but yeah