-
alexkurisu
Weird questions time: does anybody know of any software that internally uses packed UTF-8 bytes for Unicode tables instead of UTF-32? I'm currently experimenting with my custom Unicode library, so i wonder if that's even reasonable. This will let me get rid of decoder/encoder…
-
moparisthebest
alexkurisu: what even uses utf-32 ? Java did originally but even it hasn't for years when it can avoid it... https://doc.rust-lang.org/std/string/struct.String.html is utf8 only
-
Link Mauve
moparisthebest, no, Java originally used UCS-2, and moved to UTF-16 when that started existing.
-
Link Mauve
alexkurisu, I believe ICU internally uses UTF-16, not UTF-32, so maybe have a look at that suite of libraries.
-
Link Mauve
I have no experience hacking on ICU, but its API requires UTF-16 input.✎ -
Link Mauve
I have no experience hacking on ICU, but its API requires UTF-16 input and output. ✏
-
alexkurisu
> alexkurisu: what even uses utf-32 ? Java did originally but even it hasn't for years when it can avoid it... > > https://doc.rust-lang.org/std/string/struct.String.html is utf8 only Well, UTF-32 is equivalent to a sequence of Unicode codepoint numbers, so with my approach it will be impossible to get the Unicode codepoint number of an arbitrary symbol. But i wonder if that's even something useful ↺
-
Link Mauve
alexkurisu, what do you want to use the Unicode table for here?
-
alexkurisu
> alexkurisu, what do you want to use the Unicode table for here? I'm making a custom Unicode library, so properties and stuff. But those can be pre-encoded to UTF-8 ↺
-
moparisthebest
> moparisthebest, no, Java originally used UCS-2, and moved to UTF-16 when that started existing. yep utf-16 you are right, it's finally happening I'm starting to forget Java!!!! 🎉🥳🎊 ↺
-
Link Mauve
alexkurisu, I think that works, just you won’t be able to use SIMD since the alignment will be off.
-
singpolyma
UTF16 for internal is dheort popular afaict but there are some that use utf8 for internal as well✎ -
singpolyma
UTF16 for internal is the most popular afaict but there are some that use utf8 for internal as well ✏
-
alexkurisu
> alexkurisu, I think that works, just you won’t be able to use SIMD since the alignment will be off. Would it be though? AFAIK, `char`'s alignment is 1 and i don't plan on using packed UTF-8 for strings, just use them instead of codepoints to avoid decoding ↺
-
Link Mauve
Not packed UTF-8, so you mean you’ll be padding all characters to four bytes?
-
Link Mauve
That’s an encoding I have never seen.
-
moparisthebest
ah here's the other thing I was talking about https://openjdk.org/jeps/254 looks like as of Java 9 it'll store ASCII or utf-16 based on the contents of the string
-
alexkurisu
> Not packed UTF-8, so you mean you’ll be padding all characters to four bytes? Just pack the bytes as is into something like `uint32_t` ↺
-
Link Mauve
moparisthebest, ah, Python does that too.
-
alexkurisu
> Just pack the bytes as is into something like `uint32_t` Strings themselves are normal UTF-8 ↺
-
Link Mauve
alexkurisu, so padded UTF-8, interesting, I have never seen anyone do that.
-
Link Mauve
Note that this isn’t UTF-8 then, as null bytes are encoded as null bytes in UTF-8.
-
alexkurisu
> Note that this isn’t UTF-8 then, as null bytes are encoded as null bytes in UTF-8. Well, i don't know of any decoder that is actually able to correctly encode NULL bytes anyways ↺
-
Link Mauve
All of them, it’s standard UTF-8.
-
alexkurisu
Technically, one can use `0xFF` as a sequence end
-
alexkurisu
So, `0xFF`-terminated padded UTF-8 :)
-
alexkurisu
Not like it would be used this way though
-
Link Mauve
Uh, you’re inventing a non-UTF-8 encoding at this point.
-
singpolyma
why not just use char and regular utf8 if that's what you want?
-
alexkurisu
That's what i intend to do, the padded UTF-8 thing is needed for property lookup tables, so i was interested if someone also did that✎ -
alexkurisu
That's what i intend to do, the padded UTF-8 thing is needed only for property lookup tables, so i was interested if someone also did that ✏
-
alexkurisu
The idea is to avoid dealing with UTF-8 encoding/decoding completely
-
singpolyma
you'll still need to "deal with" UTF8 for some operations, if you do them. any transformation or split or length etc. but yeah