This is kind of nifty if true: https://twitter.com/Midar3/status/839059229289943041
jubalhhas left
kaboomhas left
SamWhited
(TL;DR — libstrophe is listed in the Nintendo Switch's open source license credits)
jonasw
libstrophe. nice :)
vurpohas left
vurpohas joined
Ge0rG
and somebody didn't bother enough to rotate that picture.
Tobias
heh
jonasw
Ge0rG: you can rotate your screen, can’t you :P
Ge0rG
jonasw: I tried, but it turned out to be attached to the laptop body.
jonasw
xrandr --output $OUTPUT --rotate left
Ge0rG
alternative version: I did rotate it, but then the desktop manager autorotated it back.
SamWhited
Tangentially related: I didn't realize Jack Moffitt worked for Mozilla or was in charge of Servo these days; that's fantastic. I wish he'd revamp libstrophe in Rust.
SamWhited
or maybe I did realize that, since I apparently follow him on a bunch of Rust stuff, but didn't realize he was the same person who wrote libstrophe.
kaboomhas left
Tobias
SamWhited, similar thing with some of Cisco's original jabber devs :)
sezuanhas left
Zash
I wouldn't really expect Mozilla to do anything with XMPP.
Tobias
they probably went there for non-XMPP related things
SamWhited
yah, I once got asked at a conference by someone on the Hello team (or whatever that short lived firefox messenger was) what the point of XMPP or using standards was; I dunno about the rest of Mozilla, but I more or less gave up on the Firefox team then and there.
Tobias
Zash, although they should do more with it
jonasw
SamWhited: wtf
SamWhited
I think his exact words were "why would anyone bother using standards?"
jonasw
wtf
jonasw
have they even SEEN internet explorer?
SamWhited
Granted, I doubt he's representative of the rest of the Firefox people given their involvement with all things web-standards related; maybe that was just the Hello team.
Tobias
jonasw, come on...with web assembler you can finnally render your IE6 pages they way they are supposed to look on every platform :)
intosihas left
jonasw
Tobias: you make me sad.
nicolas.veritehas joined
nicolas.veritehas left
kaboomhas left
Zash
I wonder if they still remember that "The Internet" does not equal "The Web"
SamWhited, oh you have some of those new flipping desks
jonasw
SamWhited: relevant for tableflip: https://www.youtube.com/watch?v=eob7V_WtAVg
arc
its a method to constrain acceptable string values within xml
SamWhited
Tobias: Nah, they went for the sit-stand ones, but wouldn't spring for the flipping ones
jonasw
arc: seems like a regex variant used by XML?
arc
im still reading into it, but yes. thankfully its a simplified variant
arc
EXI uses it for strings.
Ge0rGhas joined
Zash
Why and where?
arc
all strings.
arc
CH notably
arc
without it, an unconstrained string in the schema is transmitted as an unsigned int per char representing the unicode codepoint. no UTF-8
Zash
Shouldn't most strings fit in either enums or user-provided strings with no restrictions?
arc
oh yes. but you'd be insane to do so
Zash
Whta
Zash
No UTF-8?
arc
no UTF-8
Zash
KILL IT WITH FIRE
arc
not as far as ive found. admittedly ive just started
arc
that was my gut reaction too. however the more i read into this, the more i understand why.
jonasw
why would one want to do that?
SamWhited
> representing the unicode codepoint
SamWhited
Yup, now my desk and things are all over the floor. Saw it coming.
Zash
So, UTF-32
Guushas left
jonasw
SamWhited: did you do it with the excellent stare of Alan Rickman?
arc
... yes. but as i said, you'd be insane to not use the regex to restrict the character map
SamWhited
jonasw: No, I am nowhere near that fantastic; that was amazing.
Guushas left
jonasw
arc: so assuming this is used for standard desktop clients, I either have to restrict what codepoints users can use or the text is blown up to factor two to four of the bytes needed with UTF-8?
arc
unfortunetly the EXI spec doesn't go into deep detail on this, it refers to other documentation on xml regex, but it appears with bitpacked encoding you can compress it down a lot better than UTF-8
arc
jonasw: nope. you can craft a method to support the entire breadth of unicode in a much tighter format than UTF-8 because you're no longer constrained to byte boundaries.
jonasw
mhm
Zash
Like, huffman code
arc
i wouldn't go that far with it.
arc
im trying to track down whether a codepoint can represent multi-character sequences now.
arc
i would not be suprised.
arc
unlike using DEFLATE tho, this would not be dynamic, but encoded as part of the schema.
SamWhited
Define "multi-character sequences?"
arc
i mean, you could allocate the values 128-255 to represent the 127 most common words in the english language
arc
i do not know if this is true yet or not.
Zash
Well, Hangul?
SamWhited
You could probably do that, you won't be able to do that for all languages though
arc
ive only been reading into this for the past 2 hours.
SamWhited
Not without canonicalizing inputs first
jonasw
SamWhited: but as far as I understand it, your client could choose a schema specialised for the locale you’re using
jubalhhas joined
arc
SamWhited: since the client dictates the schema, the client could adjust this per selected language for the user
jonasw
wee, I understood EXI \o/
arc
also, there's nothing stopping you from using 9 bits
arc
im just commenting that we have a shortcoming in the XEP schemas. strings can and should be validated
arc
also this could be extremely useful for client-side data forms validation
Are there really strings that are not logically enums, while being user controlled?
arc
before 8:30am this morning i wasnt aware that XML regex even existed, so my understanding is still very crude, and further what subset of this applies to EXI
SamWhited
Unicode provides ways to do canonicalization of things like that, you'd just have to make sure you were doing it before building the string table and to any words you compare against the string table
arc
SamWhited: *IF* this supports multi-character sequences, and not simply constraining which unicode codepoints are acceptable in regex format, then its whatever is defined in the schema. but this is entirely separate from the string table provided by EXI.
Zash
That way lies madness
Zashpoints in the general direction of Unicode
vurpohas left
vurpohas joined
bjchas joined
SamWhited
Oh, I thought this had something to do with the string table. Either way, if you're searching for things, you'll need to do canonicalization too if you want to actually find things (since the same thing may be encoded differently on different machines, but be the same as far as the user is concerned)
arc
SamWhited: yes, on the machine side this is evaluated to a unicode codepoint per character in any case.
having the alert bubble show a new message while on a google search, and see people arguing implied consent for sexual penetration by the TSA by their choice to fly... break time.
it shows up in google searches for unrelated things.
Tobias
i'm sure that's not standard behavior
arc
anyway. yes im starting to suspect that the way this works, multi-character sequences can be implied, but it might be even more devious. more like a smartphone dictionary predictor
arc
if you have both the letter "c" and a number of whole words that "c" could be grouped properly, you could resolve whole words in a minimum number of bits. and that can be optimized by the client in the chosen schema
jonasw
arc: so basically the string is encoded by the states of a regex automaton which gets the string fed as input?
arc
I think so.
arc
actually i should go back to what i did in the early days with this work, grab the reference implementation and try some things on it, then read the bits
jonasw
clever and devious at the same time
goffihas left
arc
you might even be able to, if you are very clever, recreate UTF-8 using an XML regex.
arc
that's not even work, that'd be pure joy for some weekend.
Valerianhas left
Zash
Wat
uchas left
uchas joined
Steve Killehas left
arc
well remember that the top bit of UTF-8 determines whether its a 1-byte or multi-byte sequence. and if the first byte has bit 128 set, then the next byte will have the top two bits set appropriately to show a continuation, etc
arc
if you are very very clever, and if this works the way im starting to understand, then you could build a regex that recreates UTF-8 precisely such that the string value encoded by EXI would be precisely UTF-8
arc
such that if you encoded EXI byte-aligned, and you read the raw stream, you would find the UTF-8 encoded strings within
Steve Killehas left
blipphas left
blipphas joined
Guushas left
Ge0rGhas left
Ge0rGhas left
vurpohas left
kaboomhas left
jubalhhas joined
kaboomhas left
arc
it might not be possible but im fairly certain it is, because the bits in UTF-8 are always meaningful, you would just have to nest your atoms appropriately.
waqashas joined
Martinhas left
Martinhas joined
arc
but UTF-8 encoding is a hack for ascii backwards compatibility, i believe in almost every case you could craft a better one. which is kind of cool if you think about it, even with a limited dictionary, Zipf's Law will ensure extremely tight compression, and without the encryption concerns
arc
https://www.youtube.com/watch?v=fCn8zs912OE
Zash
Out of all Unicode related things, UTF-8 is the last thing I'd complain about
Valerianhas joined
mhterreshas left
arc
oh im not complaining about UTF-8. i love UTF-8. but I can see now why UTF-32 was acceptable.
Zash
Why not UTF-64? Surely it'll be more efficient on modern machines ;)
Guushas left
arc
heh
arc
i think actually most uses of this would be about as fast as UTF-8 decoding
SamWhitedhas left
waqashas left
danielhas left
danielhas left
Kevhas left
danielhas left
Guushas left
nicolas.veritehas joined
uchas left
uchas joined
Martinhas left
uchas left
uchas joined
danielhas left
kalkinhas left
kalkinhas joined
bjchas left
blipphas left
blipphas joined
mimi89999has left
uchas left
Guushas left
uchas joined
jonaswhas left
bjchas joined
Guushas left
intosihas joined
danielhas left
kaboomhas left
efrithas joined
kaboomhas joined
intosihas left
Ge0rGhas left
arc
a very simple regex could be something like """[\p{BasicLatin}|(he)|(se)|(re)|(hat)|.]*"""
waqashas joined
arc
EXI usually follows schema semantics literally, so i would assume 3 bits would be used to determine whether its a chr(0:127), one of the four provided common word segments, or a full unicode character
arc
"The" would be then be encoded as "000 1010100 001"