SamWhitedThis is kind of nifty if true: https://twitter.com/Midar3/status/839059229289943041
jubalhhas left
kaboomhas left
SamWhited(TL;DR — libstrophe is listed in the Nintendo Switch's open source license credits)
jonaswlibstrophe. nice :)
vurpohas left
vurpohas joined
Ge0rGand somebody didn't bother enough to rotate that picture.
Tobiasheh
jonaswGe0rG: you can rotate your screen, can’t you :P
Ge0rGjonasw: I tried, but it turned out to be attached to the laptop body.
jonaswxrandr --output $OUTPUT --rotate left
Ge0rGalternative version: I did rotate it, but then the desktop manager autorotated it back.
SamWhitedTangentially related: I didn't realize Jack Moffitt worked for Mozilla or was in charge of Servo these days; that's fantastic. I wish he'd revamp libstrophe in Rust.
SamWhitedor maybe I did realize that, since I apparently follow him on a bunch of Rust stuff, but didn't realize he was the same person who wrote libstrophe.
kaboomhas left
TobiasSamWhited, similar thing with some of Cisco's original jabber devs :)
sezuanhas left
ZashI wouldn't really expect Mozilla to do anything with XMPP.
Tobiasthey probably went there for non-XMPP related things
SamWhitedyah, I once got asked at a conference by someone on the Hello team (or whatever that short lived firefox messenger was) what the point of XMPP or using standards was; I dunno about the rest of Mozilla, but I more or less gave up on the Firefox team then and there.
TobiasZash, although they should do more with it
jonaswSamWhited: wtf
SamWhitedI think his exact words were "why would anyone bother using standards?"
jonaswwtf
jonaswhave they even SEEN internet explorer?
SamWhitedGranted, I doubt he's representative of the rest of the Firefox people given their involvement with all things web-standards related; maybe that was just the Hello team.
Tobiasjonasw, come on...with web assembler you can finnally render your IE6 pages they way they are supposed to look on every platform :)
intosihas left
jonaswTobias: you make me sad.
nicolas.veritehas joined
nicolas.veritehas left
kaboomhas left
ZashI wonder if they still remember that "The Internet" does not equal "The Web"
TobiasSamWhited, oh you have some of those new flipping desks
jonaswSamWhited: relevant for tableflip: https://www.youtube.com/watch?v=eob7V_WtAVg
arcits a method to constrain acceptable string values within xml
SamWhitedTobias: Nah, they went for the sit-stand ones, but wouldn't spring for the flipping ones
jonaswarc: seems like a regex variant used by XML?
arcim still reading into it, but yes. thankfully its a simplified variant
arcEXI uses it for strings.
Ge0rGhas joined
ZashWhy and where?
arcall strings.
arcCH notably
arcwithout it, an unconstrained string in the schema is transmitted as an unsigned int per char representing the unicode codepoint. no UTF-8
ZashShouldn't most strings fit in either enums or user-provided strings with no restrictions?
arcoh yes. but you'd be insane to do so
ZashWhta
ZashNo UTF-8?
arcno UTF-8
ZashKILL IT WITH FIRE
arcnot as far as ive found. admittedly ive just started
arcthat was my gut reaction too. however the more i read into this, the more i understand why.
jonaswwhy would one want to do that?
SamWhited> representing the unicode codepoint
SamWhitedYup, now my desk and things are all over the floor. Saw it coming.
ZashSo, UTF-32
Guushas left
jonaswSamWhited: did you do it with the excellent stare of Alan Rickman?
arc... yes. but as i said, you'd be insane to not use the regex to restrict the character map
SamWhitedjonasw: No, I am nowhere near that fantastic; that was amazing.
Guushas left
jonaswarc: so assuming this is used for standard desktop clients, I either have to restrict what codepoints users can use or the text is blown up to factor two to four of the bytes needed with UTF-8?
arcunfortunetly the EXI spec doesn't go into deep detail on this, it refers to other documentation on xml regex, but it appears with bitpacked encoding you can compress it down a lot better than UTF-8
arcjonasw: nope. you can craft a method to support the entire breadth of unicode in a much tighter format than UTF-8 because you're no longer constrained to byte boundaries.
jonaswmhm
ZashLike, huffman code
arci wouldn't go that far with it.
arcim trying to track down whether a codepoint can represent multi-character sequences now.
arci would not be suprised.
arcunlike using DEFLATE tho, this would not be dynamic, but encoded as part of the schema.
SamWhitedDefine "multi-character sequences?"
arci mean, you could allocate the values 128-255 to represent the 127 most common words in the english language
arci do not know if this is true yet or not.
ZashWell, Hangul?
SamWhitedYou could probably do that, you won't be able to do that for all languages though
arcive only been reading into this for the past 2 hours.
SamWhitedNot without canonicalizing inputs first
jonaswSamWhited: but as far as I understand it, your client could choose a schema specialised for the locale you’re using
jubalhhas joined
arcSamWhited: since the client dictates the schema, the client could adjust this per selected language for the user
jonaswwee, I understood EXI \o/
arcalso, there's nothing stopping you from using 9 bits
arcim just commenting that we have a shortcoming in the XEP schemas. strings can and should be validated
arcalso this could be extremely useful for client-side data forms validation
SamWhitedBut let's say one of those words is "café"; is that caf + Unicode character LATIN SMALL LETTER E WITH ACUTE (U+00E9) or cafe + Unicode character COMBINING ACUTE ACCENT (U+0301)?
ZashAre there really strings that are not logically enums, while being user controlled?
arcbefore 8:30am this morning i wasnt aware that XML regex even existed, so my understanding is still very crude, and further what subset of this applies to EXI
SamWhitedUnicode provides ways to do canonicalization of things like that, you'd just have to make sure you were doing it before building the string table and to any words you compare against the string table
arcSamWhited: *IF* this supports multi-character sequences, and not simply constraining which unicode codepoints are acceptable in regex format, then its whatever is defined in the schema. but this is entirely separate from the string table provided by EXI.
ZashThat way lies madness
Zashpoints in the general direction of Unicode
vurpohas left
vurpohas joined
bjchas joined
SamWhitedOh, I thought this had something to do with the string table. Either way, if you're searching for things, you'll need to do canonicalization too if you want to actually find things (since the same thing may be encoded differently on different machines, but be the same as far as the user is concerned)
arcSamWhited: yes, on the machine side this is evaluated to a unicode codepoint per character in any case.
arcgoodbye char*
arcgoodbye stdlib
arcgoodnight #import "string.h"
jonaswhey, were will I get my memset from, arc? ;)
archaving the alert bubble show a new message while on a google search, and see people arguing implied consent for sexual penetration by the TSA by their choice to fly... break time.
arcit shows up in google searches for unrelated things.
Tobiasi'm sure that's not standard behavior
arcanyway. yes im starting to suspect that the way this works, multi-character sequences can be implied, but it might be even more devious. more like a smartphone dictionary predictor
arcif you have both the letter "c" and a number of whole words that "c" could be grouped properly, you could resolve whole words in a minimum number of bits. and that can be optimized by the client in the chosen schema
jonaswarc: so basically the string is encoded by the states of a regex automaton which gets the string fed as input?
arcI think so.
arcactually i should go back to what i did in the early days with this work, grab the reference implementation and try some things on it, then read the bits
jonaswclever and devious at the same time
goffihas left
arcyou might even be able to, if you are very clever, recreate UTF-8 using an XML regex.
arcthat's not even work, that'd be pure joy for some weekend.
Valerianhas left
ZashWat
uchas left
uchas joined
Steve Killehas left
arcwell remember that the top bit of UTF-8 determines whether its a 1-byte or multi-byte sequence. and if the first byte has bit 128 set, then the next byte will have the top two bits set appropriately to show a continuation, etc
arcif you are very very clever, and if this works the way im starting to understand, then you could build a regex that recreates UTF-8 precisely such that the string value encoded by EXI would be precisely UTF-8
arcsuch that if you encoded EXI byte-aligned, and you read the raw stream, you would find the UTF-8 encoded strings within
Steve Killehas left
blipphas left
blipphas joined
Guushas left
Ge0rGhas left
Ge0rGhas left
vurpohas left
kaboomhas left
jubalhhas joined
kaboomhas left
arcit might not be possible but im fairly certain it is, because the bits in UTF-8 are always meaningful, you would just have to nest your atoms appropriately.
waqashas joined
Martinhas left
Martinhas joined
arcbut UTF-8 encoding is a hack for ascii backwards compatibility, i believe in almost every case you could craft a better one. which is kind of cool if you think about it, even with a limited dictionary, Zipf's Law will ensure extremely tight compression, and without the encryption concerns
archttps://www.youtube.com/watch?v=fCn8zs912OE
ZashOut of all Unicode related things, UTF-8 is the last thing I'd complain about
Valerianhas joined
mhterreshas left
arcoh im not complaining about UTF-8. i love UTF-8. but I can see now why UTF-32 was acceptable.
ZashWhy not UTF-64? Surely it'll be more efficient on modern machines ;)
Guushas left
archeh
arci think actually most uses of this would be about as fast as UTF-8 decoding
SamWhitedhas left
waqashas left
danielhas left
danielhas left
Kevhas left
danielhas left
Guushas left
nicolas.veritehas joined
uchas left
uchas joined
Martinhas left
uchas left
uchas joined
danielhas left
kalkinhas left
kalkinhas joined
bjchas left
blipphas left
blipphas joined
mimi89999has left
uchas left
Guushas left
uchas joined
jonaswhas left
bjchas joined
Guushas left
intosihas joined
danielhas left
kaboomhas left
efrithas joined
kaboomhas joined
intosihas left
Ge0rGhas left
arca very simple regex could be something like """[\p{BasicLatin}|(he)|(se)|(re)|(hat)|.]*"""
waqashas joined
arcEXI usually follows schema semantics literally, so i would assume 3 bits would be used to determine whether its a chr(0:127), one of the four provided common word segments, or a full unicode character
arc"The" would be then be encoded as "000 1010100 001"