XSF Discussion - 2017-03-07

SamWhited 15:58:03
This is kind of nifty if true: https://twitter.com/Midar3/status/839059229289943041
SamWhited 15:58:29
(TL;DR — libstrophe is listed in the Nintendo Switch's open source license credits)
jonasw 15:58:58
libstrophe. nice :)
Ge0rG 15:59:00
and somebody didn't bother enough to rotate that picture.
Tobias 15:59:00
heh
jonasw 15:59:27
Ge0rG: you can rotate your screen, can’t you :P
Ge0rG 15:59:55
jonasw: I tried, but it turned out to be attached to the laptop body.
jonasw 16:00:06
xrandr --output $OUTPUT --rotate left
Ge0rG 16:00:14
alternative version: I did rotate it, but then the desktop manager autorotated it back.
SamWhited 16:00:44
Tangentially related: I didn't realize Jack Moffitt worked for Mozilla or was in charge of Servo these days; that's fantastic. I wish he'd revamp libstrophe in Rust.
SamWhited 16:01:28
or maybe I did realize that, since I apparently follow him on a bunch of Rust stuff, but didn't realize he was the same person who wrote libstrophe.
Tobias 16:03:34
SamWhited, similar thing with some of Cisco's original jabber devs :)
Zash 16:04:47
I wouldn't really expect Mozilla to do anything with XMPP.
Tobias 16:05:02
they probably went there for non-XMPP related things
SamWhited 16:05:39
yah, I once got asked at a conference by someone on the Hello team (or whatever that short lived firefox messenger was) what the point of XMPP or using standards was; I dunno about the rest of Mozilla, but I more or less gave up on the Firefox team then and there.
Tobias 16:05:48
Zash, although they should do more with it
jonasw 16:05:52
SamWhited: wtf
SamWhited 16:06:04
I think his exact words were "why would anyone bother using standards?"
jonasw 16:06:08
wtf
jonasw 16:06:14
have they even SEEN internet explorer?
SamWhited 16:06:43
Granted, I doubt he's representative of the rest of the Firefox people given their involvement with all things web-standards related; maybe that was just the Hello team.
Tobias 16:07:04
jonasw, come on...with web assembler you can finnally render your IE6 pages they way they are supposed to look on every platform :)
jonasw 16:07:30
Tobias: you make me sad.
Zash 16:09:11
I wonder if they still remember that "The Internet" does not equal "The Web"
Zash 16:09:46
SamWhited: https://www.mozilla.org/en-US/about/manifesto/#principle-06
SamWhited 16:10:01
Zash: Hah, I should have just pointed him at that; thanks.
jonasw 16:10:34
gahaha
jonasw 16:10:42
slap ’em in the face with that manifesto
Tobias 16:10:46
Zash, nah...just use Hello
jonasw 16:10:53
those are the same people who didn’t fight (enough) against WebDRM
SamWhited 16:11:18
They fought a lot, they just didn't win.
Tobias 16:11:28
Zash, or allo https://twitter.com/burnflare/status/838966485011685376 :)
SamWhited 16:11:37
I don't think it's fair to say they didn't fight enough; they were against it all the way through.
jonasw 16:12:02
SamWhited: okay
jonasw 16:12:14
I admit I haven’t followed it in detail, but what I saw from news coverage it didn’t seem too great.
arc 16:12:53
wow, i am just now realizing how much work there is to be done
arc 16:13:02
anyone here touched xml regex?
Tobias 16:13:13
XML regex?
arc 16:13:17
yes
Zash 16:13:19
arc: Wait for it
jonasw 16:13:24
what the heck is XML regex
Zash prepares for the obligatory Zalgo reference 16:13:29
SamWhited 16:13:33
I'm about to have to flip my table, aren't I?
arc 16:13:47
http://www.xmlschemareference.com/regularExpression.html
jonasw 16:13:48
https://stackoverflow.com/a/1732454/1248008
Tobias 16:13:56
SamWhited, oh you have some of those new flipping desks
jonasw 16:14:10
SamWhited: relevant for tableflip: https://www.youtube.com/watch?v=eob7V_WtAVg
arc 16:14:12
its a method to constrain acceptable string values within xml
SamWhited 16:14:16
Tobias: Nah, they went for the sit-stand ones, but wouldn't spring for the flipping ones
jonasw 16:14:41
arc: seems like a regex variant used by XML?
arc 16:15:22
im still reading into it, but yes. thankfully its a simplified variant
arc 16:15:31
EXI uses it for strings.
Zash 16:15:56
Why and where?
arc 16:16:12
all strings.
arc 16:16:25
CH notably
arc 16:16:57
without it, an unconstrained string in the schema is transmitted as an unsigned int per char representing the unicode codepoint. no UTF-8
Zash 16:17:05
Shouldn't most strings fit in either enums or user-provided strings with no restrictions?
arc 16:17:17
oh yes. but you'd be insane to do so
Zash 16:17:26
Whta
Zash 16:17:28
No UTF-8?
arc 16:17:32
no UTF-8
Zash 16:17:36
KILL IT WITH FIRE
arc 16:17:43
not as far as ive found. admittedly ive just started
arc 16:18:10
that was my gut reaction too. however the more i read into this, the more i understand why.
jonasw 16:18:11
why would one want to do that?
SamWhited 16:18:18
> representing the unicode codepoint
SamWhited 16:18:26
Yup, now my desk and things are all over the floor. Saw it coming.
Zash 16:18:33
So, UTF-32
jonasw 16:18:45
SamWhited: did you do it with the excellent stare of Alan Rickman?
arc 16:18:52
... yes. but as i said, you'd be insane to not use the regex to restrict the character map
SamWhited 16:19:11
jonasw: No, I am nowhere near that fantastic; that was amazing.
jonasw 16:19:36
arc: so assuming this is used for standard desktop clients, I either have to restrict what codepoints users can use or the text is blown up to factor two to four of the bytes needed with UTF-8?
arc 16:19:50
unfortunetly the EXI spec doesn't go into deep detail on this, it refers to other documentation on xml regex, but it appears with bitpacked encoding you can compress it down a lot better than UTF-8
arc 16:20:59
jonasw: nope. you can craft a method to support the entire breadth of unicode in a much tighter format than UTF-8 because you're no longer constrained to byte boundaries.
jonasw 16:21:24
mhm
Zash 16:21:25
Like, huffman code
arc 16:21:43
i wouldn't go that far with it.
arc 16:22:03
im trying to track down whether a codepoint can represent multi-character sequences now.
arc 16:22:12
i would not be suprised.
arc 16:22:28
unlike using DEFLATE tho, this would not be dynamic, but encoded as part of the schema.
SamWhited 16:22:29
Define "multi-character sequences?"
arc 16:22:52
i mean, you could allocate the values 128-255 to represent the 127 most common words in the english language
arc 16:23:15
i do not know if this is true yet or not.
Zash 16:23:50
Well, Hangul?
SamWhited 16:23:56
You could probably do that, you won't be able to do that for all languages though
arc 16:23:59
ive only been reading into this for the past 2 hours.
SamWhited 16:24:07
Not without canonicalizing inputs first
jonasw 16:24:20
SamWhited: but as far as I understand it, your client could choose a schema specialised for the locale you’re using
arc 16:24:23
SamWhited: since the client dictates the schema, the client could adjust this per selected language for the user
jonasw 16:24:33
wee, I understood EXI \o/
arc 16:24:36
also, there's nothing stopping you from using 9 bits
arc 16:25:11
im just commenting that we have a shortcoming in the XEP schemas. strings can and should be validated
arc 16:25:21
also this could be extremely useful for client-side data forms validation
SamWhited 16:25:36
But let's say one of those words is "café"; is that caf + Unicode character LATIN SMALL LETTER E WITH ACUTE (U+00E9) or cafe + Unicode character COMBINING ACUTE ACCENT (U+0301)?
Zash 16:25:58
Are there really strings that are not logically enums, while being user controlled?
arc 16:26:07
before 8:30am this morning i wasnt aware that XML regex even existed, so my understanding is still very crude, and further what subset of this applies to EXI
SamWhited 16:26:53
Unicode provides ways to do canonicalization of things like that, you'd just have to make sure you were doing it before building the string table and to any words you compare against the string table
arc 16:27:16
SamWhited: *IF* this supports multi-character sequences, and not simply constraining which unicode codepoints are acceptable in regex format, then its whatever is defined in the schema. but this is entirely separate from the string table provided by EXI.
Zash 16:27:16
That way lies madness
Zash points in the general direction of Unicode 16:27:22
SamWhited 16:28:54
Oh, I thought this had something to do with the string table. Either way, if you're searching for things, you'll need to do canonicalization too if you want to actually find things (since the same thing may be encoded differently on different machines, but be the same as far as the user is concerned)
arc 16:29:33
SamWhited: yes, on the machine side this is evaluated to a unicode codepoint per character in any case.
arc 16:29:37
goodbye char*
arc 16:29:50
goodbye stdlib
arc 16:30:00
goodnight #import "string.h"
jonasw 16:30:10
hey, were will I get my memset from, arc? ;)
arc 16:30:19
jonasw: heh
Zash 16:30:49
jonasw: dd in=/dev/mem of=/dev/mem start=x count=y
Zash 16:31:02
or was it seek=
jonasw 16:31:03
*blink*
jonasw 16:31:06
I’m done for today.
arc 16:31:11
LOL
jonasw 16:31:44
Zash: also, that’s memcpy, not memset.
arc 16:50:34
Ok I'm now officially over G+
arc 16:51:27
having the alert bubble show a new message while on a google search, and see people arguing implied consent for sexual penetration by the TSA by their choice to fly... break time.
jonasw 16:52:44
wat
arc 16:57:17
https://plus.google.com/+JohnWarthog9Hawley/posts/LEvErfQnajc
Zash 16:57:31
off topic much?
arc 16:57:38
that's the problem with G+
arc 16:57:51
it shows up in google searches for unrelated things.
Tobias 16:58:29
i'm sure that's not standard behavior
arc 17:01:23
anyway. yes im starting to suspect that the way this works, multi-character sequences can be implied, but it might be even more devious. more like a smartphone dictionary predictor
arc 17:02:36
if you have both the letter "c" and a number of whole words that "c" could be grouped properly, you could resolve whole words in a minimum number of bits. and that can be optimized by the client in the chosen schema
jonasw 17:03:22
arc: so basically the string is encoded by the states of a regex automaton which gets the string fed as input?
arc 17:03:29
I think so.
arc 17:03:55
actually i should go back to what i did in the early days with this work, grab the reference implementation and try some things on it, then read the bits
jonasw 17:03:57
clever and devious at the same time
arc 17:05:19
you might even be able to, if you are very clever, recreate UTF-8 using an XML regex.
arc 17:05:53
that's not even work, that'd be pure joy for some weekend.
Zash 17:06:53
Wat
arc 17:07:55
well remember that the top bit of UTF-8 determines whether its a 1-byte or multi-byte sequence. and if the first byte has bit 128 set, then the next byte will have the top two bits set appropriately to show a continuation, etc
arc 17:08:34
if you are very very clever, and if this works the way im starting to understand, then you could build a regex that recreates UTF-8 precisely such that the string value encoded by EXI would be precisely UTF-8
arc 17:08:56
such that if you encoded EXI byte-aligned, and you read the raw stream, you would find the UTF-8 encoded strings within
arc 17:22:16
it might not be possible but im fairly certain it is, because the bits in UTF-8 are always meaningful, you would just have to nest your atoms appropriately.
arc 17:27:14
but UTF-8 encoding is a hack for ascii backwards compatibility, i believe in almost every case you could craft a better one. which is kind of cool if you think about it, even with a limited dictionary, Zipf's Law will ensure extremely tight compression, and without the encryption concerns
arc 17:27:43
https://www.youtube.com/watch?v=fCn8zs912OE
Zash 17:30:15
Out of all Unicode related things, UTF-8 is the last thing I'd complain about
arc 17:34:48
oh im not complaining about UTF-8. i love UTF-8. but I can see now why UTF-32 was acceptable.
Zash 17:35:30
Why not UTF-64? Surely it'll be more efficient on modern machines ;)
arc 17:36:46
heh
arc 17:37:20
i think actually most uses of this would be about as fast as UTF-8 decoding
arc 18:52:56
a very simple regex could be something like """[\p{BasicLatin}|(he)|(se)|(re)|(hat)|.]*"""
arc 18:55:49
EXI usually follows schema semantics literally, so i would assume 3 bits would be used to determine whether its a chr(0:127), one of the four provided common word segments, or a full unicode character
arc 18:57:21
"The" would be then be encoded as "000 1010100 001"