XSF Discussion - 2017-03-07


  1. SamWhited

    This is kind of nifty if true: https://twitter.com/Midar3/status/839059229289943041

  2. SamWhited

    (TL;DR — libstrophe is listed in the Nintendo Switch's open source license credits)

  3. jonasw

    libstrophe. nice :)

  4. Ge0rG

    and somebody didn't bother enough to rotate that picture.

  5. Tobias

    heh

  6. jonasw

    Ge0rG: you can rotate your screen, can’t you :P

  7. Ge0rG

    jonasw: I tried, but it turned out to be attached to the laptop body.

  8. jonasw

    xrandr --output $OUTPUT --rotate left

  9. Ge0rG

    alternative version: I did rotate it, but then the desktop manager autorotated it back.

  10. SamWhited

    Tangentially related: I didn't realize Jack Moffitt worked for Mozilla or was in charge of Servo these days; that's fantastic. I wish he'd revamp libstrophe in Rust.

  11. SamWhited

    or maybe I did realize that, since I apparently follow him on a bunch of Rust stuff, but didn't realize he was the same person who wrote libstrophe.

  12. Tobias

    SamWhited, similar thing with some of Cisco's original jabber devs :)

  13. Zash

    I wouldn't really expect Mozilla to do anything with XMPP.

  14. Tobias

    they probably went there for non-XMPP related things

  15. SamWhited

    yah, I once got asked at a conference by someone on the Hello team (or whatever that short lived firefox messenger was) what the point of XMPP or using standards was; I dunno about the rest of Mozilla, but I more or less gave up on the Firefox team then and there.

  16. Tobias

    Zash, although they should do more with it

  17. jonasw

    SamWhited: wtf

  18. SamWhited

    I think his exact words were "why would anyone bother using standards?"

  19. jonasw

    wtf

  20. jonasw

    have they even SEEN internet explorer?

  21. SamWhited

    Granted, I doubt he's representative of the rest of the Firefox people given their involvement with all things web-standards related; maybe that was just the Hello team.

  22. Tobias

    jonasw, come on...with web assembler you can finnally render your IE6 pages they way they are supposed to look on every platform :)

  23. jonasw

    Tobias: you make me sad.

  24. Zash

    I wonder if they still remember that "The Internet" does not equal "The Web"

  25. Zash

    SamWhited: https://www.mozilla.org/en-US/about/manifesto/#principle-06

  26. SamWhited

    Zash: Hah, I should have just pointed him at that; thanks.

  27. jonasw

    gahaha

  28. jonasw

    slap ’em in the face with that manifesto

  29. Tobias

    Zash, nah...just use Hello

  30. jonasw

    those are the same people who didn’t fight (enough) against WebDRM

  31. SamWhited

    They fought a lot, they just didn't win.

  32. Tobias

    Zash, or allo https://twitter.com/burnflare/status/838966485011685376 :)

  33. SamWhited

    I don't think it's fair to say they didn't fight enough; they were against it all the way through.

  34. jonasw

    SamWhited: okay

  35. jonasw

    I admit I haven’t followed it in detail, but what I saw from news coverage it didn’t seem too great.

  36. arc

    wow, i am just now realizing how much work there is to be done

  37. arc

    anyone here touched xml regex?

  38. Tobias

    XML regex?

  39. arc

    yes

  40. Zash

    arc: Wait for it

  41. jonasw

    what the heck is XML regex

  42. Zash prepares for the obligatory Zalgo reference

  43. SamWhited

    I'm about to have to flip my table, aren't I?

  44. arc

    http://www.xmlschemareference.com/regularExpression.html

  45. jonasw

    https://stackoverflow.com/a/1732454/1248008

  46. Tobias

    SamWhited, oh you have some of those new flipping desks

  47. jonasw

    SamWhited: relevant for tableflip: https://www.youtube.com/watch?v=eob7V_WtAVg

  48. arc

    its a method to constrain acceptable string values within xml

  49. SamWhited

    Tobias: Nah, they went for the sit-stand ones, but wouldn't spring for the flipping ones

  50. jonasw

    arc: seems like a regex variant used by XML?

  51. arc

    im still reading into it, but yes. thankfully its a simplified variant

  52. arc

    EXI uses it for strings.

  53. Zash

    Why and where?

  54. arc

    all strings.

  55. arc

    CH notably

  56. arc

    without it, an unconstrained string in the schema is transmitted as an unsigned int per char representing the unicode codepoint. no UTF-8

  57. Zash

    Shouldn't most strings fit in either enums or user-provided strings with no restrictions?

  58. arc

    oh yes. but you'd be insane to do so

  59. Zash

    Whta

  60. Zash

    No UTF-8?

  61. arc

    no UTF-8

  62. Zash

    KILL IT WITH FIRE

  63. arc

    not as far as ive found. admittedly ive just started

  64. arc

    that was my gut reaction too. however the more i read into this, the more i understand why.

  65. jonasw

    why would one want to do that?

  66. SamWhited

    > representing the unicode codepoint

  67. SamWhited

    Yup, now my desk and things are all over the floor. Saw it coming.

  68. Zash

    So, UTF-32

  69. jonasw

    SamWhited: did you do it with the excellent stare of Alan Rickman?

  70. arc

    ... yes. but as i said, you'd be insane to not use the regex to restrict the character map

  71. SamWhited

    jonasw: No, I am nowhere near that fantastic; that was amazing.

  72. jonasw

    arc: so assuming this is used for standard desktop clients, I either have to restrict what codepoints users can use or the text is blown up to factor two to four of the bytes needed with UTF-8?

  73. arc

    unfortunetly the EXI spec doesn't go into deep detail on this, it refers to other documentation on xml regex, but it appears with bitpacked encoding you can compress it down a lot better than UTF-8

  74. arc

    jonasw: nope. you can craft a method to support the entire breadth of unicode in a much tighter format than UTF-8 because you're no longer constrained to byte boundaries.

  75. jonasw

    mhm

  76. Zash

    Like, huffman code

  77. arc

    i wouldn't go that far with it.

  78. arc

    im trying to track down whether a codepoint can represent multi-character sequences now.

  79. arc

    i would not be suprised.

  80. arc

    unlike using DEFLATE tho, this would not be dynamic, but encoded as part of the schema.

  81. SamWhited

    Define "multi-character sequences?"

  82. arc

    i mean, you could allocate the values 128-255 to represent the 127 most common words in the english language

  83. arc

    i do not know if this is true yet or not.

  84. Zash

    Well, Hangul?

  85. SamWhited

    You could probably do that, you won't be able to do that for all languages though

  86. arc

    ive only been reading into this for the past 2 hours.

  87. SamWhited

    Not without canonicalizing inputs first

  88. jonasw

    SamWhited: but as far as I understand it, your client could choose a schema specialised for the locale you’re using

  89. arc

    SamWhited: since the client dictates the schema, the client could adjust this per selected language for the user

  90. jonasw

    wee, I understood EXI \o/

  91. arc

    also, there's nothing stopping you from using 9 bits

  92. arc

    im just commenting that we have a shortcoming in the XEP schemas. strings can and should be validated

  93. arc

    also this could be extremely useful for client-side data forms validation

  94. SamWhited

    But let's say one of those words is "café"; is that caf + Unicode character LATIN SMALL LETTER E WITH ACUTE (U+00E9) or cafe + Unicode character COMBINING ACUTE ACCENT (U+0301)?

  95. Zash

    Are there really strings that are not logically enums, while being user controlled?

  96. arc

    before 8:30am this morning i wasnt aware that XML regex even existed, so my understanding is still very crude, and further what subset of this applies to EXI

  97. SamWhited

    Unicode provides ways to do canonicalization of things like that, you'd just have to make sure you were doing it before building the string table and to any words you compare against the string table

  98. arc

    SamWhited: *IF* this supports multi-character sequences, and not simply constraining which unicode codepoints are acceptable in regex format, then its whatever is defined in the schema. but this is entirely separate from the string table provided by EXI.

  99. Zash

    That way lies madness

  100. Zash points in the general direction of Unicode

  101. SamWhited

    Oh, I thought this had something to do with the string table. Either way, if you're searching for things, you'll need to do canonicalization too if you want to actually find things (since the same thing may be encoded differently on different machines, but be the same as far as the user is concerned)

  102. arc

    SamWhited: yes, on the machine side this is evaluated to a unicode codepoint per character in any case.

  103. arc

    goodbye char*

  104. arc

    goodbye stdlib

  105. arc

    goodnight #import "string.h"

  106. jonasw

    hey, were will I get my memset from, arc? ;)

  107. arc

    jonasw: heh

  108. Zash

    jonasw: dd in=/dev/mem of=/dev/mem start=x count=y

  109. Zash

    or was it seek=

  110. jonasw

    *blink*

  111. jonasw

    I’m done for today.

  112. arc

    LOL

  113. jonasw

    Zash: also, that’s memcpy, not memset.

  114. arc

    Ok I'm now officially over G+

  115. arc

    having the alert bubble show a new message while on a google search, and see people arguing implied consent for sexual penetration by the TSA by their choice to fly... break time.

  116. jonasw

    wat

  117. arc

    https://plus.google.com/+JohnWarthog9Hawley/posts/LEvErfQnajc

  118. Zash

    off topic much?

  119. arc

    that's the problem with G+

  120. arc

    it shows up in google searches for unrelated things.

  121. Tobias

    i'm sure that's not standard behavior

  122. arc

    anyway. yes im starting to suspect that the way this works, multi-character sequences can be implied, but it might be even more devious. more like a smartphone dictionary predictor

  123. arc

    if you have both the letter "c" and a number of whole words that "c" could be grouped properly, you could resolve whole words in a minimum number of bits. and that can be optimized by the client in the chosen schema

  124. jonasw

    arc: so basically the string is encoded by the states of a regex automaton which gets the string fed as input?

  125. arc

    I think so.

  126. arc

    actually i should go back to what i did in the early days with this work, grab the reference implementation and try some things on it, then read the bits

  127. jonasw

    clever and devious at the same time

  128. arc

    you might even be able to, if you are very clever, recreate UTF-8 using an XML regex.

  129. arc

    that's not even work, that'd be pure joy for some weekend.

  130. Zash

    Wat

  131. arc

    well remember that the top bit of UTF-8 determines whether its a 1-byte or multi-byte sequence. and if the first byte has bit 128 set, then the next byte will have the top two bits set appropriately to show a continuation, etc

  132. arc

    if you are very very clever, and if this works the way im starting to understand, then you could build a regex that recreates UTF-8 precisely such that the string value encoded by EXI would be precisely UTF-8

  133. arc

    such that if you encoded EXI byte-aligned, and you read the raw stream, you would find the UTF-8 encoded strings within

  134. arc

    it might not be possible but im fairly certain it is, because the bits in UTF-8 are always meaningful, you would just have to nest your atoms appropriately.

  135. arc

    but UTF-8 encoding is a hack for ascii backwards compatibility, i believe in almost every case you could craft a better one. which is kind of cool if you think about it, even with a limited dictionary, Zipf's Law will ensure extremely tight compression, and without the encryption concerns

  136. arc

    https://www.youtube.com/watch?v=fCn8zs912OE

  137. Zash

    Out of all Unicode related things, UTF-8 is the last thing I'd complain about

  138. arc

    oh im not complaining about UTF-8. i love UTF-8. but I can see now why UTF-32 was acceptable.

  139. Zash

    Why not UTF-64? Surely it'll be more efficient on modern machines ;)

  140. arc

    heh

  141. arc

    i think actually most uses of this would be about as fast as UTF-8 decoding

  142. arc

    a very simple regex could be something like """[\p{BasicLatin}|(he)|(se)|(re)|(hat)|.]*"""

  143. arc

    EXI usually follows schema semantics literally, so i would assume 3 bits would be used to determine whether its a chr(0:127), one of the four provided common word segments, or a full unicode character

  144. arc

    "The" would be then be encoded as "000 1010100 001"