XSF Discussion - 2020-09-14


  1. MattJ

    Sigh

  2. MattJ

    I think for XEP-0335 I received conflicting feedback :)

  3. MattJ

    One request was to drop all mention of character encoding and anything below the "stream of XML characters" layer, and the other was a request to add details of characters that need to be XML-escaped

  4. flow

    the former seems sensible

  5. flow

    converting an arbitrary unicode string into its xml representation is not something you should cover in an arbitrary xep

  6. flow

    it's something the the xmpp-core rfc covers by saying "we put XML encoded in UTF-8" on the wire

  7. Zash

    What about arbitrary Unicode? JSON is weird and requires that to be escaped to \uXXXX IIRC

  8. flow

    it would be different if we talk about codepoints that re not allowed in xml 1.0

  9. MattJ

    and XML forbids some characters (even escaped) that JSON allows

  10. MattJ

    so I don't think it's as simple as "don't mention it"

  11. Zash

    And probably in UTF-16 surrogate pair mode

  12. flow

    Zash, for those the escaping happens on the json layer, before the xmpp library sees it and transformes the unicode string to xml, no?

  13. Zash

    Sure, yeah

  14. flow

    MattJ, are those characters kept in their naturual representation in json or, in the escaped representation?

  15. Zash

    MattJ: hm?

  16. flow

    if the latter, then we are fine, if the former, then you have to either define your custom escaping scheme, or convert the json string to base64

  17. Zash

    Bring out the Venn diagrams!

  18. MattJ

    https://mail.jabber.org/pipermail/standards/2019-February/035796.html

  19. flow

    Zash, how does the choosen unicode encoding (UTF-8, UTF-16) matter?

  20. Zash

    Does it say somewhere that an encoded JSON thing is ASCII?

  21. flow

    isn't json, just like xml, at first only a sequence of unicode code points?

  22. Zash

    Aren't JS (and hence JSON) strings by definition UTF-16?

  23. flow

    but then they are also a sequence of unicode codepoints

  24. Zash

    But the encoded JSON is either clean ASCII or ???

  25. flow

    I guess the encoded JSON is whatever encoding you aggreed on to exchange the sequence of codepoints

  26. flow

    I guess the encoded JSON is whatever encoding you agreed on to exchange the sequence of codepoints

  27. Zash

    > JSON syntax describes a sequence of Unicode code points. http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf

  28. flow

    it appears to me that json strings must contain e.g. control characters in their escaped form only

  29. flow

    All Unicode characters may be placed within the quotation marks, except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).

  30. flow

    RFC 7159 § 7

  31. flow

    so if you feed a json string to an xml library, the xml library should never see e.g. U+0010, but only the sequence of codepoints that composes the escape sequence of U+0010

  32. Zash

    Seems so

  33. Zash

    But waqas refers to https://tools.ietf.org/html/rfc7159

  34. Zash

    Why have one definition of JSON when there can be three?

  35. MattJ

    The XEP refers to https://tools.ietf.org/html/rfc4627 which was obsoleted by 7159, I guess I'll update that

  36. Zash

    Or .. twentyeleven.. probably about as many as there are JSON implementations

  37. Zash

    https://tools.ietf.org/html/rfc7159#section-7 > unescaped = %x20-21 / %x23-5B / %x5D-10FFFF

  38. Zash

    Huh, where did I see the thing about surrogate pairs then?

  39. MattJ

    If someone wants to tell me what characters are unrepresentable in XML, please add to the list thread I just revived and I can include it in the next revision

  40. Ge0rG

    there are characters that can't be represented in XML?

  41. MattJ

    According to Waqas. I don't see them at a glance

  42. MattJ

    JSON characters, that is

  43. MattJ

    Some control characters are unrepresentable in XML 1.0

  44. MattJ

    It looks to me like they are also forbidden in JSON

  45. Link Mauve

    Seems like they have a Wikipedia article: https://en.wikipedia.org/wiki/Valid_characters_in_XML

  46. flow

    MattJ, forbidden as in, they have to be escaped in JSON, right?

  47. MattJ

    Yes

  48. MattJ

    or

  49. MattJ looks at the RFC again

  50. flow

    I guess you have to check if any of those characters

  51. flow

    unescaped = %x20-21 / %x23-5B / %x5D-10FFFF

  52. flow

    is invalid/forbidden in XML 1.0

  53. MattJ

    Yes, I don't see anything forbidding escaped things in JSON

  54. flow

    Any character may be escaped.

  55. flow

    say RFC 7159 (the current JSON RFC it appears)

  56. flow

    says RFC 7159 (the current JSON RFC it appears)

  57. flow

    so a naive but safe json to xml converter simply escapes every character ;)

  58. flow

    this errata, although rejected, may be of relevance: https://www.rfc-editor.org/errata/eid3984

  59. jonas’

    MattJ, note that XML forbids control characters, even in escaped form

  60. jonas’

    not that it matters for the JSON usecase

  61. jonas’

    (here, "escaped form" means hex-entity-encoding)

  62. MattJ

    jonas’: I know, that's why I chose "unrepresentable"

  63. stpeter

    FYI, we just received our GSoC payment. Thanks to everyone who contributed to this year's summer of code!