jdev - 2023-03-01

  1. pep.

    Is there any testing harness (big word) for JID validation?

  2. MattJ

    Nothing universal. Prosody's tests are at https://hg.prosody.im/trunk/file/tip/spec/util_jid_spec.lua and it's been quite a while since anyone found bugs in it...

  3. edhelas

    Wondering how we can have test suites accros languages

  4. MattJ

    Defined input and output formats, not unheard of

  5. edhelas

    I'm currently looking to have a proper RFC 6122 support in PHP

  6. Zash

    pep., MattJ: nothing universal and every time it is brought up someone NIHs a new format

  7. MattJ

    Might as well just publish some CSV or whatever with inputs and expected results, something boringly simple and easy to parse - then people can figure out how to get that into whatever framework they use

  8. Zash

    MattJ: I might even have started but got bored when yet another custom format was proposed instead

  9. pep.

    "," isn't valid in any JID part?

  10. pep.

    I'd have two files, so that isn't an issue

  11. Kev

    I thought it was valid in resources, but haven't checked.

  12. Link,Mauve

    It is valid in resources.

  13. pep.

    You demonstrated at best it's valid in Prosody MUCs :P

  14. edhelas

    Movim accepts it as well, as there is no proper validation at all :p

  15. nicoco

    what would be super great for me is a "random UTF8 junk to valid resource part" converter. some legacy service allow control chars it seems. Or maybe it's slixmpp that is not permissive enough? Is "×͜× " a valid resource?

  16. Zash

    nicoco, resourceprep says "maybe"

  17. Link Mauve

    nicoco, it isn’t.

  18. Link Mauve

    (I just tried, and poezio showed no error… Will fix.)

  19. Link Mauve

    nicoco, you might get some issue with multiple “random UTF-8 junk” mapping to the same resourcepart.

  20. nicoco

    my "nickname cleaner" right now is `"".join(x for x in nickname if x in string.printable) + " [renamed by slidge]"`, but for this nickname for instance, it turns it to " [renamed by slidge]", which is not very satisfying

  21. Link Mauve

    For instance, Link Mauve and Link Mauve do map to the same resource after resourceprep.

  22. Beherit

    (In the xsf muc I suggested to discuss whether we should consider Google Season of Docs for writing XEPs. https://developers.google.com/season-of-docs/docs/get-started?hl=en xmpp:xsf@muc.xmpp.org?join )

  23. nicoco

    and also, yes Link Mauve, you're right, I'm risking collisions

  24. Link Mauve

    If the legacy protocol you map to considers those two different users, here be dragons.

  25. nicoco

    well, they are 2 different users, but in MUCs, even non anonymous, the nickname is supposedly also a unique identifier

  26. nicoco

    actually I was dropping by to ask whether it made any sense to use "XEP-0421: Anonymous unique occupant identifiers for MUCs" in *non*anonymous mucs.

  27. Zash

    nicoco, where everyone sees real JIDs? not much value in it, but I suppose it doesn't hurt for the server to do it anyway

  28. singpolyma

    > "," isn't valid in any JID part? Why wouldn't it be? Almost everything is allowed in localpart

  29. nicoco

    in fact, AFAIU, none of the legacy services I map right now even have the concept of anonymous groups. but I suspect some clients are not going to allow retractions without it anyway. I'd rather avoid adding it if it doesn't make sense. more code = more trouble. ^^

  30. Zash

    reference to the C in CSV? it has quoting, so not a problem

  31. MattJ

    I almost wrote TSV (which I prefer), but it's less of a standard

  32. MattJ

    CSV is a standard because there are so many variants to choose from

  33. nicoco

    is that the definition of a standard? something that you have many variants to choose from?

  34. Zash

    I think I'd just cry a tear and suggest JSON

  35. MattJ

    I almost suggested JSON, but that's not ideal if you want to test various unicode things

  36. Zash

    Is that how ... was it flow? ended up with some newline-based custom format?

  37. singpolyma

    JSON is required to be utf8, no? So Unicode should be no issue

  38. Zash

    singpolyma, no, it's UTF-16

  39. MattJ

    No, and the Prosody tests include invalid UTF-8

  40. singpolyma

    Zash: I think you're talking about the escape sequences?

  41. MattJ

    So anything required to be valid unicode is not suitable, unless you add additional encoding

  42. singpolyma

    UTF16 encoded json is not a thing

  43. Zash

    Sure, escapes that use surrogate pairs and stuff. Depending on your JSON library.

  44. Zash

    Lua is obviously the best format :)

  45. Zash

    Binary safe, descendant of an actual data description format :)

  46. singpolyma

    MattJ: ah, so the tests assume utf8 decoder but contain raw binary?

  47. singpolyma

    What we did for the dhall tests was folders with input in one file and output in another

  48. singpolyma

    No format, just use the filesystem

  49. MattJ

    Prosody's tests don't assume anything, as Lua isn't unicode-aware so you can put most things literally in strings. That's not always sensible due to editors and stuff, though.

  50. MattJ

    So for some things we apply hex or base64

  51. Zash

    Soooooooooooooo we're doing another round of format bikeshedding?

  52. singpolyma

    Zash: no, were just talking :)

  53. singpolyma

    Zash: no, we're just talking :)

  54. singpolyma

    I'm just here because someone said commas might not be allowed in jid and I freaked out ;)

  55. pep.

    "Zash> Is that how ... was it flow? ended up with some newline-based custom format?" yeah fwiw I would do one line per entry, on two separate files. Not much parsing required, very little chance for confusion..

  56. pep.

    singpolyma, not what I said, I asked indeed because of CSV

  57. Zash

    pep., but how do you test jids with newlines???

  58. pep.

    Is that a thing?

  59. Zash

    no, but how do you verify that your library rejects it? :)

  60. pep.


  61. Zash throws CBOR in the ring

  62. Zash

    or why not netstrings?

  63. singpolyma

    I mean, people can use whatever works for them :)

  64. Zash

    Hah! The thing I had started used XML

  65. Zash

    <?xml version="1.0" encoding="UTF-8"?> <tests type="invalid"> <test> <jid>node@/server</jid> </test> <test> <jid>@server</jid> </test> ...

  66. Zash

    and XSLT for turning into unit tests

  67. wurstsalat

    seems like the obvious choice, given the standard this room is about

  68. singpolyma

    Means inventing a format and can't do invalid utf8 in the general case, but obviously it's an option

  69. moparisthebest

    Why not just a jid per line and a valid.txt and an invalid.txt

  70. Zash


  71. pep.

    moparisthebest, I proposed that. And that means you can't test newlines in your jids

  72. moparisthebest

    JIDs shouldn't have newlines ;)

  73. moparisthebest

    But fine, seperate lines with \0

  74. singpolyma

    Or just use one file per input one per output and you don't need a format at all. So many options!

  75. MattJ

    It's not just "valid" or "invalid" though - Prosody's tests extensively test correct splitting, which many clients/libraries have got wrong in the past

  76. pep.

    splitting? the 3 parts?

  77. singpolyma

    Rather than standard tests I'd rather see well tested libraries, ideally.

  78. singpolyma

    Most libraries right now accept almost any random crap in their jid "parser"

  79. Zash

    Isn't the goal here to have common test data and test all the libraries at the same time?

  80. MattJ

    Exactly. That's part of the reason libraries aren't sufficiently tested... because without shared test cases, every project just writes their own and inevitably misses some

  81. MattJ

    If they write any at all

  82. Zash

    Something like what exists for JSON and Markdown libraries ... but of course I couldn't those sites now

  83. flow

    MattJ, fwiw, the valid jids are tested for proper splitting

  84. MattJ

    in Smack?

  85. flow

    no in jxmpp (which is used by Smack)

  86. MattJ

    Right, sure

  87. MattJ

    I'm not commenting on any individual implementation

  88. pep.

    I guess flow meant there is not need to test splitting separately?

  89. MattJ

    Just saying I agree that a common set of test cases would be beneficial

  90. flow


  91. MattJ

    I brought up splitting because having a list of "invalid JIDs" and a list of "valid JIDs" is not sufficient for testing a JID parser

  92. pep.

    Anyway I'm happy with whatever people have. Maybe label tests so that they can be run separately?

  93. MattJ

    and that's one of the solutions that was proposed

  94. flow

    ahh ok, the list of valid JIDs in jxmpp corpus also consists of the expected splitted parts

  95. MattJ

    Exactly. In what format? :)

  96. flow

    the grammar is defined in https://github.com/igniterealtime/jxmpp/blob/master/jxmpp-strings-testframework/src/main/resources/xmpp-strings/jids/valid/main#L13-L20

  97. flow

    basically using control chars to separate the parts

  98. pep.

    I guess that's easily convertible to another format anyway?

  99. flow

    which yields the nice property that it's still a simple text file that can hold the corpus, while you do not need to escape antyhing

  100. flow

    sure, transformations are possible, but I wonder if there is a better format. but I am happy to hear the ideas

  101. flow

    the format jxmpp's jid corpus uses is trivially parsable

  102. pep.

    Is that the corpus?

  103. pep.

    Or just an example

  104. flow

    the two files are the currently existing corpus

  105. pep.

    I see

  106. flow

    I have some invalid JIDs scraped from openfire (courtesty of Guus) that I need to add

  107. flow

    but since every JID is checked with 4 different "stringprep" implementations, it is a bit of work to add them to the corpus. because you first have to play protocol laywer and decide if its a valid jid or not, and then mask the non-conforming implementations

  108. pep.

    That doesn't seem to hard to use in xmpp-rs/jid

  109. flow

    it shouldn't be, I am surse there is a decent PEG parser for rust

  110. pep.

    yeah yeah

  111. flow

    it shouldn't be, I am sure there is a decent PEG parser for rust

  112. pep.

    it's twice faster than nom.

  113. pep.

    (Sorry, private joke on #rust-fr)

  114. flow

    pff, inside jokes :)

  115. pep.

    It's because pest, a peg parser in Rust, had a graph on their web page a while back showing how better it was than other Rust parsers, and it was totally bonkers. #Benchmarks

  116. pep.

    And the nom dev is a regular in #rust-fr so that was the joke

  117. flow

    I would be happy of the jid corpus had a size where parsing speed would be of consideration :)

  118. pep.

    flow, is this correct? Corpus → Entry* Entry → Jid* | CommentLine*

  119. pep.

    Shouldn't Entry be Jid | CommentLine ? (without the *)

  120. pep.

    So that there is something

  121. pep.

    So that there is something to parse

  122. pep.

    Since Corpus is already Entry*

  123. flow

    is it currenlty simply saying that an entry consists of either potentially multiple Jid or CommentLine entries?

  124. flow

    so the * suffix for Jid and CommentLine is not strictly required, but also technically not wrong

  125. flow

    or am I missing something?

  126. flow

    (it's been a while since I wrote this…)

  127. pep.

    Ah I thought * was 0+ not 1+

  128. pep.

    Yeah it's not wrong

  129. pep.

    hmm, wait, an entry consists of multiple jids or coments?

  130. pep.

    hmm, wait, an entry consists of multiple jids or comments?

  131. pep.

    I'd say at most one jid or at most one comment?

  132. pep.

    I mean, I think this is what it should be

  133. moparisthebest

    I'm using afl to generate stanza/nonza test cases, could do the same for JIDs

  134. pep.

    https://gitlab.com/xmpp-rs/xmpp-rs/-/commit/7fece526332dfe4d32c1f1989349fbc17e6018c3 That's only the valid parser. Untested.

  135. pep.

    afk nao