-
pep.
Is there any testing harness (big word) for JID validation?
-
MattJ
Nothing universal. Prosody's tests are at https://hg.prosody.im/trunk/file/tip/spec/util_jid_spec.lua and it's been quite a while since anyone found bugs in it...
-
edhelas
Wondering how we can have test suites accros languages
-
MattJ
Defined input and output formats, not unheard of
-
edhelas
I'm currently looking to have a proper RFC 6122 support in PHP
-
Zash
pep., MattJ: nothing universal and every time it is brought up someone NIHs a new format
-
MattJ
Might as well just publish some CSV or whatever with inputs and expected results, something boringly simple and easy to parse - then people can figure out how to get that into whatever framework they use
-
Zash
MattJ: I might even have started but got bored when yet another custom format was proposed instead
-
pep.
"," isn't valid in any JID part?
-
pep.
I'd have two files, so that isn't an issue
-
Kev
I thought it was valid in resources, but haven't checked.
-
Link,Mauve
It is valid in resources.
-
pep.
You demonstrated at best it's valid in Prosody MUCs :P
-
edhelas
Movim accepts it as well, as there is no proper validation at all :p
-
nicoco
what would be super great for me is a "random UTF8 junk to valid resource part" converter. some legacy service allow control chars it seems. Or maybe it's slixmpp that is not permissive enough? Is "×͜× " a valid resource?
-
Zash
nicoco, resourceprep says "maybe"
-
Link Mauve
nicoco, it isn’t.
-
Link Mauve
(I just tried, and poezio showed no error… Will fix.)
-
Link Mauve
nicoco, you might get some issue with multiple “random UTF-8 junk” mapping to the same resourcepart.
-
nicoco
my "nickname cleaner" right now is `"".join(x for x in nickname if x in string.printable) + " [renamed by slidge]"`, but for this nickname for instance, it turns it to " [renamed by slidge]", which is not very satisfying
-
Link Mauve
For instance, Link Mauve and Link Mauve do map to the same resource after resourceprep.
-
Beherit
(In the xsf muc I suggested to discuss whether we should consider Google Season of Docs for writing XEPs. https://developers.google.com/season-of-docs/docs/get-started?hl=en xmpp:xsf@muc.xmpp.org?join )
-
nicoco
and also, yes Link Mauve, you're right, I'm risking collisions
-
Link Mauve
If the legacy protocol you map to considers those two different users, here be dragons.
-
nicoco
well, they are 2 different users, but in MUCs, even non anonymous, the nickname is supposedly also a unique identifier
-
nicoco
actually I was dropping by to ask whether it made any sense to use "XEP-0421: Anonymous unique occupant identifiers for MUCs" in *non*anonymous mucs.
-
Zash
nicoco, where everyone sees real JIDs? not much value in it, but I suppose it doesn't hurt for the server to do it anyway
-
singpolyma
> "," isn't valid in any JID part? Why wouldn't it be? Almost everything is allowed in localpart
-
nicoco
in fact, AFAIU, none of the legacy services I map right now even have the concept of anonymous groups. but I suspect some clients are not going to allow retractions without it anyway. I'd rather avoid adding it if it doesn't make sense. more code = more trouble. ^^
-
Zash
reference to the C in CSV? it has quoting, so not a problem
-
MattJ
I almost wrote TSV (which I prefer), but it's less of a standard
-
MattJ
CSV is a standard because there are so many variants to choose from
-
nicoco
is that the definition of a standard? something that you have many variants to choose from?
-
Zash
I think I'd just cry a tear and suggest JSON
-
MattJ
I almost suggested JSON, but that's not ideal if you want to test various unicode things
-
Zash
Is that how ... was it flow? ended up with some newline-based custom format?
-
singpolyma
JSON is required to be utf8, no? So Unicode should be no issue
-
Zash
singpolyma, no, it's UTF-16
-
MattJ
No, and the Prosody tests include invalid UTF-8
-
singpolyma
Zash: I think you're talking about the escape sequences?
-
MattJ
So anything required to be valid unicode is not suitable, unless you add additional encoding
-
singpolyma
UTF16 encoded json is not a thing
-
Zash
Sure, escapes that use surrogate pairs and stuff. Depending on your JSON library.
-
Zash
Lua is obviously the best format :)
-
Zash
Binary safe, descendant of an actual data description format :)
-
singpolyma
MattJ: ah, so the tests assume utf8 decoder but contain raw binary?
-
singpolyma
What we did for the dhall tests was folders with input in one file and output in another
-
singpolyma
No format, just use the filesystem
-
MattJ
Prosody's tests don't assume anything, as Lua isn't unicode-aware so you can put most things literally in strings. That's not always sensible due to editors and stuff, though.
-
MattJ
So for some things we apply hex or base64
-
Zash
Soooooooooooooo we're doing another round of format bikeshedding?
-
singpolyma
Zash: no, were just talking :)✎ -
singpolyma
Zash: no, we're just talking :) ✏
-
singpolyma
I'm just here because someone said commas might not be allowed in jid and I freaked out ;)
-
pep.
"Zash> Is that how ... was it flow? ended up with some newline-based custom format?" yeah fwiw I would do one line per entry, on two separate files. Not much parsing required, very little chance for confusion..
-
pep.
singpolyma, not what I said, I asked indeed because of CSV
-
Zash
pep., but how do you test jids with newlines???
-
pep.
Is that a thing?
-
Zash
no, but how do you verify that your library rejects it? :)
-
pep.
Indeed
- Zash throws CBOR in the ring
-
Zash
or why not netstrings?
-
singpolyma
I mean, people can use whatever works for them :)
-
Zash
Hah! The thing I had started used XML
-
Zash
<?xml version="1.0" encoding="UTF-8"?> <tests type="invalid"> <test> <jid>node@/server</jid> </test> <test> <jid>@server</jid> </test> ...
-
Zash
and XSLT for turning into unit tests
-
wurstsalat
seems like the obvious choice, given the standard this room is about
-
singpolyma
Means inventing a format and can't do invalid utf8 in the general case, but obviously it's an option
-
moparisthebest
Why not just a jid per line and a valid.txt and an invalid.txt
-
Zash
https://github.com/igniterealtime/jxmpp/tree/master/jxmpp-strings-testframework/src/main/resources/xmpp-strings/jids
-
pep.
moparisthebest, I proposed that. And that means you can't test newlines in your jids
-
moparisthebest
JIDs shouldn't have newlines ;)
-
moparisthebest
But fine, seperate lines with \0
-
singpolyma
Or just use one file per input one per output and you don't need a format at all. So many options!
-
MattJ
It's not just "valid" or "invalid" though - Prosody's tests extensively test correct splitting, which many clients/libraries have got wrong in the past
-
pep.
splitting? the 3 parts?
-
singpolyma
Rather than standard tests I'd rather see well tested libraries, ideally.
-
singpolyma
Most libraries right now accept almost any random crap in their jid "parser"
-
Zash
Isn't the goal here to have common test data and test all the libraries at the same time?
-
MattJ
Exactly. That's part of the reason libraries aren't sufficiently tested... because without shared test cases, every project just writes their own and inevitably misses some
-
MattJ
If they write any at all
-
Zash
Something like what exists for JSON and Markdown libraries ... but of course I couldn't those sites now
-
flow
MattJ, fwiw, the valid jids are tested for proper splitting
-
MattJ
in Smack?
-
flow
no in jxmpp (which is used by Smack)
-
MattJ
Right, sure
-
MattJ
I'm not commenting on any individual implementation
-
pep.
I guess flow meant there is not need to test splitting separately?
-
MattJ
Just saying I agree that a common set of test cases would be beneficial
-
flow
ack
-
MattJ
I brought up splitting because having a list of "invalid JIDs" and a list of "valid JIDs" is not sufficient for testing a JID parser
-
pep.
Anyway I'm happy with whatever people have. Maybe label tests so that they can be run separately?
-
MattJ
and that's one of the solutions that was proposed
-
flow
ahh ok, the list of valid JIDs in jxmpp corpus also consists of the expected splitted parts
-
MattJ
Exactly. In what format? :)
-
flow
the grammar is defined in https://github.com/igniterealtime/jxmpp/blob/master/jxmpp-strings-testframework/src/main/resources/xmpp-strings/jids/valid/main#L13-L20
-
flow
basically using control chars to separate the parts
-
pep.
I guess that's easily convertible to another format anyway?
-
flow
which yields the nice property that it's still a simple text file that can hold the corpus, while you do not need to escape antyhing
-
flow
sure, transformations are possible, but I wonder if there is a better format. but I am happy to hear the ideas
-
flow
the format jxmpp's jid corpus uses is trivially parsable
-
pep.
Is that the corpus?
-
pep.
Or just an example
-
flow
the two files are the currently existing corpus
-
pep.
I see
-
flow
I have some invalid JIDs scraped from openfire (courtesty of Guus) that I need to add
-
flow
but since every JID is checked with 4 different "stringprep" implementations, it is a bit of work to add them to the corpus. because you first have to play protocol laywer and decide if its a valid jid or not, and then mask the non-conforming implementations
-
pep.
That doesn't seem to hard to use in xmpp-rs/jid
-
flow
it shouldn't be, I am surse there is a decent PEG parser for rust✎ -
pep.
yeah yeah
-
flow
it shouldn't be, I am sure there is a decent PEG parser for rust ✏
-
pep.
it's twice faster than nom.
-
pep.
(Sorry, private joke on #rust-fr)
-
flow
pff, inside jokes :)
-
pep.
It's because pest, a peg parser in Rust, had a graph on their web page a while back showing how better it was than other Rust parsers, and it was totally bonkers. #Benchmarks
-
pep.
And the nom dev is a regular in #rust-fr so that was the joke
-
flow
I would be happy of the jid corpus had a size where parsing speed would be of consideration :)
-
pep.
flow, is this correct? Corpus → Entry* Entry → Jid* | CommentLine*
-
pep.
Shouldn't Entry be Jid | CommentLine ? (without the *)
-
pep.
So that there is something✎ -
pep.
So that there is something to parse ✏
-
pep.
Since Corpus is already Entry*
-
flow
is it currenlty simply saying that an entry consists of either potentially multiple Jid or CommentLine entries?
-
flow
so the * suffix for Jid and CommentLine is not strictly required, but also technically not wrong
-
flow
or am I missing something?
-
flow
(it's been a while since I wrote this…)
-
pep.
Ah I thought * was 0+ not 1+
-
pep.
Yeah it's not wrong
-
pep.
hmm, wait, an entry consists of multiple jids or coments?✎ -
pep.
hmm, wait, an entry consists of multiple jids or comments? ✏
-
pep.
I'd say at most one jid or at most one comment?
-
pep.
I mean, I think this is what it should be
-
moparisthebest
I'm using afl to generate stanza/nonza test cases, could do the same for JIDs
-
pep.
https://gitlab.com/xmpp-rs/xmpp-rs/-/commit/7fece526332dfe4d32c1f1989349fbc17e6018c3 That's only the valid parser. Untested.
-
pep.
afk nao