but i think it will keep the whole xml document in memory
lovetox
i think this is a problem .. i dont know how much data is exchanged between client and server, but i guess if a client runs for some days, that could be problematic
flow
lovetox, smack uses an xml pull parser but splits the top level stream elements, so that at most one is kept in memory
flow
also pull parser tend to not keep the whole document in memory
flow
since you typically can't rewind, the unreachable parts of the document can be dropped
jonas’
lovetox, xml.sax! :)
spectrumhas left
flow
IIRC pull/push parsers can operate on streams, unlike DOM based parsers
flow
of course, this may vary from parser implementation to parser implemention
flow
having said that some xml parsers can operate on streams, I believe that in XMPP it's best to not parse on the stream but to split the top-level stream elements, as it, amongst other things, allows you to enforce a top-level stream element size limit
jonas’
you can also enforce that with a parser which can tell you where in the stream it currently is
jonas’
splitting sounds like implementing half on an XML parser which I definitely cannot recommend you doing
jonas’
(I know that moparisthebest loves that kind of pain though :))
flow
My personal experience is that the developing the splitter implmentation was not painful
flow
but yes, it's basically an xml parser, or, i'd rather say, an xml state machine
flow
and yes, if the parser is able to tell you the current position in the stream, then you probably do not need to do that
flow
unfortunately, that was not true for in my case
Wojtekhas joined
Kevhas left
Kevhas joined
phikoenhas left
phikoenhas joined
spectrumhas joined
Kev
We more or less use a streaming dom parser, by using a stream parser and then constructing a pseudo-dom on individual stanzas. I imagine we're not alone in that.
PapaTutuWawahas joined
Sam
That's more or less what I do as well if I understood you correctly. It's all streaming until we need to do something with a stanza or an element, then the user has an option of parsing the whole element into a DOM or similar and operating on that instead of the streamed tokens.
lovetox
jonas’, of course i can use sax, but then i have to write code to build the xml elements myself
jonas’
or port to aioxmpp :-X
jonas’
lovetox, actually, no, lxml.etree has a SAX -> etree thing
jonas’
I use that in aioxmpp to capture unschematized elements
then you just need a shim thing which handles the stream header and then delegates to that
marmistrzhas left
marmistrzhas joined
goffihas joined
antranigvhas joined
kikuchiyohas left
pasdesushihas left
pasdesushihas joined
marmistrzhas left
goffihas left
lovetox
https://lxml.de/parsing.html#modifying-the-tree
jonas’
hurts to look at :)
lovetox
yeah lxml build the same pullparser and added a method to delete elements when not needed anymore
jonas’
I'd rather go the sax way where I can be sure of what's what
debaclehas left
atomicwatchhas left
debaclehas joined
moparisthebest
yea I tend to agree that sax parsers are painful, and it's better to split the stream into stanzas without parsing, and only parse individual stanzas as a whole/with a DOM parser if needed
antranigvhas left
antranigvhas joined
Zashhas left
zashhas joined
lovetox
i want a parser, that simply dispatches stanzas to my handlers. with that pullparser, i can just look for the "end" event, and at depth 1, i displatch the element that parser gives me.
lovetox
thats a xml parser in probably 50 lines of code
flow
lovetox, that sounds similar to what smack does
jonas’
lovetox, and hope that the delete actually works as you wish :)
flow
although what you really want to is something like jaxb, which matches the XML elements to types of your programming languages, with additional verification
lovetox
jonas’, i will test this now
flow
so much possibility to optimize the process, so little free time :)✎
lovetox
and also how vulnerable this parser is against attacks
flow
so much possibilities to optimize the process, so little free time :) ✏
lovetox
because there is zero settings here that i could configure to make it more secure
Zashhas joined
moparisthebest
lovetox, this is a stand-alone file that splits a stream on stanza boundaries without an xml parser, so you'd get notified of individual stanzas https://github.com/moparisthebest/xmpp-proxy/blob/master/src/stanzafilter.rs#L255
antranigvhas left
lovetox
yeah im not sure i should do this with python
lovetox
its probably slow
lovetox
but yeah i think this approach is cleaner
flow
given that its a client application, I doubt that this particular overhead will be an issue
pasdesushihas left
pasdesushihas joined
moparisthebest
I wouldn't think it'd be slow in python, there aren't any string copies or anything, it's just running a switch on 1 character at a time, surely faster than an XML parser
kikuchiyohas joined
stpeterhas joined
lovetox
but its additional, its not like i spare myself the xml parsing
kikuchiyohas left
moparisthebest
you spare yourself the streaming sax parsing
lovetox
Oo, maybe i missing something, all your filter is doing is cutting the stream into pieces and feeding it to the parser
lovetox
oh you mean i dont need to care then about sax events
lovetox
just give me the finished element after you parsed everything
atomicwatchhas joined
lovetox
that lxml lib is amazing, you can even register custom element classes, that will be used when it encounters a certain tag / namespace or event attribute combination
lovetox
this thing is made for xmpp
edhelas
lovetox funny, it's roughtly what I did for my XML library in Movim :D
This happens on each element['sub-element'] call, and is quite slow.
Link Mauve
I once started to work on a JIT for that, but never managed to make poezio start.
9lakeshas left
9lakeshas joined
antranigvhas left
goffihas joined
marmistrzhas left
dezanthas joined
lovetox
what is the opinion on validating stanzas against a schema?
lovetox
is this even possible?
lovetox
can xml schemas be "open" in they allow elements not defined in the schema
lovetox
and only validating MUST haves?
Link Mauve
In my experience, XML Schema was too poor to be used for both validation and extraction of the data to be more machine-usable.
Link Mauve
lovetox, yes, you can have <xs:any namespace='##other'/> in a schema.
Link Mauve
I’ve been trying to fix our schemas every time I encounter something invalid or not restricted enough, but they clearly aren’t used for validation (or their users never contributed back their fixes).
Link Mauve
In xmpp-parsers, I have drafted a DSL to represent the most common constructs we have in XMPP, but I translate manually from the XML Schema as well as from the XEP’s examples and text.
Link Mauve
And I still can’t represent everything.
me9has left
Link Mauve
It made me discover a whole bunch of invalid things people do in the wild.
Zash
DSL eh?
Link Mauve
And led to fixes in clients and servers.
Zash
I wonder if it maps to the OpenAPI JSON Schema XML stuff
marmistrzhas joined
pulkomandyhas joined
marmistrzhas left
nephelehas left
pulkomandyhas left
pulkomandyhas joined
jgarthas left
lovetox
Link Mauve, what i would like to do is just basic stuff like, checking that a presence type attribute is one of the allowed strings etc
right now i have to do this all in python, manually ✏
lovetox
that would allow me later to have much cleaner code
Zash
and if not, what would you do?
lovetox
i would only validate MUST
lovetox
so invalid type on a presence, ignore stanza
lovetox
or am i again to strict, and have to later add not failing on that because someone sends wrong values all the time :D
cyrilhas left
cyrilhas joined
Zash
And if you already have that in Python, do you really gain much by having it in XML instead?
lovetox
yes much cleaner code
Kevhas left
Kevhas joined
lovetox
now everywhere i have to be ready to catch exceptions, every attribute i query i have to be prepared for it to be None etc
lovetox
parsing 4 attributes into enums, is then 30 lines of code
lovetox
but true much is probably not possible to validate
lovetox
many things have implicit defaults, so i need to check afterwards anyway if its there or not
Link Mauve
lovetox, have a look at slixmpp or aioxmpp, they both already do so and in Python.
Link Mauve
They validate, and also make a more ergonomic API.
larma
I don't really see cleaner code. When you do pre-validation, you only safe the one "else" case for all invalid values which needs to throw an exception. In proper languages that's about one line of code, in python maybe three.
stpeterhas joined
jonas’
larma, catching obviously invalid stanzas early (and returning an error to the sender early) makes for cleaner code than doing it from random places all over the code
larma
jonas’, I was thinking we parse as early as possible to not carry around heavy lifting strings when an enum value would've been enough...
jonas’
yes, enumification is also a nice thing of doing things early✎
jonas’
yes, enumification is also a nice opportunity you get when doing things early ✏
Zash
Cleaner code... /me looks at https://hg.prosody.im/prosody-modules/file/58a20d5ac6e9/mod_rest/res/schema-xmpp.json
Zash
I made this monstrosity. Fear me!
malthehas joined
sonnyhas left
sonnyhas joined
dezanthas left
dezanthas joined
dezanthas left
dezanthas joined
sonnyhas left
sonnyhas joined
malthehas left
antranigvhas joined
nephelehas joined
antranigvhas left
stpeterhas left
flow
lovetox, you can validate schema and should, if it's not obviously broken, with the one exception that XML schemas IIRC do not allow unspecified elements and attributes. but in XMPP we allow those everywhere (I believe some may disagree with me about that, but IMHO it's the only sane thing to do)
Kevhas left
Kevhas joined
lovetox
flow so how do i validate then? if someone adds a element, my validation fails ..
lovetox
but its not invalid per xmpp definition
flow
so basically you can validate XML for which an schema exists, but not XML for which no schema exists. those should be simply ignored, as you obviously know nothing about them and did not negogiate it either
lovetox
but thats very extension unfriendly
flow
well if a schema specifies the 'priority' attribute to be unsignedByte, you can validate that
lovetox
i mean often XEPs start with someone just putting something additionally in there
Kevhas left
Kevhas joined
lovetox
flow i can only validate everything or nothing
lovetox
not only one attribute
flow
that was just an example ;)
flow
> lovetox> flow i can only validate everything or nothing
flow
I don't that that's true
flow
In fact I know that this is not true :)
flow
but maybe we are talking past each other
malthehas joined
Zash
Does XML schema validators not ignore unknown things?
flowgoes looking for a simple schema example from the XEP
flow
Zash, I don't think so no
flow
that's the one thing where we in XMPP vary from how XML schemas are validated
flow
IIRC you have to add some magic schema thingy in the elements schema to allow for arbitrary further child elements
Kevhas left
flow
that are not part of the schema
Kevhas joined
flow
and IIRC we mostly don't do that in our XMPP schemas
flow
lovetox, look for example at https://xmpp.org/extensions/xep-0203.html#schema
flow
surely you can validate stamp
flow
you can validate that it's value is in the correct format and that the required attribute actually exists
but let's assume I add a child element into <delay/>
flow
and you don't have a schema about it, it's purely optional and does not need to be negotiated
flow
then you cann still verify 'stamp', but not the child element✎
flow
then you can still verify 'stamp', but not the child element ✏
flow
so, it's not a all or nothing validation situation✎
flow
so, it's not an all or nothing validation situation ✏
flow
lovetox, does this help?
Link Mauve
flow, it would help in that case to add an <xs:any namespace='##other'/> in the delay element.
flow
sure, but let's just pretend that this is everywhere in our schemas
Zash
sprincle those everywhere 😕
Link Mauve
Please no. ^^'
Link Mauve
It makes validation (of the whole stanza) much harder.
flow
what makes validation much harder?
Link Mauve
In xmpp-parsers I add Option types for each known sub-element defined elsewhere, slixmpp doesn’t do that and instead keeps the extensibility everywhere, at the expense of performances.
Link Mauve
flow, let’s say I want to assert that delay won’t be extended, or it is an error.
Link Mauve
(For instance because I don’t support it to be extended.)
Link Mauve
Leaving an extension point in it will take more memory and void the type safety of it.
stpeterhas joined
Link Mauve
While with the current schema, we can take it as OOP languages use the final keyword.
flow
I think my general point is that in XMPP, unknown things should typically be simply ignored, as otherwise, the eXtensability in XMPP becomes unnecessary hard
flow
servers can still filter unknown elements/attributes if they determined that those have not be negoiated between sender and recipient (but I am not sure if this is feasible)
flow
unnecessary hard and even impossible in some situations
Link Mauve
I’m ok with that in the generic public network case, but for specific implementations it does make sense to reject extensions you don’t know about.
Link Mauve
Even if just for validation of other implementations.
flow
Link Mauve> flow, let’s say I want to assert that delay won’t be extended, or it is an error.
I may be misunderstanding the meaning of 'error' here, but I believe that mindest is harmful
flow
it shouldn't be an error, it should simply be ignored
flow
and it's fine if your native types don't allow the user to access those ignored elements/attributes
flow
Link Mauve, define 'reject' here
rafasaurushas left
Link Mauve
Either ignore the whole delay element and only account for the other payloads, or ignore the whole stanza and reply with an error, stuff like that.
flow
why would you igniore the whole delay element if it's perfectly fine otherwise?✎
flow
why would you ignore the whole delay element if it's perfectly fine otherwise? ✏
Kev
I would recommend more or less completely ignoring the schemas, personally.
Zash
when we say "validate", do we mean `validate(schema, stanza) : boolean` ?
flow
if someone send you additional data within the delay element, without obviously negoiating it, then the sender has to assume that you may ignore it (because he doesn't know if you understand it)
flow
Kev, in my personal experience, schemas in XEPs haven proven to be very helpful when implementing said XEP to clarify things that aren't clear from the text
malthehas left
flow
Zash, I think so, yes, the stanza is either valid or not
Zash
What would a client do with this boolean?
flow
the question is: what consitutes a valid stanza in the presence of unknown elements/attributes?
Kev
Fair enough. Having gone through some experience of people trying to use schemas for validating traffic, I feel confident in saying using them normatively can also lead to pain.
Zash
Sorry, something put an invalid delay tag in your message, so we threw the whole message into the trash
flow
Zash, configurable, could be ignoring the stanza completely
flow
Zash, define "invalid delay element" here?
Zash
Isn't what you actually want something that takes a schema and extracts the bits you care about, and ignores any undefined extras?
Zash
A data mapper, rather than a schema
Link Mauve
Zash, what I do currently is `validate(schema, RFC-parts-of-the-stanza; foreach payload in stanza: validate(schema, payload)`.
flow
is it like stamp='invalid'? or with an unknown child extension element?
Link Mauve
Although that’s not completely correct as it is a transformative operation, not just a validation.
Zash
and I don't see how XML schema is particularly helpful for making such a thing
Link Mauve
In the current example, I end up with a struct Delay { from: Jid, stamp: DateTime } for instance.
Kev
(And we *do* have a product that does validation of XMPP traffic across security boundaries, so I'm not saying "Don't do this", merely "Here there be dragons")
Link Mauve
And I also agree with Kev, ignore schemas (or extract information useful to you) and write your own DSL.
flow
Zash> Sorry, something put an invalid delay tag in your message, so we threw the whole message into the trash
I think it is safer to throw the whole message in the trash, as ignoring parts of the elements of an stanza could modify the stanzas semantics
flow
even though I am not aware of a concrete example
flow
but I rather be defensive per default
sonnyhas left
sonnyhas joined
Link Mauve
I’d really like to have a server plugin to validate every stream, and report issues it finds to the developers of the relevant violating implementations.
Zash
flow: So a new XEP comes along and now you can't read any messages anymore because they have a new tag?
flow
Zash, that's not what I had in mind
flow
It's not even what I tried to express
aramhas left
rafasaurushas joined
Zash
I may be too tired today for properly reading anything, sorry if I misunderstand
flow
trying to come up with a slightly related example
Sam
I read it that way too, fwiw *shrug*
flow
I do this on the fly so it is maybe not good
flow
if someone sends you an *invalid* stanza, e.g. an attribute which value 'foo' but it's specified to be an integer
flow
I think then you can only ignore it
flow
because the sender may wants to tell you something
Kev
Although I think it's been lost to the mists of time, we did have a DSL processor for Swiften to create the various parsers/serialisers/elements, but it wasn't XSD-based.
flow
but you have no idea what, hence you can only guess what it is, and that leads to issues, because you may have guessed wrong
Zash
not treat it as if that child tag or extension was simply gone?
flow
presense status with an invalid priority are probably a good example
Link Mauve
flow, then you have a choice: either you work around it and still try to interpret it somehow, carry this tribal knowledge forever, and laugh at other implementations which didn’t make the same choice or carry it there too; or you report it to them so that they can stop sending the invalid value.
flow
Zash, that's the tricky question
Link Mauve
I’m firmly in the latter camp.
flow
as I said, I could imagine that the semantic of a stanza is made up of multiple of its child elements
Zash
priority is a core thing, so uh
Kev
There's definitely, to my mind, a difference between unexpected, and expected-but-invalid content.
flow
and then, if you ignore the existence of one of the child elements, the semantic may be different
Link Mauve
Zash, yet I’ve received invalid ones, IIRC it was Gajim which didn’t validate what the user could input there.
stpeterhas left
Link Mauve
It allowed numbers fewer than -128 or bigger than 127.
flow
assume a client has set the highest priority 99999
flow
he may then assumes that messages are routing according to this pirority
flow
but servers may cap it?
Link Mauve
Someone sent me a priority 500, my program rightfully rejected the stanza, I reported it and it got fixed.
flow
or servers may treat it as 0?
Link Mauve
So no one has to handle that case.
Link Mauve
(flow, highest priority is 127.)
flow
Link Mauve, I know :)
flow
but the client wo set it to 99999 didn't
Link Mauve
Right, so exactly my example. ^^
flow
in the end, I believe it's the best for the open source community ecosystem of XMPP if implementations validate and reject invalid stanzas as a whole, and emit a visible error message in such a case, ideally pointing to the sending entity, with an request to report an issue to the vendor
Link Mauve
+1
jubalhhas left
pulkomandyhas left
pulkomandyhas joined
flow
Link Mauve, :)
flow
Link Mauve, btw, is there an equivalent of <xs:any namespace='##other'/> for attributes?
Link Mauve
Yes, <xs:anyAttribute namespace="##other"/>.
Link Mauve
We already use both in our XEPs.
Link Mauve
But AFAIK we still disallow namespaced attributes.
Link Mauve
There is still no XEP for an entity to advertise it properly supports XML namespaces.
flow
not sure if we officially disallow them, it just seems that some seem to be afraid of them
Link Mauve
Or that yeah.
stpeterhas joined
qrpnxzhas left
qrpnxzhas joined
Kev
We definitely have guidance not to use them, I'm just not sure where.