-
lovetox
on the topic of xml parsers
-
lovetox
python has a xmlpullparser
-
lovetox
but i think it will keep the whole xml document in memory
-
lovetox
i think this is a problem .. i dont know how much data is exchanged between client and server, but i guess if a client runs for some days, that could be problematic
-
flow
lovetox, smack uses an xml pull parser but splits the top level stream elements, so that at most one is kept in memory
-
flow
also pull parser tend to not keep the whole document in memory
-
flow
since you typically can't rewind, the unreachable parts of the document can be dropped
-
jonas’
lovetox, xml.sax! :)
-
flow
IIRC pull/push parsers can operate on streams, unlike DOM based parsers
-
flow
of course, this may vary from parser implementation to parser implemention
-
flow
having said that some xml parsers can operate on streams, I believe that in XMPP it's best to not parse on the stream but to split the top-level stream elements, as it, amongst other things, allows you to enforce a top-level stream element size limit
-
jonas’
you can also enforce that with a parser which can tell you where in the stream it currently is
-
jonas’
splitting sounds like implementing half on an XML parser which I definitely cannot recommend you doing
-
jonas’
(I know that moparisthebest loves that kind of pain though :))
-
flow
My personal experience is that the developing the splitter implmentation was not painful
-
flow
but yes, it's basically an xml parser, or, i'd rather say, an xml state machine
-
flow
and yes, if the parser is able to tell you the current position in the stream, then you probably do not need to do that
-
flow
unfortunately, that was not true for in my case
-
Kev
We more or less use a streaming dom parser, by using a stream parser and then constructing a pseudo-dom on individual stanzas. I imagine we're not alone in that.
-
Sam
That's more or less what I do as well if I understood you correctly. It's all streaming until we need to do something with a stanza or an element, then the user has an option of parsing the whole element into a DOM or similar and operating on that instead of the streamed tokens.
-
lovetox
jonas’, of course i can use sax, but then i have to write code to build the xml elements myself
-
jonas’
or port to aioxmpp :-X
-
jonas’
lovetox, actually, no, lxml.etree has a SAX -> etree thing
-
jonas’
I use that in aioxmpp to capture unschematized elements
-
jonas’
or used it anyway, I don't think I still do.
-
lovetox
thats exactly what i need
-
jonas’
lovetox, https://lxml.de/sax.html#building-a-tree-from-sax-events
-
jonas’
then you just need a shim thing which handles the stream header and then delegates to that
-
lovetox
https://lxml.de/parsing.html#modifying-the-tree
-
jonas’
hurts to look at :)
-
lovetox
yeah lxml build the same pullparser and added a method to delete elements when not needed anymore
-
jonas’
I'd rather go the sax way where I can be sure of what's what
-
moparisthebest
yea I tend to agree that sax parsers are painful, and it's better to split the stream into stanzas without parsing, and only parse individual stanzas as a whole/with a DOM parser if needed
-
lovetox
i want a parser, that simply dispatches stanzas to my handlers. with that pullparser, i can just look for the "end" event, and at depth 1, i displatch the element that parser gives me.
-
lovetox
thats a xml parser in probably 50 lines of code
-
flow
lovetox, that sounds similar to what smack does
-
jonas’
lovetox, and hope that the delete actually works as you wish :)
-
flow
although what you really want to is something like jaxb, which matches the XML elements to types of your programming languages, with additional verification
-
lovetox
jonas’, i will test this now
-
flow
so much possibility to optimize the process, so little free time :)✎ -
lovetox
and also how vulnerable this parser is against attacks
-
flow
so much possibilities to optimize the process, so little free time :) ✏
-
lovetox
because there is zero settings here that i could configure to make it more secure
-
moparisthebest
lovetox, this is a stand-alone file that splits a stream on stanza boundaries without an xml parser, so you'd get notified of individual stanzas https://github.com/moparisthebest/xmpp-proxy/blob/master/src/stanzafilter.rs#L255
-
lovetox
yeah im not sure i should do this with python
-
lovetox
its probably slow
-
lovetox
but yeah i think this approach is cleaner
-
flow
given that its a client application, I doubt that this particular overhead will be an issue
-
moparisthebest
I wouldn't think it'd be slow in python, there aren't any string copies or anything, it's just running a switch on 1 character at a time, surely faster than an XML parser
-
lovetox
but its additional, its not like i spare myself the xml parsing
-
moparisthebest
you spare yourself the streaming sax parsing
-
lovetox
Oo, maybe i missing something, all your filter is doing is cutting the stream into pieces and feeding it to the parser
-
lovetox
oh you mean i dont need to care then about sax events
-
lovetox
just give me the finished element after you parsed everything
-
lovetox
that lxml lib is amazing, you can even register custom element classes, that will be used when it encounters a certain tag / namespace or event attribute combination
-
lovetox
this thing is made for xmpp
-
edhelas
lovetox funny, it's roughtly what I did for my XML library in Movim :D
-
edhelas
https://github.com/movim/movim/tree/master/lib/moxl#payload
-
lovetox
hm parsing 1 million xml elements and creating custom python classes for it, in 2 seconds
-
lovetox
xml parsing is indeed not the bottleneck in a gui app
-
Link Mauve
Want me to test on my server? o:)
-
Link Mauve
In poezio it is very much the bottleneck.
-
Link Mauve
Sadly.
-
lovetox
why that?
-
Link Mauve
Due to the way slixmpp converts DOM-like structs into XMPP-specific structs.
-
flow
Link Mauve, please tell use more :)
-
flow
maybe with an example code
-
Link Mauve
https://lab.louiz.org/poezio/slixmpp/-/blob/master/slixmpp/xmlstream/stanzabase.py#L672 mostly.
-
Link Mauve
This happens on each element['sub-element'] call, and is quite slow.
-
Link Mauve
I once started to work on a JIT for that, but never managed to make poezio start.
-
lovetox
what is the opinion on validating stanzas against a schema?
-
lovetox
is this even possible?
-
lovetox
can xml schemas be "open" in they allow elements not defined in the schema
-
lovetox
and only validating MUST haves?
-
Link Mauve
In my experience, XML Schema was too poor to be used for both validation and extraction of the data to be more machine-usable.
-
Link Mauve
lovetox, yes, you can have <xs:any namespace='##other'/> in a schema.
-
Link Mauve
I’ve been trying to fix our schemas every time I encounter something invalid or not restricted enough, but they clearly aren’t used for validation (or their users never contributed back their fixes).
-
Link Mauve
In xmpp-parsers, I have drafted a DSL to represent the most common constructs we have in XMPP, but I translate manually from the XML Schema as well as from the XEP’s examples and text.
-
Link Mauve
And I still can’t represent everything.
-
Link Mauve
It made me discover a whole bunch of invalid things people do in the wild.
-
Zash
DSL eh?
-
Link Mauve
And led to fixes in clients and servers.
-
Zash
I wonder if it maps to the OpenAPI JSON Schema XML stuff
-
lovetox
Link Mauve, what i would like to do is just basic stuff like, checking that a presence type attribute is one of the allowed strings etc
-
lovetox
right now i have to do this all in python✎ -
lovetox
right now i have to do this all in python, manually ✏
-
lovetox
that would allow me later to have much cleaner code
-
Zash
and if not, what would you do?
-
lovetox
i would only validate MUST
-
lovetox
so invalid type on a presence, ignore stanza
-
lovetox
or am i again to strict, and have to later add not failing on that because someone sends wrong values all the time :D
-
Zash
And if you already have that in Python, do you really gain much by having it in XML instead?
-
lovetox
yes much cleaner code
-
lovetox
now everywhere i have to be ready to catch exceptions, every attribute i query i have to be prepared for it to be None etc
-
lovetox
parsing 4 attributes into enums, is then 30 lines of code
-
lovetox
but true much is probably not possible to validate
-
lovetox
many things have implicit defaults, so i need to check afterwards anyway if its there or not
-
Link Mauve
lovetox, have a look at slixmpp or aioxmpp, they both already do so and in Python.
-
Link Mauve
They validate, and also make a more ergonomic API.
-
larma
I don't really see cleaner code. When you do pre-validation, you only safe the one "else" case for all invalid values which needs to throw an exception. In proper languages that's about one line of code, in python maybe three.
-
jonas’
larma, catching obviously invalid stanzas early (and returning an error to the sender early) makes for cleaner code than doing it from random places all over the code
-
larma
jonas’, I was thinking we parse as early as possible to not carry around heavy lifting strings when an enum value would've been enough...
-
jonas’
yes, enumification is also a nice thing of doing things early✎ -
jonas’
yes, enumification is also a nice opportunity you get when doing things early ✏
-
Zash
Cleaner code... /me looks at https://hg.prosody.im/prosody-modules/file/58a20d5ac6e9/mod_rest/res/schema-xmpp.json
-
Zash
I made this monstrosity. Fear me!
-
flow
lovetox, you can validate schema and should, if it's not obviously broken, with the one exception that XML schemas IIRC do not allow unspecified elements and attributes. but in XMPP we allow those everywhere (I believe some may disagree with me about that, but IMHO it's the only sane thing to do)
-
lovetox
flow so how do i validate then? if someone adds a element, my validation fails ..
-
lovetox
but its not invalid per xmpp definition
-
flow
so basically you can validate XML for which an schema exists, but not XML for which no schema exists. those should be simply ignored, as you obviously know nothing about them and did not negogiate it either
-
lovetox
but thats very extension unfriendly
-
flow
well if a schema specifies the 'priority' attribute to be unsignedByte, you can validate that
-
lovetox
i mean often XEPs start with someone just putting something additionally in there
-
lovetox
flow i can only validate everything or nothing
-
lovetox
not only one attribute
-
flow
that was just an example ;)
-
flow
> lovetox> flow i can only validate everything or nothing
-
flow
I don't that that's true
-
flow
In fact I know that this is not true :)
-
flow
but maybe we are talking past each other
-
Zash
Does XML schema validators not ignore unknown things?
- flow goes looking for a simple schema example from the XEP
-
flow
Zash, I don't think so no
-
flow
that's the one thing where we in XMPP vary from how XML schemas are validated
-
flow
IIRC you have to add some magic schema thingy in the elements schema to allow for arbitrary further child elements
-
flow
that are not part of the schema
-
flow
and IIRC we mostly don't do that in our XMPP schemas
-
flow
lovetox, look for example at https://xmpp.org/extensions/xep-0203.html#schema
-
flow
surely you can validate stamp
-
flow
you can validate that it's value is in the correct format and that the required attribute actually exists
-
flow
<delay xmlns='urn:xmpp:delay' from='juliet@capulet.com/balcony' stamp='2002-09-10T23:41:07Z'/>
-
flow
but let's assume I add a child element into <delay/>
-
flow
and you don't have a schema about it, it's purely optional and does not need to be negotiated
-
flow
then you cann still verify 'stamp', but not the child element✎ -
flow
then you can still verify 'stamp', but not the child element ✏
-
flow
so, it's not a all or nothing validation situation✎ -
flow
so, it's not an all or nothing validation situation ✏
-
flow
lovetox, does this help?
-
Link Mauve
flow, it would help in that case to add an <xs:any namespace='##other'/> in the delay element.
-
flow
sure, but let's just pretend that this is everywhere in our schemas
-
Zash
sprincle those everywhere 😕
-
Link Mauve
Please no. ^^'
-
Link Mauve
It makes validation (of the whole stanza) much harder.
-
flow
what makes validation much harder?
-
Link Mauve
In xmpp-parsers I add Option types for each known sub-element defined elsewhere, slixmpp doesn’t do that and instead keeps the extensibility everywhere, at the expense of performances.
-
Link Mauve
flow, let’s say I want to assert that delay won’t be extended, or it is an error.
-
Link Mauve
(For instance because I don’t support it to be extended.)
-
Link Mauve
Leaving an extension point in it will take more memory and void the type safety of it.
-
Link Mauve
While with the current schema, we can take it as OOP languages use the final keyword.
-
flow
I think my general point is that in XMPP, unknown things should typically be simply ignored, as otherwise, the eXtensability in XMPP becomes unnecessary hard
-
flow
servers can still filter unknown elements/attributes if they determined that those have not be negoiated between sender and recipient (but I am not sure if this is feasible)
-
flow
unnecessary hard and even impossible in some situations
-
Link Mauve
I’m ok with that in the generic public network case, but for specific implementations it does make sense to reject extensions you don’t know about.
-
Link Mauve
Even if just for validation of other implementations.
-
flow
Link Mauve> flow, let’s say I want to assert that delay won’t be extended, or it is an error. I may be misunderstanding the meaning of 'error' here, but I believe that mindest is harmful
-
flow
it shouldn't be an error, it should simply be ignored
-
flow
and it's fine if your native types don't allow the user to access those ignored elements/attributes
-
flow
Link Mauve, define 'reject' here
-
Link Mauve
Either ignore the whole delay element and only account for the other payloads, or ignore the whole stanza and reply with an error, stuff like that.
-
flow
why would you igniore the whole delay element if it's perfectly fine otherwise?✎ -
flow
why would you ignore the whole delay element if it's perfectly fine otherwise? ✏
-
Kev
I would recommend more or less completely ignoring the schemas, personally.
-
Zash
when we say "validate", do we mean `validate(schema, stanza) : boolean` ?
-
flow
if someone send you additional data within the delay element, without obviously negoiating it, then the sender has to assume that you may ignore it (because he doesn't know if you understand it)
-
flow
Kev, in my personal experience, schemas in XEPs haven proven to be very helpful when implementing said XEP to clarify things that aren't clear from the text
-
flow
Zash, I think so, yes, the stanza is either valid or not
-
Zash
What would a client do with this boolean?
-
flow
the question is: what consitutes a valid stanza in the presence of unknown elements/attributes?
-
Kev
Fair enough. Having gone through some experience of people trying to use schemas for validating traffic, I feel confident in saying using them normatively can also lead to pain.
-
Zash
Sorry, something put an invalid delay tag in your message, so we threw the whole message into the trash
-
flow
Zash, configurable, could be ignoring the stanza completely
-
flow
Zash, define "invalid delay element" here?
-
Zash
Isn't what you actually want something that takes a schema and extracts the bits you care about, and ignores any undefined extras?
-
Zash
A data mapper, rather than a schema
-
Link Mauve
Zash, what I do currently is `validate(schema, RFC-parts-of-the-stanza; foreach payload in stanza: validate(schema, payload)`.
-
flow
is it like stamp='invalid'? or with an unknown child extension element?
-
Link Mauve
Although that’s not completely correct as it is a transformative operation, not just a validation.
-
Zash
and I don't see how XML schema is particularly helpful for making such a thing
-
Link Mauve
In the current example, I end up with a struct Delay { from: Jid, stamp: DateTime } for instance.
-
Kev
(And we *do* have a product that does validation of XMPP traffic across security boundaries, so I'm not saying "Don't do this", merely "Here there be dragons")
-
Link Mauve
And I also agree with Kev, ignore schemas (or extract information useful to you) and write your own DSL.
-
flow
Zash> Sorry, something put an invalid delay tag in your message, so we threw the whole message into the trash I think it is safer to throw the whole message in the trash, as ignoring parts of the elements of an stanza could modify the stanzas semantics
-
flow
even though I am not aware of a concrete example
-
flow
but I rather be defensive per default
-
Link Mauve
I’d really like to have a server plugin to validate every stream, and report issues it finds to the developers of the relevant violating implementations.
-
Zash
flow: So a new XEP comes along and now you can't read any messages anymore because they have a new tag?
-
flow
Zash, that's not what I had in mind
-
flow
It's not even what I tried to express
-
Zash
I may be too tired today for properly reading anything, sorry if I misunderstand
-
flow
trying to come up with a slightly related example
-
Sam
I read it that way too, fwiw *shrug*
-
flow
I do this on the fly so it is maybe not good
-
flow
if someone sends you an *invalid* stanza, e.g. an attribute which value 'foo' but it's specified to be an integer
-
flow
I think then you can only ignore it
-
flow
because the sender may wants to tell you something
-
Kev
Although I think it's been lost to the mists of time, we did have a DSL processor for Swiften to create the various parsers/serialisers/elements, but it wasn't XSD-based.
-
flow
but you have no idea what, hence you can only guess what it is, and that leads to issues, because you may have guessed wrong
-
Zash
not treat it as if that child tag or extension was simply gone?
-
flow
presense status with an invalid priority are probably a good example
-
Link Mauve
flow, then you have a choice: either you work around it and still try to interpret it somehow, carry this tribal knowledge forever, and laugh at other implementations which didn’t make the same choice or carry it there too; or you report it to them so that they can stop sending the invalid value.
-
flow
Zash, that's the tricky question
-
Link Mauve
I’m firmly in the latter camp.
-
flow
as I said, I could imagine that the semantic of a stanza is made up of multiple of its child elements
-
Zash
priority is a core thing, so uh
-
Kev
There's definitely, to my mind, a difference between unexpected, and expected-but-invalid content.
-
flow
and then, if you ignore the existence of one of the child elements, the semantic may be different
-
Link Mauve
Zash, yet I’ve received invalid ones, IIRC it was Gajim which didn’t validate what the user could input there.
-
Link Mauve
It allowed numbers fewer than -128 or bigger than 127.
-
flow
assume a client has set the highest priority 99999
-
flow
he may then assumes that messages are routing according to this pirority
-
flow
but servers may cap it?
-
Link Mauve
Someone sent me a priority 500, my program rightfully rejected the stanza, I reported it and it got fixed.
-
flow
or servers may treat it as 0?
-
Link Mauve
So no one has to handle that case.
-
Link Mauve
(flow, highest priority is 127.)
-
flow
Link Mauve, I know :)
-
flow
but the client wo set it to 99999 didn't
-
Link Mauve
Right, so exactly my example. ^^
-
flow
in the end, I believe it's the best for the open source community ecosystem of XMPP if implementations validate and reject invalid stanzas as a whole, and emit a visible error message in such a case, ideally pointing to the sending entity, with an request to report an issue to the vendor
-
Link Mauve
+1
-
flow
Link Mauve, :)
-
flow
Link Mauve, btw, is there an equivalent of <xs:any namespace='##other'/> for attributes?
-
Link Mauve
Yes, <xs:anyAttribute namespace="##other"/>.
-
Link Mauve
We already use both in our XEPs.
-
Link Mauve
But AFAIK we still disallow namespaced attributes.
-
Link Mauve
There is still no XEP for an entity to advertise it properly supports XML namespaces.
-
flow
not sure if we officially disallow them, it just seems that some seem to be afraid of them
-
Link Mauve
Or that yeah.
-
Kev
We definitely have guidance not to use them, I'm just not sure where.