jdev - 2021-11-03


  1. lovetox

    on the topic of xml parsers

  2. lovetox

    python has a xmlpullparser

  3. lovetox

    but i think it will keep the whole xml document in memory

  4. lovetox

    i think this is a problem .. i dont know how much data is exchanged between client and server, but i guess if a client runs for some days, that could be problematic

  5. flow

    lovetox, smack uses an xml pull parser but splits the top level stream elements, so that at most one is kept in memory

  6. flow

    also pull parser tend to not keep the whole document in memory

  7. flow

    since you typically can't rewind, the unreachable parts of the document can be dropped

  8. jonas’

    lovetox, xml.sax! :)

  9. flow

    IIRC pull/push parsers can operate on streams, unlike DOM based parsers

  10. flow

    of course, this may vary from parser implementation to parser implemention

  11. flow

    having said that some xml parsers can operate on streams, I believe that in XMPP it's best to not parse on the stream but to split the top-level stream elements, as it, amongst other things, allows you to enforce a top-level stream element size limit

  12. jonas’

    you can also enforce that with a parser which can tell you where in the stream it currently is

  13. jonas’

    splitting sounds like implementing half on an XML parser which I definitely cannot recommend you doing

  14. jonas’

    (I know that moparisthebest loves that kind of pain though :))

  15. flow

    My personal experience is that the developing the splitter implmentation was not painful

  16. flow

    but yes, it's basically an xml parser, or, i'd rather say, an xml state machine

  17. flow

    and yes, if the parser is able to tell you the current position in the stream, then you probably do not need to do that

  18. flow

    unfortunately, that was not true for in my case

  19. Kev

    We more or less use a streaming dom parser, by using a stream parser and then constructing a pseudo-dom on individual stanzas. I imagine we're not alone in that.

  20. Sam

    That's more or less what I do as well if I understood you correctly. It's all streaming until we need to do something with a stanza or an element, then the user has an option of parsing the whole element into a DOM or similar and operating on that instead of the streamed tokens.

  21. lovetox

    jonas’, of course i can use sax, but then i have to write code to build the xml elements myself

  22. jonas’

    or port to aioxmpp :-X

  23. jonas’

    lovetox, actually, no, lxml.etree has a SAX -> etree thing

  24. jonas’

    I use that in aioxmpp to capture unschematized elements

  25. jonas’

    or used it anyway, I don't think I still do.

  26. lovetox

    thats exactly what i need

  27. jonas’

    lovetox, https://lxml.de/sax.html#building-a-tree-from-sax-events

  28. jonas’

    then you just need a shim thing which handles the stream header and then delegates to that

  29. lovetox

    https://lxml.de/parsing.html#modifying-the-tree

  30. jonas’

    hurts to look at :)

  31. lovetox

    yeah lxml build the same pullparser and added a method to delete elements when not needed anymore

  32. jonas’

    I'd rather go the sax way where I can be sure of what's what

  33. moparisthebest

    yea I tend to agree that sax parsers are painful, and it's better to split the stream into stanzas without parsing, and only parse individual stanzas as a whole/with a DOM parser if needed

  34. lovetox

    i want a parser, that simply dispatches stanzas to my handlers. with that pullparser, i can just look for the "end" event, and at depth 1, i displatch the element that parser gives me.

  35. lovetox

    thats a xml parser in probably 50 lines of code

  36. flow

    lovetox, that sounds similar to what smack does

  37. jonas’

    lovetox, and hope that the delete actually works as you wish :)

  38. flow

    although what you really want to is something like jaxb, which matches the XML elements to types of your programming languages, with additional verification

  39. lovetox

    jonas’, i will test this now

  40. flow

    so much possibility to optimize the process, so little free time :)

  41. lovetox

    and also how vulnerable this parser is against attacks

  42. flow

    so much possibilities to optimize the process, so little free time :)

  43. lovetox

    because there is zero settings here that i could configure to make it more secure

  44. moparisthebest

    lovetox, this is a stand-alone file that splits a stream on stanza boundaries without an xml parser, so you'd get notified of individual stanzas https://github.com/moparisthebest/xmpp-proxy/blob/master/src/stanzafilter.rs#L255

  45. lovetox

    yeah im not sure i should do this with python

  46. lovetox

    its probably slow

  47. lovetox

    but yeah i think this approach is cleaner

  48. flow

    given that its a client application, I doubt that this particular overhead will be an issue

  49. moparisthebest

    I wouldn't think it'd be slow in python, there aren't any string copies or anything, it's just running a switch on 1 character at a time, surely faster than an XML parser

  50. lovetox

    but its additional, its not like i spare myself the xml parsing

  51. moparisthebest

    you spare yourself the streaming sax parsing

  52. lovetox

    Oo, maybe i missing something, all your filter is doing is cutting the stream into pieces and feeding it to the parser

  53. lovetox

    oh you mean i dont need to care then about sax events

  54. lovetox

    just give me the finished element after you parsed everything

  55. lovetox

    that lxml lib is amazing, you can even register custom element classes, that will be used when it encounters a certain tag / namespace or event attribute combination

  56. lovetox

    this thing is made for xmpp

  57. edhelas

    lovetox funny, it's roughtly what I did for my XML library in Movim :D

  58. edhelas

    https://github.com/movim/movim/tree/master/lib/moxl#payload

  59. lovetox

    hm parsing 1 million xml elements and creating custom python classes for it, in 2 seconds

  60. lovetox

    xml parsing is indeed not the bottleneck in a gui app

  61. Link Mauve

    Want me to test on my server? o:)

  62. Link Mauve

    In poezio it is very much the bottleneck.

  63. Link Mauve

    Sadly.

  64. lovetox

    why that?

  65. Link Mauve

    Due to the way slixmpp converts DOM-like structs into XMPP-specific structs.

  66. flow

    Link Mauve, please tell use more :)

  67. flow

    maybe with an example code

  68. Link Mauve

    https://lab.louiz.org/poezio/slixmpp/-/blob/master/slixmpp/xmlstream/stanzabase.py#L672 mostly.

  69. Link Mauve

    This happens on each element['sub-element'] call, and is quite slow.

  70. Link Mauve

    I once started to work on a JIT for that, but never managed to make poezio start.

  71. lovetox

    what is the opinion on validating stanzas against a schema?

  72. lovetox

    is this even possible?

  73. lovetox

    can xml schemas be "open" in they allow elements not defined in the schema

  74. lovetox

    and only validating MUST haves?

  75. Link Mauve

    In my experience, XML Schema was too poor to be used for both validation and extraction of the data to be more machine-usable.

  76. Link Mauve

    lovetox, yes, you can have <xs:any namespace='##other'/> in a schema.

  77. Link Mauve

    I’ve been trying to fix our schemas every time I encounter something invalid or not restricted enough, but they clearly aren’t used for validation (or their users never contributed back their fixes).

  78. Link Mauve

    In xmpp-parsers, I have drafted a DSL to represent the most common constructs we have in XMPP, but I translate manually from the XML Schema as well as from the XEP’s examples and text.

  79. Link Mauve

    And I still can’t represent everything.

  80. Link Mauve

    It made me discover a whole bunch of invalid things people do in the wild.

  81. Zash

    DSL eh?

  82. Link Mauve

    And led to fixes in clients and servers.

  83. Zash

    I wonder if it maps to the OpenAPI JSON Schema XML stuff

  84. lovetox

    Link Mauve, what i would like to do is just basic stuff like, checking that a presence type attribute is one of the allowed strings etc

  85. lovetox

    right now i have to do this all in python

  86. lovetox

    right now i have to do this all in python, manually

  87. lovetox

    that would allow me later to have much cleaner code

  88. Zash

    and if not, what would you do?

  89. lovetox

    i would only validate MUST

  90. lovetox

    so invalid type on a presence, ignore stanza

  91. lovetox

    or am i again to strict, and have to later add not failing on that because someone sends wrong values all the time :D

  92. Zash

    And if you already have that in Python, do you really gain much by having it in XML instead?

  93. lovetox

    yes much cleaner code

  94. lovetox

    now everywhere i have to be ready to catch exceptions, every attribute i query i have to be prepared for it to be None etc

  95. lovetox

    parsing 4 attributes into enums, is then 30 lines of code

  96. lovetox

    but true much is probably not possible to validate

  97. lovetox

    many things have implicit defaults, so i need to check afterwards anyway if its there or not

  98. Link Mauve

    lovetox, have a look at slixmpp or aioxmpp, they both already do so and in Python.

  99. Link Mauve

    They validate, and also make a more ergonomic API.

  100. larma

    I don't really see cleaner code. When you do pre-validation, you only safe the one "else" case for all invalid values which needs to throw an exception. In proper languages that's about one line of code, in python maybe three.

  101. jonas’

    larma, catching obviously invalid stanzas early (and returning an error to the sender early) makes for cleaner code than doing it from random places all over the code

  102. larma

    jonas’, I was thinking we parse as early as possible to not carry around heavy lifting strings when an enum value would've been enough...

  103. jonas’

    yes, enumification is also a nice thing of doing things early

  104. jonas’

    yes, enumification is also a nice opportunity you get when doing things early

  105. Zash

    Cleaner code... /me looks at https://hg.prosody.im/prosody-modules/file/58a20d5ac6e9/mod_rest/res/schema-xmpp.json

  106. Zash

    I made this monstrosity. Fear me!

  107. flow

    lovetox, you can validate schema and should, if it's not obviously broken, with the one exception that XML schemas IIRC do not allow unspecified elements and attributes. but in XMPP we allow those everywhere (I believe some may disagree with me about that, but IMHO it's the only sane thing to do)

  108. lovetox

    flow so how do i validate then? if someone adds a element, my validation fails ..

  109. lovetox

    but its not invalid per xmpp definition

  110. flow

    so basically you can validate XML for which an schema exists, but not XML for which no schema exists. those should be simply ignored, as you obviously know nothing about them and did not negogiate it either

  111. lovetox

    but thats very extension unfriendly

  112. flow

    well if a schema specifies the 'priority' attribute to be unsignedByte, you can validate that

  113. lovetox

    i mean often XEPs start with someone just putting something additionally in there

  114. lovetox

    flow i can only validate everything or nothing

  115. lovetox

    not only one attribute

  116. flow

    that was just an example ;)

  117. flow

    > lovetox> flow i can only validate everything or nothing

  118. flow

    I don't that that's true

  119. flow

    In fact I know that this is not true :)

  120. flow

    but maybe we are talking past each other

  121. Zash

    Does XML schema validators not ignore unknown things?

  122. flow goes looking for a simple schema example from the XEP

  123. flow

    Zash, I don't think so no

  124. flow

    that's the one thing where we in XMPP vary from how XML schemas are validated

  125. flow

    IIRC you have to add some magic schema thingy in the elements schema to allow for arbitrary further child elements

  126. flow

    that are not part of the schema

  127. flow

    and IIRC we mostly don't do that in our XMPP schemas

  128. flow

    lovetox, look for example at https://xmpp.org/extensions/xep-0203.html#schema

  129. flow

    surely you can validate stamp

  130. flow

    you can validate that it's value is in the correct format and that the required attribute actually exists

  131. flow

    <delay xmlns='urn:xmpp:delay' from='juliet@capulet.com/balcony' stamp='2002-09-10T23:41:07Z'/>

  132. flow

    but let's assume I add a child element into <delay/>

  133. flow

    and you don't have a schema about it, it's purely optional and does not need to be negotiated

  134. flow

    then you cann still verify 'stamp', but not the child element

  135. flow

    then you can still verify 'stamp', but not the child element

  136. flow

    so, it's not a all or nothing validation situation

  137. flow

    so, it's not an all or nothing validation situation

  138. flow

    lovetox, does this help?

  139. Link Mauve

    flow, it would help in that case to add an <xs:any namespace='##other'/> in the delay element.

  140. flow

    sure, but let's just pretend that this is everywhere in our schemas

  141. Zash

    sprincle those everywhere 😕

  142. Link Mauve

    Please no. ^^'

  143. Link Mauve

    It makes validation (of the whole stanza) much harder.

  144. flow

    what makes validation much harder?

  145. Link Mauve

    In xmpp-parsers I add Option types for each known sub-element defined elsewhere, slixmpp doesn’t do that and instead keeps the extensibility everywhere, at the expense of performances.

  146. Link Mauve

    flow, let’s say I want to assert that delay won’t be extended, or it is an error.

  147. Link Mauve

    (For instance because I don’t support it to be extended.)

  148. Link Mauve

    Leaving an extension point in it will take more memory and void the type safety of it.

  149. Link Mauve

    While with the current schema, we can take it as OOP languages use the final keyword.

  150. flow

    I think my general point is that in XMPP, unknown things should typically be simply ignored, as otherwise, the eXtensability in XMPP becomes unnecessary hard

  151. flow

    servers can still filter unknown elements/attributes if they determined that those have not be negoiated between sender and recipient (but I am not sure if this is feasible)

  152. flow

    unnecessary hard and even impossible in some situations

  153. Link Mauve

    I’m ok with that in the generic public network case, but for specific implementations it does make sense to reject extensions you don’t know about.

  154. Link Mauve

    Even if just for validation of other implementations.

  155. flow

    Link Mauve> flow, let’s say I want to assert that delay won’t be extended, or it is an error. I may be misunderstanding the meaning of 'error' here, but I believe that mindest is harmful

  156. flow

    it shouldn't be an error, it should simply be ignored

  157. flow

    and it's fine if your native types don't allow the user to access those ignored elements/attributes

  158. flow

    Link Mauve, define 'reject' here

  159. Link Mauve

    Either ignore the whole delay element and only account for the other payloads, or ignore the whole stanza and reply with an error, stuff like that.

  160. flow

    why would you igniore the whole delay element if it's perfectly fine otherwise?

  161. flow

    why would you ignore the whole delay element if it's perfectly fine otherwise?

  162. Kev

    I would recommend more or less completely ignoring the schemas, personally.

  163. Zash

    when we say "validate", do we mean `validate(schema, stanza) : boolean` ?

  164. flow

    if someone send you additional data within the delay element, without obviously negoiating it, then the sender has to assume that you may ignore it (because he doesn't know if you understand it)

  165. flow

    Kev, in my personal experience, schemas in XEPs haven proven to be very helpful when implementing said XEP to clarify things that aren't clear from the text

  166. flow

    Zash, I think so, yes, the stanza is either valid or not

  167. Zash

    What would a client do with this boolean?

  168. flow

    the question is: what consitutes a valid stanza in the presence of unknown elements/attributes?

  169. Kev

    Fair enough. Having gone through some experience of people trying to use schemas for validating traffic, I feel confident in saying using them normatively can also lead to pain.

  170. Zash

    Sorry, something put an invalid delay tag in your message, so we threw the whole message into the trash

  171. flow

    Zash, configurable, could be ignoring the stanza completely

  172. flow

    Zash, define "invalid delay element" here?

  173. Zash

    Isn't what you actually want something that takes a schema and extracts the bits you care about, and ignores any undefined extras?

  174. Zash

    A data mapper, rather than a schema

  175. Link Mauve

    Zash, what I do currently is `validate(schema, RFC-parts-of-the-stanza; foreach payload in stanza: validate(schema, payload)`.

  176. flow

    is it like stamp='invalid'? or with an unknown child extension element?

  177. Link Mauve

    Although that’s not completely correct as it is a transformative operation, not just a validation.

  178. Zash

    and I don't see how XML schema is particularly helpful for making such a thing

  179. Link Mauve

    In the current example, I end up with a struct Delay { from: Jid, stamp: DateTime } for instance.

  180. Kev

    (And we *do* have a product that does validation of XMPP traffic across security boundaries, so I'm not saying "Don't do this", merely "Here there be dragons")

  181. Link Mauve

    And I also agree with Kev, ignore schemas (or extract information useful to you) and write your own DSL.

  182. flow

    Zash> Sorry, something put an invalid delay tag in your message, so we threw the whole message into the trash I think it is safer to throw the whole message in the trash, as ignoring parts of the elements of an stanza could modify the stanzas semantics

  183. flow

    even though I am not aware of a concrete example

  184. flow

    but I rather be defensive per default

  185. Link Mauve

    I’d really like to have a server plugin to validate every stream, and report issues it finds to the developers of the relevant violating implementations.

  186. Zash

    flow: So a new XEP comes along and now you can't read any messages anymore because they have a new tag?

  187. flow

    Zash, that's not what I had in mind

  188. flow

    It's not even what I tried to express

  189. Zash

    I may be too tired today for properly reading anything, sorry if I misunderstand

  190. flow

    trying to come up with a slightly related example

  191. Sam

    I read it that way too, fwiw *shrug*

  192. flow

    I do this on the fly so it is maybe not good

  193. flow

    if someone sends you an *invalid* stanza, e.g. an attribute which value 'foo' but it's specified to be an integer

  194. flow

    I think then you can only ignore it

  195. flow

    because the sender may wants to tell you something

  196. Kev

    Although I think it's been lost to the mists of time, we did have a DSL processor for Swiften to create the various parsers/serialisers/elements, but it wasn't XSD-based.

  197. flow

    but you have no idea what, hence you can only guess what it is, and that leads to issues, because you may have guessed wrong

  198. Zash

    not treat it as if that child tag or extension was simply gone?

  199. flow

    presense status with an invalid priority are probably a good example

  200. Link Mauve

    flow, then you have a choice: either you work around it and still try to interpret it somehow, carry this tribal knowledge forever, and laugh at other implementations which didn’t make the same choice or carry it there too; or you report it to them so that they can stop sending the invalid value.

  201. flow

    Zash, that's the tricky question

  202. Link Mauve

    I’m firmly in the latter camp.

  203. flow

    as I said, I could imagine that the semantic of a stanza is made up of multiple of its child elements

  204. Zash

    priority is a core thing, so uh

  205. Kev

    There's definitely, to my mind, a difference between unexpected, and expected-but-invalid content.

  206. flow

    and then, if you ignore the existence of one of the child elements, the semantic may be different

  207. Link Mauve

    Zash, yet I’ve received invalid ones, IIRC it was Gajim which didn’t validate what the user could input there.

  208. Link Mauve

    It allowed numbers fewer than -128 or bigger than 127.

  209. flow

    assume a client has set the highest priority 99999

  210. flow

    he may then assumes that messages are routing according to this pirority

  211. flow

    but servers may cap it?

  212. Link Mauve

    Someone sent me a priority 500, my program rightfully rejected the stanza, I reported it and it got fixed.

  213. flow

    or servers may treat it as 0?

  214. Link Mauve

    So no one has to handle that case.

  215. Link Mauve

    (flow, highest priority is 127.)

  216. flow

    Link Mauve, I know :)

  217. flow

    but the client wo set it to 99999 didn't

  218. Link Mauve

    Right, so exactly my example. ^^

  219. flow

    in the end, I believe it's the best for the open source community ecosystem of XMPP if implementations validate and reject invalid stanzas as a whole, and emit a visible error message in such a case, ideally pointing to the sending entity, with an request to report an issue to the vendor

  220. Link Mauve

    +1

  221. flow

    Link Mauve, :)

  222. flow

    Link Mauve, btw, is there an equivalent of <xs:any namespace='##other'/> for attributes?

  223. Link Mauve

    Yes, <xs:anyAttribute namespace="##other"/>.

  224. Link Mauve

    We already use both in our XEPs.

  225. Link Mauve

    But AFAIK we still disallow namespaced attributes.

  226. Link Mauve

    There is still no XEP for an entity to advertise it properly supports XML namespaces.

  227. flow

    not sure if we officially disallow them, it just seems that some seem to be afraid of them

  228. Link Mauve

    Or that yeah.

  229. Kev

    We definitely have guidance not to use them, I'm just not sure where.