jdev - 2021-04-05

  1. lovetox

    about xep393

  2. lovetox

    it shows a example, which should not be styled, *not \n strong*

  3. lovetox

    what rule does this violate

  4. lovetox

    im not able to find it in the document

  5. Sam

    Spans cannot cross block level things and linea are blocks

  6. Sam

    lines, even

  7. lovetox

    where can i read that in the document?

  8. lovetox

    i read the text about blocks and spans now 4 times, it does not mention anything about new lines or lines

  9. Sam

    > Spans are always children of blocks and may not escape from their containing block. It's in the definition of spans. That's not a good place for it though, I should move it inti tge description

  10. lovetox

    a block is a chunk of text, so including new lines

  11. lovetox

    and a span can not escape this block does not mean for me it cant have newlines

  12. Sam

    > Individual lines of text that are not inside of a preformatted text block are considered a "plain" block. Plain blocks are line by line

  13. Sam

    See also Example 2

  14. Sam

    The entire concept of a plain block was just to make writing a parser more logical, but it's just confusing. Maybe that was a mistake.

  15. lovetox

    i dont know im reading such a styling doc the first time, maybe its just me

  16. lovetox

    actually i want to write a parser, so im looking exactly for those rules :)

  17. Sam

    For the first one about spans not crossing blocks you're definitely right, I'll make a PR for that right now. It should be in the section describing spans, not the glossary. That's just confusing.

  18. lovetox

    and for writing a parser, it should first identify blocks, and then if the block allows it go line for line inside the block and identify spans

  19. Sam

    Depends on the block, but in general yes. You can recurse down into blocks, and then do spans.

  20. Sam

    I sort of handle it differently in my implementation and don't consider "plain" spans or blocks to be a thing at all, but the rules I use come out to be the same I think.

  21. Sam

    Also pre-formatted blocks are special, those you just consume everything until the end because they can't have children.

  22. Sam

    lovetox: if you want some tests by the way I have a ton I can export as JSON or something.

  23. Sam

    I keep meaning to figure out somewhere to put them.

  24. lovetox

    yeah would be great

  25. flow

    Sam, how "compatible" would a CommonMark parser be with 393?

  26. Sam

    flow: not at all.

  27. Sam

    Like 3 styles would be the same and the others would be different.

  28. Sam

    And the parsing is different IIRC, so some text would look different even if you only used styles they both share.

  29. Sam

    lovetox: https://github.com/xsf/xeps/pull/1049 in case you want to review and see if that makes things easier to comprehend, sorry about the glossary/business rules mixup. I immediately looked for it in the business rules too.

  30. Sam

    lovetox: here are some tests. They make take some work to adapt depending on how your parser works, but basically there's a test name, some input, and a list of different styles that should be output: https://gist.github.com/SamWhited/55e35347a7eb6df60c0b1df67db76f05

  31. lovetox

    ok thanks Sam, text looks good

  32. Sam

    It's that time of the week again! This week for the XMPP Office Hours we have "Cryptographic Identity: Conquering the Fingerprint Chaos" by Paul Schaub (vanitasvitae) on Tuesday, 6 April 16:00 UTC

  33. Sam

    See you tomorrow

  34. lovetox

    Sam, *asd _sad* asd_

  35. lovetox

    this should only match a bold span

  36. Zash


  37. lovetox

    as i understand it spans can have childs, but they cannot overlap in that way

  38. lovetox

    and because we are parsing lazy left to right, bold gets precedencse

  39. lovetox

    is that correct ?

  40. Sam

    Yes, that's correct

  41. Zash

    Someone™ needs to build an equivalent to https://babelmark.github.io/?text=**foo+*bar**+baz*

  42. Sam

    Zash: this was kind of a joke, but if you just want to see how something will be formatted it works well enough: https://fosstodon.org/web/statuses/105990119687587415

  43. lovetox

    Sam, i know you really like 393, but all we do here with impl parsers for that, is to get the information that in 394 is already provided by the sending client

  44. Sam

    I have an HTML version in the examples in the docs too, though neither is complete.

  45. lovetox

    meaning start end positions of styling directives.

  46. lovetox

    so this feels like an extra step

  47. Sam

    I don't especially like 393, but I think it avoids all the impossible problems of 394 while being "good enough".

  48. Sam

    I don't like references for various reasons that have already been discussed way too much, or having multiple bodies that in theory say the same thing but maybe don't. I do like Zash's idea of having an alternative body that just has a simple XML language, but as always I don't know how the fallback works except also having a plain body, and that feels bad (in the same way that most HTML/Plain emails end up being borken for me)

  49. Zash

    Sam, I don't want to see how mellium behaves. I want to have side-by-side comparisons of 𝐞𝐯𝐞𝐫𝐲 implementation!

  50. Sam

    Zash: oh, gotcha, yah, that would be helpful

  51. Zash

    And that was probably the 12th time I typed out an xep 457 implementation in a Lua REPL

  52. Sam

    I do wish we could have come up with something regular. I thought this parser would be simple enough, but I was wrong on that I think. A number of people have told me they have trouble with it.

  53. Zash


  54. Zash

    xep71bis when?

  55. Sam

    71 had the opposite problem. It was too easy so everyone just imported a library and called it a day and then acted shocked when the giant security model of the web was broken in the default case.

  56. Zash

    The mistake we made was to let the web ruin everything.

  57. Zash

    71 wasn't broken, the web is

  58. Sam

    Yes, and 71 uses the web, therefore by the transitive property 71 was broken IMO

  59. Sam

    I mean, you're right, technically 71 was safe, but we have to think about how people will implement things and it should have been obvious that this would be broken (just like every other thing that uses HTML this way)

  60. Sam

    Not to mention that 71 let you do stupid things like change font color. If I ever receive another email/IM with yellow text because the senders backgorund is dark and mine isn't it will be too soon.

  61. Zash

    393 getting confused with Markdown, a HTML superset, doesn't make me think we fixed anything

  62. Sam

    As far as I know there was only 1 time that's ever been confused and it got caught and fixed.

  63. Sam

    So it seems to be doing its job. Someone noticed it just wasn't working and wasn't the same thing, and changed it.

  64. Zash

    No, *I* noticed and yelled at them.

  65. Sam

    Exactly, it worked.

  66. Sam

    (I didn't actually remember that it was you, but thanks for that)

  67. Zash

    At least I was there to panic about it.

  68. Sam

    But still, I'm not saying it's perfect, if we can distinguish it more in some way I'm glad to do so. It just doesn't seem to be a widespread problem like it was with literally every web client I tried that implemented XHTML-IM.

  69. Zash

    I'm still waiting for someone to explain or exploit this in everything else that's using HTML embedded in JSON for formatting. (ie Matrix, MastodonPub, more I forget)

  70. Sam

    I'm also not against it just being a front end and having a better backend, I just don't think 0394 is it.

  71. Zash

    *explain how it's not broken in ...

  72. Sam

    I have found these sorts of things in multiple other non-XMPP products too, FWIW. It is possible to hire a good web dev that knows what they're doing and come out with a roughly safe product too, so I'm sure not everything is. I'm just saying that all the XMPP clients I tried were vulnerable, and that seems like a problem. Pretending its not just seems dangerous.

  73. lovetox

    Sam, im not understanding the result of one testcase

  74. lovetox

    3 unmatched directives

  75. lovetox


  76. lovetox

    why is that not allowed, its a non-zero-width span

  77. lovetox

    and has valid start and end

  78. lovetox

    does this sentence forbid it

  79. lovetox

    Matches of spans between two styling directives MUST contain some text between the two styling directives, otherwise neither directive is valid

  80. lovetox

    means if i match ** and see no text i have to ignore both?

  81. Sam

    Sorry, I tried to clarify this recently but the rules are still confusing, but you've got it right.

  82. Sam

    We lazily match the first one, then there's nothing between them so neither directive is valid.

  83. Sam

    That's arguably still wrong, I get confused every time I read the rules too, but I never could find a way to write them that was clear and unambiguous.

  84. Sam

    That's my own fault for being a bad technical writer.

  85. lovetox

    but i wonder why this rule exists that way

  86. lovetox

    this means i can make * bold

  87. lovetox

    why on finding an empty span, just ignore the end directive

  88. lovetox

    and search further

  89. lovetox


  90. lovetox

    *cant make bold

  91. lovetox

    *cant make * bold

  92. lovetox

    omg i should go to bad, letters missing, words missing

  93. lovetox

    i hope you can decipher that

  94. Sam

    No, there still have to be two styling directives, otherwise there are too many false positives

  95. Sam

    We don't want someone writing "just multipley x*y and then…" to make the rest of the line bold, for example.

  96. Sam

    It's not perfect, but it does a pretty good job avoiding them in my experience.

  97. Sam

    Or doing a correct, some people use * for that.

  98. Sam

    *multiply, for example, since I Just typoed that :)

  99. lovetox

    Sam, im not sayin we need only one directive

  100. lovetox

    im saying ignore the second of the 3

  101. lovetox

    its still a start and an end

  102. Sam

    But we lazily match, so we hit the second first

  103. lovetox

    yeah, then we determine its invalid because zero width span, and ignore it

  104. lovetox

    the same we hit a invalid end with withspace in front

  105. lovetox

    we dont dismiss the start

  106. lovetox

    or do we?

  107. Sam

    Yes, we do, but in this case it specifically says that if it's empty both are invalid

  108. lovetox


  109. lovetox

    ok then its at least consistent

  110. Sam

    0I thought, now I can't find it.

  111. lovetox

    yeah it says it

  112. Sam

    oh, first sentence, yah

  113. lovetox

    thats why i thought this is special

  114. lovetox

    and normally we dont dismiss both

  115. Sam

    This is still confusingly worded though, sorry about that. I've tried to clean it up several times and it always still ends up being confusing. I shoud rewrite this whole section as a bulleted list of rules instead.

  116. Sam

    Sort of. In the other case the thing just isn't a directive at all. In this case it is but it's specifically invalid because there's nothing between them.

  117. lovetox

    oh wait

  118. lovetox

    so i was right first but for the wrong reason

  119. Sam

    Eg. in the message "*not strong *" the second thing just isn't a directive at all because close directives can't have a space.

  120. Sam

    In "**" they *would* be a directive except for that first rule which says to disregard both of them.

  121. lovetox

    *strong *strong*

  122. lovetox

    so here everything should be strong

  123. lovetox

    because the middle is ignored because not a directive

  124. Sam

    Yes, I think so.

  125. lovetox

    hm but its a valid start directive

  126. lovetox

    just not a valid end directive

  127. Sam

    Hmm, my library doesn't do that how I would expect, unsure if bug there or in the text. I'll have to look.

  128. lovetox

    i guess if we declare it useless to nest bold inside bold

  129. lovetox

    hm wow, i only know if this is start or end until i have the whole string

  130. lovetox


  131. Sam

    Yes, you have to pull the entire styling directive into memory, sadly. If I were designing something from scratch it would be a requirement not to do this, but we'll have max-message limits anyways so in practice it's not an issue.

  132. lovetox

    so what is it now, this example is inconclusive for me

  133. lovetox

    there are arguments for both stylings, the second strong can be bold

  134. lovetox

    but also the whole thing can be bold

  135. lovetox

    depends on how you interpret that middle *

  136. Sam

    I'm not actually sure, I'll have to look later. This could be a mistake in the text

  137. lovetox

    as start directive or end directive

  138. lovetox

    ok thanks

  139. lovetox

    ping me once you have an update :)

  140. Sam

    I'll try to remember

  141. Sam

    Okay, rereading the text I am sure that entire thing would be strong.

  142. Sam

    Specifically because of " Characters that would be styling directives but do not follow these rules are not considered when matching and thus may be present between two other styling directives."

  143. Sam

    My implementation treats the middle one as a close styling directive, which is just wrong, so I've got a bug somewhere. I'll add that to the tests.

  144. lovetox

    ok Sam, but the middle one follows the rules

  145. lovetox

    whitespace + sd = start directive

  146. Sam

    Oh right

  147. lovetox

    its kind of arbitrary that you decide its a close one

  148. Sam

    I think the intent was you parse a full directive, then parse children, so this would be a start without a closing diretive and therefore invalid. But I'm not sure if this actually says that or not.

  149. Sam

    Yah, this looks like a bug, we mention how to parse blocks but not how to parse spans. Good find.

  150. Sam

    I could go either way on this, I guess we'll have to think about whether one way or the other has a benefit and write another update.

  151. lovetox

    hm i think this is one of the situation where any way is wrong for someone

  152. lovetox

    we could look to other messengers

  153. lovetox

    to be consistent with them

  154. Sam

    Yah, this spec is definitely always going to have problems with some messages, it was designed to be a "good enough" solution by just copying what watsapp and slack were doing (more or less). I'll see what Slack is doing, that's the only other messenger I have

  155. Sam

    I suspect shorter is better and we should do that instead of the whole thing, but that's just an initial impression.

  156. Sam

    (Slack just makes the second part strong)

  157. lovetox

    yeah if i had to decide know i would also lean towards the shorter one

  158. lovetox

    but i dont have a good reason

  159. Sam

    Me neither.

  160. Sam

    I'm curious, let me naively fix the bug in my library where we're not checking for the space before * properly and see what it does afterwards :)

  161. Sam

    I vote we go with whatever it does so I don't have to change it more :)

  162. Sam

    If I fix my library to include the check for if the previous thing was a space it makes the whole thing strong.

  163. Sam

    I think this actually makes parsing easier (not just because I have to change less stuff), but I'm not sure yet.

  164. lovetox

    hm yes, its a bit nicer now

  165. lovetox

    i now have only 2 paths

  166. lovetox

    is_valid_start_span is_valid_end_span

  167. Sam

    Actually, no, I say that but I'm sort of wrong.

  168. Sam

    It's just do you keep a stack of span start tokens and match them to span end tokens, or do you scan for an end token, then do child spans. The first would be more efficient, but I'm not sure that it matters since this is for messages that even in the worst case are going to be relatively short.

  169. lovetox

    the order is important now, i have to first check if its a valid start, and if yes, then dont check anymore if its also valid end

  170. lovetox

    i use a stack

  171. Sam

    Oh right, I was about to say "wait, I use a stack, how is it getting this result?" but my order is backwards

  172. lovetox

    yes the order is now important

  173. Sam

    I do some weird stuff to scan for child spans though, I should rewrite this both ways and just see which one is simpler.

  174. lovetox

    i have only one scan

  175. lovetox

    this is if i encounter a pre

  176. lovetox

    then i scan ahead, so i dont have to parse all kind of childs which i afterwards have to ignore

  177. lovetox

    my parser can only detect spans right now

  178. lovetox

    so no blocks yet

  179. theTedd

    this old thing, again

  180. theTedd

    "*strong *strong*" would be "{*strong *strong*}"

  181. Sam

    theTedd: maybe. lovetox has a good point though: is the middle one a valid start styling directive or an invalid end styling directive? I don't think the current rules tell you what to do there.

  182. Sam

    Ie. do we start parsing outwards and move inwards, or parse in order.

  183. theTedd

    it's a valid start, but it doesn't match because you already have your start, so only a valid close can match the start you already have

  184. theTedd

    you parse left-to-right; whether you happen to implement that in a nested way is your problem

  185. Sam

    I think that's only true if we assume that a span can't be nested inside of another identical span. This makes sense, but I don't think the rules say that.

  186. theTedd

    spans can't be nested - they're spans

  187. Sam

    They can be nested. The document says so.

  188. theTedd

    sorry, yes

  189. Sam

    Eg. _*test*_ is a valid span.

  190. theTedd

    lazy matching means take the first and find a close, not look for another open so you can throw the first away

  191. Sam

    If we assume **span** is a nested strong span (not that that makes sense) then we could interpret this as starting from the left and finding a possible opening directive, moving right and finding another opening directive, moving right and finding a close. Which one was closed?

  192. theTedd

    except ** fails

  193. Sam

    I *think* I agree. But at best it's confusing in the document and at worst it's ambiguous. Needs some text changed either way.

  194. Sam

    ** doesn't matter, that's a separate special case.

  195. Sam

    oh yah, I mean, the **span** example I gave is bad, fair enough.

  196. Sam

    I'd be curious how whatsapp handles "*strong *strong*" if anyone has it and can test it.

  197. theTedd

    right, so this is essentially a "dangling-else" problem, you have to specify one way or the other

  198. Sam

    I think so, yes

  199. Sam

    Slack only makes the second one strong, FWIW.

  200. theTedd

    nearest wins, makes the most sense with being 'lazy'

  201. theTedd

    so it's {*strong {*strong*} and the first has no matching close, so it's invalid

  202. theTedd

    I mean to write the ABNF at some point (don't hold your breath too long)

  203. Sam

    I'm pretty sure ABNF is impossible or at least incredibly long for something like this. It's not really meant for this sort of thing, but that would be awesome if you can do it.

  204. Sam

    We could add text along the lines of "Once scanning for an end directive any opening directives identical to the previous opening directive are no longer valid" which would result in {*strong *strong*} or we could add something like "Opening directives are scanned from the beginning of the byte stream to the end and can be nested even inside identical opening directives" which would result in "{*strong {*strong*}}". I'm honestly not sure which is better or if it makes a difference.

  205. Sam

    The first means more scanning. The second means we have to do weird stuff and possibly throw away state after the fact when we decide that the initial strong isn't actually a directive.

  206. theTedd

    the second is more consistent with nesting different kinds of directives; you have to be able to throw away the first directive in the case there isn't a matching close anyway

  207. moparisthebest

    I propose language like the following: > Do whatever you want, users really don't care about edge cases here.

  208. lovetox

    i also think the nesting thing is nicer

  209. lovetox

    the code must allow parsing nested directives anyway

  210. lovetox

    its more work to make an exception for identical directives

  211. lovetox

    then just treating them like all the others

  212. lovetox

    than just treating them like all the others

  213. Sam

    moparisthebest makes a good point. It's not likely to ever matter in practice.

  214. theTedd

    just erase the entire spec and replace it with "do whatever YOLO!"

  215. Sam

    I think he meant that as long as *this* works it's fine.

  216. Sam

    I still don't fully get how I'm getting "{*strong {*strong*}}". I need to re-learn how my own parser works I guess.

  217. theTedd

    for users, that's fine; for implementers, they need to know which way to take it, otherwise the same text looks different on different screens

  218. theTedd

    it's not a huge issue, but it's easily cleaned up by defining them as nesting and matching with the nearest

  219. Kev

    Matching with the *nearest*?

  220. theTedd

    in terms of a dangling-else

  221. Kev

    So *this*, *message* has just a comma bold in it?

  222. theTedd

    I meant in this specific case, not as a complete set of rules

  223. Kev

    I’ll read the summary on standards@ in the morning ;) GN.

  224. theTedd


  225. theTedd

    Sam, you may be matching the final * twice as a result of taking substrings

  226. Sam

    I hope not, but it's quite possible I have an off-by-one somewhere and am still matching it.

  227. Sam

    *whew* my implementation does handle Kev's example correctly at least (or what I think is correctly where the ", " is plain and everything else is strong)

  228. Sam

    But that one doesn't show this problem, so I guess I shouldn't have been scared that it wouldn't work

  229. moparisthebest

    > otherwise the same text looks different on different screens Yes, and users don't care