jdev - 2020-06-29


  1. jonas’

    lovetox, I don’t see any issue here, so unlikely: https://github.com/horazont/muchopper/issues

  2. Жокир

    Do any popular servers actually implement XEP-0368? If yes, could anyone give point to any such servers?

  3. jonas’

    Жокир, https://compliance.conversations.im/ any suggested as "compliant" servers should do, at least for c2s

  4. jonas’

    I don’t know of any s2s implementation except maybe https://github.com/surevine/Metre , which isn’t quite a server.

  5. Guus

    Looking at my server log, I'm noticing that I'm getting a lot of connection timeouts on s2s in bursts - presumably x minutes after a user that caused the federation to be set up sent its last presence update.

  6. Guus

    I wonder if it'd be good to introduce a small factor of randomness to the timeout interval, to avoid staggered behavior.

  7. flow

    Guus, maybe controversal counter-question: that does sounds like an timeout enforced on the application layer. if so, then why would you have an application layer timeout for s2s connections and not simply let the tcp connection timeout

  8. jonas’

    flow, save resources.

  9. jonas’

    the tcp connection will also not time out ever

  10. flow

    is it worth it?

  11. jonas’

    because both peers can see each other (in this scenario)

  12. jonas’

    file descriptors are limited and when you notice you’re running out of them, it’s too late

  13. jonas’

    being a bit proactive about preserving them is generally a good idea

  14. flow

    ok so kill idle connections based on the amount of available file descriptions, but not based on time

  15. jonas’

    you don’t know the amount of available file descriptors

  16. flow

    (or, to be precise, only as second criteria based on time)

  17. jonas’

    you know the limit, but you don’t know how many are open in your process

  18. jonas’

    you can estimate, but you can be wrong in the bad direction.

  19. jonas’

    (or in both directions, depending on how you estimate)

  20. flow

    ls /proc/$pid/fd/ | wc -l

  21. flow

    ?

  22. jonas’

    that’s at least a rather expensive way to do it

  23. jonas’

    but true, that works, on systems where procfs has that feature

  24. flow

    why is it expensive?

  25. Guus

    flow I don't mind much either way - I'm just noticing that I get a lot of disconnects. I'm thinking Prosody does this? Openfire probably does so as well, though.

  26. jonas’

    flow, that’s many syscalls

  27. Guus

    (at the very least, it's configurable)

  28. jonas’

    I can’t see immediately in man 5 procfs whether /proc/$pid/fd is a linux or a posix thing

  29. flow

    jonas’, I wouldn't be surpised if there is a more efficient way to get that number

  30. jonas’

    I would

  31. Guus

    I don't mind much closing idle connections (although it does feel like premature optimization a bit.)

  32. flow

    especially on linux

  33. jonas’

    I think I looked into that already and found that it’s not possible

  34. jonas’

    there’s surely a reason why sudo does a for i in 0..MAXFD do close($i); done

  35. Guus

    as Openfire is a multi-platform solution, depending on any platform specific thingy is going to be a pain.

  36. Guus

    unless Java exposes things, which I doubt.

  37. flow

    Guus, UnixOperationSystemMXBean.getOpenFileDescriptorCount()

  38. flow

    not sure if something like that also exist for other OS'es

  39. Guus

    *Unix*OperationSystemMXBean is likely going to fail on Windows? 🙂

  40. Guus

    but also: not worth the complexity, maybe?

  41. flow

    so you may have to implement a fallback strategy for sure (like disconnection based on a timeout)

  42. jonas’

    I don’t see any problem with a timeout here, to be honest

  43. flow

    Guus, potentially, depends on your goals I'd say

  44. jonas’

    everything else seems slightly overengineered

  45. jonas’

    file descriptors may also just be one reason why you want to keep the number of open connections low

  46. jonas’

    other reasons may include running behind a stateful firewall and wanting to conserve resources there

  47. Guus

    I was just suggesting to add a small random factor in the timeout delay, nothing more 😉

  48. flow

    Guus, which is always a good idea

  49. Guus

    also, given how I see batches of s2s tear down only to be brought up again, I'm suspecting that the default timeout of (Prosody?) might be on the low end.

  50. jonas’

    prosody doesn’t have a default timeout

  51. Guus

    oh, that's interesting

  52. Guus

    note that I didn't actually check what server software is used on those. I just assumed.

  53. Guus

    having had a closer look: might be ejabberd 🙂

  54. jonas’

    https://sotecware.net/files/noindex/connections.png

  55. jonas’

    :-)

  56. flow

    looks like a 30 minute timeout

  57. flow

    combined with an hourly cron job maybe?

  58. jonas’

    I think there are two timeouts, one ~15min (the linear curve down) and one ~30min (which also looks randomized, because of the slight exp-y behaviour at the end)

  59. jonas’

    and yes, this is the connection stats of search.jabber.network, and the spikes you see is the hourly scan :)

  60. Guus

    (maybe randomize your scan!)

  61. flow

    now we only need to identify the implementations with the 15m and 30m (randomized) timeout

  62. jonas’

    Guus, it’s already shuffled :)

  63. flow

    and what is keeping the baseline of 1.5k connections

  64. Guus

    moar shuffling!

  65. jonas’

    flow, compare the ratios with https://search.jabber.network/stats#software :)

  66. jonas’

    assuming that many "unknowns" are in fact prosody MUCs, because prosody doesn’t report version on MUC by default IIRC

  67. flow

    ahh, so it is probably prosody which keeps the connections

  68. jonas’

    flow, very likely

  69. flow

    but the amount of 15m and 30m timeout connections appears to be nearly equal

  70. jonas’

    I experimented with loading mod_s2s_idle_timeout or whatsitcalled on s.j.n, but then I disabled it to reduce the codebase to the minimum for some unrelated testing

  71. jonas’

    flow, I’ll have to dig deeper into it, it’s also possible that the different behaviours there are an artifact of how the scanner works

  72. jonas’

    flow, I’ll have to dig deeper into it, it’s also possible that the two different falloff behaviours there are an artifact of how the scanner works

  73. flow

    i see

  74. jonas’

    since there are two scanning components, and one finishes much quicker than the other; it’s possible that the quicker one is causing the additional tip of the initial spike, while the slower one is what causes the slow fall off at the end

  75. jonas’

    since the quicker one also tends to touch more domains

  76. jonas’

    oh yeah, that’s very plausible

  77. jonas’

    that may also explain the exp falloff due to shuffling

  78. jonas’

    if there’s really just a 15m or something timeout involved

  79. flow

    jonas’, are you aware that 'German' appears twice in the room languages table?

  80. jonas’

    yes

  81. jonas’

    de-de vs. de

  82. jonas’

    I need to normalize that

  83. jonas’

    https://sotecware.net/files/noindex/connections-1h.png

  84. jonas’

    https://sotecware.net/files/noindex/ingestion-1h.png

  85. jonas’

    that seems to fit very well

  86. jonas’

    (the "filled" part in the second graph is the fast component, the "line" part in the second graph is the slow component)

  87. jonas’

    the fast component ends at 07:24, which is exactly when the initial spike drops in the first graph