XMPP Service Operators - 2024-03-11


  1. nuegia.net

    Does anybody have a redundant Prosody server?

  2. nuegia.net

    high availability?

  3. jonas’

    define "high availability"

  4. nuegia.net

    service remains online if one of the servers is taken offline

  5. nuegia.net

    or suffers a failure

  6. jonas’

    "remains online" certainly not, because prosody does not support hot-standby

  7. jonas’

    "remains online" certainly not, because prosody does not support active/active configurations

  8. jonas’

    hot-standby may work actually if you use a replicated storage backend, but you still need custom logic to ensure the other node is truly down.

  9. nuegia.net

    has anyone already implemented that?

  10. jonas’

    no, because it is a hard problem.

  11. jonas’

    not that I know, because it is a hard problem.

  12. jonas’

    determining that another node is truly down is a bit tricky.

  13. jonas’

    you need some kind of quorum system

  14. jonas’

    with a database backend which uses that, and by configuring prosody to only use the local backend, and making prosody shut down when the backend claims that it lost quorum, it could be done.

  15. nuegia.net

    lets say that we can determine a node health

  16. nuegia.net

    there's a plugin for prosody's internal webserver the exposes timers, and a simple script could be made that checks that and prosody's health

  17. nuegia.net

    possibly even latency

  18. nuegia.net

    and if your going for a ACTIVE/BACKUP model, a circuit breaker style trip that switches to the backup if the primary server stops working or latency gets too bad.

  19. nuegia.net

    what else needs to be done?

  20. nuegia.net

    postgres already supports that model

  21. jonas’

    what you described isn't sufficient

  22. jonas’

    the latency may only look bad from your other node, e.g. because of a temporary internet routing issue

  23. jonas’

    you need to ensure that before starting the backup node, the active node is killed.

  24. nuegia.net

    the prosody daemon is killed or just not connectable?

  25. jonas’

    killed.

  26. nuegia.net

    why?

  27. nuegia.net

    if no clients or servers are able to connect to it, what's the harm?

  28. jonas’

    cronjobs inside prosody

  29. nuegia.net

    oh

  30. nuegia.net

    what else?

  31. jonas’

    I don't understand

  32. nuegia.net

    what do those cron jobs do? also you mentioned a replicated storage backed.

  33. jonas’

    for instance expiry of uploaded files/storage, but in general arbitrary code and you cannot rely on them (not) doing a specific thing

  34. nuegia.net

    what storage needs to be replicated? something in the filesystem or can everything be done in the database?

  35. jonas’

    the only safe way to do this is to ensure prosody is *stopped* on all nodes except one.

  36. jonas’

    I don't know of a way to put uploaded files into the database, so you'd likely need that (or an external service for that feature) + database for everything else.

  37. nuegia.net

    I already have prosody's http uploads managed by an external service as part of my webserver. a crontjob on the external webserver manages file expiration based on atimes

  38. nuegia.net

    is there anything else?

  39. nuegia.net

    all prosody's job is for files is to generate tokens for the upload server

  40. jonas’

    the safe way of doing this, as far as I know (the prosody people would know better): Prerequisites: - ensure prosody uses replicated storage for everything, e.g. a database - have reliable measurement of availibility of all nodes - have a way to "fence" (turn off) a node remotely and be sure it's actually off, even when it is already broken/unreachable Failover: 1. detect that currently active node is down 2. make sure it stays down (i.e. turn it off) 3. ensure data is replicated correctly 4. start backup node

  41. jonas’

    anything other than exactly this failover procedure is gambling.

  42. nuegia.net

    » 3. ensure data is replicated correctly is it talking about the database backend?

  43. jonas’

    yes.

  44. nuegia.net

    also something that's not covered; restoration of the primary server.

  45. jonas’

    should not need extra action with proper databases, but you never know.

  46. jonas’

    well restoration of the primary server is just like another failover.

  47. nuegia.net

    doesn't seem impossible

  48. nuegia.net

    » - have a way to "fence" (turn off) a node remotely and be sure it's actually off, even when it is already broken/unreachable would iocage stop prosody1 suffice?

  49. jonas’

    I do not know what those words mean.

  50. nuegia.net

    » 3. ensure data is replicated correctly how is this done?

  51. nuegia.net

    » <jonas’> I do not know what those words mean. turning off the BSD jail that prosody belongs too

  52. jonas’

    seems like it would be sufficient.

  53. jonas’

    regarding "3. ensure data is replicated correctly" / "how is this done?" -> with a proper replicated database, that's going to be a given and doesn't need separate checking.

  54. jonas’

    if you use some homebrew hot-standby filesystem sync trickery, that's a different story.

  55. nuegia.net

    so you just mean configure a postgres cluster not run a application specific database table consistency check

  56. jonas’

    yes

  57. nuegia.net

    is there any benefit or cons to configuring prosody to use databases instead of files that it uses by default?

  58. jonas’

    yes.

  59. jonas’

    (but don't ask me for specifics)

  60. Polarian

    > is there any benefit or cons to configuring prosody to use databases instead of files that it uses by default? Normally speed

  61. Polarian

    and maintainability

  62. Polarian

    and scriptability

  63. jonas’

    (the discussion moved on to prosody@conference.prosody.im)

  64. Polarian

    oh... ok nevermind then :)

  65. jonas’

    thanks :)

  66. sch

    Greetings to one and all

  67. sch

    Would it be correct to assume that an agent/transport (server component) JID can not return ping?

  68. sch

    I ask because a function that pings to own JID appears to indicate that the JID does not return ping, when JID is component.

  69. Guus

    every XMPP entity MUST respond to an IQ request - even if it doesn't understand it.

  70. jonas’

    correct.

  71. MattJ

    (re-asked in jdev, a more appropriate venue I think)

  72. Guus

    So, even if the component does not understand the 'ping' request, it should still return an error.

  73. jonas’

    correct, MattJ.

  74. sch

    Pardon for cross-posting

  75. sch

    Guus, I run the component and it appears that I fail to receive ping from the component itself to itself.

  76. sch

    Both, client and component have the same XEPs loaded.