jdev - 2020-06-29

jonas’ 06:09:09
lovetox, I don’t see any issue here, so unlikely: https://github.com/horazont/muchopper/issues
Жокир 07:12:59
Do any popular servers actually implement XEP-0368? If yes, could anyone give point to any such servers?
jonas’ 07:13:52
Жокир, https://compliance.conversations.im/ any suggested as "compliant" servers should do, at least for c2s
jonas’ 07:14:06
I don’t know of any s2s implementation except maybe https://github.com/surevine/Metre , which isn’t quite a server.
Guus 07:37:13
Looking at my server log, I'm noticing that I'm getting a lot of connection timeouts on s2s in bursts - presumably x minutes after a user that caused the federation to be set up sent its last presence update.
Guus 07:37:55
I wonder if it'd be good to introduce a small factor of randomness to the timeout interval, to avoid staggered behavior.
flow 07:53:16
Guus, maybe controversal counter-question: that does sounds like an timeout enforced on the application layer. if so, then why would you have an application layer timeout for s2s connections and not simply let the tcp connection timeout
jonas’ 07:54:00
flow, save resources.
jonas’ 07:54:07
the tcp connection will also not time out ever
flow 07:54:08
is it worth it?
jonas’ 07:54:20
because both peers can see each other (in this scenario)
jonas’ 07:54:37
file descriptors are limited and when you notice you’re running out of them, it’s too late
jonas’ 07:54:44
being a bit proactive about preserving them is generally a good idea
flow 07:55:09
ok so kill idle connections based on the amount of available file descriptions, but not based on time
jonas’ 07:55:35
you don’t know the amount of available file descriptors
flow 07:55:36
(or, to be precise, only as second criteria based on time)
jonas’ 07:55:48
you know the limit, but you don’t know how many are open in your process
jonas’ 07:56:00
you can estimate, but you can be wrong in the bad direction.
jonas’ 07:56:11
(or in both directions, depending on how you estimate)
flow 07:56:27
ls /proc/$pid/fd/ | wc -l
flow 07:56:28
?
jonas’ 07:56:43
that’s at least a rather expensive way to do it
jonas’ 07:56:50
but true, that works, on systems where procfs has that feature
flow 07:56:52
why is it expensive?
Guus 07:56:54
flow I don't mind much either way - I'm just noticing that I get a lot of disconnects. I'm thinking Prosody does this? Openfire probably does so as well, though.
jonas’ 07:56:58
flow, that’s many syscalls
Guus 07:57:10
(at the very least, it's configurable)
jonas’ 07:57:42
I can’t see immediately in man 5 procfs whether /proc/$pid/fd is a linux or a posix thing
flow 07:58:26
jonas’, I wouldn't be surpised if there is a more efficient way to get that number
jonas’ 07:58:30
I would
Guus 07:58:31
I don't mind much closing idle connections (although it does feel like premature optimization a bit.)
flow 07:58:32
especially on linux
jonas’ 07:58:37
I think I looked into that already and found that it’s not possible
jonas’ 07:59:00
there’s surely a reason why sudo does a for i in 0..MAXFD do close($i); done
Guus 07:59:07
as Openfire is a multi-platform solution, depending on any platform specific thingy is going to be a pain.
Guus 07:59:30
unless Java exposes things, which I doubt.
flow 08:00:07
Guus, UnixOperationSystemMXBean.getOpenFileDescriptorCount()
flow 08:00:25
not sure if something like that also exist for other OS'es
Guus 08:00:34
*Unix*OperationSystemMXBean is likely going to fail on Windows? 🙂
Guus 08:00:51
but also: not worth the complexity, maybe?
flow 08:00:53
so you may have to implement a fallback strategy for sure (like disconnection based on a timeout)
jonas’ 08:01:10
I don’t see any problem with a timeout here, to be honest
flow 08:01:29
Guus, potentially, depends on your goals I'd say
jonas’ 08:01:34
everything else seems slightly overengineered
jonas’ 08:01:49
file descriptors may also just be one reason why you want to keep the number of open connections low
jonas’ 08:02:07
other reasons may include running behind a stateful firewall and wanting to conserve resources there
Guus 08:05:58
I was just suggesting to add a small random factor in the timeout delay, nothing more 😉
flow 08:06:11
Guus, which is always a good idea
Guus 08:06:36
also, given how I see batches of s2s tear down only to be brought up again, I'm suspecting that the default timeout of (Prosody?) might be on the low end.
jonas’ 08:07:20
prosody doesn’t have a default timeout
Guus 08:07:32
oh, that's interesting
Guus 08:08:01
note that I didn't actually check what server software is used on those. I just assumed.
Guus 08:08:40
having had a closer look: might be ejabberd 🙂
jonas’ 08:12:44
https://sotecware.net/files/noindex/connections.png
jonas’ 08:12:44
:-)
flow 08:14:27
looks like a 30 minute timeout
flow 08:15:08
combined with an hourly cron job maybe?
jonas’ 08:16:01
I think there are two timeouts, one ~15min (the linear curve down) and one ~30min (which also looks randomized, because of the slight exp-y behaviour at the end)
jonas’ 08:16:20
and yes, this is the connection stats of search.jabber.network, and the spikes you see is the hourly scan :)
Guus 08:16:56
(maybe randomize your scan!)
flow 08:17:06
now we only need to identify the implementations with the 15m and 30m (randomized) timeout
jonas’ 08:17:32
Guus, it’s already shuffled :)
flow 08:17:40
and what is keeping the baseline of 1.5k connections
Guus 08:18:10
moar shuffling!
jonas’ 08:18:44
flow, compare the ratios with https://search.jabber.network/stats#software :)
jonas’ 08:19:26
assuming that many "unknowns" are in fact prosody MUCs, because prosody doesn’t report version on MUC by default IIRC
flow 08:19:48
ahh, so it is probably prosody which keeps the connections
jonas’ 08:19:53
flow, very likely
flow 08:20:19
but the amount of 15m and 30m timeout connections appears to be nearly equal
jonas’ 08:20:20
I experimented with loading mod_s2s_idle_timeout or whatsitcalled on s.j.n, but then I disabled it to reduce the codebase to the minimum for some unrelated testing
jonas’ 08:20:58
~~flow, I’ll have to dig deeper into it, it’s also possible that the different behaviours there are an artifact of how the scanner works~~ ✎
jonas’ 08:21:09
flow, I’ll have to dig deeper into it, it’s also possible that the two different falloff behaviours there are an artifact of how the scanner works ✏
flow 08:21:13
i see
jonas’ 08:21:39
since there are two scanning components, and one finishes much quicker than the other; it’s possible that the quicker one is causing the additional tip of the initial spike, while the slower one is what causes the slow fall off at the end
jonas’ 08:21:45
since the quicker one also tends to touch more domains
jonas’ 08:22:46
oh yeah, that’s very plausible
jonas’ 08:22:56
that may also explain the exp falloff due to shuffling
jonas’ 08:23:08
if there’s really just a 15m or something timeout involved
flow 08:24:31
jonas’, are you aware that 'German' appears twice in the room languages table?
jonas’ 08:24:40
yes
jonas’ 08:24:42
de-de vs. de
jonas’ 08:24:45
I need to normalize that
jonas’ 08:25:38
https://sotecware.net/files/noindex/connections-1h.png
jonas’ 08:25:43
https://sotecware.net/files/noindex/ingestion-1h.png
jonas’ 08:25:46
that seems to fit very well
jonas’ 08:26:02
(the "filled" part in the second graph is the fast component, the "line" part in the second graph is the slow component)
jonas’ 08:26:20
the fast component ends at 07:24, which is exactly when the initial spike drops in the first graph