-
tom
https://www.pyzor.org/en/latest/about.html
-
tom
» » » Discards all message headers. » If the message is greater than 4 lines in length: » » Discards the first 20% of the message. » Uses the next 3 lines. » Discards the next 40% of the message. » Uses the next 3 lines. » Discards the remainder of the message. » » Removes any ‘words’ (sequences of characters separated by whitespace) that are 10 or more characters long. » Removes anything that looks like an email address (X@Y). » Removes anything that looks like a URL. » Removes anything that looks like HTML tags. » Removes any whitespace. » Discards any lines that are fewer than 8 characters in length. » »
-
tom
Would this work for XMPP as well?
-
tom
This wouldn't trigger on 'hello' either
-
tom
» The central premise of Pyzor is that it converts an email message to a short digest that uniquely identifies the message. Simply hashing the entire message is an ineffective method of generating a digest, because message headers will differ when the content does not, and because spammers will often try to make a message unique by injecting random/unrelated text into their messages.
-
tom
Not would this exact piece of software work for XMPP as well, but would a similar implementation specificlly designed for XMPP work for fighting xmpp spam?
-
MattJ
My feeling is that advanced content analysis (whether manually-coded heuristics or machine-learning/"AI") won't work with IM, because there simply isn't enough content per message
-
jonas’
I tend to agree
-
Holger
Well actual spam often does contain quite a bit of content.
-
Holger
But yes relying on just that won't do the trick for all kinds of spam.
-
tom
Of course not
-
tom
But as part of a larger weighted system, like how spamassasin works to score with a bunch of milters maybe
-
Holger
Yup.
-
Holger
IMO we should classify based on as much data as possible, and while content alone won't be enough, it can certainly contribute to a score. The score of a "hello" message obviously won't be based on content at all but IMO that's no reasoning to ignore content altogether. E2EE can be a problem though.
-
Holger
(But it might also be an additional data point if the spammer's E2EE somehow differs from the user's.)
-
moparisthebest
tom, my spam dropped 99% when I started dropping any messages containing cyrillic
-
tom
lol
-
Licaon_Kter
moparisthebest: but what about your russian buddies?
-
tom
that's not really practical though
-
Ge0rG
moparisthebest: drop us-ascii and your spam wil go down by 100%
-
moparisthebest
obviously depending who your server user's talk to decides whether you can do that or not, but maybe it's a good heuristic
-
moparisthebest
none of my family can read them anyway so no loss :)
-
tom
drop all stazas
-
tom
trust noone, not even yourself
-
Ge0rG
my "family" consists of ~2000 active users, of which a significant minority is Russian.
-
moparisthebest
right I'm not proposing a XEP to forbid cyrillic from XMPP network wide, just a maybe-helpful heuristic depending on who your users are, useless in your case obviously :)
-
tom
well
-
tom
One of the things I did when setting up spamass was assign a higher spam possibility score for non- english, spanish, and german text
-
Holger
That's where user-trained Bayesian filters (with per-user DB) become useful ...
-
frog
Ge0rG: a bit offtopic, but I remember talking with you about DNS challenge for LE certs ages ago. What software did you use for this?
-
Licaon_Kter
frog: aren't all doing it? Eg. acme.sh
-
frog
I don't do it yet, still on http challenge. I wanted to see how others do it to migrate
-
frog
acme.sh was the most promising from my search, need to try it
-
moparisthebest
acme.sh is excellent
-
moparisthebest
I do DNS challenges with it
-
tom
moparisthebest: quark.c is a great companion to acme.sh if you don't already have a webserve