ChanServ changed the topic of #freedesktop to: https://www.freedesktop.org infrastructure and online services || for questions about freedesktop.org projects, please see each project's contact || for discussions about specifications, please use https://gitlab.freedesktop.org/xdg or xdg@lists.freedesktop.org
alpernebbi has quit [Ping timeout: 480 seconds]
Traneptora_ has quit [Ping timeout: 480 seconds]
alpernebbi has joined #freedesktop
scrumplex has joined #freedesktop
scrumplex_ has quit [Ping timeout: 480 seconds]
JanC is now known as Guest24195
JanC has joined #freedesktop
Guest24195 has quit [Ping timeout: 480 seconds]
JanC is now known as Guest24196
JanC has joined #freedesktop
Guest24196 has quit [Ping timeout: 480 seconds]
snetry has joined #freedesktop
sentry has quit [Ping timeout: 480 seconds]
swatish2 has joined #freedesktop
Zathras_11 has joined #freedesktop
swatish2 has quit [Ping timeout: 480 seconds]
Traneptora has joined #freedesktop
Zathras has quit [Ping timeout: 480 seconds]
swatish2 has joined #freedesktop
swatish2 has quit [Ping timeout: 480 seconds]
AbleBacon has quit [Remote host closed the connection]
Kayden has joined #freedesktop
JanC has quit [Ping timeout: 480 seconds]
tzimmermann has joined #freedesktop
sima has joined #freedesktop
haaninjo has joined #freedesktop
swatish2 has joined #freedesktop
alarumbe has quit []
todi1 has joined #freedesktop
todi has quit [Ping timeout: 480 seconds]
karolherbst has quit [Ping timeout: 480 seconds]
karolherbst has joined #freedesktop
ximion has quit [Remote host closed the connection]
swatish2 has quit [Ping timeout: 480 seconds]
JanC has joined #freedesktop
kasper93_ has joined #freedesktop
kasper93 is now known as Guest24211
kasper93_ is now known as kasper93
Guest24211 has quit [Ping timeout: 480 seconds]
kasper93 has quit [Ping timeout: 480 seconds]
kasper93 has joined #freedesktop
kasper93_ has joined #freedesktop
kasper93 is now known as Guest24213
kasper93_ is now known as kasper93
kasper93_ has joined #freedesktop
kasper93 is now known as Guest24214
kasper93_ is now known as kasper93
kasper93_ has joined #freedesktop
kasper93 is now known as Guest24215
kasper93_ is now known as kasper93
Guest24213 has quit [Ping timeout: 480 seconds]
Guest24214 has quit [Ping timeout: 480 seconds]
Guest24215 has quit [Ping timeout: 480 seconds]
<karolherbst>
daniels: I was thinking about the smtp credentials thing again and... couldn't we just do it via ssh port forwarding? Like we just create a tunnel to the server and then send the email through that, so it looks like it comes from our servers anyway
<karolherbst>
uhm..
<karolherbst>
or isn't it how it works 🙃
<Mithrandir>
you could do that, or make /usr/sbin/sendmail on the sending host just ssh to wherever and call sendmail on the other side.
<karolherbst>
yeah but sendmail is a terrible interface
<karolherbst>
and the point was to not use it in the first place
<pinchartl>
karolherbst: just for my information, what's the issue with leaking the sender's IP ?
<karolherbst>
pinchartl: protecting the coc member you specifically sent out the email?
<karolherbst>
like if it's an IP in $country, the person might figure who it was and just target them specifically
<karolherbst>
or if it's an IP to a company network
<karolherbst>
or...
<pinchartl>
so it's to be able to send mails that originate from a group, without identifying the person who pressed the button. got it
<karolherbst>
exactly
<karolherbst>
but sending emails via ssh on the cli on the server sucks for many reasons
<Mithrandir>
karolherbst: your email client is likely able to call sendmail on the host you're running it, was my point.
<karolherbst>
uhhhhh I kinda want a reliable solution tbh
<karolherbst>
like I suspect "email client using sendmail" is de facto dead code and untested
jsa1 has joined #freedesktop
<Mithrandir>
I obviously can't speak for every client under the sun, but I'd assume _lots_ of people use that for most of the clients.
<karolherbst>
huh? why would they if they have imap and smtp?
<karolherbst>
like yeah, git sendmail is used, but besides that?
<karolherbst>
and even that uses smtp
<karolherbst>
though I think that uses sendmail under the hood
<karolherbst>
heh.. actually it also does smtp natively
<karolherbst>
yeah.. so if you specify smtpserver with git send-email it does not use sendmail
<karolherbst>
but anyway, I don't understand why it's controversial wanting to use smtp, given that's what most people do and is used by clients, where sendmail is actually not
<karolherbst>
like thunderbird doesn't seem to support sendmail either
<karolherbst>
Evolution seems to have terrible integration where you need to build the argument list yourself for anything not being basic "send to one recipient" situations...
<karolherbst>
anyway.. what needs to be done to make smtp work?
vkareh has joined #freedesktop
<daniels>
karolherbst: wanting to use SMTP isn't controversial, but it doesn't mean it's really easy to do and people have the time on their hands to go figure out how to make it work
<karolherbst>
sure, I understand that, and I'd be up for helping out with that
<karolherbst>
daniels: anyway, what's like the main thing that needs to be done to wire it all up for smtp? Creating the account? Anything else?
<daniels>
karolherbst: right, you can use saslpasswd2 on gabe to create an account to be able to use SMTP from - what needs to be done is figuring out how it is that you don't immediately leak the Received header
<karolherbst>
daniels: yeah.. that's something I can play around with. The I had was to use ssh port forwarding and see if that helps
<karolherbst>
so from gabe it looks like it's coming from a local account
<daniels>
sure, go for it
<karolherbst>
daniels: I don't think I have enough permissions for the account creation part
<Mithrandir>
just do ssh -L 1234:localhost:25 gabe and then use localhost:1234 as the smtp server, no extra accounts needed, and nothing should leak.
<karolherbst>
Mithrandir: but I need to do it with the account I want to send from, right?
<karolherbst>
but yeah.. let me try that
<Mithrandir>
no, any account should be fine
<karolherbst>
ahh, okay
<karolherbst>
that makes it easier
guludo has joined #freedesktop
<karolherbst>
Mithrandir: I'm getting "channel 3: open failed: administratively prohibited: open failed"
<Mithrandir>
maybe we need to allow your user to do port forwarding
<karolherbst>
Mithrandir: maybe it would be easier to do it on the conduct account and then we can manage ssh keys there, so it's not tied to individuals? Though for testing would be good enough if I can do it with my own account
<Mithrandir>
or socat tcp-listen:1234,reuseaddr,fork EXEC:'ssh gabe socat - TCP4-CONNECT:localhost:25' maybe
<karolherbst>
"E EXEC: wrong number of parameters (3 instead of 1): usage: EXEC:<command-line>"
<karolherbst>
uhm...
<karolherbst>
let me check first if it's not a local mess up :D
<karolherbst>
nah, seems fine
kasper93 has quit [Ping timeout: 480 seconds]
kasper93 has joined #freedesktop
<Mithrandir>
socat TCP-LISTEN:1234,reuseaddr,fork "EXEC:ssh gabe.freedesktop.org nc localhost 25" works for me
<karolherbst>
Mithrandir: that worked, thanks!
<karolherbst>
"client-ip=131.252.210.177" sounds about right
jsa1 has quit [Ping timeout: 480 seconds]
jsa1 has joined #freedesktop
AbleBacon has joined #freedesktop
jsa2 has joined #freedesktop
jsa1 has quit [Ping timeout: 480 seconds]
alarumbe has joined #freedesktop
ximion has joined #freedesktop
noodlez1232 has quit [Remote host closed the connection]
noodlez1232 has joined #freedesktop
jsa1 has joined #freedesktop
jsa2 has quit [Ping timeout: 480 seconds]
tzimmermann has quit [Quit: Leaving]
jsa1 has quit [Ping timeout: 480 seconds]
xe has joined #freedesktop
sentry has joined #freedesktop
snetry has quit [Ping timeout: 480 seconds]
___nick___ has joined #freedesktop
guludo has quit [Ping timeout: 480 seconds]
guludo has joined #freedesktop
snetry has joined #freedesktop
sentry has quit [Ping timeout: 480 seconds]
scrumplex_ has joined #freedesktop
___nick___ has quit [Remote host closed the connection]
i-garrison has quit []
scrumplex has quit [Ping timeout: 480 seconds]
ybogdano has quit [Remote host closed the connection]
ybogdano has joined #freedesktop
___nick___ has joined #freedesktop
vkareh has quit [Quit: WeeChat 4.7.0]
Caterpillar has quit [Remote host closed the connection]
<Consolatis>
did the crawlers finally figure out how to modify their user agent?
<karolherbst>
nah, they just start solving the math problem apparently
<dwfreed>
^ exactly that
___nick___ has quit [Remote host closed the connection]
andy-turner has quit []
<Consolatis>
well, that was to be expected at some point. The fun starts with completely bypassing Anubis specifically at which point it will only bother legit users while not interfering with any crawlers
<pinchartl>
time to make nvidia accountable for destroying the web ?
<Consolatis>
well, or in case of gitlab.fdo to require a login to access CPU / database intensive pages which can (or should not) be cached
<pinchartl>
those are not mutually exclusive options
<karolherbst>
if there is one thing I'm sure of, then it's logins are doing nothing against bots
<karolherbst>
most of the accounts on our gitlab are bots anyway
<Consolatis>
it defeats non-targeted attacks
<karolherbst>
not really
<karolherbst>
you think those sign up by hand?
<karolherbst>
since anubis got deployed we are seeing a lot less sign-ups as well
<Consolatis>
if a scaper registers an account it is (by my definition) targeted
<karolherbst>
well you can automate it
<karolherbst>
and detecting gitlab isn't even hard
<Consolatis>
if it really turns out to be an issue even with account requirement then one could also start tracking requests per hour per account + IPs used within $timespan per account and react accordingly
<karolherbst>
well then they'll create 10000 accounts across 10000 IPs
alanc has quit [Remote host closed the connection]
<pinchartl>
10000 ? if only it was that little
<karolherbst>
they don't even use their own machine for the scrapping
<pinchartl>
some scrapers use "vpn" networks that claim 100M residential IPs
<karolherbst>
all the mitigations you'd come up with, they'll already found a way to get around
alanc has joined #freedesktop
<Consolatis>
doesn't gitlab.fdo require a capture for account creation?
<Consolatis>
captcha*
<karolherbst>
bots solve captches better than humans
<pinchartl>
captchas have long been defeated
<karolherbst>
"all the mitigations you'd come up with, they'll already found a way to get around"
<karolherbst>
I meant it
<Consolatis>
if that would be the case.. then why are we still using those captchas and similar annoyances? to make human experience worse?
<karolherbst>
yep
guludo has quit [Ping timeout: 480 seconds]
<karolherbst>
or to train AI
<karolherbst>
it's even worse.. some also sell you solving captchas services
<karolherbst>
super cheap
<karolherbst>
sometimes it's machines
<karolherbst>
sometimes it's people crammed in sweatshops
<karolherbst>
soo one might even say it's unethical to deploy captchas even
jsa1 has joined #freedesktop
<pinchartl>
captchas are still useful to address the low-hanging fruits
<pinchartl>
but certainly not the professional bots
<karolherbst>
yeah but compared to the AI crawlers they don't really matter anymore
<Consolatis>
well, lets see how long it takes the crawlers to bypass Anubis rather than just solving it (be it via custom implementation or a 2nd more expensive tier of crawlers using actual headless browsers)
<karolherbst>
well the point of aanubis was, that it forces you to deploy something capable of doing real JS
<karolherbst>
which most crawlers just didn't do
<Consolatis>
there are other ways to do that, no computation required
<karolherbst>
the entire point is just to make use of the fact that those crawlers are developed really badly
<karolherbst>
Consolatis: I mean you are free to come up with another alternative that works, I'm sure everybody would want to use it
<Consolatis>
whenever a larger group uses the same "alternative" it becomes a big player and thus a target to attack. site specific behavior seems way more scalable
<karolherbst>
not gonna patch gitlab
<karolherbst>
but sure, every website on earth could come up with their own mitigation and then nobody else would have time to do anything else anymore
<karolherbst>
only to get defeated anyway
ybogdano has joined #freedesktop
<karolherbst>
anyway, I assure you that all the basic stuff doesn't work these days
sima has quit [Ping timeout: 480 seconds]
<Consolatis>
it certainly helps if software is written in away which takes crawlers into account and minimizes CPU and database queries; delivering static content is cheap. but I assume > "not gonna patch gitlab" is the main issue here
<Consolatis>
as an example, an MR only updates on certain changes like git pushes, comments, label changes and so on. Instead of a giant query (or a bunch of simpler ones via XHR), each change can update static content (redis / actual file / ..) which then gets delivered to any client until the next change
<Consolatis>
so it scales with the amount of MR changes rather than with the amount of users
<Consolatis>
then there are things like git blame which personally I would simply disable completely, there is git itself for that
<karolherbst>
I think you kinda miss the point here
<karolherbst>
it doesn't matter if something is cheap or not
<karolherbst>
if every endpoint is cheap, they'll just crawl even more
jturney has quit [Ping timeout: 480 seconds]
<Ermine>
cheap * millions of them = not so cheap
<Consolatis>
why would a single AI crawler crawl the same page twice? it would be against its own goal to crawl as much of the web as possible
<karolherbst>
it's like with roads, or processing power. The more capacity you have, the more usage you'll see
<Consolatis>
I don't buy this argument
<karolherbst>
Consolatis: :D
<karolherbst>
they crawl the same page millions of times
<karolherbst>
they don't care
<karolherbst>
they crawl as often and as much as they want
<karolherbst>
don't try to apply your logic to them
<karolherbst>
you'll only lose the argument
<karolherbst>
if you tihnk it's too stupid to do it, they do it anyway
<Consolatis>
in that case its still better to have more requests of the same crawler which in total is still not really impacting the important resources (outside of bandwidth) rather than have less requests which bring everything to a stop
<karolherbst>
they'll just crawl even more, seriously
<karolherbst>
they don't care if they put systems under full load, they'll just throw more crawlers at it anyway
<karolherbst>
it's not just gitlab that gets crawled to death
<karolherbst>
but also pages like lwn
<Consolatis>
well, let them in that case. I think this is extremely hypothetical though. AI crawler operators have a goal, requesting the same page of and over again doesn't make sense for them and if that happens its a bug. They are not simply "evil" and want to annoy people but rather want to archive something
<karolherbst>
"hyptothetical"? Because you don't believe it's already happening, which it is
<karolherbst>
with your logic any website with only cheap resources shouldn't be crawled to death
<karolherbst>
but they are
<karolherbst>
you should just throw all your assumptions away
<karolherbst>
"crawler operators have a goal", yes, crawl every website as often as possible, as much as possible. That's the goal
<karolherbst>
they might not be evil, but they don't care if they dos your webpage
<Consolatis>
i think their goal is more like "crawl as much of the internet as possible"
<karolherbst>
no
<karolherbst>
"as much and as often"
<Consolatis>
which contradicts "crawl a single website as often as possible"
<karolherbst>
crawling the same page 1000000 times? yes, they do that
<karolherbst>
you can pretend they don't, but...
<Consolatis>
do you have any links for that claim?
<karolherbst>
again.. don't apply your logic, you'll be wrong
<karolherbst>
I've talked with admins about it
<karolherbst>
like, I'm not making any assumptions here
<karolherbst>
I'm just telling what people see
<Consolatis>
so admins told you a single crawler (same set of IPs?) requested the same URLs over and over again?
<Consolatis>
e.g. a million times
sentry has joined #freedesktop
<karolherbst>
they hide behind a residential IP botnet...
<Consolatis>
right. so how do you know its the same crawler then?
<karolherbst>
fingerprinting
<Consolatis>
that just detects the software stack / family of crawler (if at all)
<Consolatis>
but its people paying for their use that have a goal, that is the metric
<karolherbst>
not in the mood of simply contining this argument if all you want to be is right. Maybe just accept that your assumptions simply don't hold true and that "resource optimizing endpoints" won't help at all, because low overhead pages die under the weight just as much as gitlabs
<karolherbst>
I don't even understand why it matters to you that much if your assumptions are true or not, because the end result is, that everybody gets crawled to death regardless of what they run
<karolherbst>
but if you don't want to listen, then I'm not going to waste my time any further explaining it
snetry has quit [Ping timeout: 480 seconds]
<DragoonAethis>
Consolatis: I don't have links, but I do have Patchwork access logs, and yes, the same pages were hit over and over again
<DragoonAethis>
It's not redownloading the same page in a loop, it's more like multiple companies independently scraping the same (expensive to generate server-side) content, and then scheduling a crawl of all visible links, then refreshing their scrapes once every few days
<Consolatis>
karolherbst: I am just a bit annoyed by pointing fingers to a known "bad guy" and then ignoring the actual technical reasons for massive resource usage and resulting slow downs and not even considering fixing those because it won't help anyway because it would just make the bad guys do more bad guy stuff.
<Consolatis>
DragoonAethis: that makes sense, thanks for the insights
<Consolatis>
so in that case if it would be static content rather than expensive to generate, the resource exhaustion issue would mostly be solved by trading it for some disk (or memory) space
<Consolatis>
e.g. it would not increase the amount of requests for the same URL whatsoever just because its now slightly faster from the crawler side