ChanServ changed the topic of #freedesktop to: https://www.freedesktop.org infrastructure and online services || for questions about freedesktop.org projects, please see each project's contact || for discussions about specifications, please use https://gitlab.freedesktop.org/xdg or xdg@lists.freedesktop.org
mrpops2ko has quit []
mrpops2ko has joined #freedesktop
scrumplex_ has joined #freedesktop
scrumplex has quit [Ping timeout: 480 seconds]
ity has joined #freedesktop
<elibrokeit>
Anubis by default assumes that all user agents which don't mention "Mozilla" aren't scrapers because scrapers always try to masquerade as interactive users :p
<elibrokeit>
(and because many things need to work with wget/curl, like downloading raw/patch files in a distro integration, and git clone needs to work, etc etc)
georgc has joined #freedesktop
<pinchartl>
how long until scrapers maskerade as git or curl or something else? :(
gchini has quit [Ping timeout: 480 seconds]
guludo has quit [Ping timeout: 480 seconds]
snetry has joined #freedesktop
sentry has quit [Ping timeout: 480 seconds]
m5zs7k has quit [Ping timeout: 480 seconds]
m5zs7k has joined #freedesktop
i-garrison has quit []
sima has joined #freedesktop
i-garrison has joined #freedesktop
tzimmermann has joined #freedesktop
sima has quit [Ping timeout: 480 seconds]
sima has joined #freedesktop
sima is now known as Guest21194
sima has joined #freedesktop
swatish2 has joined #freedesktop
alarumbe has quit []
GNUmoon has quit [Remote host closed the connection]
GNUmoon has joined #freedesktop
georgc has quit [Quit: Leaving]
gchini has joined #freedesktop
AbleBacon has quit [Read error: Connection reset by peer]
jsa1 has joined #freedesktop
ximion has quit [Remote host closed the connection]
<immibis>
pinchartl: most sites block curl
<dwfreed>
many sites that scrapers would genuinely be interested in tend to block curl so it's probably unlikely a scraper would use a curl user agent normally, but it may start using it for specific sites to bypass blocks
<dwfreed>
especially if anubis continues its meteoric popularity rise as a defense against the scrapers
swatish2 has quit [Ping timeout: 480 seconds]
mripard has joined #freedesktop
mripard_ has quit [Ping timeout: 480 seconds]
swatish2 has joined #freedesktop
vimproved has quit [Ping timeout: 480 seconds]
<Mithrandir>
the logical conclusion is a payment system where you pay to download a page and the resources involved in generating and creating it.
vkareh has joined #freedesktop
vimproved has joined #freedesktop
<martink>
heya. Getting an "gitlab.exceptions.GitlabHttpError: 403: insufficient_scope" while trying to bin/ci/ci_run_n_monitor.sh --pipeline --target here. What would be a good token scope for that?
<daniels>
martink: api
<martink>
daniels: works like a charm, thanks!
<daniels>
martink: np :)
haaninjo has joined #freedesktop
<mripard>
n/bu21
Harry_ has joined #freedesktop
Harry_ has left #freedesktop [#freedesktop]
sima has quit [Remote host closed the connection]
Guest21194 has quit [Remote host closed the connection]
guludo has joined #freedesktop
gnuiyl has quit [Ping timeout: 480 seconds]
sima has joined #freedesktop
sima is now known as Guest21212
sima has joined #freedesktop
swatish2 has quit [Ping timeout: 480 seconds]
tonyg has left #freedesktop [#freedesktop]
ximion has joined #freedesktop
sentry has joined #freedesktop
snetry has quit [Ping timeout: 480 seconds]
noodlez1232_ has joined #freedesktop
noodlez1232 has quit [Ping timeout: 480 seconds]
noodlez1232_ is now known as noodlez1232
Guest21212 has quit []
tzimmermann has quit [Quit: Leaving]
jsa1 has quit [Ping timeout: 480 seconds]
kasper93 has quit []
kasper93 has joined #freedesktop
cascardo_ has quit [Ping timeout: 480 seconds]
AbleBacon has joined #freedesktop
<svuorela>
are 'we' dos'ing canonical servers or is someone standing on the internet hose?
<svuorela>
connect (101: Network is unreachable) Cannot initiate the connection to archive.ubuntu.com:80 (2620:2d:4000:1::102). - connect (101: Network is unreachable) Cannot initiate the connection to archive.ubuntu.com:80 (2620:2d:4002:1::103). - connect (101: Network is unreachable) Cannot initiate the connection to archive.ubuntu.com:80 (2620:2d:4000:1::101). - connect (101: Network is unreachable) Cannot
<svuorela>
initiate the connection to archive.ubuntu.com:80 (2620:2d:4000:1::103). - connect (101: Network is unreachable) Cannot initiate the connection to archive.ubuntu.com:80 (2620:2d:4002:1::102). - connect (101: Network is unreachable) [IP: 185.125.190.81 80]
cascardo has joined #freedesktop
jsa1 has joined #freedesktop
guludo has quit [Ping timeout: 480 seconds]
guludo has joined #freedesktop
ximion has quit [Remote host closed the connection]
digetx has quit [Quit: No Ping reply in 180 seconds.]
digetx has joined #freedesktop
<pinchartl>
Mithrandir: food for thoughts on the subject. I read a recent study that focussed on how aggressive blocking of aggressive crawlers meant that LLMs are increasingly trained on lower quality and predatory (as in being biased or entirely false for political goals) data, leading to increased dissemination of incorrect information
<pinchartl>
this probably doesn't apply much to blocking access to gitlab.fdo (or at least not in a way that would hurt beyond wasting the time of LLM users), but is an issue with scientific papers
<elibrokeit>
I have no problem with this lol
<pinchartl>
with LLMs increasingly pushing conspiracy theories ? again that won't come from deploying anubis on gitlab.fdo, but it's an important side effect of blocking crawlers globally
<karolherbst>
cloudflare now sells access for crawlers 🙃
<karolherbst>
and blocks the ones not paying, which isn't a terrible idea as long as the actual websites get most of the money
<elibrokeit>
I think that we're long past the point where LLMs will ever not be pushing conspiracy theories, at least making them more blatant might either cause people to smell something stinky or cause any resulting documents to be more easily dismissed
<karolherbst>
twitter's LLM is already actively tempered with to push nonsense, soooo..
<karolherbst>
side-effect of LLMs being pushed by the biggest assholes of the tech industry is that it's unavoidable that it's being tempered with
<elibrokeit>
cloudflare a while back was majorly bragging about how instead of deploying mazes of junk data (iocaine, nepenthes, etc), they were explicitly deploying "mazes of useful data to protect your site without running the risk of poisoning the AI"
<karolherbst>
and cloudflare is a good judge of what's "useful data" because?
<elibrokeit>
"give the scraper something healthy to munch on"
<elibrokeit>
well I think their theory was to just put random pages of generic facts
<karolherbst>
right... it's a neat idea if you are willing to trust them
<karolherbst>
the bigger issue here isn't about what "wrong" data they could show, but what "correct" data they won't
<Consolatis>
<karolherbst> cloudflare now sells access for crawlers and blocks the ones not paying
<Consolatis>
sounds more like an extortion scheme to me
<karolherbst>
as long as it's extorting the AI crawlers I don't care 🙃
<karolherbst>
the big issue is, that they were externalizing their costs by forcing everybody to pay more for servers
<karolherbst>
if they now have to pay up for their access that only sounds fair
alanc has quit [Remote host closed the connection]
<pinchartl>
karolherbst: "as long as the actual websites get most of the money" <- that would surprise me
<karolherbst>
instead of deploying anubis, we could also just get 5 times the hardware
alanc has joined #freedesktop
<karolherbst>
pinchartl: doesn't even matter much, because it's cloudflare providing the hardware anyway
<karolherbst>
maybe customers just pay less for traffic in the end
<karolherbst>
or they choose smaller plans
<karolherbst>
in the end doesn't really matter if it's a pay out or just spending less
<elibrokeit>
> This pre-generated content is seamlessly integrated as hidden links on existing pages via our custom HTML transformation process, without disrupting the original structure or content of the page. Each generated page includes appropriate meta directives to protect SEO by preventing search engine indexing. We also ensured that these links remain invisible to human visitors
<elibrokeit>
through carefully implemented attributes and styling. To further minimize the impact to regular visitors, we ensured that these links are presented only to suspected AI scrapers, while allowing legitimate users and verified crawlers to browse normally.
<karolherbst>
it does sound like something easy to get around tho
<elibrokeit>
yeah I'm full of questions about this
<karolherbst>
maybe they underestimate how much the companies are willing to do to get their crawlers to crawl
guludo has quit [Ping timeout: 480 seconds]
guludo has joined #freedesktop
<karolherbst>
maybe it's the new way of earning money instead of ads lol
JanC is now known as Guest21241
JanC has joined #freedesktop
Guest21241 has quit [Ping timeout: 480 seconds]
airlied_ is now known as airlied
jsa1 has quit [Ping timeout: 480 seconds]
JanC is now known as Guest21246
JanC has joined #freedesktop
Guest21246 has quit [Ping timeout: 480 seconds]
vkareh has quit [Quit: WeeChat 4.6.3]
ximion has joined #freedesktop
JanC is now known as Guest21250
JanC has joined #freedesktop
Guest21250 has quit [Ping timeout: 480 seconds]
JanC is now known as Guest21251
JanC has joined #freedesktop
Guest21251 has quit [Ping timeout: 480 seconds]
JanC is now known as Guest21252
JanC has joined #freedesktop
Guest21252 has quit [Ping timeout: 480 seconds]
sima has quit [Ping timeout: 480 seconds]
<DemiMarie>
The CI situation is really annoying, especially when PipeWire tests fail locally so I have no idea if my changes broke something or not.
<DemiMarie>
that said, having the CI create an ephemeral VM for each job (which is what github.com and gitlab.com do) might take more resources than fd.o can afford
<DemiMarie>
also, anything involving HW accell would still need to be gated
<daniels>
DemiMarie: I’m not sure where you mean by ‘the CI situation’
<DemiMarie>
usually what I do in this situation is push and let upstream CI run the tests in a known-working environment, but that only works when CI runs happen automatically
cascardo has quit [Ping timeout: 480 seconds]
JanC is now known as Guest21256
JanC has joined #freedesktop
Guest21256 has quit [Ping timeout: 480 seconds]
JanC is now known as Guest21257
JanC has joined #freedesktop
<whot>
DemiMarie: you can pull down the container the CI runs in and run your tests in that
<DemiMarie>
whot: what is the easiest way to find it?
<whot>
DemiMarie: line 7 of the job output :)
Guest21257 has quit [Ping timeout: 480 seconds]
<whot>
Where it says "using docker image...", you can e.g. podman run -it registry.freedesktop.org/demimarie/pipewire/fedora/41:2025-05-10.0
<DemiMarie>
whot: makes sense
<whot>
DemiMarie: if you look at the libinput .gitlab-ci.yml, search for the build-in-vng@template, you can run the containers as a vm too
snetry has joined #freedesktop
sentry has quit [Ping timeout: 480 seconds]
guludo has quit [Ping timeout: 480 seconds]
<DemiMarie>
whot: how many people use this workflow?
<whot>
virtme-ng is used elsewhere (gstreamer?) but I don't recall which projects specifically
<pinchartl>
libcamera and linux-media both use virtme-ng in CI
<DemiMarie>
Curious: is the reason that fd.o CI doesn't automatically spin up a VM for each run the need to use devices, resource constraints, or something else?
<DemiMarie>
Or just that nobody has implemented it yet?
<whot>
it takes a lot more resouces and for the vast majority of jobs its not needed
<DemiMarie>
is this because the vast majority of jobs are by trusted people?
<whot>
spinning up a container to run e.g. ruff on the code base is a matter of seconds, spinning up a vm a matter of minutes. add to that that log collection etc. from a VM is a lot more difficult because gitlab CI is designed for containers and you have an immediate bias towards containers
<whot>
the *only* reason why libinput uses vms is becuase the test suite is {un}fortunately designed to use uinput devices so we need it. if I could get rid of that requirement I'd drop the vms instantly
<DemiMarie>
whot: Does GitLab.com use VMs? I presume it does because it is a multi-tenant service with mutually distrusting users. If your VMs take minutes to start you have a misconfiguration somewhere.
<whot>
tbh I don't know the exact setup gitlab.com uses internally but runners can be configured at various levels of distrust and ours is relatively permissive
haaninjo has quit [Quit: Ex-Chat]
<daniels>
gitlab.com charges you for use
<elibrokeit>
I expect that gitlab.com is running a fleet of VMs that you don't actually have insight into, and they restart them as fast as they can and all you see is when your job gets picked up by one
<elibrokeit>
it's not like they assign you a VM and then boot the VM just for you