tomba has quit [Quit: ZNC 1.9.0+deb2build3 - https://znc.in]
tomba has joined #dri-devel
hikiko has joined #dri-devel
hikiko_ has quit [Ping timeout: 480 seconds]
pcercuei has joined #dri-devel
coldfeet has quit [Quit: Lost terminal]
vliaskov_ has quit [Ping timeout: 480 seconds]
kts has quit [Ping timeout: 480 seconds]
gouchi has joined #dri-devel
warpme has joined #dri-devel
kts has joined #dri-devel
ammen99 has quit [Remote host closed the connection]
bolson has joined #dri-devel
tobiasjakobi has joined #dri-devel
tobiasjakobi has quit []
JRepinc has quit [Ping timeout: 480 seconds]
kts has quit [Ping timeout: 480 seconds]
FAQ_ has joined #dri-devel
JRepinc has joined #dri-devel
warpme has quit []
kts has joined #dri-devel
Jeremy_Rand_Talos has joined #dri-devel
nerdopolis has joined #dri-devel
Alisa[m] has joined #dri-devel
coldfeet has joined #dri-devel
JRepin has joined #dri-devel
JRepinc has quit [Read error: Connection reset by peer]
nerdopolis has quit [Ping timeout: 480 seconds]
Daanct12 has quit [Quit: WeeChat 4.6.3]
hikiko_ has joined #dri-devel
kts has quit [Ping timeout: 480 seconds]
hikiko has quit [Ping timeout: 480 seconds]
asrivats__ has joined #dri-devel
kts has joined #dri-devel
asrivats__ has quit [Ping timeout: 480 seconds]
nerdopolis has joined #dri-devel
gouchi has quit [Remote host closed the connection]
kts has quit [Ping timeout: 480 seconds]
nerdopolis has quit [Ping timeout: 480 seconds]
pcercuei has quit [Remote host closed the connection]
Alisa[m] has quit [autokilled: This host violated network policy. Contact support@oftc.net for further information and assistance. (2025-05-31 14:30:44)]
<karolherbst>
robclark: sure, but the entire profiling needs to be reworked anyway
<robclark>
ok, as long as $new_thing dtrt
<karolherbst>
the thing is, that start/end events are about when the GPU starts/ends processing commands
<karolherbst>
and I don't really want to read those values out on the CPU
<karolherbst>
should use `get_query_result_resource` instead of `get_query_result`
<karolherbst>
and have a slab allocated heap that I can just throw into it
<robclark>
well, app is going to _eventually_ read them out on CPU.. I guess you could use the get_query_result() path, but that is a pita for me since I can't properly convert ticks to ms on the gpu without spinning up a compute shader ;-)
<robclark>
so for timestamp queries I have to kinda fake it for get_query_result_resource()
<karolherbst>
mhhh
<robclark>
if you just kept a list/vector/whatever of the queries and then did readback after flush it wouldn't be so bad
<karolherbst>
the thing is.. I just submit commands, and I need to know when the GPU starts/end processing a command, an a query object thrown into the command stream is generally how it should be (tm). Though I can see that on tilers things are different (tm)
<robclark>
for compute, fortunately tiling isn't a thing.. it really just about not being able to convert to ms on the CP
<karolherbst>
it doesn't have to be ms or something afaik
nerdopolis has joined #dri-devel
<robclark>
or us or something.. I forget what units the result is defined in, but it is a unit of time, not ticks
<karolherbst>
oh actually.. the spec wants it in ns
<robclark>
ahh, right
dsimic is now known as Guest17075
dsimic has joined #dri-devel
<robclark>
hmm, although I wonder if I could get something added to fw.. we really just need to multiply by (approx) 52
Guest17075 has quit [Ping timeout: 480 seconds]
<robclark>
I guess 52 is close enough to 52.083333333
<robclark>
still, it would be easier to do on cpu
<karolherbst>
robclark: anyway.. the idea was to insert a timestamp query object, do a bunch of gallium commands, do the second timestamp qery and just read out the results once the GPU is done
<robclark>
yeah, and as you pointed out one way to do that is get_query_result_resource().. the other is to track the query objects
<karolherbst>
how are queries implemented for you anyway? Is it like a command stream thing where the GPU writes a timestamp into a location at some point in time, or is it more like.. done on the cpu side?
<robclark>
it's on the GPU
<karolherbst>
okay
<karolherbst>
yeah when it's writing the values asynchronously to a buffer I can map later, that's perfect. We _could_ do CPU side fixs ups of the value
<karolherbst>
what's just important is, that I don't want to stall the pipeline with busy waits
<karolherbst>
as I'm doing atm
<robclark>
right
<robclark>
well, with get_query_result_resource() you eventually stall on the result resource
<karolherbst>
but I could apply a factor before reporting back the values
<karolherbst>
well sure
<robclark>
so it amounts to the same thing.. but I could see get_query_result_resource() being easier to implement
<karolherbst>
I can map without stalling
<robclark>
yeah, I guess we could add a pipe cap to adjust the result on the cpu
<karolherbst>
PIPE_MAP_UNSYNCHRONIZED or something
<robclark>
sure, but I guess you want to actually _have_ the result on the CPU at some point ;-)
<karolherbst>
yeah... needs to flush the thing at some point
<karolherbst>
but.
<karolherbst>
you can also copy to a second resource
<karolherbst>
and map that one without stalling :D
<robclark>
only for readback of result on the GPU
<karolherbst>
there are a few tricks how the actual important main work can be left alone doing it's stuff
<robclark>
(idk if cl can do that)
<karolherbst>
oh the CL API doesn't care about how it's implemented really
<robclark>
if you are reading back on the CPU, then you have to wait
<karolherbst>
it just gives you raw values
<robclark>
right, but on the _CPU_
<karolherbst>
sure
<robclark>
so unless you invent time travel, it needs to wait on GPU somewhere ;-)
<robclark>
(but moar fps via time travel would be a neat trick)
<karolherbst>
not really if you e.g. tell the GPU to write the results into a coherent/staging buffer
<robclark>
sure, but cpu needs to read it after gpu writes it
<karolherbst>
right, you can't prevent that one :)
<karolherbst>
but atm, we do the read after each even is processed, then stall the GPU then execute the next event
<karolherbst>
that just stalls the CL queue all the time
<karolherbst>
an "event" here is like a cl queue command
<robclark>
right, either the defer query object read in getProfilingInfo() or stall and read result rsc in getProfilingInfo() would amount to the same thing
<robclark>
it would stall until result is avail if it isn't already
<karolherbst>
nah, it's done on different threads
<karolherbst>
there is a queue thread working through the commands
<robclark>
gpu doesn't care so much about cpu threads
<karolherbst>
and that's stalling the GPU side of things with the current implementation constantly
<robclark>
sure, ofc
<karolherbst>
with get_query_result_resource the GPU would just write the result into some buffer working through the commands
<karolherbst>
and then at some point it gets read out once the queue is flushed/finished or so
<karolherbst>
but that happens on a different thread then and wouldn't bother the queue one
<robclark>
but if that thread just pushed the queue objects to some data structure, and deferred get_query_result() until getProfilingInfo() is called.. then you don't stall any more than you would with the get_query_result_resource() approach
<karolherbst>
getProfilingInfo doesn't call into get_query_result
<karolherbst>
the get_query_result happens way earlier
<robclark>
right, that is the problem
<robclark>
oh, but I guess you might need extra locking with my approach to avoid calling into ctx on multiple threads
<karolherbst>
yeah and instead of get_query_result, I want to use get_query_result_resource so it's not constantly waiting on the GPU. And getProfilingInfo simply reads from the buffer instead of temporary values the results of get_query_result were written to
<karolherbst>
there is already a bit of indirection going on there, because things are already cursed enough
<karolherbst>
well.. that's why I want to map unsynchronized or something, so I can just map on a different context
<karolherbst>
need to figure out the details at some point
<karolherbst>
maybe I just do a resource_from_user_memory thing...
<robclark>
yeah, for the result rsc approach, that would work, because you can wait on fence on any thread
<karolherbst>
yeah
<karolherbst>
just need to wait until the GPU is actually done, maybe make sure the results are flushed, but then it should work in principle
<robclark>
let me look into whether CP_TICKS_TO_NS is something that I can talk someone into.. I guess it at least has a non-zero chance now..
<karolherbst>
could also just collect all the query results whenever I had to wait on a fence anyway
<robclark>
and it would be useful for qbo
<karolherbst>
could be default 1
<karolherbst>
it's a bit more problematic with GL, because you can hand out a GPU resource to applications with the raw data, no?
<karolherbst>
or well.. have it written to a memory object
<robclark>
right, right now I just fail the big gl qbo timestamp/elapsed tests
<robclark>
with CP_TICKS_TO_NS would help
<karolherbst>
yeah..
<karolherbst>
anyway.. it's harder to get all the gl bits correct here. I can just apply a factor if needed, that's not really a big issue
<robclark>
s/with/which/
<robclark>
yeah, I guess we could do that as the workaround if we had to (or at least do that when fw is too old)
<karolherbst>
mhh, I think I just understood what you were trying to explain earlier 🙃.. I guess I could move the `get_query_result` calls to a later place and only do that after waiting on related fences anyway...
gouchi has joined #dri-devel
<karolherbst>
anyway, that would also require rewriting msot of the profiling code, for weird reason
<karolherbst>
s
<robclark>
yeah.. but if you are calling into pctx on a different thread from app thread, the threading might be a bit awkward
<robclark>
but other than that detail the two approaches are the "same"
<karolherbst>
yeah... atm PipeQuery stores a reference to the Context, and for rust reasons it would be a bit painful to delay reading it out. So writing into a resource would get around that part
<karolherbst>
because then I won't have to keep the query object around
gouchi has quit []
<karolherbst>
And using host visible memory mapped into the GPU might just make everything trivial enough to handle
<robclark>
yeah, you'd have to wait on a fence but no threading constraints there
<karolherbst>
or it's a persistent mapping and the event object just gets a pointer into a slice allocated for it
<karolherbst>
and then it's just reading from a pointer (after a flush/wait) and nothing else matters
<karolherbst>
anyway...
<karolherbst>
I have ideas, and I'd just need to figure out what I like the most
<karolherbst>
I think I like the idea of using a coherent + persistent mapping thing, because then I don't have to bother on the CPU side with copying values around and the results just appear at the right location at some point
coldfeet has joined #dri-devel
chamlis has quit [Remote host closed the connection]