alarumbe has quit [Quit: ZNC 1.8.2+deb3.1+deb12u1 - https://znc.in]
lynxeye has joined #etnaviv
<tomeu>
marex: lynxeye: I'm seeing MMU faults and GPU hangs after the automatic cmdstream flush after the 64k limit is reached.
<tomeu>
if I increase the limit so no cmdstream split happens, things seem solid in my testing so far
<tomeu>
do you have any ideas on what could be causing that?
<lynxeye>
tomeu: Do you re emit all the buffer bindings after the flush happens? The 3D driver does this by dirtying all state when the flush callback is called.
<lynxeye>
If you don't emit all the necessary relocs after the flush, the kernel driver won't know that you are using the buffers in the commandstream after the flush.
<tomeu>
lynxeye: you mean etna_cmd_stream_ref_bo ?
<tomeu>
ah, I can test that
<tomeu>
looks promising, thanks!
<lynxeye>
You really need to reemit all states that contain addresses. If the cmdstream is split, a job from another context/process/whatever may run or the GPU/NPU may even enter runtime PM in between your two submits. So after the flush you must not assume that the core has kept any of the previous state.
<tomeu>
hmm, ok
<tomeu>
so I guess this automatic flush thing doesn't make much sense with the NPU command streams, because most commands are just addresses
<tomeu>
though I had been seeing this even with runtime PM disabled
<tomeu>
what is the rationale for that 128kb limit for the command stream?
<tomeu>
I'm seeing this problem with a really big model
<lynxeye>
tomeu: commands are allocated from a shared contig region, so we don't want individual jobs to hog the space
<tomeu>
when I split each operation (with ETNA_MESA_DEBUG=npu_no_batching) the problem doesn't reproduce
<tomeu>
wonder if I'm hitting some race in the kernel, which the splitting just makes more likely
<tomeu>
actually, scratch that, I have reproduced with ETNA_MESA_DEBUG=npu_no_batching
<tomeu>
something interesting is that I see these faults only when the 6 parallel process start running, in their first submits
<tomeu>
after the first reset, the processes keep executing inferences without problems
pcercuei has joined #etnaviv
chewitt has quit [Quit: Zzz..]
szemzoa has joined #etnaviv
alarumbe has joined #etnaviv
SpieringsAE has joined #etnaviv
<lynxeye>
tomeu: Are any of the buffers only kept alive via the submit (resource already destroyed on the driver level)? In that case the kernel driver might drop the BO and MMU mapping after the job is done. If you have some data stuck in write caches, the writeback on cache replacement may cause MMU faults on the next job. That's why the kernel driver flushes all known write caches after each job, but I'm not sure if the flushing we do is suff
<lynxeye>
cient for the NPU, as I only ever looked at the 2D/3D side of things.
SpieringsAE has quit []
lynxeye has quit [Quit: Leaving.]
tlwoerner has joined #etnaviv
mvlad has quit [Remote host closed the connection]