#etnaviv on 2025-04-25 — irc logs at oftc.catirclogs.org

2024-07-16 04:52 ChanServ changed the topic of #etnaviv to: #etnaviv - the home of the reverse-engineered Vivante GPU driver - Logs https://oftc.irclog.whitequark.org/etnaviv

00:50 shoragan_ has joined #etnaviv

00:50 shoragan has quit [Ping timeout: 480 seconds]

01:02 alarumbe has joined #etnaviv

06:10 pjakobsson has joined #etnaviv

06:15 szemzoa has quit []

06:58 chewitt has joined #etnaviv

07:19 mvlad has joined #etnaviv

07:29 alarumbe has quit [Quit: ZNC 1.8.2+deb3.1+deb12u1 - https://znc.in]

07:40 lynxeye has joined #etnaviv

08:16 <tomeu> marex: lynxeye: I'm seeing MMU faults and GPU hangs after the automatic cmdstream flush after the 64k limit is reached.

08:16 <tomeu> if I increase the limit so no cmdstream split happens, things seem solid in my testing so far

08:17 <tomeu> do you have any ideas on what could be causing that?

08:18 <lynxeye> tomeu: Do you re emit all the buffer bindings after the flush happens? The 3D driver does this by dirtying all state when the flush callback is called.

08:19 <lynxeye> If you don't emit all the necessary relocs after the flush, the kernel driver won't know that you are using the buffers in the commandstream after the flush.

08:19 <tomeu> lynxeye: you mean etna_cmd_stream_ref_bo ?

08:19 <tomeu> ah, I can test that

08:20 <tomeu> looks promising, thanks!

08:22 <lynxeye> You really need to reemit all states that contain addresses. If the cmdstream is split, a job from another context/process/whatever may run or the GPU/NPU may even enter runtime PM in between your two submits. So after the flush you must not assume that the core has kept any of the previous state.

08:38 <tomeu> hmm, ok

08:39 <tomeu> so I guess this automatic flush thing doesn't make much sense with the NPU command streams, because most commands are just addresses

08:39 <tomeu> though I had been seeing this even with runtime PM disabled

08:40 <tomeu> what is the rationale for that 128kb limit for the command stream?

08:40 <tomeu> I'm seeing this problem with a really big model

08:59 <lynxeye> tomeu: commands are allocated from a shared contig region, so we don't want individual jobs to hog the space

09:07 <tomeu> lynxeye: I have tested splitting my batches along operation boundaries, which should be fine regarding state changes and relocs, but for some reason I still see the same faults: https://gitlab.freedesktop.org/tomeu/mesa/-/commit/612c54a737abda4db003b4fc6cdfa29ddf9a0f71

09:07 <tomeu> when I split each operation (with ETNA_MESA_DEBUG=npu_no_batching) the problem doesn't reproduce

09:08 <tomeu> wonder if I'm hitting some race in the kernel, which the splitting just makes more likely

09:11 <tomeu> actually, scratch that, I have reproduced with ETNA_MESA_DEBUG=npu_no_batching

09:14 <tomeu> something interesting is that I see these faults only when the 6 parallel process start running, in their first submits

09:14 <tomeu> after the first reset, the processes keep executing inferences without problems

09:22 pcercuei has joined #etnaviv

10:39 chewitt has quit [Quit: Zzz..]

11:10 szemzoa has joined #etnaviv

12:59 alarumbe has joined #etnaviv

13:26 SpieringsAE has joined #etnaviv

13:58 <lynxeye> tomeu: Are any of the buffers only kept alive via the submit (resource already destroyed on the driver level)? In that case the kernel driver might drop the BO and MMU mapping after the job is done. If you have some data stuck in write caches, the writeback on cache replacement may cause MMU faults on the next job. That's why the kernel driver flushes all known write caches after each job, but I'm not sure if the flushing we do is suff

13:58 <lynxeye> cient for the NPU, as I only ever looked at the 2D/3D side of things.

14:46 SpieringsAE has quit []

17:03 lynxeye has quit [Quit: Leaving.]

17:17 tlwoerner has joined #etnaviv

20:23 mvlad has quit [Remote host closed the connection]

23:07 pcercuei has quit [Quit: dodo]