#etnaviv on 2025-05-08 — irc logs at oftc.catirclogs.org

2024-07-16 04:52 ChanServ changed the topic of #etnaviv to: #etnaviv - the home of the reverse-engineered Vivante GPU driver - Logs https://oftc.irclog.whitequark.org/etnaviv

00:08 pcercuei has quit [Quit: dodo]

01:08 alarumbe has joined #etnaviv

03:40 _whitelogger_ has joined #etnaviv

03:59 _whitelogger_ has quit [Remote host closed the connection]

04:00 _whitelogger_ has joined #etnaviv

04:09 _whitelogger_ has quit [Remote host closed the connection]

04:09 _whitelogger_ has joined #etnaviv

04:35 _whitelogger_ has quit [Remote host closed the connection]

04:37 _whitelogger_ has joined #etnaviv

05:00 _whitelogger_ has quit [Remote host closed the connection]

05:25 _whitelogger has joined #etnaviv

05:32 alarumbe has quit [Quit: ZNC 1.8.2+deb3.1+deb12u1 - https://znc.in]

05:37 _whitelogger has joined #etnaviv

05:45 _whitelogger has joined #etnaviv

06:21 SpieringsAE has joined #etnaviv

06:55 _whitelogger has joined #etnaviv

08:00 lynxeye has joined #etnaviv

08:12 pjakobsson_ has joined #etnaviv

08:14 pjakobsson has quit [Remote host closed the connection]

08:59 <lynxeye> tomeu: I don't see any way how such a race could happen. The submit ioctl takes a lot of care to copy all userspace data into kernel internal structure before even starting to validate the submission.

09:00 <lynxeye> Do you have a devcoredump from one of the failing jobs? If so I would like to take a look at this.

09:00 <tomeu> Yeah, and the user space data structures seem to be fine as well

09:00 <tomeu> I don't have one, because it's very hard to reproduce now after the last fix

09:01 <tomeu> If I add just some extra logging, it runs for hours without reproducing

09:03 <tomeu> But I do see a batch faulting on an address that corresponds to a BO that is referenced in a submit ioctl that is currently being processed

09:11 <lynxeye> Huh, I think I see a potential issue with missing barriers: until now I assumed that the mb() before making the cmds visible to the GPU also guaranteed that pagetable updates are visible, but that's not necessarily the case as the updates happen from the submit thread, but cmd submission happens from the scheduler thread. So there is a potential SMP issue there. Let me look into this.

11:48 mvlad has joined #etnaviv

12:29 pcercuei has joined #etnaviv

12:38 <tomeu> after adding some more logging, I'm having trouble reproducing the issue in which a job accesses an address for a BO in the next batch

12:38 <tomeu> but I'm seeing accesses to other values in the cmdstream, such as uniform values

12:39 <tomeu> lynxeye: so I'm wondering if a submit ioctl can overwrite the cmd stream of a previously submitted job

13:07 <lynxeye> tomeu: that should be impossible, the cmd buffer itself is also copied from userspace and put into a preallocated memory region in kernel memory.

13:27 <tomeu> lynxeye: right, looks like we are overwriting that preallocated region:

13:27 * tomeu sent a code block: https://matrix.org/oftc/media/v1/media/download/AUr2LqMBk7DsDl_OcIv9turv-mRbpR3KB8rc6WHhRQ0xSTQNu9CYTe9A3KigiepOu1N5y3FnfxcrXcDpJTXzfzZCeW-C3g7wAG1hdHJpeC5vcmcvUHNOYXlZTkhveGxadFlrQ3pUZVVSdXJp

13:28 <tomeu> so submissions can overwrite a cmdstream that the HW is currently executing

13:42 <lynxeye> tomeu: that ffff800083cad000 is cmdbuf.vaddr?

13:42 <tomeu> yep, which is what gets executed

13:45 <tomeu> I'm trying to reproduce with some logging on the reclaiming logic, but this si very sensitive to timing...

13:51 <lynxeye> are you sure there is no irq in between signaling the job finish of the queued job?

13:56 <lynxeye> maybe the NPU does actually signal job completion before it is really done

13:57 <lynxeye> the PE->FE stall before the VIVS_GL_EVENT does make sure that we don't fire the irq before the GPU is really done with the job, but maybe that's insufficient for the NPU and it needs other means of synchronization there?

13:59 <tomeu> I think there wasn't, but I have overwritten those logs and now it's taking a while to reproduce

13:59 <tomeu> will give that a look

14:15 <tomeu> grr, cannot reproduce at all now...

14:33 SpieringsAE has quit [Quit: SpieringsAE]

14:35 SpieringsAE has joined #etnaviv

14:36 SpieringsAE has quit []

15:31 <tomeu> lynxeye: changing subjects: I see that Ahmad sent some changes regarding overdrive. Do I understand correctly that with mainline the NPU will run on overdriver on all/most imx8mp boards?

16:00 <lynxeye> tomeu: Yes, most boards are using the overdrive SoC voltage anyway, as that's required for the DDR-4000 mode, so there is no point in running the peripherals at nominal clock rate.

16:07 <tomeu> Nice!

17:14 <lynxeye> tomeu: How does this idx thing work when queuing multiple NN/TP commands? The code in the userspace driver looks a bit suspicious to me. The field in the *_INST_ADDR where idx +1 is written is only 5 bits wide, according to https://github.com/nxp-imx/linux-imx/blob/37d02f4dcbbe6677dc9f5fc17f386c05d6a7bd7a/drivers/mxc/gpu-viv/hal/kernel/arch/gc_hal_kernel_hardware_func_usc.h#L910-L912

17:15 <lynxeye> What happens if there are more operations than what fits in this field? And why does this mutate to a fixed 0x1f when there is more than one TP operation per job?

19:19 <tomeu> lynxeye: that is set only when one enables parallelism, which is disabled by default because it didn't work reliably not even with the proprietary driver

19:19 <tomeu> the proprietary driver also has that disabled by default

19:20 <tomeu> some simpler models like mobilenet seemed to work, but more complex ones didn't

19:20 <tomeu> maybe there is a HW bug that prevents enabling that by default

19:31 <lynxeye> tomeu: Ah, okay. It just piqued my interest when looking for potential synchronization issues. First time looking at the ML code in any detail really.

20:18 mvlad has quit [Remote host closed the connection]

20:28 lynxeye has quit [Quit: Leaving.]

22:48 pcercuei has quit [Quit: dodo]