_whitelogger_ has quit [Remote host closed the connection]
_whitelogger_ has joined #etnaviv
_whitelogger_ has quit [Remote host closed the connection]
_whitelogger_ has joined #etnaviv
_whitelogger_ has quit [Remote host closed the connection]
_whitelogger_ has joined #etnaviv
_whitelogger_ has quit [Remote host closed the connection]
_whitelogger has joined #etnaviv
alarumbe has quit [Quit: ZNC 1.8.2+deb3.1+deb12u1 - https://znc.in]
_whitelogger has joined #etnaviv
_whitelogger has joined #etnaviv
SpieringsAE has joined #etnaviv
_whitelogger has joined #etnaviv
lynxeye has joined #etnaviv
pjakobsson_ has joined #etnaviv
pjakobsson has quit [Remote host closed the connection]
<lynxeye>
tomeu: I don't see any way how such a race could happen. The submit ioctl takes a lot of care to copy all userspace data into kernel internal structure before even starting to validate the submission.
<lynxeye>
Do you have a devcoredump from one of the failing jobs? If so I would like to take a look at this.
<tomeu>
Yeah, and the user space data structures seem to be fine as well
<tomeu>
I don't have one, because it's very hard to reproduce now after the last fix
<tomeu>
If I add just some extra logging, it runs for hours without reproducing
<tomeu>
But I do see a batch faulting on an address that corresponds to a BO that is referenced in a submit ioctl that is currently being processed
<lynxeye>
Huh, I think I see a potential issue with missing barriers: until now I assumed that the mb() before making the cmds visible to the GPU also guaranteed that pagetable updates are visible, but that's not necessarily the case as the updates happen from the submit thread, but cmd submission happens from the scheduler thread. So there is a potential SMP issue there. Let me look into this.
mvlad has joined #etnaviv
pcercuei has joined #etnaviv
<tomeu>
after adding some more logging, I'm having trouble reproducing the issue in which a job accesses an address for a BO in the next batch
<tomeu>
but I'm seeing accesses to other values in the cmdstream, such as uniform values
<tomeu>
lynxeye: so I'm wondering if a submit ioctl can overwrite the cmd stream of a previously submitted job
<lynxeye>
tomeu: that should be impossible, the cmd buffer itself is also copied from userspace and put into a preallocated memory region in kernel memory.
<tomeu>
lynxeye: right, looks like we are overwriting that preallocated region:
<tomeu>
so submissions can overwrite a cmdstream that the HW is currently executing
<lynxeye>
tomeu: that ffff800083cad000 is cmdbuf.vaddr?
<tomeu>
yep, which is what gets executed
<tomeu>
I'm trying to reproduce with some logging on the reclaiming logic, but this si very sensitive to timing...
<lynxeye>
are you sure there is no irq in between signaling the job finish of the queued job?
<lynxeye>
maybe the NPU does actually signal job completion before it is really done
<lynxeye>
the PE->FE stall before the VIVS_GL_EVENT does make sure that we don't fire the irq before the GPU is really done with the job, but maybe that's insufficient for the NPU and it needs other means of synchronization there?
<tomeu>
I think there wasn't, but I have overwritten those logs and now it's taking a while to reproduce
<tomeu>
will give that a look
<tomeu>
grr, cannot reproduce at all now...
SpieringsAE has quit [Quit: SpieringsAE]
SpieringsAE has joined #etnaviv
SpieringsAE has quit []
<tomeu>
lynxeye: changing subjects: I see that Ahmad sent some changes regarding overdrive. Do I understand correctly that with mainline the NPU will run on overdriver on all/most imx8mp boards?
<lynxeye>
tomeu: Yes, most boards are using the overdrive SoC voltage anyway, as that's required for the DDR-4000 mode, so there is no point in running the peripherals at nominal clock rate.
<lynxeye>
What happens if there are more operations than what fits in this field? And why does this mutate to a fixed 0x1f when there is more than one TP operation per job?
<tomeu>
lynxeye: that is set only when one enables parallelism, which is disabled by default because it didn't work reliably not even with the proprietary driver
<tomeu>
the proprietary driver also has that disabled by default
<tomeu>
some simpler models like mobilenet seemed to work, but more complex ones didn't
<tomeu>
maybe there is a HW bug that prevents enabling that by default
<lynxeye>
tomeu: Ah, okay. It just piqued my interest when looking for potential synchronization issues. First time looking at the ML code in any detail really.
mvlad has quit [Remote host closed the connection]