alarumbe has quit [Quit: ZNC 1.8.2+deb3.1+deb12u1 - https://znc.in]
pjakobsson has quit [Ping timeout: 480 seconds]
pjakobsson has joined #etnaviv
mwalle has joined #etnaviv
frieder has joined #etnaviv
lynxeye has joined #etnaviv
mvlad has joined #etnaviv
pcercuei has joined #etnaviv
alarumbe has joined #etnaviv
<tomeu>
lynxeye: I think the splits might just increase the probability of hitting some race
<tomeu>
and regarding why it happens only when the parallel processes are starting up, could it be related to the powering on or the setting up of the MMU for each of these processes?
<tomeu>
I'm going now through the logs and trying to find any commonalities between the runs that triggered the fault
<tomeu>
btw, sometimes it's a "slave not present" fault, sometimes it's a "page not found" fault
<tomeu>
of 61 runs that triggered this, I get this distribution among fault type and address:
<tomeu>
46 (page not present) addr 0xfae50000
<tomeu>
15 (slave not present) addr 0xfa3a7000
<lynxeye>
tomeu: Are those valid addresses? I.e. are there buffers mapped at this location?
<tomeu>
lynxeye: not starting at that location, but 0xfae4e000 is allocated, as well as 0xfae53000
<tomeu>
guess I should check what is allocated there and with what sizes
<lynxeye>
tomeu: if you capture a devcoredump of one of those failing runs, the BO map contained in the dump should tell you which BOs are currently mapped and their size
<tomeu>
hmm, it's an output tensor in a NN job
<tomeu>
that's 0xfae4e000-0xfae504c0
<tomeu>
wonder if it can be significant that it seems to fault on a page boundary
<tomeu>
0xfae50000
<tomeu>
that BO is allocated from 0xfae4e000 to 0xfae53000
<tomeu>
I don't see anything wrong with the lifecycle of these BOs, so I will go back to checking for a race
frieder has quit [Remote host closed the connection]
<lynxeye>
Hard to tell if this is simply just the first location in the output tensor that is touched by the NPU.
italove8 has joined #etnaviv
pcercuei has quit [Quit: Lost terminal]
lynxeye has quit [Quit: Leaving.]
mvlad has quit [Remote host closed the connection]