ChanServ changed the topic of #etnaviv to: #etnaviv - the home of the reverse-engineered Vivante GPU driver - Logs https://oftc.irclog.whitequark.org/etnaviv
alarumbe has quit [Quit: ZNC 1.8.2+deb3.1+deb12u1 - https://znc.in]
SpieringsAE has joined #etnaviv
SpieringsAE has quit [Quit: SpieringsAE]
SpieringsAE has joined #etnaviv
<tomeu> lynxeye: have looked further into this, and I think what is going on is that sometimes the MMU isn't fully setup when a job starts executing
<tomeu> which would explain why sometimes I see a "slave not present" and sometimes a "page not present"
<tomeu> and the BO affected by this is sometimes a shader, or an input or output
frieder has joined #etnaviv
<tomeu> when I start 4 inferences at the same time, I rarely see a fault. But with 5 at the same time, most runs see a fault happen.
<tomeu> I can confirm that the faults happen only when we have had to switch to a different PTA
<tomeu> and the fact that it happens only on the first run points to the NPU seeing an incomplete page table
<tomeu> lynxeye: how would you ensure that the MMU in the NPU accesses the latest data in the page tables?
lynxeye has joined #etnaviv
<tomeu> allocating the page tables as coherent or flushing the cpu write cache with wmb() hasn't helped
<tomeu> so I wonder if the MMU in the NPU isn't using a stale read cache
<lynxeye> tomeu: fwiw, the barrier for the page table writes is the mb() in etnaviv_buffer_replace_wait()
<lynxeye> so the issue doesn't only happen when the GPU starts up, but also when you switch between different address spaces?
<tomeu> only when I switch between different address spaces for the first time in a client's lifetime
<tomeu> I think it doesn't happen if I force a flush for every job, I'm verifying this right now
<tomeu> - bool need_flush = switch_mmu_context || gpu->flush_seq != new_flush_seq;
<tomeu> + bool need_flush = true; //switch_mmu_context || gpu->flush_seq != new_flush_seq;
<tomeu> seems to fix it
<tomeu> so, what I think happens is that I start up 8 processes at the same time and each are going to submit N times a cmdstream that has been split in 3 chunks
<tomeu> always, the first chunk for all processes will execute fine
<tomeu> but sometimes, the second chunk will fail at an address that was mapped after the first chunk executed
<tomeu> so I think that the MMU has cached the TLB and doesn't realize that that BO has been mapped
<tomeu> guess I need to debug now the flush_seq logic
<lynxeye> I think I can see how the MMU flush sequence would race in the case you describe above.
<tomeu> yeah, this may be the fix:
<tomeu> otherwise, we switch contexts, but leave the flush_seq for the previous context
<tomeu> which in this case can be ahead of the seq for the new context
<lynxeye> nope, that shouldn't happen: when we need to switch contexts, need_flush is also true, so the flush sequence gets set a few lines below
pcercuei has joined #etnaviv
<lynxeye> I wonder if we might record a flush sequence that corresponds to page table changes that are not yet visible to the GPU and thus skip the necessary flush on a later buffer queue. Since buffer queue and MMU updates might happen from different threads, this might be a possibility, but I currently don't see how the flush_seq update on the MMU context might reorder against the page table update, since the mb before queuing the command
<lynxeye> d make all updates visible to the GPU.
<lynxeye> WFT: The MMU context used in the READ_ONCE is just plain wrong.
<lynxeye> This should be the new context aka mmu_context, not the current context aka gpu->mmu_context
pcercuei has quit [Quit: brb]
<lynxeye> so in fact we do record a wrong flush_seq after the address space switch
<tomeu> yeah, I also need that change
<tomeu> I'm testing right now
pcercuei has joined #etnaviv
<tomeu> so far so good
SpieringsAE has quit [Quit: SpieringsAE]
<tomeu> lynxeye: it got much better, but I'm still seeing faults and I think there may be a concurrency issue:
<tomeu> looks like we are mapping 0xfa95c000 after the job depending on it is run
<tomeu> hrm, but etnaviv_sched_push_job shouldn't have been called yet for that submit. will debug
<lynxeye> tomeu: looks odd. The mapping is established on the submit, so even before the job gets queued to the scheduler. So there should be no way for the scheduler to push the job to the GPU before the buffer is mapped. Maybe this is a userspace issue where only the second submit has the reference to the buffer, but the first submit is already using the softpin address?
<tomeu> Yeah, I'm trying to think how that could be possible, because I add relocs to the submit as I reference them in the cmdstream.
<lynxeye> does this happen with the forced flush due to running out of cmdbuf space, or does it also happen without such flushes?
<tomeu> I have only seen unstability with the forced flushes
<lynxeye> if it's due to forced flushes, maybe your submissions get broken up at points where you don't expect it, like commands getting added to one cmdbuf, but references to bos only get applied to the one after the flush.
<lynxeye> The way to avoid this is to reserve enough space in the cmdstream upfront via etna_cmd_stream_reserve() for all the operations you want to add to one cmdbuf atomically.
<lynxeye> If the cmdbuf doesn't have enough space for the size you are asking for, the forced flush will be triggered at reserve time and you'll roll over to the next cmdbuf
frieder has quit [Remote host closed the connection]
lynxeye has quit [Quit: Leaving.]
<tomeu> I added forced flushes to the ml-specific code, splitting between operations
<tomeu> so doesn't seem to be that
<tomeu> what I have just noticed is that if I add some writes to /dev/kmsg when adding the relocs from each operation, I cannot reproduce it
<tomeu> so it seems as if in order to reproduce the problem you need to start adding relocs right after a previous submission
<tomeu> so I'm wondering if anything gets overwritten while the kernel is still reading the ioctl from userspace