#etnaviv on 2025-05-07 — irc logs at oftc.catirclogs.org

2024-07-16 04:52 ChanServ changed the topic of #etnaviv to: #etnaviv - the home of the reverse-engineered Vivante GPU driver - Logs https://oftc.irclog.whitequark.org/etnaviv

05:12 alarumbe has quit [Quit: ZNC 1.8.2+deb3.1+deb12u1 - https://znc.in]

05:42 SpieringsAE has joined #etnaviv

05:54 SpieringsAE has quit [Quit: SpieringsAE]

05:54 SpieringsAE has joined #etnaviv

05:58 <tomeu> lynxeye: have looked further into this, and I think what is going on is that sometimes the MMU isn't fully setup when a job starts executing

05:59 <tomeu> which would explain why sometimes I see a "slave not present" and sometimes a "page not present"

06:00 <tomeu> and the BO affected by this is sometimes a shader, or an input or output

06:10 frieder has joined #etnaviv

06:33 <tomeu> when I start 4 inferences at the same time, I rarely see a fault. But with 5 at the same time, most runs see a fault happen.

06:52 <tomeu> I can confirm that the faults happen only when we have had to switch to a different PTA

06:53 <tomeu> and the fact that it happens only on the first run points to the NPU seeing an incomplete page table

07:15 <tomeu> lynxeye: how would you ensure that the MMU in the NPU accesses the latest data in the page tables?

07:48 lynxeye has joined #etnaviv

08:44 <tomeu> allocating the page tables as coherent or flushing the cpu write cache with wmb() hasn't helped

08:44 <tomeu> so I wonder if the MMU in the NPU isn't using a stale read cache

09:41 <lynxeye> tomeu: fwiw, the barrier for the page table writes is the mb() in etnaviv_buffer_replace_wait()

09:42 <lynxeye> so the issue doesn't only happen when the GPU starts up, but also when you switch between different address spaces?

09:43 <tomeu> only when I switch between different address spaces for the first time in a client's lifetime

09:44 <tomeu> I think it doesn't happen if I force a flush for every job, I'm verifying this right now

09:44 <tomeu> - bool need_flush = switch_mmu_context || gpu->flush_seq != new_flush_seq;

09:44 <tomeu> + bool need_flush = true; //switch_mmu_context || gpu->flush_seq != new_flush_seq;

09:45 <tomeu> seems to fix it

09:46 <tomeu> so, what I think happens is that I start up 8 processes at the same time and each are going to submit N times a cmdstream that has been split in 3 chunks

09:47 <tomeu> always, the first chunk for all processes will execute fine

09:47 <tomeu> but sometimes, the second chunk will fail at an address that was mapped after the first chunk executed

09:48 <tomeu> so I think that the MMU has cached the TLB and doesn't realize that that BO has been mapped

09:49 <tomeu> guess I need to debug now the flush_seq logic

09:58 <lynxeye> I think I can see how the MMU flush sequence would race in the case you describe above.

10:00 <tomeu> yeah, this may be the fix:

10:00 * tomeu sent a code block: https://matrix.org/oftc/media/v1/media/download/ASs30LarnsoaGfljlliwp_UXUKvFc7qW4Yi_mvR0K_O03clZ8AeiOcOBkTaF6-48L97atzigfAQvYw_zxY-hhq1CeW8kn31gAG1hdHJpeC5vcmcvZGVpYVFtUG1EWHVzR1lkdkt3blRBdXFz

10:01 <tomeu> otherwise, we switch contexts, but leave the flush_seq for the previous context

10:02 <tomeu> which in this case can be ahead of the seq for the new context

10:06 <lynxeye> nope, that shouldn't happen: when we need to switch contexts, need_flush is also true, so the flush sequence gets set a few lines below

10:06 pcercuei has joined #etnaviv

10:13 <lynxeye> I wonder if we might record a flush sequence that corresponds to page table changes that are not yet visible to the GPU and thus skip the necessary flush on a later buffer queue. Since buffer queue and MMU updates might happen from different threads, this might be a possibility, but I currently don't see how the flush_seq update on the MMU context might reorder against the page table update, since the mb before queuing the command

10:13 <lynxeye> d make all updates visible to the GPU.

10:15 <lynxeye> WFT: The MMU context used in the READ_ONCE is just plain wrong.

10:15 <lynxeye> This should be the new context aka mmu_context, not the current context aka gpu->mmu_context

10:16 pcercuei has quit [Quit: brb]

10:17 <lynxeye> so in fact we do record a wrong flush_seq after the address space switch

10:19 <tomeu> yeah, I also need that change

10:19 <tomeu> I'm testing right now

10:19 pcercuei has joined #etnaviv

10:19 <tomeu> so far so good

14:45 SpieringsAE has quit [Quit: SpieringsAE]

14:50 <tomeu> lynxeye: it got much better, but I'm still seeing faults and I think there may be a concurrency issue:

14:50 <tomeu> [ 8232.223054] *** etnaviv_buffer_queue: pta 8 flush_seq 1042 -> pta 5 flush_seq 439... (full message at <https://matrix.org/oftc/media/v1/media/download/AVtU9ijh-CjLyzKkjBRS1ZgpMTOXauO9_0EsSNwxTZzDXlYd0mAKIxFoQIHrP8IeYlry2J4qcqjznU_TfxHAd9ZCeW81LwVwAG1hdHJpeC5vcmcvWXRvY1F3S3ZxT3pKbVBDSWhSdlpnUUFR>)

14:50 <tomeu> looks like we are mapping 0xfa95c000 after the job depending on it is run

14:54 <tomeu> hrm, but etnaviv_sched_push_job shouldn't have been called yet for that submit. will debug

15:00 <lynxeye> tomeu: looks odd. The mapping is established on the submit, so even before the job gets queued to the scheduler. So there should be no way for the scheduler to push the job to the GPU before the buffer is mapped. Maybe this is a userspace issue where only the second submit has the reference to the buffer, but the first submit is already using the softpin address?

15:02 <tomeu> Yeah, I'm trying to think how that could be possible, because I add relocs to the submit as I reference them in the cmdstream.

15:07 <lynxeye> does this happen with the forced flush due to running out of cmdbuf space, or does it also happen without such flushes?

15:09 <tomeu> I have only seen unstability with the forced flushes

15:10 <lynxeye> if it's due to forced flushes, maybe your submissions get broken up at points where you don't expect it, like commands getting added to one cmdbuf, but references to bos only get applied to the one after the flush.

15:12 <lynxeye> The way to avoid this is to reserve enough space in the cmdstream upfront via etna_cmd_stream_reserve() for all the operations you want to add to one cmdbuf atomically.

15:12 <lynxeye> If the cmdbuf doesn't have enough space for the size you are asking for, the forced flush will be triggered at reserve time and you'll roll over to the next cmdbuf

15:29 frieder has quit [Remote host closed the connection]

17:45 lynxeye has quit [Quit: Leaving.]

18:10 <tomeu> I added forced flushes to the ml-specific code, splitting between operations

18:10 <tomeu> so doesn't seem to be that

18:10 <tomeu> what I have just noticed is that if I add some writes to /dev/kmsg when adding the relocs from each operation, I cannot reproduce it

18:11 <tomeu> so it seems as if in order to reproduce the problem you need to start adding relocs right after a previous submission

18:12 <tomeu> so I'm wondering if anything gets overwritten while the kernel is still reading the ioctl from userspace