<tomeu>
so, what I think happens is that I start up 8 processes at the same time and each are going to submit N times a cmdstream that has been split in 3 chunks
<tomeu>
always, the first chunk for all processes will execute fine
<tomeu>
but sometimes, the second chunk will fail at an address that was mapped after the first chunk executed
<tomeu>
so I think that the MMU has cached the TLB and doesn't realize that that BO has been mapped
<tomeu>
guess I need to debug now the flush_seq logic
<lynxeye>
I think I can see how the MMU flush sequence would race in the case you describe above.
<tomeu>
otherwise, we switch contexts, but leave the flush_seq for the previous context
<tomeu>
which in this case can be ahead of the seq for the new context
<lynxeye>
nope, that shouldn't happen: when we need to switch contexts, need_flush is also true, so the flush sequence gets set a few lines below
pcercuei has joined #etnaviv
<lynxeye>
I wonder if we might record a flush sequence that corresponds to page table changes that are not yet visible to the GPU and thus skip the necessary flush on a later buffer queue. Since buffer queue and MMU updates might happen from different threads, this might be a possibility, but I currently don't see how the flush_seq update on the MMU context might reorder against the page table update, since the mb before queuing the command
<lynxeye>
d make all updates visible to the GPU.
<lynxeye>
WFT: The MMU context used in the READ_ONCE is just plain wrong.
<lynxeye>
This should be the new context aka mmu_context, not the current context aka gpu->mmu_context
pcercuei has quit [Quit: brb]
<lynxeye>
so in fact we do record a wrong flush_seq after the address space switch
<tomeu>
yeah, I also need that change
<tomeu>
I'm testing right now
pcercuei has joined #etnaviv
<tomeu>
so far so good
SpieringsAE has quit [Quit: SpieringsAE]
<tomeu>
lynxeye: it got much better, but I'm still seeing faults and I think there may be a concurrency issue:
<tomeu>
looks like we are mapping 0xfa95c000 after the job depending on it is run
<tomeu>
hrm, but etnaviv_sched_push_job shouldn't have been called yet for that submit. will debug
<lynxeye>
tomeu: looks odd. The mapping is established on the submit, so even before the job gets queued to the scheduler. So there should be no way for the scheduler to push the job to the GPU before the buffer is mapped. Maybe this is a userspace issue where only the second submit has the reference to the buffer, but the first submit is already using the softpin address?
<tomeu>
Yeah, I'm trying to think how that could be possible, because I add relocs to the submit as I reference them in the cmdstream.
<lynxeye>
does this happen with the forced flush due to running out of cmdbuf space, or does it also happen without such flushes?
<tomeu>
I have only seen unstability with the forced flushes
<lynxeye>
if it's due to forced flushes, maybe your submissions get broken up at points where you don't expect it, like commands getting added to one cmdbuf, but references to bos only get applied to the one after the flush.
<lynxeye>
The way to avoid this is to reserve enough space in the cmdstream upfront via etna_cmd_stream_reserve() for all the operations you want to add to one cmdbuf atomically.
<lynxeye>
If the cmdbuf doesn't have enough space for the size you are asking for, the forced flush will be triggered at reserve time and you'll roll over to the next cmdbuf
frieder has quit [Remote host closed the connection]
lynxeye has quit [Quit: Leaving.]
<tomeu>
I added forced flushes to the ml-specific code, splitting between operations
<tomeu>
so doesn't seem to be that
<tomeu>
what I have just noticed is that if I add some writes to /dev/kmsg when adding the relocs from each operation, I cannot reproduce it
<tomeu>
so it seems as if in order to reproduce the problem you need to start adding relocs right after a previous submission
<tomeu>
so I'm wondering if anything gets overwritten while the kernel is still reading the ioctl from userspace