April 2019
Sorry, my phone scrolled to put the submit button where the keyboard used to be.
Anyway, I'd expect the function to tend to crash on the first time through the loop if it was L3 cache related, since that's when the instructions are coming through L3.
I really appreciate these experiments @Falkentyne, they'ree very helpful and informative!
April 2019 - last edited April 2019
I don't know if this helps, but if the core voltage is far too low, Apex will sometimes throw out a WHEA logged correctable "CPU TLB" error
"Translation Lookaside Buffer" error. (the game doesn't crash at the same time this error happens however).
What would cause Apex Legends to throw out this error?
This is all I can find:
https://en.wikipedia.org/wiki/Translation_lookaside_buffer
I really wish i had learned programming. What Apex is doing is very interesting.
Everyone thought it was AVX instructions but it seems to be SSE2 or other things.
April 2019
@Falkentyne, this crash is interesting. We released 1.1.1 yesterday, and this crash is with that newer executable. Looking at the disassembly, this function is identical between 1.1.0 and 1.1.1; not even the registers got shuffled, and it's even at the same offset in the executable.
However, the crash is a new crash. We haven't seen that offset before with 1.1.0.
Also, the address implicated in the crash dump is not the start of an instruction. The instruction starts one byte earlier.
The incorrect instruction at 2F2DD9 actually just skips a prefix that turns a vector SSE instruction into a scalar SSE instruction. In this case, it would be mulss if it started at 2F2DD8, but because it skipped the first byte of the instruction, it turns into mulps. So it will do a vector multiply instead of a scalar multiply. When it does that, it requires 16-byte alignment instead of 4-byte alignment for memory reads. The address it's reading has 4-byte alignment but not 16, so doing mulps will crash due to unaligned access, whereas mulss (which we wanted) would work.
For reasons I don't know, I've always seen unaligned memory accesses in SSE instructions get reported as read or write access violations of memory location FFFFFFFFFFFFFFFF. It doesn't matter what memory it tried to read or write, it always reports it at that other address.
So it looks like the instruction pointer got off-by-one in this crash, which ended up causing an unaligned memory read of a valid address, which gets reported as a memory read of an invalid address. But the hardware bug was the off-by-one in the instruction pointer register, RIP.
April 2019
@OrioStorm Attached my crash log. Running a i9-9900K @5.0GHz
Voltages:
VCore - 1.35v
VCCIO - 1.328v
VCCSA - 1.264v
So I don't think voltages are the issue as @Falkentyne suspects.
Have an ASUS Maximus XI Hero motherboard, will try stock BIOS settings tonight.
April 2019
Game never crashed before. Now this is happening every 2 games. It just crashes to desktop. Log:
crash: { R5Apex: 000000000085A313 EXCEPTION_ACCESS_VIOLATION(read): 000009B2B788F670 } cpu: "AMD Ryzen 5 2600X Six-Core Processor " ram: 16 // GB callstack: { KERNELBASE: 000000000008667C ntdll: 00000000000A810B ntdll: 000000000008FD56 ntdll: 00000000000A46AF ntdll: 0000000000004BEF ntdll: 00000000000A341E R5Apex: 000000000085A313 R5Apex: 000000000068D81E R5Apex: 000000000068F91B R5Apex: 00000000004C116D R5Apex: 00000000004C0A0A R5Apex: 00000000004C0FC7 R5Apex: 00000000004BF767 R5Apex: 00000000004C1502 KERNEL32: 0000000000017974 ntdll: 000000000006A271 } registers: { rax = 0x000009B2B788F660 rbx = 1 rcx = 0x000001B2C2681F90 rdx = 1 rsp = 0x00000023FE30F820 rbp = 0x042B3085 rsi = 0x000009B2B788F660 rdi = 0x000001B2C2681F90 r8 = 1 r9 = 209 r10 = 0 r11 = 21 r12 = 20 r13 = 0 r14 = 0xFFFF000000000000 r15 = 0x2B00000000000000 rip = 0x00007FF6A015A313 xmm0 = [ [0.001, 0, 0, 0], [0x3A83126F, 0x00000000, 0x00000000, 0x00000000] ] xmm1 = [ [93.531677, 0, 0, 0], [0x42BB1038, 0x00000000, 0x00000000, 0x00000000] ] xmm2 = [ [0, 0, 0, 0], [0, 0, 0, 0] ] xmm3 = [ [0, 0, 0, 0], [0, 0, 0, 0] ] xmm4 = [ [0, 0, 0, 0], [0, 0, 0, 0] ] xmm5 = [ [0, -nan, -nan, -nan], [0, -1, -1, -1] ] xmm6 = [ [0, 0, 0, 0], [0, 0, 0, 0] ] xmm7 = [ [0, 0, 0, 0], [0, 0, 0, 0] ] xmm8 = [ [0.010810852, 0, 0, 0], [0x3C312000, 0x00000000, 0x00000000, 0x00000000] ] xmm9 = [ [93.542488, 0, 0, 0], [0x42BB15C1, 0x00000000, 0x00000000, 0x00000000] ] xmm10 = [ [0, 0, 0, 0], [0, 0, 0, 0] ] xmm11 = [ [0, 0, 0, 0], [0, 0, 0, 0] ] xmm12 = [ [0, 0, 0, 0], [0, 0, 0, 0] ] xmm13 = [ [0, 0, 0, 0], [0, 0, 0, 0] ] xmm14 = [ [0, 0, 0, 0], [0, 0, 0, 0] ] xmm15 = [ [0, 0, 0, 0], [0, 0, 0, 0] ] } build_id: 1554860081
April 2019 - last edited April 2019
I played 2 hours last after resetting my Asus BIOS to default optimize settings and only XMP profile to match my RAM timings.
I did not crash.
I was only able to play “stable” previously using Nvidia’s 416 drivers on my 2080 Ti and capping my FPS at 100. I would crash very quickly if I didn’t cap the FPS.
I upgraded to 425 drivers and was able to play unlimited FPS for the duration of the two hours while using stock BIOS settings.
On stock settings, my i9-9900k does not boost above 4.7GHz. My next steps are to intro overclocking if I’m able to play again without crashing another night or two and start moving in 100Mhz increments from 4.7Ghz. 5.0Ghz on my machine is not stable with Apex (regardless of voltage).
April 2019
Please keep me posted.
And you were right. I had three hours of stability, but then suddenly the "internal parity errors" started happening again, and not long after, the random crashes.
The only fix was to increase CPU voltage -excessively- high.
Oriostorm, is Apex doing something timing related that is causing this instability?
Because it really should not be happening.
I set my cpu to 5.2 ghz (hyperthreading off) at 1.335v and I ran Battlefield 5 (Firestorm) for an hour. BF5 was using >75% of the CPU cores, temps and power draw were much higher, and there were no errors or crashes. I then ran the Blender Classroom render stress test (you can google this), and blender BMW stress test and both completed with flying colors. The only programs that had problems at this point were Apex Legends (those crashes and Internal Parity Error) and Prime95 AVX 1344K (clock watchdog timeout BSOD). Prime95 small FFT with AVX disabled ran forever. (Prime95 29.8 build 1).
Increasing CPU voltage to 1.385v completely stopped the Prime95 1344K AVX tests from crashing. These ran fine now.
But APEX legends was still randomly acting up (exception error, or CTD sometimes with no error, or internal parity error (no crash).
1.395v-1.40v was needed to stop this--far far beyond what any other program or application needed! Something is really strange here.
And what's even worse---Apex Legends uses FAR FAR far less CPU resources than Blender, Battlefield 5, Prime95 (AVX disabled), etc.
Oriostorm, is there any way, if you have time, to write a very small 'stress test' code sample that can extensively test some of the instructions you are using for Apex Legends, in a repeated intense loop of cycles, so that we can run the executable program and it can catch any SSE2 error in a bugtrap? This may help track down the problem. This might also help determine why some users with stock clocks are also getting these crashes (in most cases, pure stock clocks prevent these problems). It shouldn't take that long to write a code sample that we can download, and it may help determine what's going on.
It really is inconceivable that Apex is needing far more voltage than something like Blender or Prime95....
Could EasyAntiCheat be causing these timing errors?
By the way, this "bizarre" behavior seems to get reversed (meaning things start acting properly) if you run at base clocks (4.7 ghz) and then downvolt the CPU enough so it's unstable.
Doing this and Apex Legends runs nice and happily while Prime95 with AVX disabled generates a BSOD crash. This happened at about 4.7 ghz (HT enabled) at 1.065v.
You have to go even lower on the voltage before Apex starts generating parity errors.
April 2019
@Sptz87, thank you for the crash log. This actually isn't the Intel CPU crash everybody else in this thread has been getting, so we can actually do something about it in our code! Yay!
It looks like the game crashed updating animation on a worker thread, because it needed to look up something in memory that had been freed.
You can try setting "cl_parallel_clientside_animations 0" and see if that works around the crash until we can get it fixed in a future patch. Please let me know if that helps!
April 2019
April 2019
@Falkentyne, I honestly don't know how Easy AntiCheat works (I wasn't involved in integrating that), so I can't speak to whether it's a factor.
I do think it's related to the actual instruction sequence, and possibly it's offset in memory. I can try making a standalone test program, that's a good idea. But it's not as trivial as you might think. This function does connect with the rest of our engine, so we have to sever those connections to make a standalone program, but we have to do it in a way that doesn't change the generated assembly.
If that doesn't work, I guess I could write an assembly language file that just has this function exactly as it is in the live game, and try to have external code feed it the data expects in a way that won't crash.
Another tricky thing is that the path through this code depends on branches based on what you can see in the game. If a particular path through the code is causing the bug, you'd need to replicate the live data that causes the crash. That data can be different for every viewpoint in King's Canyon, so even replicating all of the code may not replicate the crash if we don't replicate the control flow decisions. There are things we can try to force different control flows, but it's not guaranteed to repro.
Still, it's worth a shot!