In stage 2 it was discovered that operations that involved the movement of data from memory caused the most performance hits. After analyzing the code in the most expensive functions it was clear that both x86_64 and arm_64 architectures faced different kinds of memory access performance issues.
On the x86_64 front, one common pattern that occurred was that both x86_64 machines faced the most performance issues on the function "DecodeCurrentAccessUnit". Looking deeper into that function, the operation type that caused the most issues was "rep" in conjunction with "movsq". Without really knowing what "rep" does, I did some research into that operation and it turns out that it is used for repetition and when used with "movsq" the typical use case would turn into string usage. And considering that the function "DecodeCurrentAccessUnit" has quite a few string operations for logging, I'm lead to believe that the string operations are causing the most of the performance issues. As of which I suggest that in order to improve performance, string operations are to be reduced or if needed the amount of characters used should be reduced as much as possible. I believe that this optimization method would work because if the "rep" operation is causing performance issues, then having it called less should minimize the damage from that operation.
x86_64: Resources researched:
Fedora VM Profiling Results:
https://gist.github.com/Megawats777/171d81b7e9a907002d3418b36a446d71
Xerxes Profiling Results:
https://gist.github.com/Megawats777/ca297149ef90c72621adb113a439966c
Link to "DecodeCurrentAccessUnit" function:
In regards to the arm_64 architecture, the function that both the Aarchie and Bbetty systems had trouble with was the function "WelsResidualBlockCavlc". And within that function any operation that involved "ldr" took at least 2 seconds to execute. Since the purpose of "ldrb" was to load values from memory to registers, it became clear that any operation that involved the assignment of values to variables may have caused performance hits within the function. As of which I believe that if inline asm was used for value assignment, more explicit control would be given on which registers hold which variables which in turn should grant greater performance. As you would be able to prioritize which values should be accessed quickly. However, due to the amount of local variables used in the function, figuring out which registers hold which variables may prove too challenging. In addition, using inline asm would also make the code significantly less portable.
Link to "WelsResidualBlockCavlc" function:
Aarchie Profiling Results:
https://gist.github.com/Megawats777/40a3dd389c2fa9e9b9ad5494762dcecd
Bbetty Profiling Results:
https://gist.github.com/Megawats777/46bf0beba2a85dc420456f3f632f0746
The easiest solution to these performance problems would be in the end, to increase the compiler optimization level. As back in stage 1 there were noticeable if not, significant improvements in execution time as the optimization levels got higher.
Stage 1 Results:
https://gist.github.com/Megawats777/6ce232ad0c186427a71ee1f0f52d8f7d
Upon reflecting on the process for stage 3 it made me realize that performance issues may not come from the most obvious or exciting aspects of a program. Sometimes they may come from something mundane like variable assignment and string usage. When it comes to the solution I learned what resolves the problem may not be something exciting like overhauling your structure. But instead the solution may end up being adjusting how a couple lines work or limiting how many characters you use in a string.
Thank you for reading this series and I'll see you on the other side.