Wednesday, April 15, 2020

Final Project - Stage 3 Report

This is a report for stage 3 of the final SPO600 project. Before reading this post it is vital you read the reports for stage 1 and stage 2. The task required for stage 3 is to take all the information gathered from the first two stages and to use it to come up with solutions on how to make the software we selected faster.

In stage 2 it was discovered that operations that involved the movement of data from memory caused the most performance hits. After analyzing the code in the most expensive functions it was clear that both x86_64 and arm_64 architectures faced different kinds of memory access performance issues.

On the x86_64 front, one common pattern that occurred was that both x86_64 machines faced the most performance issues on the function "DecodeCurrentAccessUnit". Looking deeper into that function, the operation type that caused the most issues was "rep" in conjunction with "movsq". Without really knowing what "rep" does, I did some research into that operation and it turns out that it is used for repetition and when used with "movsq" the typical use case would turn into string usage. And considering that the function "DecodeCurrentAccessUnit" has quite a few string operations for logging, I'm lead to believe that the string operations are causing the most of the performance issues. As of which I suggest that in order to improve performance, string operations are to be reduced or if needed the amount of characters used should be reduced as much as possible. I believe that this optimization method would work because if the "rep" operation is causing performance issues, then having it called less should minimize the damage from that operation.

x86_64: Resources researched:



Link to "DecodeCurrentAccessUnit" function:



In regards to the arm_64 architecture, the function that both the Aarchie and Bbetty systems had trouble with was the function "WelsResidualBlockCavlc". And within that function any operation that involved "ldr" took at least 2 seconds to execute. Since the purpose of "ldrb" was to load values from memory to registers, it became clear that any operation that involved the assignment of values to variables may have caused performance hits within the function. As of which I believe that if inline asm was used for value assignment, more explicit control would be given on which registers hold which variables which in turn should grant greater performance. As you would be able to prioritize which values should be accessed quickly. However, due to the amount of local variables used in the function, figuring out which registers hold which variables may prove too challenging. In addition, using inline asm would also make the code significantly less portable.


Link to "WelsResidualBlockCavlc" function:


The easiest solution to these performance problems would be in the end, to increase the compiler optimization level. As back in stage 1 there were noticeable if not, significant improvements in execution time as the optimization levels got higher. 

Stage 1 Results:
https://gist.github.com/Megawats777/6ce232ad0c186427a71ee1f0f52d8f7d



Upon reflecting on the process for stage 3 it made me realize that performance issues may not come from the most obvious or exciting aspects of a program. Sometimes they may come from something mundane like variable assignment and string usage. When it comes to the solution I learned what resolves the problem may not be something exciting like overhauling your structure. But instead the solution may end up being adjusting how a couple lines work or limiting how many characters you use in a string.


Thank you for reading this series and I'll see you on the other side.




No comments:

Post a Comment