Wednesday, April 15, 2020

Final Project - Stage 3 Report

This is a report for stage 3 of the final SPO600 project. Before reading this post it is vital you read the reports for stage 1 and stage 2. The task required for stage 3 is to take all the information gathered from the first two stages and to use it to come up with solutions on how to make the software we selected faster.

In stage 2 it was discovered that operations that involved the movement of data from memory caused the most performance hits. After analyzing the code in the most expensive functions it was clear that both x86_64 and arm_64 architectures faced different kinds of memory access performance issues.

On the x86_64 front, one common pattern that occurred was that both x86_64 machines faced the most performance issues on the function "DecodeCurrentAccessUnit". Looking deeper into that function, the operation type that caused the most issues was "rep" in conjunction with "movsq". Without really knowing what "rep" does, I did some research into that operation and it turns out that it is used for repetition and when used with "movsq" the typical use case would turn into string usage. And considering that the function "DecodeCurrentAccessUnit" has quite a few string operations for logging, I'm lead to believe that the string operations are causing the most of the performance issues. As of which I suggest that in order to improve performance, string operations are to be reduced or if needed the amount of characters used should be reduced as much as possible. I believe that this optimization method would work because if the "rep" operation is causing performance issues, then having it called less should minimize the damage from that operation.

x86_64: Resources researched:



Link to "DecodeCurrentAccessUnit" function:



In regards to the arm_64 architecture, the function that both the Aarchie and Bbetty systems had trouble with was the function "WelsResidualBlockCavlc". And within that function any operation that involved "ldr" took at least 2 seconds to execute. Since the purpose of "ldrb" was to load values from memory to registers, it became clear that any operation that involved the assignment of values to variables may have caused performance hits within the function. As of which I believe that if inline asm was used for value assignment, more explicit control would be given on which registers hold which variables which in turn should grant greater performance. As you would be able to prioritize which values should be accessed quickly. However, due to the amount of local variables used in the function, figuring out which registers hold which variables may prove too challenging. In addition, using inline asm would also make the code significantly less portable.


Link to "WelsResidualBlockCavlc" function:


The easiest solution to these performance problems would be in the end, to increase the compiler optimization level. As back in stage 1 there were noticeable if not, significant improvements in execution time as the optimization levels got higher. 

Stage 1 Results:
https://gist.github.com/Megawats777/6ce232ad0c186427a71ee1f0f52d8f7d



Upon reflecting on the process for stage 3 it made me realize that performance issues may not come from the most obvious or exciting aspects of a program. Sometimes they may come from something mundane like variable assignment and string usage. When it comes to the solution I learned what resolves the problem may not be something exciting like overhauling your structure. But instead the solution may end up being adjusting how a couple lines work or limiting how many characters you use in a string.


Thank you for reading this series and I'll see you on the other side.




Wednesday, April 8, 2020

Final Project - Stage 2 Report

This post is about my profiling results for stage 2 of the final project. It is strongly recommended you read the report from the first stage here in order to gain vital context required for this post. 

The task required for stage 2 was to profile and identify the performance issues that arise when performing the use case created from stage 1. The profiling tool I used was called "perf" and it follows the profile technique of sampling. And the machines tested were the same ones used for stage 1, which are a mix of x86_64 and arm_64 computers. In regards to the profiling tool used, I initially wanted to use both "gprof" in addition "perf" but even though I set the proper compiler parameters in "make" I still was not able to get the required files for "gprof" to be generated.

The testing procedure was to run the use case from stage 1 but this time with "perf" running in the background. And this was done twice on every machine tested in order to observe any consistency in the results. 



Profile Results:

Aarchie: https://gist.github.com/Megawats777/40a3dd389c2fa9e9b9ad5494762dcecd

Bbetty: https://gist.github.com/Megawats777/46bf0beba2a85dc420456f3f632f0746

FedoraVM: https://gist.github.com/Megawats777/171d81b7e9a907002d3418b36a446d71

Xerxes: https://gist.github.com/Megawats777/ca297149ef90c72621adb113a439966c


What stood out to me after reading over the profiling results were the following. First, any operation involving the movement of values to and from memory would cause large spikes in CPU time spent. And this was common on all machine's tested. An example is on the machine Xerxes around 36 seconds would be spent on the "movsq" operation. Another aspect that stood out was when comparing the Xerxes and FedoraVM machines, both of them would have different sets of functions deemed expensive. For instance, on the Xerxes machine it's most expensive functions were "DecodeCurrentAccessUnit" and "WelsResidualBlockCalvc". While on FedoraVM the main costs came from the functions "__memset_sse2_unaligned_erms" and "DecodeCurrentAccessUnit". The main idea I'm getting from this difference is that the FedoraVM struggles more heavily from memory movement operations when compared to Xerxes. As on Xerxes it's main costs come from functions that seem to be more related to the decoding process. Finally, one result that stood out was that both arm_64 machines Aarchie and Bbetty both had their main costs come from the functions "__memcpy_generic" and "__GI_memset_generic". This is leads me to believe that these arm_64 machines also suffer greatly from memory movement operations in comparison to the decoding process.


In summary after doing these tests I learned the value of profiling as it tells you not only if something is slow but also how it is slow. And after these profiling tests it's clear that the main bottleneck that would need to be addressed is how memory is moved around.


Thank you for reading and I'll see you in stage 3.