Wednesday, April 8, 2020

Final Project - Stage 2 Report

This post is about my profiling results for stage 2 of the final project. It is strongly recommended you read the report from the first stage here in order to gain vital context required for this post. 

The task required for stage 2 was to profile and identify the performance issues that arise when performing the use case created from stage 1. The profiling tool I used was called "perf" and it follows the profile technique of sampling. And the machines tested were the same ones used for stage 1, which are a mix of x86_64 and arm_64 computers. In regards to the profiling tool used, I initially wanted to use both "gprof" in addition "perf" but even though I set the proper compiler parameters in "make" I still was not able to get the required files for "gprof" to be generated.

The testing procedure was to run the use case from stage 1 but this time with "perf" running in the background. And this was done twice on every machine tested in order to observe any consistency in the results. 



Profile Results:

Aarchie: https://gist.github.com/Megawats777/40a3dd389c2fa9e9b9ad5494762dcecd

Bbetty: https://gist.github.com/Megawats777/46bf0beba2a85dc420456f3f632f0746

FedoraVM: https://gist.github.com/Megawats777/171d81b7e9a907002d3418b36a446d71

Xerxes: https://gist.github.com/Megawats777/ca297149ef90c72621adb113a439966c


What stood out to me after reading over the profiling results were the following. First, any operation involving the movement of values to and from memory would cause large spikes in CPU time spent. And this was common on all machine's tested. An example is on the machine Xerxes around 36 seconds would be spent on the "movsq" operation. Another aspect that stood out was when comparing the Xerxes and FedoraVM machines, both of them would have different sets of functions deemed expensive. For instance, on the Xerxes machine it's most expensive functions were "DecodeCurrentAccessUnit" and "WelsResidualBlockCalvc". While on FedoraVM the main costs came from the functions "__memset_sse2_unaligned_erms" and "DecodeCurrentAccessUnit". The main idea I'm getting from this difference is that the FedoraVM struggles more heavily from memory movement operations when compared to Xerxes. As on Xerxes it's main costs come from functions that seem to be more related to the decoding process. Finally, one result that stood out was that both arm_64 machines Aarchie and Bbetty both had their main costs come from the functions "__memcpy_generic" and "__GI_memset_generic". This is leads me to believe that these arm_64 machines also suffer greatly from memory movement operations in comparison to the decoding process.


In summary after doing these tests I learned the value of profiling as it tells you not only if something is slow but also how it is slow. And after these profiling tests it's clear that the main bottleneck that would need to be addressed is how memory is moved around.


Thank you for reading and I'll see you in stage 3. 

No comments:

Post a Comment