mirror of
https://github.com/reactos/reactos.git
synced 2024-11-09 08:08:38 +00:00
c501d8112c
svn path=/branches/aicom-network-fixes/; revision=34994
55 lines
3 KiB
Text
55 lines
3 KiB
Text
Surfing the Internet, I stumbled upon http://www.sciencemark.org where you
|
|
can download a benchmark program that (amongst others) can benchmark different
|
|
x86 memcpy implementations. Running that benchmark on my machine revealed that
|
|
the fastest implementation was roughly twice as fast as the "rep movsl"
|
|
implementation (lib/string/i386/memcpy_asm.s) that ReactOS uses.
|
|
To test the alternate implementations in a ReactOS setting, I first
|
|
instrumented the existing memcpy implementation to log with which arguments
|
|
it was being called. I then booted ReactOS, started a background compile in it
|
|
(to generate some I/O) and played a game of Solitaire (to generate graphics
|
|
operations). After loosing the game, I shut down ReactOS. I then extracted
|
|
the memcpy calls roughly between the start of Explorer (to get rid of one time
|
|
startup effects) an shutdown. The resulting call profile is attached below.
|
|
I then used that profile to make calls to the existing memcpy and an alternate
|
|
implementation (I selected the "MMX registry copy with SSE prefetching"),
|
|
taking care to use different source and destination regions to remove caching
|
|
effects. The profile consisted of roughly 250000 calls to memcpy, I found
|
|
that I had to execute the profile 10000 times to get "reasonable" time values.
|
|
To compensate for the overhead of the test program, I also ran a test where
|
|
the whole memcpy routine consisted of a single instruction: "ret". The test
|
|
results, after applying a correction for the overhead:
|
|
|
|
rep movl 70.5 sec
|
|
mmx registers 58.3 sec
|
|
Speed increase: 17%
|
|
|
|
(Test machine: AMD Athlon MP 2800+ running Linux).
|
|
Although the relative speed increase is nice (17%), we also have to look at the
|
|
absolute speed increase. Remember that the 70.5 sec for the "rep movl" case
|
|
was obtained by running the whole profile 10000 times. This means that all the
|
|
memcpy's executed during the profiling run of ReactOS together took only
|
|
0.00705 seconds. So the conclusion has to be that we're simply not spending
|
|
a significant amount of time in memcpy (BTW, our memcpy implementation is
|
|
shared between kernel and user mode, of the total of 250000 memcpy calls about
|
|
90% were made from kernel mode and 10% from user mode), so optimizing memcpy
|
|
(although possible) will not result in a significant better performance of
|
|
ReactOS as a whole.
|
|
Just for fun, I then used only the part of the profile where the memory area
|
|
was larger than 128 bytes. The MMX implementation actually only runs for sizes
|
|
over 128 bytes, for smaller sizes it deferred to the "rep movl" implementation.
|
|
According to the profile, the vast majority of memcpy calls is made with a
|
|
size smaller than 128 bytes (96.8%).
|
|
|
|
rep movl 52.9 sec
|
|
mmx registers 27.1 sec
|
|
Speed increase 48%
|
|
|
|
This is more or less in line with the results I got from the membench benchmark
|
|
from http://www.sciencemark.org.
|
|
|
|
Final conclusion: Although optimizing memcpy is useful (and feasible) for
|
|
transfer of large blocks, the usage pattern in ReactOS consists mostly of
|
|
small blocks. The resulting absolute spead increase doesn't justify the
|
|
increased code complexity.
|
|
|
|
2005/12/03 GvG
|