I'm working on an app with lots of asynchronous stream processing done using Rx (Reactive Extensions) - IObservable<T> method. We were discussing the other day whether to replace the implementations which only return a single data item via the Rx stream with the more standard TPL (Task Parallel Library) Task<T> method.
We came to the conclusion we wouldn't make the switch for a couple reasons, firstly keeping the code consistency, we're using Rx everywhere for async so why change; and secondly with a possibly more important reason because performance isn't an issue (at the moment).
Then I thought....
What's the performance difference between IObservable<T> and Task <T> for a single async invocation?
A simple console app should do, running IObservable<T> vs Task<T> and the quickest wins - reminds me of Celebrity Deathmatch...
Firstly we need something to test, calculating the first 100 primes:
All I need now is a couple of methods for each async implementation...
firstly IObservable<T>:
secondly Task<T>:
All I need now is a test program:
So which test method is quicker?
The answer is Task<T>, in facts it quicker by a factor of greater than 10:
Even if I swap the order Task<T> out performs:
A simple console app should do, running IObservable<T> vs Task<T> and the quickest wins - reminds me of Celebrity Deathmatch...
Firstly we need something to test, calculating the first 100 primes:
All I need now is a couple of methods for each async implementation...
firstly IObservable<T>:
secondly Task<T>:
All I need now is a test program:
So which test method is quicker?
The answer is Task<T>, in facts it quicker by a factor of greater than 10:
Even if I swap the order Task<T> out performs:
I've given this a try with the latest Rx v2.0 RTM binaries and couldn't repro the issue when making a few tweaks:
ReplyDelete1. The comparison above uses await in one case (Rx) but not in the other (Task with Wait instead). This makes the comparison biased due to the use of the async method machinery in the former case, but not in the latter. So, I've used await in both cases.
2. Using a bigger sample size for the number of primes computed to get out of the noise of tens of milliseconds (which is close to the magical 15.6ms anyway), and put the whole thing in a loop to repeat and compute average run time.
3. Have a warm-up phase, eliminating the skew that may be due to mscorlib and other BCL assemblies being loaded already which Rx has to come in fresh. Also, the former assemblies are NGEN'd while Rx isn't, so giving both warm up time will eliminate JIT overheads.
When doing all of this, the result I'm seeing are vastly different:
- First iteration: Task wins with several 10s of percents (but not factors of 10 as shown above).
- Subsequent iterations: IO and Task are in the same ballpark, give or take a percent on either side (mostly on the Rx side, which could be explained by our awaiter type being a class rather than a struct).
The code used boils down to:
sw.Start();
{
for (int i = 0; i < N; i++)
await /* Task or IO */ () => CalcPrimes(M);
}
sw.Stop();
with N and M being sufficiently large. When M is small, Task may take advantage of the fast path of the await code more aggressively (but then again, IObservable is optimized under the assumption event streams typically are lengthy), though Rx will do so as well (but thresholds may be different).
In general, I'm very wary of micro-benchmarks like these. Set a performance goal for a bigger system, measure for pieces of code with relevant latencies and compute sizes, and - if goals aren't met - find the bottleneck using profilers such as the one in Visual Studio.
Bart thanks for detailed reply and I totally agree about micro-benchmarks and yes this was a micro-benchmark come to think about it.
ReplyDeleteKeep up the good work, the 'team' must be very busy :)