Your browser either does not support Javascript or you have it disabled. Please enable Javascript to be able to navigate our site and utilize features.

Montavista


Cache: the key to multicore performance

October 10th, 2008

If you attended my multicore webinar last month then you know that I’m a big fan of looking out the window at existing open source applications and seeing how they tackle performance issues. The webinar used the Apache HTTP server as an example. A recent Intel engineering study uses a modified version of snort as a test application for improving multicore performance.

The Intel study, commented extensively on by Lori Matassa, indicated performance scaling on a 4 core system of 6.2x. This is notable because the generally expected benefit from adding a core is typically slightly under 1. For example moving an application to a 4 core system and then getting a 3.8x performance boost would be expected and would likely require some work to attain.

The Intel engineering study is notable because it indicates something that experienced practitioners have always known: efficient cache usage is a critical factor in multicore performance. It simply makes no sense to play games elsewhere until you’ve got a grasp on cache efficiency and have maximized that aspect of your system performance.

The Intel engineering study did something interesting with “flow pinning”. Each TCP flow through the system was handled, for the lifetime of the flow, by a single assigned core. This improves cache efficiency by optimizing locality of reference.

The Intel paper also prompts some thoughts in my mind regarding those who are migrating RTOS applications from their dead-end platform to the new funky Linux world. A vogue thought these days is that a virtualization platform can be used to just run your RTOS side-by-side with the new Linux platform. My concern is that taking a non-multicore aware RTOS based application and just moving it to a new multicore processor implies either no cache efficiency or a decrease in cache efficiency. The RTOS based application never had multiple cores and hence has no awareness or ability to do flow-pinning as discussed in the paper.

Potentially a better approach is to migrate the RTOS application’s algorithms to a consistent Linux platform and then do cache optimization work now that you’ve eliminated other variables and have design flexibility. If you want to learn more about RTOS to Linux migration we’ve got an upcoming webinar on that, too.

Take a read… the paper is compelling. 6.2x performance boost on 4-cores is impressive.

2 Responses to “Cache: the key to multicore performance”

  1. Jakob Engblom Says:

    A small note on the using virtualization to move RTOSes onto a multicore: often, that virtuallization layer will put each RTOS on its own core in the set, essentially pinning it to that core. This is really necessary as trying to run the RTOS on multiple different cores over its lifetime — or even worse, on a shared-memory truly concurrent SMP abstraction — is almost guaranteed to break a bunch of assumptions.

    So it is hard to see how it could hurt to put a legacy stack onto a single core, and then have a Linux or other OS share the other cores. It is an engineering-wise very efficient approach, and also one that retain the key value in terms of latency, robustness, and similar traits that make traditional RTOSes technically better at many things than a more general-purpose best-effort OS like Linux.

    I think a common pattern for using a four-core or eight-core multicore device is to gather software stacks that used to run on four or eight physically separate processors onto a single die. This means much reduced bill of materials and package size, while still not totally requiring a rewrite of software.

    A completely homogeneous OS across all cores on a multicore is going to be pretty rare, I think, as it seems to make more sense to run multiple OS instances, some specialized for control, some for line processing, and some for operations and maintenance.

    /jakob

  2. Brad Dixon Says:

    Jakob:

    Great response… thank you for reading.

    I think that your response is in essence an echo of my original post: That keeping an eye on efficient cache usage is paramount. [Again.. hardly an original or earth shattering thought.] Pinning an RTOS to a specific core does help to efficiently use of the cache. In reply to you… no, it doesn’t hurt to put a legacy stack on a single core. Potentially it doesn’t help as much as it could, however.

    The Intel study was interesting because it indicated that for this particular workload that one could utilize the cache even more efficiently by segregating the data, not the algorithm, to a single assigned core/cache unit. I hear so much about pinning an OS of whatever sort to a core that to read a different approach was refreshing and notable.

    Your use case is absolutely en vogue. We have customers follow that approach frequently and I presume they do so because it has benefits.

    I do wonder, though, as the core count scales upwards in the next decade when we will get to a point that we’ve run out of boxes and algorithms to consolidate onto a single multicore (eventually many-core) device? Will there be a second multi-core transition hurdle as developers finally break down their co-resident code silos and create actual multicore algorithms? I presume, like all things, it will happen piece by piece.

Leave a Reply

Developer Resources
Contact Us      Careers      Resource Download Library      Meld Community      Request Information            Feeds of news, blogs, and more

©2010 MontaVista Software, LLC. All Rights Reserved