Optimizing blocksize of data based on memory architecture

By: rwessel (rwessel.delete@this.yahoo.com), March 3, 2022 3:44 am
Room: Moderated Discussions
rocky (rocky.rwt.delete@this.gmail.com) on March 2, 2022 11:28 pm wrote:
> I have a stream of data that I wish to capture as a log and persist in memory. Messages (records)
> will be small. There will be filter/aggregation operations performed on the log, similar to
> a time-series database. Old messages should be wiped as new messages come in (FIFO).
>
> I could allocate a large block of memory and treat it as a circular queue. As new messages
> come in, they simply overwrite old messages. Conceptually, this is very simple.
>
> One alternative is to maintain a circular queue of pointers, each pointing to a distinct
> block of memory (N bytes). My thinking is that, to take advantage of important bus widths
> in memory hierarchies, the blocked approach could be significantly better. Such widths
> include: DRAM bus width, memory prefetch width, and cache line size.
>
> I'm sure this is a well-explored problem space, so I thought I'd check with the experts here. In my
> mind, the queue of pointers could allow for overlap between (a) fetching blocks from main memory into
> cache lines and (b) computation on the data now-loaded into the cache, but I also wouldn't be surprised
> to learn these types of problems are often just bottlenecked by main memory bandwidth.


If you're doing these filtering operations sequentially, then prefetching of that data into cache happens on all modern large processors. It's almost never worth it for a straight-sequential access pattern, but explicit prefetching can sometimes be useful as well. Automatic sequential prefetching is almost always very, very good, and sequential passes through memory tend to result in the highest possible bandwidth (NUMA-ish systems present some exceptions to that, but let's assume you're on more common hardware)

More random memory accesses (your pointer-based approach) will almost always show substantially worse, since prefetching can't happen across items. Explicit prefetch often does help in such a case, because the processor can't do it automatically.

If you're doing this filtering on multiple cores the situation becomes more complex, although if individual items are small they're likely going to be quick to filter, and so coordination across CPUs is going to have considerable overhead. If these cores are all on the same die, you may still get considerable prefetching. If you can group the items in some way you can set individual cores off processing clusters of these items, and leave what each core does basically sequential. IOW, keep an auxiliary table of pointers to where clusters of large (perhaps megabyte-ish) of items start, and then spin off those clusters to the different cores, or keep the circular queue as a collection of large (again megabyte-ish) blocks in a list, each of which is processed sequentially.

So the slightly oversimplified answer is that if you want to blast through as much memory as possible, do it sequentially.

Also, if you're not measuring, you're almost certainly doing it wrong.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Optimizing blocksize of data based on memory architecturerocky2022/03/03 12:28 AM
  Optimizing blocksize of data based on memory architecturerwessel2022/03/03 03:44 AM
    Optimizing blocksize of data based on memory architecturerocky2022/03/03 01:09 PM
    Optimizing blocksize of data based on memory architectureJörn Engel2022/03/03 03:28 PM
    Optimizing blocksize of data based on memory architectureBrendan2022/03/03 05:54 PM
  Optimizing blocksize of data based on memory architectureMark2022/03/03 10:17 AM
    Optimizing blocksize of data based on memory architecturerocky2022/03/03 01:04 PM
      Optimizing blocksize of data based on memory architectureMark2022/03/03 02:21 PM
  What data rate?Mark Roulo2022/03/03 12:23 PM
    What data rate?rocky2022/03/03 12:54 PM
      What data rate?rwessel2022/03/04 01:26 PM
  Optimizing blocksize of data based on memory architectureAnon2022/03/04 01:59 PM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊