By: Mark (nospamplease.delete@this.nothere.net), March 3, 2022 10:17 am
Room: Moderated Discussions
rocky (rocky.rwt.delete@this.gmail.com) on March 2, 2022 11:28 pm wrote:
> I have a stream of data that I wish to capture as a log and persist in memory. Messages (records)
> will be small. There will be filter/aggregation operations performed on the log, similar to
> a time-series database. Old messages should be wiped as new messages come in (FIFO).
>
> I could allocate a large block of memory and treat it as a circular queue. As new messages
> come in, they simply overwrite old messages. Conceptually, this is very simple.
>
> One alternative is to maintain a circular queue of pointers, each pointing to a distinct
> block of memory (N bytes). My thinking is that, to take advantage of important bus widths
> in memory hierarchies, the blocked approach could be significantly better. Such widths
> include: DRAM bus width, memory prefetch width, and cache line size.
>
> I'm sure this is a well-explored problem space, so I thought I'd check with the experts here. In my
> mind, the queue of pointers could allow for overlap between (a) fetching blocks from main memory into
> cache lines and (b) computation on the data now-loaded into the cache, but I also wouldn't be surprised
> to learn these types of problems are often just bottlenecked by main memory bandwidth.
You must have some very high performance goals.
The plain circular buffer can run at memory bandwidth speed. Code has to be tight to be limited by sequential memory bandwidth, just look at how much effort goes in to memcpy(). If your code is as simple as read or write one 4 byte int at a time the speed limit will be set by CPU cycles and not memory. Scanning the buffer may hit bandwidth limits faster since your code will likely look at a few words and go on to the next record.
Memory optimization may be irrelevant with the code writing or using the records setting the pace.
The table of pointers approach will be slower unless it is more like an index containing the pointer and just enough extra information to pick relevant records in a search. While the CPU's out of order machinery can allow multiple loops of indirection and work on the record to run in parallel this path is heavy compared to simple linear prefetch. It will not achieve as much memory parallelism and it takes a lot of main memory reads in flight to approach bandwidth limits.
> I have a stream of data that I wish to capture as a log and persist in memory. Messages (records)
> will be small. There will be filter/aggregation operations performed on the log, similar to
> a time-series database. Old messages should be wiped as new messages come in (FIFO).
>
> I could allocate a large block of memory and treat it as a circular queue. As new messages
> come in, they simply overwrite old messages. Conceptually, this is very simple.
>
> One alternative is to maintain a circular queue of pointers, each pointing to a distinct
> block of memory (N bytes). My thinking is that, to take advantage of important bus widths
> in memory hierarchies, the blocked approach could be significantly better. Such widths
> include: DRAM bus width, memory prefetch width, and cache line size.
>
> I'm sure this is a well-explored problem space, so I thought I'd check with the experts here. In my
> mind, the queue of pointers could allow for overlap between (a) fetching blocks from main memory into
> cache lines and (b) computation on the data now-loaded into the cache, but I also wouldn't be surprised
> to learn these types of problems are often just bottlenecked by main memory bandwidth.
You must have some very high performance goals.
The plain circular buffer can run at memory bandwidth speed. Code has to be tight to be limited by sequential memory bandwidth, just look at how much effort goes in to memcpy(). If your code is as simple as read or write one 4 byte int at a time the speed limit will be set by CPU cycles and not memory. Scanning the buffer may hit bandwidth limits faster since your code will likely look at a few words and go on to the next record.
Memory optimization may be irrelevant with the code writing or using the records setting the pace.
The table of pointers approach will be slower unless it is more like an index containing the pointer and just enough extra information to pick relevant records in a search. While the CPU's out of order machinery can allow multiple loops of indirection and work on the record to run in parallel this path is heavy compared to simple linear prefetch. It will not achieve as much memory parallelism and it takes a lot of main memory reads in flight to approach bandwidth limits.
Topic | Posted By | Date |
---|---|---|
Optimizing blocksize of data based on memory architecture | rocky | 2022/03/03 12:28 AM |
Optimizing blocksize of data based on memory architecture | rwessel | 2022/03/03 03:44 AM |
Optimizing blocksize of data based on memory architecture | rocky | 2022/03/03 01:09 PM |
Optimizing blocksize of data based on memory architecture | Jörn Engel | 2022/03/03 03:28 PM |
Optimizing blocksize of data based on memory architecture | Brendan | 2022/03/03 05:54 PM |
Optimizing blocksize of data based on memory architecture | Mark | 2022/03/03 10:17 AM |
Optimizing blocksize of data based on memory architecture | rocky | 2022/03/03 01:04 PM |
Optimizing blocksize of data based on memory architecture | Mark | 2022/03/03 02:21 PM |
What data rate? | Mark Roulo | 2022/03/03 12:23 PM |
What data rate? | rocky | 2022/03/03 12:54 PM |
What data rate? | rwessel | 2022/03/04 01:26 PM |
Optimizing blocksize of data based on memory architecture | Anon | 2022/03/04 01:59 PM |