By: rocky (rocky.rwt.delete.delete@this.this.gmail.com), March 3, 2022 1:09 pm
Room: Moderated Discussions
rwessel (rwessel.delete@this.yahoo.com) on March 3, 2022 2:44 am wrote:
> rocky (rocky.rwt.delete@this.gmail.com) on March 2, 2022 11:28 pm wrote:
>
> If you're doing these filtering operations sequentially, then prefetching of that data into cache happens
> on all modern large processors. It's almost never worth it for a straight-sequential access pattern, but
> explicit prefetching can sometimes be useful as well. Automatic sequential prefetching is almost always
> very, very good, and sequential passes through memory tend to result in the highest possible bandwidth (NUMA-ish
> systems present some exceptions to that, but let's assume you're on more common hardware)
Everything will be very sequential, so I will take the point that I should try to leverage as much "sequentiality" as possible.
> If you're doing this filtering on multiple cores the situation becomes more complex, although if individual
> items are small they're likely going to be quick to filter, and so coordination across CPUs is going
> to have considerable overhead.
Since it's time-series data without anything involving state-based calculations (e.g.: convolation), the work is very parallel and can be divided nicely among multiple cores. If all data is in-memory, I can split the data into working sets and assign to multiple cores in a machine.
> If these cores are all on the same die, you may still get considerable
> prefetching. If you can group the items in some way you can set individual cores off processing clusters
> of these items, and leave what each core does basically sequential. IOW, keep an auxiliary table
> of pointers to where clusters of large (perhaps megabyte-ish) of items start, and then spin off those
> clusters to the different cores, or keep the circular queue as a collection of large (again megabyte-ish)
> blocks in a list, each of which is processed sequentially.
Tracking offsets into a circular queue, right?
> So the slightly oversimplified answer is that if you want to blast
> through as much memory as possible, do it sequentially.
>
> Also, if you're not measuring, you're almost certainly doing it wrong.
ack, ack
> rocky (rocky.rwt.delete@this.gmail.com) on March 2, 2022 11:28 pm wrote:
>
> If you're doing these filtering operations sequentially, then prefetching of that data into cache happens
> on all modern large processors. It's almost never worth it for a straight-sequential access pattern, but
> explicit prefetching can sometimes be useful as well. Automatic sequential prefetching is almost always
> very, very good, and sequential passes through memory tend to result in the highest possible bandwidth (NUMA-ish
> systems present some exceptions to that, but let's assume you're on more common hardware)
Everything will be very sequential, so I will take the point that I should try to leverage as much "sequentiality" as possible.
> If you're doing this filtering on multiple cores the situation becomes more complex, although if individual
> items are small they're likely going to be quick to filter, and so coordination across CPUs is going
> to have considerable overhead.
Since it's time-series data without anything involving state-based calculations (e.g.: convolation), the work is very parallel and can be divided nicely among multiple cores. If all data is in-memory, I can split the data into working sets and assign to multiple cores in a machine.
> If these cores are all on the same die, you may still get considerable
> prefetching. If you can group the items in some way you can set individual cores off processing clusters
> of these items, and leave what each core does basically sequential. IOW, keep an auxiliary table
> of pointers to where clusters of large (perhaps megabyte-ish) of items start, and then spin off those
> clusters to the different cores, or keep the circular queue as a collection of large (again megabyte-ish)
> blocks in a list, each of which is processed sequentially.
Tracking offsets into a circular queue, right?
> So the slightly oversimplified answer is that if you want to blast
> through as much memory as possible, do it sequentially.
>
> Also, if you're not measuring, you're almost certainly doing it wrong.
ack, ack
Topic | Posted By | Date |
---|---|---|
Optimizing blocksize of data based on memory architecture | rocky | 2022/03/03 12:28 AM |
Optimizing blocksize of data based on memory architecture | rwessel | 2022/03/03 03:44 AM |
Optimizing blocksize of data based on memory architecture | rocky | 2022/03/03 01:09 PM |
Optimizing blocksize of data based on memory architecture | Jörn Engel | 2022/03/03 03:28 PM |
Optimizing blocksize of data based on memory architecture | Brendan | 2022/03/03 05:54 PM |
Optimizing blocksize of data based on memory architecture | Mark | 2022/03/03 10:17 AM |
Optimizing blocksize of data based on memory architecture | rocky | 2022/03/03 01:04 PM |
Optimizing blocksize of data based on memory architecture | Mark | 2022/03/03 02:21 PM |
What data rate? | Mark Roulo | 2022/03/03 12:23 PM |
What data rate? | rocky | 2022/03/03 12:54 PM |
What data rate? | rwessel | 2022/03/04 01:26 PM |
Optimizing blocksize of data based on memory architecture | Anon | 2022/03/04 01:59 PM |