Keep the data in faster memory and use an appropriate access pattern.
architectures have different types of memory and these have different access
costs. Registers, caches, and scratchpads are cheaper to access than local
memory, but have smaller capacity. When data is loaded into a register, cache
line, or memory page, use an access pattern that will use all the data before
moving to the next chunk. When memory is banked, use a stride that avoids all
the compute cores trying to access the same memory bank simultaneously.