Von Neumann Architecture
Von Neumann Architecture
Fetches the memory into various kinds of buffers to speed up memory fetch by CPU.
Registers in a processor
DRAM Memory
- Has capacitors.
- Capacitors discharge over time hence needs to be continuously powered and recharged.
- Take time to charge and discharge, which creates bottlenecks for CPU.
Flow of types through time:
- SDRAM: Synchronous DRAM
- RDRAM: Rambus (company)
- DDR DRAM: Double data rate - over here both rising and falling edges will have actions.
DIMM chips: Dual inline memory modules
There are two ranks, with 8 chips for each. One rank is on one side. Each bus activates all the 8 chips at once.
One layer/rank/side is only for data and the other is only for instructions and commands. There is another bus that chooses the rank.
This is more efficient as one DRAM chip can be held using 8 pins only. This reduces size and number of pins (they are costly).
However, this activates all the 8 DRAM chips at once when we don't need to, so it is power inefficient.
For each DRAM cell, there are 8 banks, each of which has 8 arrays. Each array has 64 bits, however the data is stored so that all the 8 arrays store a bit each for any byte.
Each array has a RAS (row access strobe) and a CAS (column access strobe). There are 8 rows and 8 columns. Each row has 8 bits, but not from the same byte. Those are parallel.
Each RAS has selects a row and each CAS selects a column to fetch a bit. All the 8 bits from each of the array combine to form a byte. Therefore a bank stores 64 bytes but in parallel.
In each DRAM chip, at one time, only one bank works. The index of the bank is the same for all the chips.
Each bank sends out a byte (we'll talk about bursts later).
It takes 9 bits to address a bit in DRAM. 3 for the array (8 in a bank), 3 for the row, and 3 for the memory.
Steps for accessing a bit:
- Row address sent through RAS. Activates the entire row.
- All the charge in the row stored in the Sense Amplifier.
- CAS selects the bit and rest is sent back for recharging the capacitors.
Step 1 is the slowest. To improve that:
Store multiple bits after step 2 in a buffer. This is called a burst.
Burst sizes:
- DDR2: 4 bytes
- DDR3: 8 bytes
- DDR4: 16 bytes
The said buffer still sends out 8 bytes in 8 cycles. So, it takes 8 cycles for 1 burst. So a burst sends out 8 bytes per bank. This is stored in the row buffer below the bank. Each DRAM chip uses one bank at a time. There are 8 DRAM chips, so for a burst of 8 bytes, the total data sent in a burst is 64 bytes.
To improve upon it further, we do bank interleaving. How this happens is:
| T0 | T1 | T2 | T3 | T4 | T5 |
|---|---|---|---|---|---|
| read req | read burst | recovery | |||
| read req | read burst | recovery | |||
| read req | read burst | recovery | |||
| read req | read burst | recovery |
In the above each row is a separate bank. This approach always keeps the data bus saturated.
DRAM refresh happens at bank level and requires all the energy to send the charge back from the Sense Amplifier that is right before the column access strobe back to the row from which the bits were fetched as we technically discharged the capacitor to get the charges.
Row hammer attack: When you activate rows that have one row in between such that the row in between changes values due to EM fields. Saltanat's work EDIT: I've been lied to. She works on TEEs for legacy applications.
Cache
Each core has separate L1 cache for instructions and data. L2 cache is common for both and mostly per-core (can also be shared across multiple cores). L3 cache is shared across all the cores.
Uses 6 transistors and is much faster than DRAM but also much costlier. This is called an SRAM cell.
Each memory line in DRAM is mapped to a memory line in the cache. It is usually 32 bytes or 64 bytes per cache line.
Direct Memory Mapping
If there are L cache lines and A lines in the main memory, then the k'th line is mapped to k mod L.
We have tag bits in the address, that are used to identify whether the addresses match or not.
If the addresses don't match, we have a cache miss and we fetch the bytes from the DRAM chip again.
Example:
Assume a cache with 1024 cache lines -> 10 bits for indexing the line.
Cache line size is 1 word (4 bytes) -> 2 offset bits within the word.
Address size = 64 bits => tag size = 64 - 10 - 2 = 52 bits.
This is the 64 bit address.
Now for the cache line size
| valid | Tag | data |
|---|---|---|
| 1 bit | 52 bits | 32 bits |
The valid bit checks if the cache line is valid or not. Can be invalidated due to concurrency.
The tag bits of address and the cache line are XORed to check if it's the same. If same, it is a hit. Then valid bit is checked. Finally data is fetched and the offset is used to get the exact byte (if it is byte addressable).
- Miss penalty
This is the time taken to fill a cache line when there is a miss. There are two ways to lessen the miss penalty:
- Early Restart: The cache is filled sequentially like always but as soon as the required word/byte is written to the cache it is sent to the CPU to work upon. This is especially useful for instructions as they are mostly used in sequence.
- Critical Word First: The word that is required atm is the one that is fetched first and sent to the CPU and the rest is fetched in the background. This is helpful for data as that is mostly random access.
- Writing back to memory
- Write through: Write to cache and DRAM both. Very slow since direct cache writes take a long time.
- Write buffer: Write to DRAM using a buffer where writes are queued. In the meantime the cache is also updated and the processor can continue to function as usual. If the cache line is flushed, then there is no need to be bothered as the write will be written by the priority queue before it is recalled.
- Write back: In this the cache line is written back to the DRAM when flushed.
- Cache thrashing
When the same cache line is used for all the operations, it is flushed repeatedly and is slow while all the other cache lines are just idle. This can occur if you are iterating through every k'th step. To counter this, two other schemes were used.
Fully associative Cache mapping
The entire address is used for tag bits. Apart from that any memory block in DRAM can go anywhere in the cache. Now cache hits have become slower due to the necessity of looking up all the cache lines. However this reduces cache thrashing massively. To improve upon the lookup times, we use set associative cache mapping.
Set associative Cache mapping
Tradeoff between hardware size and cache misses. A block of memory can be mapped to any cache line in a set of cache lines. If there are S cache sets, then a block A gets mapped to any cache line in A mod S.
Let there be 32 KB of cache with each line being 64 byte. Hence \(2^{15}\) bytes with \(2^6\) bytes per line. So there are \(2^9\) lines in total.
If the cache is 8-way set associative, then there are \(2^6\) cache sets.
Now the least significant 6 bits are used for offset mapping for the bytes of the cache line. The next 6 bits are used for identifying the set. The rest of the bits are tag bits.
Split and Unified L1 Cache
- Split: leads to better performance since instruction and data can be fetched independent of each other. However, has the drawback that instruction cache is used much more frequently than data cache.
- Unified: Leads to better utilization of cache resources.
