Data Scrambling Purpose - cpu-architecture

Can someone please explain to me what data scrambling is when it comes to a memory controller? According to Wikipedia, it somehow masks the user data with random patterns to prevent reverse engineering of a DRAM. But, it is also is used to finding electrical problems. Can someone please elaborate on these features of data scrambling? Thanks!

The Wikipedia article claimed:
Memory controllers integrated into certain Intel Core processors also
provide memory scrambling as a feature that turns user data written to
the main memory into pseudo-random patterns.[6][7] As such, memory
scrambling prevents forensic and reverse-engineering analysis based on
DRAM data remanence, by effectively rendering various types of cold
boot attacks ineffective. However, this feature has been designed to
address DRAM-related electrical problems, not to prevent security
issues, so it may not be rigorously cryptographically secure.[8]
However, I think that this claim is somewhat misleading because it implies that the purpose of data scrambling is to prevent reverse engineering. In fact the cited sources (listed as [6][7] in the quote) say the following:
The memory controller incorporates a DDR3 Data Scrambling feature to
minimize the impact of excessive di/dt on the platform DDR3 VRs due to
successive 1s and 0s on the data bus. Past experience has demonstrated
that traffic on the data bus is not random and can have energy
concentrated at specific spectral harmonics creating high di/dt that
is generally limited by data patterns that excite resonance between
the package inductance and on-die capacitances. As a result, the
memory controller uses a data scrambling feature to create
pseudo-random patterns on the DDR3 data bus to reduce the impact of
any excessive di/dt.
Basically the purpose of scrambling is to limit fluctuations in the current draw that is used on the DRAM data bus. There is nothing in the cited source to support the claim that it is designed to prevent reverse-engineering, though I suppose it is reasonable to assume that it might make reverse engineering more difficult. I'm not an expert in this area so I don't know for sure.
I have edited the Wikipedia article to remove the improperly sourced claim. Though I suppose someone could add it back it, but if so hopefully they can provide better sourcing.

It's not reverse engineering of the DRAM, it's reverse engineering of the data in the DRAM that scrambing is designed to prevent (e.g. forensics like cold-boot attacks), according to that article.
The electrical properties thing made me think of Row Hammer. Scrambling might make that harder, but IDK if that's what the author of that paragraph had in mind.

Related

SIMD programming: hybrid approch for data structure layout

The Intel Optimization Reference Manual
https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
discusses the advantage of Structure-Of-Arrays (SoA) data layout for SIMD processing compared to the traditional Array-Of-Structures (AoS) layout. This is clear.
However, there's one argument I don't understand. On page 4-23 it says "SoA can have the disadvantage of requiring more independent memory stream references. A computation that uses arrays X, Y, and Z (see Example 4-20) would require three separate data streams. This can require the use of more prefetches, additional address generation calculations, as well as having a greater impact on DRAM page access efficiency." To mitigate this problem they recommend a hybrid approach (Example 4-22).
Can somebody please explain the "three separate data streams", the "prefetches" and "additional address generation calculations", and "impact on DRAM page access efficiency"?
At https://stackoverflow.com/a/40169187/3852630 Peter Cordes discusses two effects: Three different data streams for X, Y, and Z would tie up three registers for the addresses, and if the three arrays would be mapped to the same cache lines, frequent cache eviction would be a problem. However, registers are not a sparse resource on modern CPUs, and multi-way caches should mitigate the cache problem.

Does the Harvard architecture have the von Neumann bottleneck?

From the naming and this article I feel the answer is no, but I don't understand why. The bottleneck is how fast you can fetch data from memory. Whether you can fetch instruction at the same time doesn't seem to matter. Don't you still have to wait until the data arrive? Suppose fetching data takes 100 cpu cycles and executing instruction takes 1, the ability of doing that 1 cycle in advance doesn't seem to be a huge improvement. What am I missing here?
Context: I came across this article saying the Spectre bug is not going to be fixed because of speculative execution. I think speculative execution, for example branch prediction, makes sense for Harvard architecture too. Am I right? I understand speculative execution is more beneficial for von Neumann architecture, but by how much? Can someone give a rough number? On what extent can we say the Spectre will stay because of von Neumann architecture?
The term "von Neumann bottleneck" isn't just talking about Harvard vs. von Neumann architectures. It's talking about the entire idea of stored-program computers, which John von Neumann invented.
(Depending on context, some people may use it to mean the competition between code-fetch and data access; that does exacerbate the overall memory bottleneck without split caches. Or perhaps I'm mixing up terminology and the more general memory bottleneck for processors I discuss in the rest of this answer shouldn't be called the von Neumann bottleneck, although it is a real thing. See the memory wall section in Modern Microprocessors
A 90-Minute Guide! )
The von Neumann bottleneck applies equally to both kinds of stored-program computers. And even to fixed-function (not stored-program) processors that keep data in RAM. (Old GPUs without programmable shaders are basically fixed-function but can still have memory bottlenecks accessing data).
Usually it's most relevant when looping over big arrays or pointer-based data structures like linked lists, so the code fits in an instruction cache and doesn't have to be fetched during data access anyway. (Computers too old to even have caches were just plain slow, and I'm not interested in arguing semantics of whether slowness even when there is temporal and/or spatial locality is a von Neumann bottleneck for them or not.)
https://whatis.techtarget.com/definition/von-Neumann-bottleneck points out that caching and prefetching is part of how we work around the von Neumann bottleneck, and that faster / wider busses make the bottleneck wider. But only stuff like Processor-in-Memory / https://en.wikipedia.org/wiki/Computational_RAM truly solves it, where an ALU is attached to memory cells directly, so there is no central bottleneck between computation and storage, and computational capacity scales with storage size. But von Neumann with a CPU and separate RAM works well enough for most things that it's not going away any time soon (given large caches and smart hardware prefetching, and out-of-order execution and/or SMT to hide memory latency.)
John von Neumann was a pioneer in early computing, and it's not surprising his name is attached to two different concepts.
Harvard vs. von Neumann is about whether program memory is in a separate address space (and a separate bus); that's an implementation detail for stored-program computers.
Spectre: yes, Spectre is just about data access and branch prediction, not accessing code as data or vice versa. If you can get a Spectre attack into program memory in a Harvard architecture in the first place (e.g. by running a normal program that makes system calls), then it can run the same as on a von Neumann.
I understand speculative execution is more beneficial for von Neumann architecture, but by how much?
What? No. There's no connection here at all. Of course, all high-performance modern CPUs are von Neumann. (With split L1i / L1d caches, but program and data memory are not separate, sharing the same address space and physical storage. Split L1 caches is often called "modified Harvard", which makes some sense on ISAs other than x86 where L1i isn't coherent with data caches so you need special flushing instructions before you can execute newly-stored bytes as code. x86 has coherent instruction caches, so it's very much an implementation detail.)
Some embedded CPUs are true Harvard, with program memory connected to Flash and data address space mapped to RAM. But often those CPUs are pretty low performance. Pipelined but in-order, and only using branch prediction for instruction prefetch.
But if you did build a very high performance CPU with fully separate program and data memories (so copying from one to the other would have to go through the CPU), there'd be basically zero different from modern high-performance CPUs. L1i cache misses are rare, and whether they compete with data access is not very significant.
I guess you'd have split caches all the way down, though; normally modern CPUs have unified L2 and L3 caches, so depending on the workload (big code size or not) more or less of L2 and L3 can end up holding code. Maybe you'd still use unified caches with one extra bit in the tag to distinguish code addresses from data addresses, allowing your big outer caches to be competitively shared between the two address-spaces.
The Harvard Architecture, separated instruction and data memories, is a mitigation of the von Neumann bottleneck. Backus' original definition of the bottleneck addresses a slightly more general problem than just instruction or data fetch and talks about the CPU/memory interface. In the paragraph before the money quote Backus talks about looking at the actual traffic on this bus,
Ironically, a large part of the traffic is not useful data but merely
names of data that most of it consists of names as well as operations
and data used only to compute such names.
In a Harvard architecture with a separated I/D bus, that will not change. It will still largely consist of names.
So the answer is a hard no. The Harvard architecture mitigates the von Neumann bottleneck but it doesn't solve it. Bluntly, it's a faster von Neumann bottleneck.

Factors that Impact Translation Time

I have run across issues in developing models where the translation time (simulates quickly but takes far too long to translate) has become a serious issue and could use some insight so I can look into resolving this.
So the question is:
What are some of the primary factors that impact the translation time of a model and ideas to address the issue?
For example, things that may have an impact:
for loops vs a vectorized method - a basic model testing this didn't seem to impact anything
using input variables vs parameters
impact of annotations (e.g., Evaluate=true)
or tough luck, this is tool dependent (Dymola, OMEdit, etc.) :(
use of many connect() - this seems to be a factor (perhaps primary) as it forces translater to do all the heavy lifting
Any insight is greatly appreciated.
Clearly the answer to this question if naturally open ended. There are many things to consider when computation times may be a factor.
For distributed models (e.g., finite difference) the use of simple models and then using connect equations to link them in the appropriate order is not the best way to produce the models. Experience has shown that this method significantly increases the translation time to unbearable lengths. It is better to create distributed models in the same approach that is used the MSL Dynamic pipe (not exactly like it but similar).
Changing the approach as described is significantly faster in translational time (orders of magnitude for larger models, >~100,000 equations) than using connect statements as the number of distributed elements increases to larger numbers. This was tested using Dymola 2017 and 2017FD01.
Some related materials pointed out by others that may be useful for more information have been included below:
https://modelica.org/events/modelica2011/Proceedings/pages/papers/07_1_ID_183_a_fv.pdf
Scalable Test Suite : https://dx.doi.org/10.3384/ecp15118459

For a Single Cycle CPU How Much Energy Required For Execution Of ADD Command

The question is obvious like specified in the title. I wonder this. Any expert can help?
OK, this is was going to be a long answer, so long that I may write an article about it instead. Strangely enough, I've been working on experiments that are closely related to your question -- determining performance per watt for a modern processor. As Paul and Sneftel indicated, it's not really possible with any real architecture today. You can probably compute this if you are looking at only the execution of that instruction given a certain silicon technology and a certain ALU design through calculating gate leakage and switching currents, voltages, etc. But that isn't a useful value because there is something always going on (from a HW perspective) in any processor newer than an 8086, and instructions haven't been executed in isolation since a pipeline first came into being.
Today, we have multi-function ALUs, out-of-order execution, multiple pipelines, hyperthreading, branch prediction, memory hierarchies, etc. What does this have to do with the execution of one ADD command? The energy used to execute one ADD command is different from the execution of multiple ADD commands. And if you wrap a program around it, then it gets really complicated.
SORT-OF-AN-ANSWER:
So let's look at what you can do.
Statistically measure running a given add over and over again. Remember that there are many different types of adds such as integer adds, floating-point, double precision, adds with carries, and even simultaneous adds (SIMD) to name a few. Limits: OSs and other apps are always there, though you may be able to run on bare metal if you know how; varies with different hardware, silicon technologies, architecture, etc; probably not useful because it is so far from reality that it means little; limits of measurement equipment (using interprocessor PMUs, from the wall meters, interposer socket, etc); memory hierarchy; and more
Statistically measuring an integer/floating-point/double -based workload kernel. This is beginning to have some meaning because it means something to the community. Limits: Still not real; still varies with architecture, silicon technology, hardware, etc; measuring equipment limits; etc
Statistically measuring a real application. Limits: same as above but it at least means something to the community; power states come into play during periods of idle; potentially cluster issues come into play.
When I say "Limits", that just means you need to well define the constraints of your answer / experiment, not that it isn't useful.
SUMMARY: it is possible to come up with a value for one add but it doesn't really mean anything anymore. A value that means anything is way more complicated but is useful and requires a lot of work to find.
By the way, I do think it is a good and important question -- in part because it is so deceptively simple.

Estimating possible # of actors in Scala

How can I estimate the number of actors that a Scala program can handle?
For context, I'm contemplating what is essentially a neural net that will be creating and forgetting cells at a high rate. I'm contemplating making each cell an actor, but there will be millions of them. I'm trying to decide whether this design is worth pursuing, but can't estimate the limits of number of actors. My intent is that it should totally run on one system, so distributed limits don't apply.
For that matter, I haven't definitely settled on Scala, if there's some better choice, but the cells do have state, as in, e.g., their connections to other cells, the weights of the connections, etc. Though this COULD be done as "Each cell is final. Changes mean replacing the current cell with a new one bearing the same id#."
P.S.: I don't know Scala. I'm considering picking it up to do this project. I'm also considering lots of other alternatives, including Java, Object Pascal and Ada. But actors seem a better map to what I'm after than thread-pools (and Java can't handle enough threads to make a thread/cell design feasible.
P.S.: At all times, most of the actors will be quiescent, but there wil need to be a way of cycling through the entire collection of them. If there isn't one built into the language, then this can be managed via first/next links within each cell. (Both links are needed, to allow cells in the middle to be extracted for release.)
With a neural net simulation, the real question is how much of the computational effort will be spent communicating, and how much will be spent computing something within a cell? If most of the effort is in communication then actors are perhaps a good choice for correctness, but not a good choice at all for efficiency (even with Akka, which performs reasonably well; AsyncFP might do the trick, though). Millions of neurons sounds slow--efficiency is probably a significant concern. If the neurons have some pretty heavy-duty computations to do themselves, then the communications overhead is no big deal.
If communications is the bottleneck, and you have lots of tiny messages, then you should design a custom data structure to hold the network, and also custom thread-handling that will take advantage of all the processors you have and minimize the amount of locking that you must do. For example, if you have space, each neuron could hold an array of input values from those neurons linked to it, and it would when calculating its output just read that array directly with no locking and the input neurons would just update the values also with no locking as they went. You can then just dump all your neurons into one big pool and have a master distribute them in chunks of, I don't know, maybe ten thousand at a time, each to its own thread. Scala will work fine for this sort of thing, but expect to do a lot of low-level work yourself, or wait for a really long time for the simulation to finish.