Using SIMD instructions in application oriented to multiple platforms and OS

Using SIMD instructions in application oriented to multiple platforms and OS - operating-system

So, no matter how much I read about SIMD instructions, there is something basic I still can't understand properly and would, therefore, love to have some (conceptual) explanation or suggestions about.
I understand that the many SIMD implementations vary from one CPU architecture to another (MMX, SSE, SSE2, etc). However, considering that since the middle of the 2000s there seems to have been greater convergence between SIMD instructions-sets across Intel and AMD (and Apple has started used Intel), I don't get the following.
Simply put, if an application has a specific SIMD code (e.g. for a vectorized math library), would it equally run in both Intel's and AMD's (therefore in Windows and Linux computers) and also in iOS without any modification?
Or would it be required that specific code is implemented for each CPU architecture/operational system that is target by the application, in a way that different compilations of the application are given for each user type?

For Intel/AMD there can be some convergence, depending on how hard you want to push the performance envelope. iOS devices are ARM-based though, and use Neon SIMD rather than Intel/AMD's SSE/AVX, so there is no binary compatibility and only minimal compatibility at the source level (e.g. via macros or template libraries). See this question for some cross-platform solutions.

Related

Do interpreted languages need an operating system to work?

Do interpreted languages such as Java and Python need an operating system to work?
For example, on a bare-metal ARM microcontroller, can an interpreter be installed such that we can have both compiled code such as C, and interpreted code such as Python working together, Or is an OS needed to support this?

Of course you can write an interpreter that runs on bare-metal, it is just that if the platform does not have an OS any run-time support the language needs must be part of the interpreter. To the extent in some cases that such an interpreter might essentially be an OS. That is if it provides the services to operate a system, it could be called an operating system.
It is not perhaps as simple as interpreted vs compiled. Java for example runs on a virtual machine and is "compiled" to bytecode. The bytecode is interpreted (or just-in-time compiled in some cases), rather then the Java source directly. In an embedded system, it is possible that you would deploy cross-compiled bytecode on the target rather then the source. Certainly however JVMs exist for bare-metal. Some support multi-threading through a third party RTOS, others either have that support built-in or do not support threading at all.
There are interpreters for cut-down subsets of JavaScript and Python that run on bare-metal microcontrollers. I am not sure about full implementations, but it is technically possible given sufficient run-time support even if not explicitly implemented. To fully support some of these languages along with all the standard and third-party libraries and frameworks a developer might expect, may require so much run-time support and resource that it is simpler to deploy and OS, so implementations for resource constrained systems are often subsets or have restricted libraries.

Java needs a VM - virtual machine. It isn't interpreted, but executes byte code. Interpreted would mean grabbing the source in run-time as it goes, like BASIC.
When Java was new and exciting around year 2000, everyone thought it would be the next big general-purpose language, replacing C++. The syntax was so clean, it was "pure OO" and not some "filthy hybrid".
It was the major buzz word of the time. Schools stopped teaching C and C++. MCU manufacturers started to make chips with Java VM in hardware. Microsoft made their own Java "standard". Everyone was high on the Java hype.
Then as the Internet hype as whole collapsed in 2002, it took the Java hype with it. In the sober hang-over afterwards, people started to realize that things like byte code, VMs and garbage collection probably don't belong on bare metal systems.
They went back to using compiled C for hardware-related programming. Or in fact they never stopped, since Java never quite made it there, save for some oddball exotic architectures.
Java remained used only in the areas were it was suitable, namely web, desktop and mobile development. And so it got a second golden age when the smart phone hype struck around 2010.

No. See for example picoJava, which is one of several solutions for running Java natively. You can't get closer to bare metal than running bytecode on the CPU.

No. Some 8-bit computers had interpreted languages in ROM despite not having anything reasonably resembling a modern operating system. The Apple 2 is one example. You could boot the system without any disks or tapes, and it would go straight to a BASIC prompt, where you could write basic (no pun intended) programs.
Note that an operating system is somewhat of a vague term when speaking about these days - these 8-bit computers did have some level of firmware, and this firmware did provide some OS-type functionality like access to basic peripherals. In these days, what we now know as an OS was more commonly called a "DOS" - a Disk Operating System. MS-DOS is one of them, as well as Apple's ProDOS. These DOS's evolved into our modern-day operating systems (e.g. Windows 95 was based on top of MS-DOS, while modern Windows versions derive from a separate branch that was largely re-implemented with more modern techniques), so one could claim that their ancestors are the closest they had to what we now call an OS.
But what is an interpreter but a piece of software?
In a more theoretical sense, an interpreter is simply software - a program that takes input and produces output. Suppose you were to implement a custom solid-state Turing Machine. In this case, your "input" would be the program to be interpreted, and the "output" would be the program's behavior. If "software" can run without an operating system, then an interpreter can.
Is this model a little simplified? Of course. The difference is a matter of degree, not nature. Add very basic user input and output capabilities (e.g. a TTY) and you have the foundation to implement all, or nearly all, of the basic functionality of a language such as Java byte code, Python, or BASIC. The main things you would be missing are libraries and whatnot that depend on things like screen manipulation, multiprocessing, and networking, but you could handle them with time too.

Firmware Development

I want to clarify before the question that I am not an established professional programmer in any position at any firm. This is solely to satisfy curiosity, and will not pertain to any task or project at this time.
As I understand it, firmware is software placed on hardware to grant it autonomous functionality from instructions, which is given through some form of input; As long as the input stream is readable, which is made possible through drivers. Drivers are software packages with pre-written reference libraries that recognize a specific set of instructions for each possible function in the attached device.
NOTE: not quoted, so I'm aware that this could be inaccurate.
What I want to know is how firmware or drivers are placed on devices without installation through an OS or storage medium; such as a DVD or USB? Specifically firmware installed by manufacturers, like bios and keyboard drivers that are present on all computers. I'm assuming these are less or not reliant on compilation in order to function properly, which is the sole reason I'm asking this question.
Can firmware be developed without compilation?
References
Demystifying Firmware
C++ Kernel Development
Starting Firmware Development
These just explain that an OS is a type of firmware, and that firmware is primarily developed in C with Assembly and C++ as plausible alternatives; pertaining to kernel development as well.

Yes, especially in the larger components. An example involving lua is http://nodelua.org/doc/index/
However, firmware development is typically an extremely memory (and frequently CPU) constrained environment.
C (or traditionally, assembler) is often preferred because it can produce extremely small executables, and is very efficient in stack usage. This matters when you're counting memory in bytes, or kilobytes.
Using a non-compiled language means you need to include a tiny interpreter, and you might not be able to set aside enough memory for this.
You've made an edit, wherein you suggest that an "OS is a type of firmware".
This can be true, in a manner of speaking.
Often firmware itself can consist of an operating system, with components. As an example, the firmware in some home internet routers will contain an OS (which might very well be linux!), however it is still regarded as firmware. There is a bit of a grey area between a computer that is an "embedded device with firmware", vs that of a 'regular computer with regular software', but generally firmware is a computer system running in a very constrained environment, often with very specific uses.
NetBSD includes Lua in it's kernel. Many systems have been developed that do not use Assembly (except for a small part of it), C, or C++, but instead use some other language - though it is typically still compiled for size and performance reasons.
As for the actual transfer of firmware (whatever the form it may be in), this depends on the device in question.
Some devices require that the firmware be burned into the components. (In ROM, though there are various types of ROM and some can be rewritten).
Other devices require that the firmware be transferred when the device is turned on.
And yet others have SDCards or battery-backed RAM or whatever that allow storing the firmware across reboots.

Is GPU and SIMD likely to be implemented in .NET / Java VMs?

For some time now, mainstream compute hardware has sported SIMD instructions (MMX, SSE, 3D-Now, etc) and more recently we're seeing AMD bringing 480-stream GPUs into the same die as the CPU.
Functional languages like F#, Scala and Clojure are also gaining traction, with one common attraction being how much easier concurrent programming is in these languages.
Are there any plans for the Java VM or .NET CLR to start providing access to parallel compute hardware resources, so that functional languages can mature to leverage the hardware?
It seems as though the VMs are currently the bottleneck against high performance computing, with SIMD and GPU access being delegated the 3rd party libraries and post-compilers (tidepowered.net, OpenTK, ScalaCL, Brahma, etc, etc.)
Does anyone know of any plans / roadmaps on the part of Microsoft / Oracle / Open-Source Community to being their VMs up-to-date with the new hardware and programming paradigms?
Is there a good reason why vendors are being so sluggish on the uptake?
Edit:
To Address feedback so far, it's true that GPU programming is complex and, done wrong, worsens performance. But it's well known that parallelism is the future of computing - so the crux of this question is that it doesn't help for hardware and programming languages to embrace a parallel paradigm if the runtimes sitting between the applications and the hardware don't support it... why aren't we seeing this on the VM vendor's radars / roadmaps?

you means JavaCL and ScalaCL? they both try to migrate CUDA/GPU programming to javavm

The mono runtime includes support for some SIMD instructions already - see http://docs.go-mono.com/index.aspx?link=N%3aMono.Simd
For Microsoft's implementation of the CLR you can use XNA which allows you to run shaders etc. or the accelerator library https://research.microsoft.com/en-us/projects/accelerator/ which provides an interface to running GPGPU calculations

Java has been making strong headway in the parallelism arena for some time, first with the java.util.concurrent package and now with the fork/join framework. Hopefully, in the future, languages like Clojure and Scala will provide great high level abstractions to leverage fork-join.
GPGPU programming offers significant performance gains only for very specialized problems. .Net and Java are general purpose programming languages. Plus, who wants to do CUDA-style programming in a language like Java?

Zach Tellman's Penumbra framework enables GPU programming in Clojure (both for graphics and general purpose programming).
It's somewhat experimental but I think the theoretical motivation is very sound:
Harness the GPU / specialised SIMD instructions for the serious number crunching on large data sets
Use a very high level langauge that is strong at metaprogramming / DSL definition (e.g. Clojure) to orchestrate the operations at the overall level and generate the appropriate lower-level code where needed (e.g. with generous use of macro expansions)

The state of programming and compiling for multicore systems

I'm doing some research on multicore processors; specifically I'm looking at writing code for multicore processors and also compiling code for multicore processors.
I'm curious about the major problems in this field that would currently prevent a widespread adoption of programming techniques and practices to fully leverage the power of multicore architectures.
I am aware of the following efforts (some of these don't seem directly related to multicore architectures, but seem to have more to do with parallel-programming models, multi-threading, and concurrency):
Erlang (I know that Erlang includes constructs to facilitate concurrency, but I am not sure how exactly it is being leveraged for multicore architectures)
OpenMP (seems mostly related to multiprocessing and leveraging the power of clusters)
Unified Parallel C
Cilk
Intel Threading Blocks (this seems to be directly related to multicore systems; makes sense as it comes from Intel. In addition to defining certain programming-constructs, it also seems have features that tell the compiler to optimize the code for multicore architectures)
In general, from what little experience I have with multithreaded programming, I know that programming with concurrency and parallelism in mind is definitely a difficult concept. I am also aware that multithreaded programming and multicore programming are two different things. in multithreaded programming you are ensuring that the CPU does not remain idle (on a single-CPU system. As James pointed out the OS can schedule different threads to run on different cores -- but I'm more interested in describing the parallel operations from the language itself, or via the compiler). As far as I know you cannot truly do parallel operations. In multicore systems, you should be able to perform truly-parallel operations.
So it seems to me that currently the problems facing multicore programming are:
Multicore programming is a difficult concept that requires significant skill
There are no native constructs in today's programming languages that provide a good abstraction to program for a multicore environment
Other than Intel's TBB library I haven't found efforts in other programming-languages to leverage the power of multicore architectures for compilation (for example, I don't know if the Java or C# compiler optimizes the bytecode for multicore systems or even if the JIT compiler does that)
I'm interested in knowing what other problems there might be, and if there are any solutions in the works to address these problems. Links to research papers (and things of that nature) would be helpful. Thanks!
EDIT
If I had to condense my question down to one sentence, it would be this: What are the problems that face multicore programming today and what research is going on in the field to solve these problems?
UPDATE
It also seems to me that there are three levels where multicore needs to be concerned:
Language level: Constructs/concepts/frameworks that abstract parallelization and concurrency and make it easy for programmers to express the same
Compiler level: If the compiler is aware of what architecture it is compiling for, it can optimize the compiled code for that architecture.
OS level: The OS optimizes the running process and perhaps schedules different threads/processes to run on different cores.
I've searched on ACM and IEEE and have found a few papers. Most of them talk about how difficult it is to think concurrently and also how current languages don't have a proper way to express concurrency. Some have gone so far as to claim that the current model of concurrency that we have (threads) is not a good way to handle concurrency (even on multiple cores). I'm interested in hearing other views.

I'm curious about the major problems in this field that would currently prevent a widespread adoption of programming techniques and practices to fully leverage the power of multicore architectures.
Inertia. (BTW: that's pretty much the answer to all "what does prevent the widespread adoption" questions, whether that be models of parallel programming, garbage collection, type safety or fuel-efficient automobiles.)
We have known since the 1960s that the threads+locks model is fundamentally broken. By ~1980, we had about a dozen better models. And yet, the vast majority of languages that are in use today (including languages that were newly created from scratch long after 1980), offer only threads+locks.

The major problems with multicore programming is the same as writing any other concurrent applications, but whereas before it was uncommon to have multiple cpus in a computer, now it is hard to find any modern computer with only one core in it, so, to take advantage of multicore, multiple cpu architectures there are new challenges.
But, this problem is an old problem, whenever computer architectures go beyond compilers then it seems the fallback solution is to move back toward functional programming, as that programming paradigm, if strictly followed, can make very parallelizable programs, as you don't have any global mutable variables, for example.
But, not all problems can be done easily using FP, so the goal then is how to easily get other programming paradigms to be easy to use on multicores.
The first thing is that many programmers have avoided writing good mulithreaded applications, so there isn't a strongly prepared number of developers, as they learned habits that will make their coding harder to do.
But, as with most changes to the cpu, you can look at how to change the compiler, and for that you can look at Scala, Haskell, Erlang and F#.
For libraries you can look at the parallel framework extension, by MS as a way to make it easier to do concurrent programming.
It is at work, but I recently either IEEE Spectrum or IEEE Computer had articles on multicore programming issues, so look at what IEEE and ACM articles have been written on these issues, to get more ideas as to what is being looked at.
I think the biggest impediment will be the difficulty to get programmers to change their language as FP is very different than OOP.
One place for research besides developing languages that will work well this way, is how to handle multiple threads accessing memory, but, as with much in this area, Haskell seems to be at the forefront in testing ideas for this, so you can look at what is going on with Haskell.
Ultimately there will be new languages, and it may be that we have DSLs to help abstract the developer more, but how to educate programmers on this will be a challenge.
UPDATE:
You may find Chapter 24. Concurrent and multicore programming of interest, http://book.realworldhaskell.org/read/concurrent-and-multicore-programming.html

One of the answers mentioned the Parallel Extensions for the .NET Framework and since you mentioned C#, it's definitely something I would investigate. Microsoft has done something interesting things there, though I have to think many of their efforts seem more suited for language enhancements in C# than a separate and distinct library for concurrent programming. But I think their efforts are worth applauding and respect that we're early here. (Disclaimer: I used to be the marketing director for Visual Studio about 3 years ago)
The Intel Thread Building Blocks are also quite interesting (Intel recently released a new version, and I'm excited to head down to Intel Developer Forum next week to learn more about how to use it properly).
Lastly, I work for Corensic, a software quality startup in Seattle. We've got a tool called Jinx that is designed to detect concurrency errors in your code. A 30-day trial edition is available for Windows and Linux, so you might want to check it out. (www.corensic.com)
In a nutshell, Jinx is a very thin hypervisor that, when activated, slips in between the processor and operating system. Jinx then intelligently takes slices of execution and runs simulations of various thread timings to look for bugs. When we find a particular thread timing that will cause a bug to happen, we make that timing "reality" on your machine (e.g., if you're using Visual Studio, the debugger will stop at that point). We then point out the area in your code where the bug was caused. There are no false positives with Jinx. When it detects a bug, it's definitely a bug.
Jinx works on Linux and Windows, and in both native and managed code. It is language and application platform agnostic and can work with all your existing tools.
If you check it out, please send us feedback on what works and doesn't work. We've been running Jinx on some big open source projects and already are seeing situations where Jinx can find bugs 50-100 times faster than simply stress testing code.

The bottleneck of any high-performance application (written in C or C++) designed to make efficient use of more than one processor/core is the memory system (caches and RAM). A single core usually saturates the memory system with its reads and writes so it is easy to see why adding extra cores and threads causes an application to run slower. If a queue of people can pass through a door one a time, adding extra queues will not only clog the door but also make the passage of any one individual through the door less efficient.
The key to any multi-core application is optimization of and economizing on memory accesses. This means structuring data and code to work as much as possible inside their own caches where they don't disturb the other cores with acceses to the common cache (L3) or RAM. Once in a while a core needs to venture there but the trick is to reduce those situations as much as possible. In particular, data needs to be structured around and adapted to cache lines and their sizes (currently 64 bytes) and code needs to be compact and not call and jump all over the place which also disrupts pipelines.
My experience is that efficient solutions are unique to the application in question. The generic guidelines (above) are a basis on which to construct code but the tweak changes resulting from profiling conclusions will not be obvious to those who were not themselves involved in the optimizing work.

Look up fork/join frameworks and work-stealing runtimes. Two names for the same, or at least related, approaches, which is to recursively subdivide large tasks into lightweight units, such that all available parallelism is exploited, without having to know in advance how much parallelism there is. The idea is that it should run at serial speed on a uniprocessor, but get a linear speedup with multiple cores.
Sort of a horizontal analogue of cache-oblivious algorithms if you look at it right.
But i'd say the main problem facing multicore programming is that the great majority of computations remain stubbornly serial. There's just no way to throw multiple cores at those computations and make them stick.

Comparison of embedded operating systems?

I've been involved in embedded operating systems of one flavor or another, and have generally had to work with whatever the legacy system had. Now I have the chance to start from scratch on a new embedded project.
The primary constraints on the system are:
It needs a web-based interface.
Inputs are required to be processed in real-time (so a true RTOS is needed).
The memory available is 32MB of RAM and FLASH.
The operating systems that the team has used previously are VxWorks, ThreadX, uCos, pSOS, and Windows CE.
Does anyone have a comparison or trade study regarding operating system choice?
Are there any other operating systems that we should consider? (We've had eCos and RT-Linux suggested).
Edit - Thanks for all the responses to date. A pity I can't flag all as "accepted".

I think it would be wise to evaluate carefully what you mean by "RTOS". I have worked for years at a large company that builds high-performance embedded systems, and they refer to them as "real-time", although that's not what they really are. They are low-latency and have deterministic schedulers, and 9 times out of 10, that's what people are really after when they say RTOS.
True real-time requires hardware support and is likely not what you really mean. If all you want is low latency and deterministic scheduling (again, I think this is what people mean 90% of the time when they say "real-time"), then any Linux distribution would work just fine for you. You could probably even get by with Windows (I'm not sure how you control the Windows scheduler though...).
Again, just be careful what you mean by "Real-time".

It all depends on how much time was allocated for your team has to learn a "new" RTOS.
Are there any reasons you don't want to use something that people already have experience with?
I have plenty of experience with vxWorks and I like it, but disregard my opinion as I work for WindRiver.
uC/OS II has the advantage of being fully documented (as in the source code is actually explained) in Labrosse's Book. Don't know about Web Support though.
I know pSos is no longer available.
You can also take a look at this list of RTOSes

I worked with QNX many years ago, and have nothing but great things to say about it. Even back then, QNX 4 (which is positively chunky compared to the Neutrino microkernel) was perfectly suited for low memory situations (though 32MB is oodles compared to the 1-2MB that we had to play with), and while I didn't explicitly play with any web-based stuff, I know Apache was available.

I purchased some development hardware from netburner
It has been very easy to work with and very well documented. It is an RTOS running uCLinux. The company is great to work with.

It might be a wise decision to select an OS that your team is experienced with. However I would like to promote two good open source options:
eCos (has you mentioned)
RTEMS
Both have a lot of features and drivers for a wide variety of architectures. You haven't mentioned what architecture you will be using. They provide POSIX layers which is nice if you want to stay as portable as possible.
Also the license for both eCos and RTEMS is GPL but with an exception so that the executable that is produced by linking against the kernel is not covered by GPL.
The communities are very active and there are companies which provide commercial support and development.

We've been very happy with the Keil RTX system....light and fast and meets all of our tight real time constraints. It also has some nice debugging features built in to monitor stack overflow, etc.

I have been pretty happy with Windows CE, although it is 'heavier'.

Posting to agree with Ben Collins -- your really need to determine if you have a soft real-time requirement (primarily for human interaction) or hard real-time requirement (for interfacing with timing-sensitive devices).

Soft can also mean that you can tolerate some hiccups every once in a while.
What is the reliability requirements? My experience with more general-purpose operating systems like Linux in embedded is that they tend to experience random hiccups due to their smart average-case optimizations that try to avoid starvation and similar for individual tasks.

VxWorks is good:
good documentation;
friendly developing tool;
low latency;
deterministic scheduling.
However, I doubt that WindRiver would convert their major attention to Linux and WindRiver Linux would break into the market of WindRiver VxWorks.
Less market, less requirement of engineers.

Here is the latest study. The last one was done more than 8 years ago so this is most relevant. The tables can be used to add additional RTOS choices. You'll note that this comparison is focused on lighter machines but is equally applicable to heavier machines provided virtual memory is not required.
http://www.embedded.com/design/operating-systems/4425751/Comparing-microcontroller-real-time-operating-systems

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse