Do all 64 bit intel architectures support SSSE3/SSE4.1/SSE4.2 instructions? - x86-64

I did searched on web and intel Software manual . But am unable to confirm if all Intel 64 architectures support upto SSSE3 or upto SSE4.1 or upto SSE4.2 or AVX etc. So that I would be able to use minimum SIMD supported instructions in my programme. Please help.

A x64 native (AMD64 or Intel 64) processor is only mandated to support SSE and SSE2.
SSE3 is supported by Intel Pentium 4 processors (“Prescott”), AMD Athlon 64 (“revision E”), AMD Phenom, and later processors. This means most, but not quite all, x64 capable CPUs should support SSE3.
Supplemental SSE3 (SSSE3) is supported by Intel Core 2 Duo, Intel Core i7/i5/i3, Intel Atom, AMD Bulldozer, AMD Bobcat, and later processors.
SSE4.1 is supported on Intel Core 2 (“Penryn”), Intel Core i7 (“Nehalem”), Intel Atom (Silvermont core), AMD Bulldozer, AMD Jaguar, and later processors.
SSE 4.1 and SSE4.2 are supported on Intel Core i7 (“Nehalem”), Intel Atom (Silvermont core), AMD Bulldozer, AMD Jaguar, and later processors.
AVX is supported by Intel “Sandy Bridge”, AMD Bulldozer, AMD Jaguar, and later processors.
See this blog series.
A CPU with x64 native support but no SSE3 support is going to be 'first-generation' 64-bit which isn't supported by Windows 8.1 x64 native due to the requirements for CMPXCHG16b, PrefetchW, and LAHF/SAHF; so in practice SSE3 is highly likely in newer machines. SSSE3 or later is more restrictive depending on exactly who you are aiming at. For example, the Valve Hardware Survey puts SSE4.1 at 77%, SSE 4.2 at 72% (anything from AMD or Intel with SSE4.1 is going to also have SSE3 and SSSE3).
UPDATE: Per the comment below, the support for SSE3 for PC gamers per the Valve survey is now 100%. SSSE3, SSE4.1, and SSE4.2 are all in the 97-98% range. AVX is around 92%--the current generation gaming consoles from Sony & Microsoft support up through AVX. The primary value of AVX is that you can use the /arch:AVX switch which allows all SSE code-generation to use the 3-operand VEX prefix which makes register scheduling more efficient. See this blog post.
AVX2 is approaching 75% which is really good, but still potentially a blocker for a game to rely on without a fallback path. AVX2 is supported by Intel “Haswell”, AMD Excavator, and later processors. See this blog post.
Windows on ARM: Note that the x86 emulation for Windows on ARM64 only supports up to SSE4.1, and the x64 emulation in Windows 11 only supports up to SSE 4.2. AVX/AVX2 is not supported for these platforms.

I have been trying to figure this out because failed to compile third party software using SSE. I found this might be helpful:
cat /proc/cpuinfo
Then pay attention to the flags section
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts md_clear flush_l1d
I can see:
sse4_1 sse4_2
If you are trying to write some code to detect this automatically the following might be useful:
cat /proc/cpuinfo | grep flags | uniq | sed 's/.\+: //' | tr ' ' '\n' | grep -o "sse.*"
sse
sse2
sse3
sse4_1
sse4_2

Related

Is there an equivalent register to Intel's MSR_SMI_COUNT on AMD architecture?

On recent Intel CPUs it's possible to count the number of SMIs that have occurred, by reading msr 0x34.
I have checked the manuals at -
https://developer.amd.com/resources/developer-guides-manuals/
for an equivalent register/function, without success.
AMD Zen specifies the LsSmiRx performance counter for System Management Interrupts (SMIs):
PMCx02B [SMIs Received] (Core::X86::Pmc::Core::LsSmiRx)
Counts the number of SMIs received.
(Open-Source
Register Reference
For AMD Family 17h Processors
Models 00h-2Fh. Rev 3.03, 2018, page 153)
On Linux, you can monitor it like this:
# perf stat -e ls_smi_rx -I 60000
This command prints each minute a count of all newly triggered SMIs aggregated over all CPUs.
That means for monitoring - unlike with the MSR_SMI_COUNT register available on Intel CPUs - you have to actively program a PMU register (to observe the LsSmiRx event).
NB: The above referenced AMD documentation confirms that AMD Zen doesn't support the SMI_COUNT MSR (0x34), since it isn't included in the list of available MSRs (in Chapter 2.1.10, page 77).
No, but SMI count is available as a PMC (performance counter) on AMD processors.

ERROR: No OpenCL platforms found, check OpenCL installation

I tried to run Matlab program on gpu (CentOS 7.3).
This Matlab use caffe.
When I run it from the command line with:
matlab -nodisplay -r "demo, quit"
it run okay.
When I run it with LSF command:
bsub -q gpu -R "select[ngpus>0] rusage[ngpus_shared=1]" matlab -nodisplay -r "demo, quit"
I get the error :
ERROR: No OpenCL platforms found, check OpenCL installation
I comprare the LD_PATH_LIBRARY - are the same.
What can be the problem?
Any ideas are welcome!
clinfo output:
Number of platforms 1
Platform Name NVIDIA CUDA
Platform Vendor NVIDIA Corporation
Platform Version OpenCL 1.2 CUDA 8.0.0
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts
Platform Extensions function suffix NV
Platform Name NVIDIA CUDA
Number of devices 1
Device Name Tesla K40m
Device Vendor NVIDIA Corporation
Device Vendor ID 0x10de
Device Version OpenCL 1.2 CUDA
Driver Version 375.26
Device OpenCL C Version OpenCL C 1.2
Device Type GPU
Device Available Yes
Device Profile FULL_PROFILE
Device Topology (NV) PCI-E, 09:00.0
Max compute units 15
Max clock frequency 745MHz
Compute Capability (NV) 3.5
Device Partition (core)
Max number of sub-devices 1
Supported partition types None
Max work item dimensions 3
Max work item sizes 1024x1024x64
Max work group size 1024
Compiler Available Yes
Linker Available Yes
Preferred work group size multiple 32
Warp size (NV) 32
Preferred / native vector sizes
char 1 / 1
short 1 / 1
int 1 / 1
long 1 / 1
half 0 / 0 (n/a)
float 1 / 1
double 1 / 1 (cl_khr_fp64)
Half-precision Floating-point support (n/a)
Single-precision Floating-point support (core)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations Yes
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Address bits 64, Little-Endian
Global memory size 11995578368 (11.17GiB)
Error Correction support Yes
Max memory allocation 2998894592 (2.793GiB)
Unified memory for Host and Device No
Integrated memory (NV) No
Minimum alignment for any data type 128 bytes
Alignment of base address 4096 bits (512 bytes)
Global Memory cache type Read/Write
Global Memory cache size 245760 (240KiB)
Global Memory cache line 128 bytes
Image support Yes
Max number of samplers per kernel 32
Max size for 1D images from buffer 134217728 pixels
Max 1D or 2D image array size 2048 images
Max 2D image size 16384x16384 pixels
Max 3D image size 4096x4096x4096 pixels
Max number of read image args 256
Max number of write image args 16
Local memory type Local
Local memory size 49152 (48KiB)
Registers per block (NV) 65536
Max constant buffer size 65536 (64KiB)
Max number of constant args 9
Max size of kernel argument 4352 (4.25KiB)
Queue properties
Out-of-order execution Yes
Profiling Yes
Prefer user sync for interop No
Profiling timer resolution 1000ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels No
Kernel execution timeout (NV) No
Concurrent copy and kernel execution (NV) Yes
Number of async copy engines 2
printf() buffer size 1048576 (1024KiB)
Built-in kernels
Device Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts
NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) No platform
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) No platform
clCreateContext(NULL, ...) [default] No platform
clCreateContext(NULL, ...) [other] Success [NV]
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) No platform
My best guess would be that the bsub command from LSF schedules the job on another machine (compute node) in a cluster, where OpenCL is not installed.
Having OpenCL/CUDA on the frontend, but not the compute nodes of a cluster is something I've witnessed quite a few times. Even parts of the filesystem with the libraries are shared, the folder /etc/OpenCL/vendors, used for OpenCLs ICD mechanism must be present.
You could try running clinfo via bsub (if you didn't already), or use bsub to execute ls /etc/OpenCL/vendors.
If you're not sure whether or not the LSF-submitted jobs run on the same machine or not, use the hostname command with and without bsub.
Hope that helps.

openocd **Error: libusb_open() failed with LIBUSB_ERROR_NOT_SUPPORTED**

I am trying to setup a eclipse development environment to work with stm32f303 nucleo 32 board.
http://www.st.com/en/evaluation-tools/nucleo-f303k8.html
So far, the all the tools seems to be correctly installed and working: I have succefully compiled and started openocd debugger for stm32f4 discovery and when I connect stm32f303 nucleo 32 board and start openocd, the led on the board indicates that it is connected. (I have flashed the board.)
The thing is openocd get lost when I do step by step debugging and this seems related to the message openocd gives me when started (look for bold line):
Open On-Chip Debugger 0.9.0 (2015-05-19-12:09)
Licensed under GNU GPL v2
For bug reports, read
http://openocd.org/doc/doxygen/bugs.html
adapter speed: 1000 kHz
adapter_nsrst_delay: 100
Info : The selected transport took over low-level target control. The results might differ compared to plain JTAG/SWD
none separate
srst_only separate srst_nogate srst_open_drain connect_deassert_srst
Info : Unable to match requested speed 1000 kHz, using 950 kHz
Info : Unable to match requested speed 1000 kHz, using 950 kHz
Info : clock speed 950 kHz
**Error: libusb_open() failed with LIBUSB_ERROR_NOT_SUPPORTED**
Info : STLINK v2 JTAG v27 API v2 SWIM v15 VID 0x0483 PID 0x374B
Info : using stlink api v2
Info : Target voltage: 3.239293
Info : stm32f3x.cpu: hardware has 6 breakpoints, 4 watchpoints
Does someone know how to fix usb driver in that case, or it is possible that something else cause the problem?
In case you experience this issue on Linux, you have to configure udev rules to work with the device. Find the 99-openocd.rules included with the source distribution of openOCD under Contributions. Connect your ST-Link USB-device and run 'lsusb' from a terminal, it will list something like this:
Bus 004 Device 009: ID 0483:3748 STMicroelectronics ST-LINK/V2
Notice the value behind ID, you want to check the 99-openocd.rules to see if there is a matching entry supporting the device, in the above case it's this one:
# STLink v2
ATTRS{idVendor}=="0483", ATTRS{idProduct}=="3748", MODE="664", GROUP="plugdev"
Copy this file to your /etc/udev/rules.d configuration directory and reboot your machine. Then try debugging again.
The OpenOCD distribution includes some libusb drivers, and recommends to run the zadig.exe tool to activate them. This will solve your problem.
http://zadig.akeo.ie/
As #silverdr mentioned in the comments, disconnecting the device and reconnecting it worked for me.

Accessing real frame buffer of PCI card

I am trying to access the framebuffer on my systems VGA controller card.
lscpi -vn gives:
00:02.0 0300: 8086:2a02 (rev 0c) (prog-if 00 [VGA controller])
Subsystem: 1028:022f
Flags: bus master, fast devsel, latency 0, IRQ 45
Memory at fea00000 (64-bit, non-prefetchable) [size=1M]
Memory at e0000000 (64-bit, prefetchable) [size=256M]
I/O ports at eff8 [size=8]
Expansion ROM at <unassigned> [disabled]
Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit-
Capabilities: [d0] Power Management version 3
Kernel driver in use: i915
Now, I access the device and I get:
fb_base = pci_resource_start( devp, 0 ); **output: FEA00000**
fb_size = pci_resource_len( devp, 0 ); **output: 1MB**
So the range of framebuffer is FEA00000 - FEB00000
But from the lspci -vn output This region is non prefetchable.
Does that mean I am not pointing to the frame buffer at all.
Is my framebuffer at address E0000000:
The driver currently using the resource is the Intel i915
So maybe when I request region or IRQ it can clash if not shared by that driver.
If I remove the i915 rmmod it to insmod my driver, will my screen go blank.
Please help.
Thanks.

Psychophysics Toolbox Matlab on Ubuntu Installation

I am trying to run code in Matlab that uses the Psychtoolbox and OpenGL. The commands that throw the error described below are:
PsychJavaTrouble
AssertOpenGL
Here are my specs:
OS: Ubuntu 14.04 LTS, 64bit
Processor: Intel Core i5-2450M CPT # 2.50GHz x 4
Graphics: Intel Sandybridge Mobile
Matlab Version: Matlab 64-Bit (Version 3.0.11 - Build date: Apr 6 2014)
Psychophysics version installed: 3
Installation methodology:
1. sudo apt-get install psychtoolbox in Terminal
2. updated it via UpdatePsychToolbox command in Matlab console
Here is the error message:
PsychJavaTrouble: Will now try to add the PsychJava folder to Matlabs dynamic
classpath...
Warning: "/home/lillian/Desktop/Matlab/Mona_Lisa/Psychtoolbox/PsychJava" is already
specified on static java path.
> In javaclasspath>local_validate_dynamic_path at 285
In javaclasspath>local_javapath at 182
In javaclasspath at 119
In javaaddpath at 71
In PsychJavaTrouble at 86
In ReverseCorrelationFaces at 2
PsychJavaTrouble: Added PsychJava folder to dynamic class path. Psychtoolbox Java
commands should work now!
PTB-INFO: Display ':0' : X-Screen 0 : Assigning primary output as 0 with RandR-CRTC
0 and GPU-CRTC 0.
PTB-INFO: This is Psychtoolbox-3 for GNU/Linux X11, under Matlab 64-Bit (Version
3.0.11 - Build date: Apr 6 2014).
PTB-INFO: No low-level controllable GPU on screenId 0. Beamposition timestamping and
other special functions disabled.
PTB-INFO: Failed to enable realtime-scheduling [Operation not permitted]!
PTB-DEBUG:PsychOSGetSwapCompletionTimestamp: Invalid return values ust = 0, msc = 0
from call with success return code (sbc = 304)! Failing with rc = -2.
PTB-DEBUG:PsychOSGetSwapCompletionTimestamp: This likely means a driver bug or
malfunction, or that timestamping support has been disabled by the user in the
driver!
PTB-INFO: OpenGL-Renderer is Intel Open Source Technology Center :: Mesa DRI
Intel(R) Sandybridge Mobile :: 3.0 Mesa 10.1.3
PTB-INFO: VBL startline = 768 , VBL Endline = -1
PTB-INFO: Will try to use OS-Builtin OpenML sync control support for accurate Flip
timestamping.
PTB-INFO: Measured monitor refresh interval from VBLsync = 16.685075 ms [59.933804
Hz]. (297 valid samples taken, stddev=0.310528 ms.)
PTB-INFO: Reported monitor refresh interval from operating system = 16.646968 ms
[60.070999 Hz].
PTB-INFO: Small deviations between reported values are normal and no reason to
worry.
WARNING: Couldn't compute a reliable estimate of monitor refresh interval! Trouble
with VBL syncing?!?
----- ! PTB - ERROR: SYNCHRONIZATION FAILURE ! ----
One or more internal checks (see Warnings above) indicate that synchronization
of Psychtoolbox to the vertical retrace (VBL) is not working on your setup.
This will seriously impair proper stimulus presentation and stimulus presentation
timing!
Please read 'help SyncTrouble' for information about how to solve or work-around the
problem.
You can force Psychtoolbox to continue, despite the severe problems, by adding the
command
Screen('Preference', 'SkipSyncTests', 1); at the top of your script, if you really
know what you are doing.
Error using Screen
See error message printed above.
Error in ReverseCorrelationFaces (line 81)
window=Screen('OpenWindow', windowNum);
What am I missing? A package? Is my hardware not okay? I can't figure this error out.
So.. buried deep inside the DownloadPsychtoolbox.m file found here (see installation instructions here), is the instruction that apparently Psychtoolbox requires a special SDK. Super annoying. I will never use this toolbox again because it's so much drama to use. But this is what was missing that was causing the Screen call to fail
Missing SDK download link:
http://docs.gstreamer.com/display/GstSDK/Installing+on+Windows