FFmpeg: Encode x264 with AMD GPU on Windows?

FFmpeg: Encode x264 with AMD GPU on Windows? - encoding

I am currently trying to record a Video on my Lenovo Laptop with its Built-In Webcam using FFmpeg on Windows 10. One of my goals is to keep the CPU Usage as low as possible, that's why i want to push the h264 encoding to the GPU.
Now it gets a bit tricky here with my Laptop. Because it uses two GPUs. The first GPU is a Intel HD 5500 Graphics Unit as Part of the CPU. This one is most likly used for non-demanding Applications like office etc. to save Energy. The other one is a AMD R5 M330 that will be used for graphic intense applications like gaming.
Currently, i am using the following command to encode the Webcam Stream on the Intel HD GPU:
ffmpeg -f dshow -vcodec mjpeg -video_size 1280x720 -framerate 30 video="Lenovo EasyCamera":audio="Mikrofon (Realtek High Definition Audio)" -c:v h264_qsv -g 60 -q 28 -look_ahead 0 -preset:v faster -c:a aac -q:a 0.6 -r 30 output.mp4
This does work so far but it seems this GPU does not have enough Power to keep up with the framerate on higher bitrates or with a high amount of i-frames. The Video starts lacking and skipping frames. If i am using CPU encoding everything works smooth.
Now that my Laptop got that second AMD GPU with a lot more Power it would be a nice Try to encode on that one, but i can't find any information about how to encode on AMD Hardware on Windows 10. So my question is: How does the ffmpeg command look like to use AMD Hardware for h264 encoding?

Current versions of ffmpeg now support hardware encoding for the major GPU vendors. Here are the options for the h264_amf (H.264) and hevc_amf (H.265 or HEVC) encoders which you would use with an AMD graphics card, taken from ffmpeg -h full:
h264_amf AVOptions:
-usage <int> E..V.... Encoder Usage (from 0 to 3) (default transcoding)
transcoding E..V.... Generic Transcoding
ultralowlatency E..V....
lowlatency E..V....
webcam E..V.... Webcam
-profile <int> E..V.... Profile (from 66 to 257) (default main)
main E..V....
high E..V....
constrained_baseline E..V....
constrained_high E..V....
-level <int> E..V.... Profile Level (from 0 to 62) (default auto)
auto E..V....
1.0 E..V....
1.1 E..V....
1.2 E..V....
1.3 E..V....
2.0 E..V....
2.1 E..V....
2.2 E..V....
3.0 E..V....
3.1 E..V....
3.2 E..V....
4.0 E..V....
4.1 E..V....
4.2 E..V....
5.0 E..V....
5.1 E..V....
5.2 E..V....
6.0 E..V....
6.1 E..V....
6.2 E..V....
-quality <int> E..V.... Quality Preference (from 0 to 2) (default speed)
speed E..V.... Prefer Speed
balanced E..V.... Balanced
quality E..V.... Prefer Quality
-rc <int> E..V.... Rate Control Method (from -1 to 3) (default -1)
cqp E..V.... Constant Quantization Parameter
cbr E..V.... Constant Bitrate
vbr_peak E..V.... Peak Contrained Variable Bitrate
vbr_latency E..V.... Latency Constrained Variable Bitrate
-enforce_hrd <boolean> E..V.... Enforce HRD (default false)
-filler_data <boolean> E..V.... Filler Data Enable (default false)
-vbaq <boolean> E..V.... Enable VBAQ (default false)
-frame_skipping <boolean> E..V.... Rate Control Based Frame Skip (default false)
-qp_i <int> E..V.... Quantization Parameter for I-Frame (from -1 to 51) (default -1)
-qp_p <int> E..V.... Quantization Parameter for P-Frame (from -1 to 51) (default -1)
-qp_b <int> E..V.... Quantization Parameter for B-Frame (from -1 to 51) (default -1)
-preanalysis <boolean> E..V.... Pre-Analysis Mode (default false)
-max_au_size <int> E..V.... Maximum Access Unit Size for rate control (in bits) (from 0 to INT_MAX) (default 0)
-header_spacing <int> E..V.... Header Insertion Spacing (from -1 to 1000) (default -1)
-bf_delta_qp <int> E..V.... B-Picture Delta QP (from -10 to 10) (default 4)
-bf_ref <boolean> E..V.... Enable Reference to B-Frames (default true)
-bf_ref_delta_qp <int> E..V.... Reference B-Picture Delta QP (from -10 to 10) (default 4)
-intra_refresh_mb <int> E..V.... Intra Refresh MBs Number Per Slot in Macroblocks (from 0 to INT_MAX) (default 0)
-coder <int> E..V.... Coding Type (from 0 to 2) (default auto)
auto E..V.... Automatic
cavlc E..V.... Context Adaptive Variable-Length Coding
cabac E..V.... Context Adaptive Binary Arithmetic Coding
-me_half_pel <boolean> E..V.... Enable ME Half Pixel (default true)
-me_quarter_pel <boolean> E..V.... Enable ME Quarter Pixel (default true)
-aud <boolean> E..V.... Inserts AU Delimiter NAL unit (default false)
-log_to_dbg <boolean> E..V.... Enable AMF logging to debug output (default false)
hevc_amf AVOptions:
-usage <int> E..V.... Set the encoding usage (from 0 to 3) (default transcoding)
transcoding E..V....
ultralowlatency E..V....
lowlatency E..V....
webcam E..V....
-profile <int> E..V.... Set the profile (default main) (from 1 to 1) (default main)
main E..V....
-profile_tier <int> E..V.... Set the profile tier (default main) (from 0 to 1) (default main)
main E..V....
high E..V....
-level <int> E..V.... Set the encoding level (default auto) (from 0 to 186) (default auto)
auto E..V....
1.0 E..V....
2.0 E..V....
2.1 E..V....
3.0 E..V....
3.1 E..V....
4.0 E..V....
4.1 E..V....
5.0 E..V....
5.1 E..V....
5.2 E..V....
6.0 E..V....
6.1 E..V....
6.2 E..V....
-quality <int> E..V.... Set the encoding quality (from 0 to 10) (default speed)
balanced E..V....
speed E..V....
quality E..V....
-rc <int> E..V.... Set the rate control mode (from -1 to 3) (default -1)
cqp E..V.... Constant Quantization Parameter
cbr E..V.... Constant Bitrate
vbr_peak E..V.... Peak Contrained Variable Bitrate
vbr_latency E..V.... Latency Constrained Variable Bitrate
-header_insertion_mode <int> E..V.... Set header insertion mode (from 0 to 2) (default none)
none E..V....
gop E..V....
idr E..V....
-gops_per_idr <int> E..V.... GOPs per IDR 0-no IDR will be inserted (from 0 to INT_MAX) (default 60)
-preanalysis <boolean> E..V.... Enable preanalysis (default false)
-vbaq <boolean> E..V.... Enable VBAQ (default false)
-enforce_hrd <boolean> E..V.... Enforce HRD (default false)
-filler_data <boolean> E..V.... Filler Data Enable (default false)
-max_au_size <int> E..V.... Maximum Access Unit Size for rate control (in bits) (from 0 to INT_MAX) (default 0)
-min_qp_i <int> E..V.... min quantization parameter for I-frame (from -1 to 51) (default -1)
-max_qp_i <int> E..V.... max quantization parameter for I-frame (from -1 to 51) (default -1)
-min_qp_p <int> E..V.... min quantization parameter for P-frame (from -1 to 51) (default -1)
-max_qp_p <int> E..V.... max quantization parameter for P-frame (from -1 to 51) (default -1)
-qp_p <int> E..V.... quantization parameter for P-frame (from -1 to 51) (default -1)
-qp_i <int> E..V.... quantization parameter for I-frame (from -1 to 51) (default -1)
-skip_frame <boolean> E..V.... Rate Control Based Frame Skip (default false)
-me_half_pel <boolean> E..V.... Enable ME Half Pixel (default true)
-me_quarter_pel <boolean> E..V.... Enable ME Quarter Pixel (default true)
-aud <boolean> E..V.... Inserts AU Delimiter NAL unit (default false)
-log_to_dbg <boolean> E..V.... Enable AMF logging to debug output (default false)
For example, ffmpeg -i input.mkv -c:v hevc_amf -rc cqp -qp_p 0 -qp_i 0 -c:a copy output.mkv would be lossless. Note that while it's much faster, the file sizes will be significantly larger than with libx264 or libx265 for the same quality - that's just how hardware encoders are at present time. You will likely want to record lossless with a hardware encoder for the speed and then later use a software encoder like libx264 or libx265 to reduce the file size.

Related

Arducam libcamera low quality in low resolution video/images

I'm trying to use the Arducam libcamera to get video from my Arducam 16 MP Autofocus camera on a Raspberry Pi 4, but I'm running into the issue that the quality is very low when retrieving 1920x1080 video (and images). In fact using any other resolution than the maximum that the camera offers, the quality is very low.
libcamera was installed by following the Arducam guide.
Using libcamera-still --list-cameras, I get the following supported modes:
0 : imx519 [4656x3496] (/base/soc/i2c0mux/i2c#1/imx519#1a)
Modes: 'SRGGB10_CSI2P' : 1280x720 [120.00 fps - (1048, 1042)/2560x1440 crop]
1920x1080 [60.05 fps - (408, 674)/3840x2160 crop]
2328x1748 [30.00 fps - (0, 0)/4656x3496 crop]
3840x2160 [18.00 fps - (408, 672)/3840x2160 crop]
4656x3496 [9.00 fps - (0, 0)/4656x3496 crop]
To demonstrate the problem, I run the following command to capture two images. One 4656x3496 and one half as big (2328x1748):
libcamera-still --immediate --shutter 50000 --gain 1.5 --width 2328 --height 1748 -e png -o org_2328x1748.png && libcamera-still --immediate --shutter 50000 --gain 1.5 --width 4656 --height 3496 -e png -o org_4656x3496.png
If I then scale the 4656x3496 image to 2328x1748 on my computer, the result is a much sharper image than the 2328x1748 image delivered by the camera. This is true regardless of which program I use to scale the image. I don't understand why that happens. I've also noticed that the file size of the 4656x3496 image (15.6 MiB) is a lot bigger than 4x the file size of the 2328x1748 image (2.7 MiB). I think the small image should have been closer to 4 MiB in size.
Arducam support says it is a matter of finding the right libcamera arguments, so I'm hoping someone can be of help.
The output from executing the libcamera-still command is the following:
pi#rpi:~ $ libcamera-still --immediate --shutter 50000 --gain 1.5 --width 2328 --height 1748 -e png -o org_2328x1748.png && libcamera-still --immediate --shutter 50000 --gain 1.5 --width 4656 --height 3496 -e png -o org_4656x3496.png
Preview window unavailable
[3:26:37.815366618] [4847] INFO Camera camera_manager.cpp:293 libcamera v0.0.0+3730-67300b62
[3:26:37.851216285] [4849] WARN CameraSensorProperties camera_sensor_properties.cpp:174 No static properties available for 'imx519'
[3:26:37.851259785] [4849] WARN CameraSensorProperties camera_sensor_properties.cpp:176 Please consider updating the camera sensor properties database
[3:26:37.868448913] [4849] WARN RPI raspberrypi.cpp:1274 Mismatch between Unicam and CamHelper for embedded data usage!
[3:26:37.868923575] [4849] ERROR DelayedControls delayed_controls.cpp:87 Delay request for control id 0x009a090a but control is not exposed by device /dev/v4l-subdev0
[3:26:37.869202646] [4849] INFO RPI raspberrypi.cpp:1398 Registered camera /base/soc/i2c0mux/i2c#1/imx519#1a to Unicam device /dev/media3 and ISP device /dev/media0
[3:26:37.870104137] [4847] INFO Camera camera.cpp:1029 configuring streams: (0) 2328x1748-BGR888 (1) 2328x1748-SRGGB10_CSI2P
[3:26:37.870498504] [4849] INFO RPI raspberrypi.cpp:763 Sensor: /base/soc/i2c0mux/i2c#1/imx519#1a - Selected sensor format: 2328x1748-SRGGB10_1X10 - Selected unicam format: 2328x1748-pRAA
Still capture image received
Preview window unavailable
[3:26:39.086635744] [4855] INFO Camera camera_manager.cpp:293 libcamera v0.0.0+3730-67300b62
[3:26:39.123343254] [4858] WARN CameraSensorProperties camera_sensor_properties.cpp:174 No static properties available for 'imx519'
[3:26:39.123386606] [4858] WARN CameraSensorProperties camera_sensor_properties.cpp:176 Please consider updating the camera sensor properties database
[3:26:39.140987785] [4858] WARN RPI raspberrypi.cpp:1274 Mismatch between Unicam and CamHelper for embedded data usage!
[3:26:39.141479410] [4858] ERROR DelayedControls delayed_controls.cpp:87 Delay request for control id 0x009a090a but control is not exposed by device /dev/v4l-subdev0
[3:26:39.141723259] [4858] INFO RPI raspberrypi.cpp:1398 Registered camera /base/soc/i2c0mux/i2c#1/imx519#1a to Unicam device /dev/media3 and ISP device /dev/media0
[3:26:39.142604010] [4855] INFO Camera camera.cpp:1029 configuring streams: (0) 4656x3496-BGR888 (1) 4656x3496-SRGGB10_CSI2P
[3:26:39.142994210] [4858] INFO RPI raspberrypi.cpp:763 Sensor: /base/soc/i2c0mux/i2c#1/imx519#1a - Selected sensor format: 4656x3496-SRGGB10_1X10 - Selected unicam format: 4656x3496-pRAA
Still capture image received
The images can be seen in this Google Drive folder. It contains the original images as well as the large image scaled down to 2328x1748 (scaled_4656x3496.png) with MS Paint. Notice that it is very sharp compared to org_2328x1748.png.

Adding --mode 4656:3496 results in high quality lower resolution images. For example:
libcamera-still --immediate --width 2328 --height 1748 -e png -o 2328x1748.png --mode 4656:3496
libcamera-still --immediate --width 1920 --height 1080 -e png -o 1920x1080.png --mode 4656:3496
It also works for libcamera-vid. The framerate is limited by the selected sensor mode. So 9 fps is the max if the selected mode is 4656:3496, as specified by --list-cameras:
libcamera-vid -o 1080p_3496_mode.h264 --width 1920 --height 1080 --framerate 9 --mode 4656:3496
I ended up using 3840:2160 mode which delivers sufficient quality and supports 18 fps.

Is there any way to reduce the PostgreSQL performance deviation between the multiple iterations?

NOPM values captured with HammerDB-v4.3 scripts (schema_tpcc.tcl and
test_tpcc.tcl ) for multiple trails.
The expected performance deviation between the multiple trials should be less
than 2%, but observed more.
Hardware configuration
Architecture x86_64
CPU op-mode(s) 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 256
On-line CPU(s) list: 0-255
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 2
NUMA node(s): 8
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 16384K
OS: RHEL8.4
RAM SIZE:512
SSD:1TB
Postgresql.conf
autovacuum_max_workers = 16
autovacuum_vacuum_cost_limit = 3000
checkpoint_completion_target = 0.9
checkpoint_timeout = '15min'
cpu_tuple_cost = 0.03
effective_cache_size = '350GB'
listen_addresses = '*'
maintenance_work_mem = '2GB'
max_connections = 1000
max_wal_size = '128GB'
random_page_cost = 1.1
shared_buffers = '128GB'
wal_buffers = '1GB'
work_mem = '128MB'
random_page_cost = 1.1
effective_io_concurrency = 200
HammerDB Scripts
>>cat schema.tcl
#!/bin/tclsh
dbset db pg
diset connection pg_host localhost
diset connection pg_port 5432
diset tpcc pg_count_ware 400
diset tpcc pg_num_vu 50
print dict
buildschema
waittocomplete
RUN TEST on i.e. start with 1VU then 2, 4, etc
| Virtual Users | Trail-1(NOPM) | Trail-2(NOPM) | %diff |
|---------------|---------------|---------------|---------|
| 12 | 99390 | 92913 | 6.516752|
| 140 | 561429 | 525408 | 6.415949|
| 192 | 636016 | 499574 | 21.4526 |
| 230 | 621644 | 701882 | 12.9074 |

There is already a comprehensive answer to this question on HammerDB discussions.
You make an assumption that PostgreSQL will scale linearly for an intensive OLTP workload on 256 logical CPUs of a particular type. However, if a workload experiences high contention then performance will not be as expected on a particular hardware/software combination due to locking and latching - this is to be expected. Your experience may be different on different hardware (with the same database) and/or a different database (on the same hardware). For example, you may find a higher core count results in lower performance as the additional cores increase contention, lowering throughput.
You need to follow the advice in the discussions post and analyze the wait events using the HammerDB v4.3 Graphical metrics viewer for pg_active_session_history or with SQL directly. This will direct you to exactly the cause of contention (with a particular hardware/software combination - LWLock is highlighted in pink in the viewer or look for this in the query output). If this does not enable you to diagnose the issues directly, then employing a PostgreSQL consultant would be necessary to explain the issue for you.

Drools performance

I have an issue regarding performance of Drools on different Machines.
I made very simple JMH Benchmark test:
package ge.magticom.rules.benchmark;
import ge.magticom.rules.benchmark.Subscriber
rule "bali.free.smsparty"
activation-group "main"
salience 4492
when
$subs:Subscriber(chargingProfileID == 2)
then
$subs.setResult(15);
end
rule "bali.free.smsparty5"
activation-group "main"
salience 4492
when
$subs:Subscriber(chargingProfileID == 3)
then
$subs.setResult(14);
end
#Benchmark
public Subscriber send() throws Exception {
Subscriber subscriber = new Subscriber();
subscriber.setChargingProfileID(5);
StatelessKieSession session = ruleBase.newStatelessKieSession();
ArrayList<Object> objs = new ArrayList<Object>();
objs.add(subscriber);
session.execute(objs);
return subscriber;
}
On Home development machine
NAME="Ubuntu"
VERSION="20.04.2 LTS (Focal Fossa)"
(Intel(R) Core(TM) i7-8700 CPU # 3.20GHz 12 Threads ) 64 GB Memory with JDK 11 and have very great performance:
With 7 threads a have nearly 2M operation per second(Stateless)
Benchmark Mode Cnt Score Error Units
RulesBenchmark.send thrpt 5 2154292.750 ± 149405.498 ops/s
But on preproduction server Which is Intel(R) Xeon(R) Gold 6258R CPU # 2.70GHz with 112 Threads and 1 TB RAM I have half of performance (Even increasing threads)
NAME="Oracle Linux Server"
VERSION="8.4"
Benchmark Mode Cnt Score Error Units
RulesBenchmark.send thrpt 5 1084939.195 ± 107897.663 ops/s
I'm trying to test our billing system with java 11 and Drolls 7.54.0.Final.
Our system was based on Jrockit realtime 1.6 and drools version 4.0.3. We are moving system from Sun Solaris SPARK to Intel base system.
Running same rules with Jrockit 1.6 I got even worth performance issue with Home and Preproduction environment:
Home test benchmark:
Benchmark Mode Cnt Score Error Units
RulesBenchmark.send thrpt 20 692054.563 ± 3507.519 ops/s
Preproduction benchmark:
Benchmark Mode Cnt Score Error Units
RulesBenchmark.send thrpt 20 382283.288 ± 6405.953 ops/s
As you can see, it's nearly half performance of very simple rules.
But for real rules, such as our online charging system, it's even bad performance :
On home environment I got
Benchmark Mode Cnt Score Error Units
WorkerBenchmark.send thrpt 5 152.846 ± 87.076 ops/s
this means 1 message contains nearly 100 iterations
so in 00:01:49 benchmark processed 16287 sessions with 430590 events of rule calls. single rule call is about 2.33 millisecond in average, which is not very great, but not as bad as on preproduction
On Preproduction server
Benchmark Mode Cnt Score Error Units
WorkerBenchmark.send thrpt 5 35.013 ± 9.565 ops/s
in 00:01:54 I got only 3723 sessions which contains wholly 98571 events of rule calls. Each call is 10.7299 msc in average.
During running all these benchmark nothing was running on preproduction system. But on home environment there is a lot of development tools, was running tests from Intellij IDEA
Can you suggest anything, which may cause such difference in performance. I tried different java versions and vendors. These results are based on oracle-jdk-11.0.8.
Here are kernel params of Preproduction server:
fs.file-max = 6815744
kernel.sem = 2250 32000 100 128
kernel.shmmni = 4096
kernel.shmall = 1073741824
kernel.shmmax = 4398046511104
net.core.rmem_default = 262144
net.core.rmem_max = 4194304
net.core.wmem_default = 262144
net.core.wmem_max = 1048576
net.ipv4.conf.all.rp_filter = 2
net.ipv4.conf.default.rp_filter = 2
fs.aio-max-nr = 1048576
net.ipv4.ip_local_port_range = 9000 65500

This is just a very wild guess since I definitively don't have enough information, but are the 2 environments using the same garbage collectors configured in the same way? Maybe you're using ParallelGC (which in my experience is better for pure throughput as you're measuring) on one side and G1 on the other?

Thanks for answer.
I used several GC configuration, none of them were ParallelGC. I think GC is not a problem. I used ZGC in final tests and GC pause times are not above 5 msc (tested also with java 16 and pause times are below 100 microsecond ). :
#Fork(value = 2, jvmArgs = {"--illegal-access=permit", "-Xms10G", "-XX:+UnlockDiagnosticVMOptions", "-XX:+DebugNonSafepoints",
"-Xmx10G","-XX:+UnlockExperimentalVMOptions", "-XX:ConcGCThreads=5", "-XX:ParallelGCThreads=10", "-XX:+UseZGC", "-XX:+UsePerfData", "-XX:MaxMetaspaceSize=10G", "-XX:MetaspaceSize=256M"}
java -version
java version "11.0.8" 2020-07-14 LTS
Java(TM) SE Runtime Environment 18.9 (build 11.0.8+10-LTS)
Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.8+10-LTS, mixed mode)
Here is Flame graph generated with AsyncProfilers
As you can see, at home enviroment Java process is using 95% of whole time, but on server only 65%. the time difference is also obviouse :
RulesBenchmark.send thrpt 5 1612318.098 ± 64712.672 ops/s
Home Result FlameGraph.html
RulesBenchmark.send thrpt 5 775498.081 ± 72237.890 ops/s
Server Flame Graph.html

ERROR: No OpenCL platforms found, check OpenCL installation

I tried to run Matlab program on gpu (CentOS 7.3).
This Matlab use caffe.
When I run it from the command line with:
matlab -nodisplay -r "demo, quit"
it run okay.
When I run it with LSF command:
bsub -q gpu -R "select[ngpus>0] rusage[ngpus_shared=1]" matlab -nodisplay -r "demo, quit"
I get the error :
ERROR: No OpenCL platforms found, check OpenCL installation
I comprare the LD_PATH_LIBRARY - are the same.
What can be the problem?
Any ideas are welcome!
clinfo output:
Number of platforms 1
Platform Name NVIDIA CUDA
Platform Vendor NVIDIA Corporation
Platform Version OpenCL 1.2 CUDA 8.0.0
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts
Platform Extensions function suffix NV
Platform Name NVIDIA CUDA
Number of devices 1
Device Name Tesla K40m
Device Vendor NVIDIA Corporation
Device Vendor ID 0x10de
Device Version OpenCL 1.2 CUDA
Driver Version 375.26
Device OpenCL C Version OpenCL C 1.2
Device Type GPU
Device Available Yes
Device Profile FULL_PROFILE
Device Topology (NV) PCI-E, 09:00.0
Max compute units 15
Max clock frequency 745MHz
Compute Capability (NV) 3.5
Device Partition (core)
Max number of sub-devices 1
Supported partition types None
Max work item dimensions 3
Max work item sizes 1024x1024x64
Max work group size 1024
Compiler Available Yes
Linker Available Yes
Preferred work group size multiple 32
Warp size (NV) 32
Preferred / native vector sizes
char 1 / 1
short 1 / 1
int 1 / 1
long 1 / 1
half 0 / 0 (n/a)
float 1 / 1
double 1 / 1 (cl_khr_fp64)
Half-precision Floating-point support (n/a)
Single-precision Floating-point support (core)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations Yes
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Address bits 64, Little-Endian
Global memory size 11995578368 (11.17GiB)
Error Correction support Yes
Max memory allocation 2998894592 (2.793GiB)
Unified memory for Host and Device No
Integrated memory (NV) No
Minimum alignment for any data type 128 bytes
Alignment of base address 4096 bits (512 bytes)
Global Memory cache type Read/Write
Global Memory cache size 245760 (240KiB)
Global Memory cache line 128 bytes
Image support Yes
Max number of samplers per kernel 32
Max size for 1D images from buffer 134217728 pixels
Max 1D or 2D image array size 2048 images
Max 2D image size 16384x16384 pixels
Max 3D image size 4096x4096x4096 pixels
Max number of read image args 256
Max number of write image args 16
Local memory type Local
Local memory size 49152 (48KiB)
Registers per block (NV) 65536
Max constant buffer size 65536 (64KiB)
Max number of constant args 9
Max size of kernel argument 4352 (4.25KiB)
Queue properties
Out-of-order execution Yes
Profiling Yes
Prefer user sync for interop No
Profiling timer resolution 1000ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels No
Kernel execution timeout (NV) No
Concurrent copy and kernel execution (NV) Yes
Number of async copy engines 2
printf() buffer size 1048576 (1024KiB)
Built-in kernels
Device Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts
NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) No platform
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) No platform
clCreateContext(NULL, ...) [default] No platform
clCreateContext(NULL, ...) [other] Success [NV]
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) No platform

My best guess would be that the bsub command from LSF schedules the job on another machine (compute node) in a cluster, where OpenCL is not installed.
Having OpenCL/CUDA on the frontend, but not the compute nodes of a cluster is something I've witnessed quite a few times. Even parts of the filesystem with the libraries are shared, the folder /etc/OpenCL/vendors, used for OpenCLs ICD mechanism must be present.
You could try running clinfo via bsub (if you didn't already), or use bsub to execute ls /etc/OpenCL/vendors.
If you're not sure whether or not the LSF-submitted jobs run on the same machine or not, use the hostname command with and without bsub.
Hope that helps.

Accessing real frame buffer of PCI card

I am trying to access the framebuffer on my systems VGA controller card.
lscpi -vn gives:
00:02.0 0300: 8086:2a02 (rev 0c) (prog-if 00 [VGA controller])
Subsystem: 1028:022f
Flags: bus master, fast devsel, latency 0, IRQ 45
Memory at fea00000 (64-bit, non-prefetchable) [size=1M]
Memory at e0000000 (64-bit, prefetchable) [size=256M]
I/O ports at eff8 [size=8]
Expansion ROM at <unassigned> [disabled]
Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit-
Capabilities: [d0] Power Management version 3
Kernel driver in use: i915
Now, I access the device and I get:
fb_base = pci_resource_start( devp, 0 ); **output: FEA00000**
fb_size = pci_resource_len( devp, 0 ); **output: 1MB**
So the range of framebuffer is FEA00000 - FEB00000
But from the lspci -vn output This region is non prefetchable.
Does that mean I am not pointing to the frame buffer at all.
Is my framebuffer at address E0000000:
The driver currently using the resource is the Intel i915
So maybe when I request region or IRQ it can clash if not shared by that driver.
If I remove the i915 rmmod it to insmod my driver, will my screen go blank.
Please help.
Thanks.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

FFmpeg: Encode x264 with AMD GPU on Windows? - encoding

Related

Arducam libcamera low quality in low resolution video/images

Is there any way to reduce the PostgreSQL performance deviation between the multiple iterations?

Drools performance

ERROR: No OpenCL platforms found, check OpenCL installation

Accessing real frame buffer of PCI card

Categories

Resources