PCIe Configuration Space vs ECAM - pci

Is the PCIe ECAM exactly the same as the "PCI-Compatible Configuration Registers" only mapped to memory instead of I/O?
It seems to me that PCIe uses the same Configuration Mechanism as conventional PCI: [1]
| 31 | 30 - 24 | 23 - 16 | 15 - 11 | 10 - 8 | 7 - 2 | 1 - 0 |
| Enable | Reserved | Bus Nr | Device Nr | Function Nr | Register Nr | 00 |
But in PCIe you can use the reserved bytes to address more registers of a function.
Is this correct?
In section 7.2.1 [2] the ECAM is defined as:
| 27 - 20 | 19 - 15 | 14 - 12 | 11 - 8 | 7 - 2 | 1 - 0 |
| Bus Nr | Dev Nr | Function Nr | Ext. Register Nr | Register Nr | Byte Enable |
It looks very similar to the conventional configuration.
Just the reserved bits are shifted to the register number which they extend in PCIe.
But I can use them like the old one? Only address them in memory space not IO space?
[1] https://wiki.osdev.o/PCI#Configuration_Space_Access_Mechanism_.231
[2] in PCI Express Base Specification, Rev. 4.0 Version 1.0

You're mixing apples and oranges in your comparison. The first address decoding is provided by a host bridge component on PC-AT architecture systems (*). It's a way of using the Intel processor's I/O address space to interface to the PCI bus configuration space mechanism. It can also be used on a PCIe system, because the PCIe host bridge component provides the same interface to PCIe devices. However, everything below the host bridge is implemented quite differently between PCI and PCIe.
Meanwhile the second decoding scheme you showed can only be used within the memory-mapped block through which PCIe provides access to its extended configuration space. And only after that block has been mapped into the physical address space in a system-dependent way.
So while they have a similar function, no, you cannot use them in the same way. You can:
Access the first 256 bytes of any PCI or PCIe device's configuration space using the first mechanism, but you must use the first addressing scheme, OR
Access the entire extended configuration space of any PCIe device using the second mechanism (including the first 256 bytes), but then you must use the second addressing scheme.
(*) The "I/O space interface to PCI bus configuration via 0xCF8 / 0xCFC" really is part of the Intel / PC-AT architecture. Other system architectures (MIPS for example) don't have separate I/O address spaces, and host bridges designed for them have different mechanisms to generate PCIe configuration space accesses (or they simply use the memory-mapped mechanism directly).

Related

Is there any way to reduce the PostgreSQL performance deviation between the multiple iterations?

NOPM values captured with HammerDB-v4.3 scripts (schema_tpcc.tcl and
test_tpcc.tcl ) for multiple trails.
The expected performance deviation between the multiple trials should be less
than 2%, but observed more.
Hardware configuration
Architecture x86_64
CPU op-mode(s) 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 256
On-line CPU(s) list: 0-255
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 2
NUMA node(s): 8
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 16384K
OS: RHEL8.4
RAM SIZE:512
SSD:1TB
Postgresql.conf
autovacuum_max_workers = 16
autovacuum_vacuum_cost_limit = 3000
checkpoint_completion_target = 0.9
checkpoint_timeout = '15min'
cpu_tuple_cost = 0.03
effective_cache_size = '350GB'
listen_addresses = '*'
maintenance_work_mem = '2GB'
max_connections = 1000
max_wal_size = '128GB'
random_page_cost = 1.1
shared_buffers = '128GB'
wal_buffers = '1GB'
work_mem = '128MB'
random_page_cost = 1.1
effective_io_concurrency = 200
HammerDB Scripts
>>cat schema.tcl
#!/bin/tclsh
dbset db pg
diset connection pg_host localhost
diset connection pg_port 5432
diset tpcc pg_count_ware 400
diset tpcc pg_num_vu 50
print dict
buildschema
waittocomplete
RUN TEST on i.e. start with 1VU then 2, 4, etc
| Virtual Users | Trail-1(NOPM) | Trail-2(NOPM) | %diff |
|---------------|---------------|---------------|---------|
| 12 | 99390 | 92913 | 6.516752|
| 140 | 561429 | 525408 | 6.415949|
| 192 | 636016 | 499574 | 21.4526 |
| 230 | 621644 | 701882 | 12.9074 |
There is already a comprehensive answer to this question on HammerDB discussions.
You make an assumption that PostgreSQL will scale linearly for an intensive OLTP workload on 256 logical CPUs of a particular type. However, if a workload experiences high contention then performance will not be as expected on a particular hardware/software combination due to locking and latching - this is to be expected. Your experience may be different on different hardware (with the same database) and/or a different database (on the same hardware). For example, you may find a higher core count results in lower performance as the additional cores increase contention, lowering throughput.
You need to follow the advice in the discussions post and analyze the wait events using the HammerDB v4.3 Graphical metrics viewer for pg_active_session_history or with SQL directly. This will direct you to exactly the cause of contention (with a particular hardware/software combination - LWLock is highlighted in pink in the viewer or look for this in the query output). If this does not enable you to diagnose the issues directly, then employing a PostgreSQL consultant would be necessary to explain the issue for you.

When writing message length is more than 1024B(mtu), it failed in softroce mode

When I am writing message length is more than 1024B(mtu), it failed in softroce mode, pls help check why.
Using the standard tool ib_write_lat to test:
when ib_write_lat -s 1024 -n 5
When ib_write_lat -s 1025 -n 5, it fails.
My softroce version in in Red Hat Enterprise Linux Server release 7.4 (Maipo)
Is it a bug in softroce?
No it isn't a bug. I had similar problems.
What did you configure at your interface configuration?
I expect that you have a MTU of 1500 Bytes configured (or leaved the default value), this will result in RoCE using 1024. If you configure your interface MTU to 4200 you can use the ib_write_lat command with up to 4096 bytes.
InfiniBand protocol Maximum Transmission Unit (MTU) defines several fix size MTU: 256, 512, 1024, 2048 or 4096 bytes.
RoCE based application that uses RDMA that runs over Ethernet should take into account that the RoCE MTU is smaller than the Ethernet MTU. (normally 1500 is the default).
https://community.mellanox.com/docs/DOC-1447

short driver from Linux Device Drivers book

I am trying to run short from Linux Device Drivers book, a driver which uses by default the parallel interface of a pc at io address base 0x378. I am specifically using the /dev/short0 device.
Quoting from the book
/dev/short0 writes to and reads from the 8-bit port
located at the I/O address base (0x378 unless changed at load time).
The write operation (on default behavior) essentially does that
while (count--) {
outb(*(ptr++), port);
wmb( );
}
The ptr variable holds a pointer to the data the user has requested to be written to the device. Only the last byte of course survives, as preceding bytes get overwritten. The read operation works similarly, by using inb instead of outb.
Quoting also from the book
If you choose to read from an output
port, you most likely get back the last value written to the port (this applies to the parallel interface and to most other digital I/O circuits in common use)
So when i do
$ echo -n "a" > /dev/short0
$ dd if=/dev/short0 bs=1 count=1 | od -t x1
as suggested in the book, i expect to get back the ascii code for 'a' in hex, but what i get is 0xff:
1+0 records in
1+0 records out
1 byte (1 B) copied, 0,000155485 s, 6,4 kB/s
0000000 ff
0000001
I have verified, adding some printks and using dmesg, that the relevant code of the driver actually gets executed and beyond that, i' m stuck. What are some possible reasons for this not working? Or where should i look next to find out why it is not working?
For what it matters, the io address range 0x378-0x37a is initially allocated from the parport module, so i rmmod it along with a few other modules that use parport before i load the short module. Finally, on my system uname -a gives
Linux Crete 3.13.0-24-generic #47-Ubuntu SMP Fri May 2 23:31:42 UTC 2014 i686 i686 i686 GNU/Linux

ERROR: No OpenCL platforms found, check OpenCL installation

I tried to run Matlab program on gpu (CentOS 7.3).
This Matlab use caffe.
When I run it from the command line with:
matlab -nodisplay -r "demo, quit"
it run okay.
When I run it with LSF command:
bsub -q gpu -R "select[ngpus>0] rusage[ngpus_shared=1]" matlab -nodisplay -r "demo, quit"
I get the error :
ERROR: No OpenCL platforms found, check OpenCL installation
I comprare the LD_PATH_LIBRARY - are the same.
What can be the problem?
Any ideas are welcome!
clinfo output:
Number of platforms 1
Platform Name NVIDIA CUDA
Platform Vendor NVIDIA Corporation
Platform Version OpenCL 1.2 CUDA 8.0.0
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts
Platform Extensions function suffix NV
Platform Name NVIDIA CUDA
Number of devices 1
Device Name Tesla K40m
Device Vendor NVIDIA Corporation
Device Vendor ID 0x10de
Device Version OpenCL 1.2 CUDA
Driver Version 375.26
Device OpenCL C Version OpenCL C 1.2
Device Type GPU
Device Available Yes
Device Profile FULL_PROFILE
Device Topology (NV) PCI-E, 09:00.0
Max compute units 15
Max clock frequency 745MHz
Compute Capability (NV) 3.5
Device Partition (core)
Max number of sub-devices 1
Supported partition types None
Max work item dimensions 3
Max work item sizes 1024x1024x64
Max work group size 1024
Compiler Available Yes
Linker Available Yes
Preferred work group size multiple 32
Warp size (NV) 32
Preferred / native vector sizes
char 1 / 1
short 1 / 1
int 1 / 1
long 1 / 1
half 0 / 0 (n/a)
float 1 / 1
double 1 / 1 (cl_khr_fp64)
Half-precision Floating-point support (n/a)
Single-precision Floating-point support (core)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations Yes
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Address bits 64, Little-Endian
Global memory size 11995578368 (11.17GiB)
Error Correction support Yes
Max memory allocation 2998894592 (2.793GiB)
Unified memory for Host and Device No
Integrated memory (NV) No
Minimum alignment for any data type 128 bytes
Alignment of base address 4096 bits (512 bytes)
Global Memory cache type Read/Write
Global Memory cache size 245760 (240KiB)
Global Memory cache line 128 bytes
Image support Yes
Max number of samplers per kernel 32
Max size for 1D images from buffer 134217728 pixels
Max 1D or 2D image array size 2048 images
Max 2D image size 16384x16384 pixels
Max 3D image size 4096x4096x4096 pixels
Max number of read image args 256
Max number of write image args 16
Local memory type Local
Local memory size 49152 (48KiB)
Registers per block (NV) 65536
Max constant buffer size 65536 (64KiB)
Max number of constant args 9
Max size of kernel argument 4352 (4.25KiB)
Queue properties
Out-of-order execution Yes
Profiling Yes
Prefer user sync for interop No
Profiling timer resolution 1000ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels No
Kernel execution timeout (NV) No
Concurrent copy and kernel execution (NV) Yes
Number of async copy engines 2
printf() buffer size 1048576 (1024KiB)
Built-in kernels
Device Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts
NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) No platform
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) No platform
clCreateContext(NULL, ...) [default] No platform
clCreateContext(NULL, ...) [other] Success [NV]
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) No platform
My best guess would be that the bsub command from LSF schedules the job on another machine (compute node) in a cluster, where OpenCL is not installed.
Having OpenCL/CUDA on the frontend, but not the compute nodes of a cluster is something I've witnessed quite a few times. Even parts of the filesystem with the libraries are shared, the folder /etc/OpenCL/vendors, used for OpenCLs ICD mechanism must be present.
You could try running clinfo via bsub (if you didn't already), or use bsub to execute ls /etc/OpenCL/vendors.
If you're not sure whether or not the LSF-submitted jobs run on the same machine or not, use the hostname command with and without bsub.
Hope that helps.

pgbouncer free_servers - how to increase them

current setting of a pgbouncer server is the following - and what I don't understand is the 'free_servers' info given by the show lists command when connecting to pgbouncer. Is it a (soft or hard) limit on the number of connexion to the postgresql databases used with this instance of pgbouncer ?
configuration :
max_client_conn = 2048
default_pool_size = 1024
min_pool_size = 10
reserve_pool_size = 500
reserve_pool_timeout = 1
server_idle_timeout = 600
listen_backlog = 1024
show lists gives :
pgbouncer=# show lists ;
list | items
---------------+--------
databases | 6
pools | 3
free_clients | 185
used_clients | 15
free_servers | 70
used_servers | 30
it seems that there is a limit at 30 + 70 = 100 servers, but couldn't find it even review configuration values with show config, and documentation doesn't explicit which configuration to change / increase free_servers.
pgbouncer version : 1.7.2
EDIT :
I've just discover that, for a pool of 6 webservers configured to hit the same PG database, 3 of them can have 200 backend connexions (server connexion), and 3 of them can only make and maintain 100 connexions (as described in the first part). But, .. the configuration is exactly the same in pgbouncer configuration file, and the servers are cloned VM. The version of pgbouncer is also the same..
So far, I still haven't found documentation on internet where this limitation come from...
This data is just some internal information for PgBouncer.
Servers information is stored inside an array list data structure which is pre-allocated up to a certain size, in this case that is 100 slots. used_servers = 30, free_servers = 70 means there are 30 slots currently in used, and 70 slots free. PgBouncer will automatically increase the size of the list when it's full, hence there's no configuration for this.