How can a usage counter in Solaris 10 /proc filesystem decrease? - solaris

I'm trying to determine the CPU utilization of specific LWPs in specific processes in Solaris 10 using data from the /proc filesystem. The problem I have is that sometimes a utilization counter decreases.
Here's the gist of it:
// we'll be reading from the file named /proc/<pid>/lwp/<lwpid>/lwpusage
std::stringstream filename;
filename << "/proc/" << pid << "/lwp/" << lwpid << "/lwpusage";
int fd = open(filename.str().c_str(), O_RDONLY);
// error checking
while(1)
{
prusage_t usage;
ssize_t readResult = pread(usage_fd, &usage, sizeof(prusage_t), 0);
// error checking
std::cout << "sec=" << usage.pr_stime.tv_sec
<< "nsec=" << usage.pr_stime.tv_nsec << std::endl;
// wait
}
close(fd);
The number of nanoseconds reported in the prusage_t struct are derived from timestamps recorded each time an LWP changes state. This feature is called microstate accounting. Sounds good, but every so often the "system call cpu time" counter decreases roughly 1-10 milliseconds.
Update: its not just the "system call cpu time" counter, I've since seen other counters decreasing as well.
Another curiosity is that it always seems to be exactly one sample that's bogus - never two near each other. All the other samples are monotonically increasing at the expected rate. This seems to rule out the possibility that the counter is somehow reset in the kernel.
Any clues as to what's going on here?
> uname -a
SunOS cdc-build-sol10u7 5.10 Generic_139556-08 i86pc i386 i86pc

If you are on a multicore machine, you might check whether this is occurring when the process is migrated from one processor core to a different one. If your processes are running, prstat will show the cpu on which they are running. To minimize lock contention, frequently updated data is sometimes updated in a processor-specific memory area and then synchronized with any copies of the data for other processors.

Just a guess. You might want to disable temporarily NTP and see if the problem still appears.

Related

QSPI connection on STM32 microcontrollers with other peripherals instead of Flash memories

I will start a project which needs a QSPI protocol. The component I will use is a 16-bit ADC which supports QSPI with all combinations of clock phase and polarity. Unfortunately, I couldn't find a source on the internet that points to QSPI on STM32, which works with other components rather than Flash memories. Now, my question: Can I use STM32's QSPI protocol to communicate with other devices that support QSPI? Or is it just configured to be used for memories?
The ADC component I want to use is: ADS9224R (16-bit, 3MSPS)
Here is the image of the datasheet that illustrates this device supports the full QSPI protocol.
Many thanks
page 33 of the datasheet
The STM32 QSPI can work in several modes. The Memory Mapped mode is specifically designed for memories. The Indirect mode however can be used for any peripheral. In this mode you can specify the format of the commands that are exchanged: presence of an instruction, of an adress, of data, etc...
See register QUADSPI_CCR.
QUADSPI supports indirect mode, where for each data transaction you manually specify command, number of bytes in address part, number of data bytes, number of lines used for each part of the communication and so on. Don't know whether HAL supports all of that, it would probably be more efficient to work directly with QUADSPI registers - there are simply too many levers and controls you need to set up, and if the library is missing something, things may not work as you want, and QUADSPI is pretty unpleasant to debug. Luckily, after initial setup, you probably won't need to change very much in its settings.
In fact, some time ago, when I was learning QUADSPI, I wrote my own indirect read/write for QUADSPI flash. Purely a demo program for myself. With a bit of tweaking it shouldn't be hard to adapt it. From my personal experience, QUADSPI is a little hard at first, I spent a pair of weeks debugging it with logic analyzer until I got it to work. Or maybe it was due to my general inexperience.
Below you can find one of my functions, which can be used after initial setup of QUADSPI. Other communication functions are around the same length. You only need to set some settings in a few registers. Be careful with the order of your register manipulations - there is no "start communication" flag/bit/command. Communication starts automatically when you set some parameters in specific registers. This is explicitly stated in the reference manual, QUADSPI section, which was the only documentation I used to write my code. There is surprisingly limited information on QUADSPI available on the Internet, even less with registers.
Here is a piece from my basic example code on registers:
void QSPI_readMemoryBytesQuad(uint32_t address, uint32_t length, uint8_t destination[]) {
while (QUADSPI->SR & QUADSPI_SR_BUSY); //Make sure no operation is going on
QUADSPI->FCR = QUADSPI_FCR_CTOF | QUADSPI_FCR_CSMF | QUADSPI_FCR_CTCF | QUADSPI_FCR_CTEF; // clear all flags
QUADSPI->DLR = length - 1U; //Set number of bytes to read
QUADSPI->CR = (QUADSPI->CR & ~(QUADSPI_CR_FTHRES)) | (0x00 << QUADSPI_CR_FTHRES_Pos); //Set FIFO threshold to 1
/*
* Set communication configuration register
* Functional mode: Indirect read
* Data mode: 4 Lines
* Instruction mode: 4 Lines
* Address mode: 4 Lines
* Address size: 24 Bits
* Dummy cycles: 6 Cycles
* Instruction: Quad Output Fast Read
*
* Set 24-bit Address
*
*/
QUADSPI->CCR =
(QSPI_FMODE_INDIRECT_READ << QUADSPI_CCR_FMODE_Pos) |
(QIO_QUAD << QUADSPI_CCR_DMODE_Pos) |
(QIO_QUAD << QUADSPI_CCR_IMODE_Pos) |
(QIO_QUAD << QUADSPI_CCR_ADMODE_Pos) |
(QSPI_ADSIZE_24 << QUADSPI_CCR_ADSIZE_Pos) |
(0x06 << QUADSPI_CCR_DCYC_Pos) |
(MT25QL128ABA1EW9_COMMAND_QUAD_OUTPUT_FAST_READ << QUADSPI_CCR_INSTRUCTION_Pos);
QUADSPI->AR = (0xFFFFFF) & address;
/* ---------- Communication Starts Automatically ----------*/
while (QUADSPI->SR & QUADSPI_SR_BUSY) {
if (QUADSPI->SR & QUADSPI_SR_FTF) {
*destination = *((uint8_t*) &(QUADSPI->DR)); //Read a byte from data register, byte access
destination++;
}
}
QUADSPI->FCR = QUADSPI_FCR_CTOF | QUADSPI_FCR_CSMF | QUADSPI_FCR_CTCF | QUADSPI_FCR_CTEF; //Clear flags
}
It is a little crude, but it may be a good starting point for you, and it's well-tested and definitely works. You can find all my functions here (GitHub). Combine it with reading the QUADSPI section of the reference manual, and you should start to get a grasp of how to make it work.
Your job will be to determine what kind of commands and in what format you need to send to your QSPI slave device. That information is available in the device's datasheet. Make sure you send command and address and every other part on the correct number of QUADSPI lines. For example, sometimes you need to have command on 1 line and data on all 4, all in the same transaction. Make sure you set dummy cycles, if they are required for some operation. Pay special attention at how you read data that you receive via QUADSPI. You can read it in 32-bit words at once (if incoming data is a whole number of 32-bit words). In my case - in the function provided here - I read it by individual bytes, hence such a scary looking *destination = *((uint8_t*) &(QUADSPI->DR));, where I take an address of the data register, cast it to pointer to uint8_t and dereference it. Otherwise, if you read DR just as QUADSPI->DR, your MCU reads 32-bit word for every byte that arrives, and QUADSPI goes crazy and hangs and shows various errors and triggers FIFO threshold flags and stuff. Just be mindful of how you read that register.

What is the latency of `clwb` and `ntstore` on Intel's Optane Persistent Memory?

In this paper, it is written that the 8 bytes sequential write of clwb and ntstore of optane PM have 90ns and 62ns latency, respectively, and sequential reading is 169ns.
But in my test with Intel 5218R CPU, clwb is about 700ns and ntstore is about 1200ns. Of course, there is a difference between my test method and the paper, but the result is too bad, which is unreasonable. And my test is closer to actual usage.
During the test, did the Write Pending Queue of CPU's iMC or the WC buffer in the optane PM become the bottleneck, causing blockage, and the measured latency has been inaccurate? If this is the case, is there a tool to detect it?
#include "libpmem.h"
#include "stdio.h"
#include "x86intrin.h"
//gcc aep_test.c -o aep_test -O3 -mclwb -lpmem
int main()
{
size_t mapped_len;
char str[32];
int is_pmem;
sprintf(str, "/mnt/pmem/pmmap_file_1");
int64_t *p = pmem_map_file(str, 4096 * 1024 * 128, PMEM_FILE_CREATE, 0666, &mapped_len, &is_pmem);
if (p == NULL)
{
printf("map file fail!");
exit(1);
}
if (!is_pmem)
{
printf("map file fail!");
exit(1);
}
struct timeval start;
struct timeval end;
unsigned long diff;
int loop_num = 10000;
_mm_mfence();
gettimeofday(&start, NULL);
for (int i = 0; i < loop_num; i++)
{
p[i] = 0x2222;
_mm_clwb(p + i);
// _mm_stream_si64(p + i, 0x2222);
_mm_sfence();
}
gettimeofday(&end, NULL);
diff = 1000000 * (end.tv_sec - start.tv_sec) + end.tv_usec - start.tv_usec;
printf("Total time is %ld us\n", diff);
printf("Latency is %ld ns\n", diff * 1000 / loop_num);
return 0;
}
Any help or correction is much appreciated!
The main reason is repeating flush to the same cacheline is delayed dramatically[1].
You are testing the avg latency instead of best-case latency like the FAST20 papaer.
ntstore are more expensive than clwb, so it's latency is higher. I guess it's a typo in your first paragraph.
appended on 4.14
Q: Tools to detect possible bottleneck on WPQ of buffers?
A: You can get a baseline when PM is idle, and use this baseline to indicate the possible bottleneck.
Tools:
Intel Memory Bandwidth Monitoring
Reads Two hardware counters from performance monitoring unit (PMU) in the processor: 1) UNC_M_PMM_WPQ_OCCUPANCY.ALL, which counts the accumulated number of WPQ entries at each cycle and 2) UNC_M_PMM_WPQ_INSERTS, which counts how many entries have been inserted into WPQ. And the calculate the queueing delay of WPQ: UNC_M_PMM_WPQ_OCCUPANCY.ALL / UNC_M_PMM_WPQ_INSERTS. [2]
[1] Chen, Youmin, et al. "Flatstore: An efficient log-structured key-value storage engine for persistent memory." Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 2020.
[2] Imamura, Satoshi, and Eiji Yoshida. “The analysis of inter-process interference on a hybrid memory system.” Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops. 2020.
https://www.usenix.org/system/files/fast20-yang.pdf describes what they're measuring: the CPU side of doing one store + clwb + mfence for a cached write1. So the CPU-pipeline latency of getting a store "accepted" into something persistent.
This isn't the same thing as making it all the way to the Optane chips themselves; the Write Pending Queue (WPQ) of the memory controllers are part of the persistence domain on Cascade Lake Intel CPUs like yours; wikichip quotes an Intel image:
Footnote 1: Also note that clwb on Cascade Lake works like clflushopt - it just evicts. So store + clwb + mfence in a loop test would test the cache-cold case, if you don't do something to load the line before the timed interval. (From the paper's description, I think they do). Future CPUs will hopefully properly support clwb, but at least CSL got the instruction supported so future libraries won't have to check CPU features before using it.
You're doing many stores, which will fill up any buffers in the memory controller or elsewhere in the memory hierarchy. So you're measuring throughput of a loop, not latency of one store plus mfence itself in a previously-idle CPU pipeline.
Separate from that, rewriting the same line repeatedly seems to be slower than sequential write, for example. This Intel forum post reports "higher latency" for "flushing a cacheline repeatedly" than for flushing different cache lines. (The controller inside the DIMM does do wear leveling, BTW.)
Fun fact: later generations of Intel CPUs (perhaps CPL or ICX) will have even the caches (L3?) in the persistence domain, hopefully making clwb even cheaper. IDK if that would affect back-to-back movnti throughput to the same location, though, or even clflushopt.
During the test, did the Write Pending Queue of CPU's iMC or the WC buffer in the optane PM become the bottleneck, causing blockage, and the measured latency has been inaccurate?
Yes, that would be my guess.
If this is the case, is there a tool to detect it?
I don't know, sorry.

How to minimize latency when reading audio with ALSA?

When trying to acquire some signals in the frequency domain, I've encountered the issue of having snd_pcm_readi() take a wildly variable amount of time. This causes problems in the logic section of my code, which is time dependent.
I have that most of the time, snd_pcm_readi() returns after approximately 0.00003 to 0.00006 seconds. However, every 4-5 call to snd_pcm_readi() requires approximately 0.028 seconds. This is a huge difference, and causes the logic part of my code to fail.
How can I get a consistent time for each call to snd_pcm_readi()?
I've tried to experiment with the period size, but it is unclear to me what exactly it does even after re-reading the documentation multiple times. I don't use an interrupt driven design, I simply call snd_pcm_readi() and it blocks until it returns -- with data.
I can only assume that the reason it blocks for a variable amount of time, is that snd_pcm_readi() pulls data from the hardware buffer, which happens to already have data readily available for transfer to the "application buffer" (which I'm maintaining). However, sometimes, there is additional work to do in kernel space or on the hardware side, hence the function call takes longer to return in these cases.
What purpose does the "period size" serve when I'm not using an interrupt driven design? Can my problem be fixed at all by manipulation of the period size, or should I do something else?
I want to achieve that each call to snd_pcm_readi() takes approximately the same amount of time. I'm not asking for a real time compliant API, which I don't imagine ALSA even attempts to be, however, seeing a difference in function call time on the order of being 500 times longer (which is what I'm seeing!) then this is a real problem.
What can be done about it, and what should I do about it?
I would present a minimal reproducible example, but this isn't easy in my case.
Typically when reading and writing audio, the period size specifies how much data ALSA has reserved in DMA silicon. Normally the period size specifies your latency. So for example while you are filling a buffer for writing through DMA to the I2S silicon, one DMA buffer is already being written out.
If you have your period size too small, then the CPU doesn't have time to write audio out in the scheduled execution slot provided. Typically people aim for a minimum of 500 us or 1 ms in latency. If you are doing heavy forms of computation, then you may want to choose 5 ms or 10 ms of latency. You may choose even more latency if you are on a non-powerful embedded system.
If you want to push the limit of the system, then you can request the priority of the audio processing thread be increased. By increasing the priority of your thread, you ask the scheduler to process your audio thread before all other threads with lower priority.
One method for increasing priority taken from the gtkIOStream ALSA C++ OO classes is like so (taken from the changeThreadPriority method) :
/** Set the current thread's priority
\param priority <0 implies maximum priority, otherwise must be between sched_get_priority_max and sched_get_priority_min
\return 0 on success, error code otherwise
*/
static int changeThreadPriority(int priority){
int ret;
pthread_t thisThread = pthread_self(); // get the current thread
struct sched_param origParams, params;
int origPolicy, policy = SCHED_FIFO, newPolicy=0;
if ((ret = pthread_getschedparam(thisThread, &origPolicy, &origParams))!=0)
return ALSA::ALSADebug().evaluateError(ret, "when trying to pthread_getschedparam\n");
printf("ALSA::Stream::changeThreadPriority : Current thread policy %d and priority %d\n", origPolicy, origParams.sched_priority);
if (priority<0) //maximum priority
params.sched_priority = sched_get_priority_max(policy);
else
params.sched_priority = priority;
if (params.sched_priority>sched_get_priority_max(policy))
return ALSA::ALSADebug().evaluateError(ALSA_SCHED_PRIORITY_ERROR, "requested priority is too high\n");
if (params.sched_priority<sched_get_priority_min(policy))
return ALSA::ALSADebug().evaluateError(ALSA_SCHED_PRIORITY_ERROR, "requested priority is too low\n");
if ((ret = pthread_setschedparam(thisThread, policy, &params))!=0)
return ALSA::ALSADebug().evaluateError(ret, "when trying to pthread_setschedparam - are you su or do you have permission to set this priority?\n");
if ((ret = pthread_getschedparam(thisThread, &newPolicy, &params))!=0)
return ALSA::ALSADebug().evaluateError(ret, "when trying to pthread_getschedparam\n");
if(policy != newPolicy)
return ALSA::ALSADebug().evaluateError(ALSA_SCHED_POLICY_ERROR, "requested scheduler policy is not correctly set\n");
printf("ALSA::Stream::changeThreadPriority : New thread priority changed to %d\n", params.sched_priority);
return 0;
}

What is the meaning of CANBUS function mode initilazing settings for STM32?

I want to understand meaning of the following function mode definition, there is explanation in the library. But I don't understand that because explanations are very short and not enough. I searched on the net I couldnt find any information about.
CAN_InitStructure.CAN_TTCM = DISABLE;
CAN_InitStructure.CAN_ABOM = DISABLE;
CAN_InitStructure.CAN_AWUM = DISABLE;
CAN_InitStructure.CAN_NART = ENABLE;
CAN_InitStructure.CAN_RFLM = DISABLE;
CAN_InitStructure.CAN_TXFP = ENABLE;
These are the names of the bits located in the CAN master control register (CAN_MCR). So, the proper source for their meaning is the reference manual. My following answer will be somewhat copy & paste from the reference manual, but I will try to explain these bits in detail.
TTCM (Time triggered communication mode): This bit activates the Time Triggered Communication (TTCAN) mode, which is an extension to the CAN standard. I don't know much about TTCAN, but as I understand, it assigns time windows to messages to satisfy some real-time requirements. So, normally this bit should remain 0.
ABOM (Automatic bus-off management): If the transmit error counter (TEC) becomes greater than 255, the CAN hardware switches to bus-off state. To recover, it must wait for the recovery sequence, 128 occurrences of 11 consecutive recessive bits. Only after that, the CAN hardware may return to the normal operating state. This bit controls the returning behavior. If it's 1, returning to normal state is automatic. Otherwise, software should make the request, provided that the recovery sequence has been observed.
AWUM (Automatic wakeup mode): The CAN module can be in one of 3 modes: Initialization mode, normal mode or sleep (low power) mode. Sleep mode is requested by the software. However, you have 2 options to exit sleep mode. If this bit is 0, then you have to exit sleep mode manually. You may enable CAN wakeup interrupt to inform you about bus activity, then exit the sleep mode in ISR. But if this bit is 1, the hardware returns to normal mode automatically when it detects bus activity.
NART (No automatic retransmission): Normally, CAN hardware retries to transmit a message if its previous attempts fail, because of arbitration lost etc. But if you make this bit 1, the transmitter does not retry. This is required when you use Time Triggered Communication (TTCAN). Otherwise, you should keep this bit 0.
RFLM (Receive FIFO locked mode): Your receive mailboxes have 3 levels depth, meaning that they can store maximum 3 messages before they are overrun. This bit controls what happens in case of mailbox overrun. Default behavior is to keep the oldest 2 messages and the newest one. For example, if you received 5 messages, the buffer keeps the messages 1, 2 & 5. However, if you make this bit 1, the mailbox keeps the messages 1, 2 & 3 and discards the new arrivals.
TXFP (Transmit FIFO priority): You have 3 transmit mailboxes. When you fill more than one, the hardware must decide which one to transmit first. Normally, one can assume that a message with a lower ID number is more important and should be transmitted first. But if you want to transfer them in a first-comes-first-served fashion for some reason, you need to make this bit 1. Of course, this is just a local priority. On the physical bus, the messages with lower ID always have priority.

Serial driver limitations on iMX processor

I'm developing on an embedded Linux device that uses an ARM iMX6 processor. The main purpose is to read an incoming serial stream from an external source.
Due to the atypical nature of the serial stream, I've run into a few roadblocks with the Linux serial driver for imx processors. But nothing that is beyond the capability of the iMX6. For example, the incoming serial stream is inverted logic. The iMX6 has a specific register setting to invert the RX signal. From what I can tell, the Linux driver does not expose it.
Another complication is that the incoming serial data arrives in 3ms bursts. The external source transmits continuously for 3ms, then 3ms of idle, then 3ms of data, then idle, etc. In order to sync up with the first byte of each burst, it's very useful to be able to detect when the line is idle. Again, the iMX6 has a register value specifically for indicating that the RX line is idle, but the Linux driver doesn't expose it.
I am also very confused how buffering works in the driver. I know the iMX6 has a 32byte FIFO buffer, but I can't tell if the driver uses that buffer or uses external RAM for buffering. I'm running into an issue where the read command hangs for a second every so often when I'm in blocking mode, which should never happen because the data stream is continuous.
For reference, here's how I configured the serial port in my C code and read 50 bytes (I've changed it to non-blocking for now):
#include <stropts.h>
#include <asm/termios.h>
#include <unistd.h>
#include <fcntl.h>
int main()
{
int fd;
struct termios2 terminal;
unsigned char v[50];
fd = open ("/dev/ttymxc2", O_RDONLY | O_NOCTTY | O_NONBLOCK );
ioctl(fd, TCGETS2, &terminal);
terminal.c_cflag |= (CLOCAL | CREAD) ;
terminal.c_cflag |= PARENB ; //enable parity
terminal.c_cflag &= ~PARODD ; //even parity
terminal.c_cflag |= CSTOPB ; //2 stop bits
terminal.c_cflag &= ~CSIZE ;
terminal.c_cflag |= CS8 ;
terminal.c_lflag &= ~(ICANON | IEXTEN | ECHO | ECHOE | ISIG) ;
terminal.c_oflag &= ~OPOST ;
terminal.c_cflag &= ~CBAUD;
terminal.c_cflag |= BOTHER;
terminal.c_ispeed = 100000; //100kHz baud
terminal.c_ospeed = 100000;
ioctl(fd, TCSETS2, &terminal);
...
for(i=0;i<50;i++)
{
read(fd,v+i,1)
}
...
}
So I have two questions:
What is the "proper" way to get the capability out of the serial port that the processor has available but the driver doesn't expose? I can't imagine I'm the first person to want to use such basic functionality of the processor, but I don't want to reinvent the wheel. Do I need to get into writing my own drivers?
Does comprehensive documentation on the iMX serial driver exist anywhere? The code is poorly commented and I get lost quickly trying to find my way around it. For example, I don't know where to start investigating the buffering problem that causes it to hang when receiving a continuous stream of data.
I've forgone with the serial driver entirely and instead wrote some functions to access the register memory directly (modeled after devmem2.c source code). Now I can directly set the INVR bit to invert the RX signal, use the IDLE bit to detect when the line has gone idle, and retrieve the incoming data bytes as soon as they arrive without delay.
I found something on another forum about the UART DMA needs the RX line to go idle for at least 8ms before it services the buffer. That was apparently the cause of the 1sec lag I was experiencing.