Can mirthconnect 3.0.1.7051 choke on a large file without having heap error? - mirth

Can mirthconnect 3.0.1.7051 choke on a large file without having heap error?
We are using Mirthconnect 3.0.1.7051 on a linux machine (Redhat, I think). The Linux machine has 16 gb, and java heap (in mirth) is set to 12gb. We have a javascript reader channel attempting to read a delimited text file, and convert it to xml (after checking some things). Other settings are Durable message delivery, remove content and attachments on completion, Polling type =(interval),Polling frequency = 2 minutes,
Source queue off – respond after processing, Destination/channel type = channel writer, Queue retry = always, retry interval every 10,000 ms.
The delimnited text file is close to 80KB. At first, when we had much less memory in machine and set in heap, it would fail to read the file and show heap error in mirth log. now, with more memory, it is not throwing heap error but mirth stops reading the input file somewhere in the middle of it, and then starts reading it again. The result is that two incomplete xml files are produced, with some overlapping data, and no apparent errors in the log.

Related

How is data copied from user space to kernel space and vice versa during I/O tasks?

I'm learning an operating system course and on slide 32:
https://people.eecs.berkeley.edu/~kubitron/courses/cs162-S19/sp19/static/lectures/3.pdf
the professor briefly said fread and fwrite implement user space buffer, thus more efficient than directly calling system functions read/write and can save disk access, but didn't explain why.
Imagine these two scenarios: we need to read/write 16 bytes, user buffer is 4-byte, scenario one is using fread/fwrite, scenario two is directing using read/write which process one byte at each time
My questions are:
Since fread calls read underneath, how many read function calls will be invoked respectively?
Is data transfer, whether one single byte or 1mb between user space buffer and kernel space buffer all done by the kernel and no user/kernel mode switch involved during transferring?
How many disk accesses are performed respectively? Won't the kernel buffer come into play during scenario two?
The read function ssize_t read(int fd, void *buf, size_t count) also has buffer and count parameters, can these replace the role of user space buffer?
Since fread calls read underneath, how many read function calls will be invoked respectively?
Because fread() is mostly just slapping a buffer (in user-space, likely in a shared library) in front of read(), the "best case number of read() system calls" will depend on the size of the buffer.
For example; with an 8 KiB buffer; if you read 6 bytes with a single fread(), or if you read 6 individual bytes with 6 fread() calls; then read() will probably be called once (to get up to 8 KiB of data into the buffer).
However; read() may return less data than was requested (and this is very common for some cases - e.g. stdin if the user doesn't type fast enough). This means that fread() might use read() to try to fill it's buffer, but read() might only read a few bytes; so fread() needs to call read() again later when it needs more data in its buffer. For a worst case (where read() only happens to return 1 byte each time) reading 6 bytes with a single fread() may cause read() to be called 6 times.
Is data transfer, whether one single byte or 1mb between user space buffer and kernel space buffer all done by the kernel and no user/kernel mode switch involved during transferring?
Often, read() (in the C standard library) calls some kind of "sys_read()" function provided by the kernel. In this case there's a switch to kernel when "sys_read()" is called, then the kernel does whatever it needs to to obtain and transfer the data, then there's one switch back from kernel to user-space.
However; nothing says that's how a kernel must work. E.g. a kernel could only provide a "sys_mmap()" (and not provide any "sys_read()") and the read() (in the C standard library) could use "sys_mmap()". For another example; with an exo-kernel, file systems might be implemented as shared libraries (with "file system cache" in shared memory) so a read() done by the C library (of a file's data that is in the "file system cache") may not involve the kernel at all.
How many disk accesses are performed respectively? Won't the kernel buffer come into play during scenario two?
There's too many possibilities. E.g.:
a) If you're reading from a pipe (where the data is in a buffer in the kernel and was previously written by a different process) then there will be no disk accesses (because the data was never on any disk to begin with).
b) If you're reading from a file and the OS cached the file's data already; then there may be no disk accesses.
c) If you're reading from a file and the OS cached the file's data already; but the file system needs to update meta-data (e.g. an "accessed time" field in the file's directory entry) then there may be multiple disk accesses that have nothing to do with the file's data.
d) If you're reading from a file and the OS hasn't cached the file's data; then at least one disk access will be necessary. It doesn't matter if it's caused by fread() attempting to read a whole buffer, read() trying to read all 6 bytes at once, or the OS fetching a whole disk block because of the first "read() of one byte" in a series of six separate "read() of one byte" requests. If the OS does no caching at all, then six separate "read() of one byte" requests will be at least 6 separate disk accesses.
e) file system code may need to access some parts of the disk to determine where the file's data actually is before it can read the file's data; and the requested file data may be split between multiple blocks/sectors on the disk; so reading 2 or more bytes from a file (regardless of whether it was caused by fread() or "read() of 2 or more bytes") could cause several disk accesses.
f) with a RAID 5/6 array involving 2 or more physical disks (where reading a "logical block" involves reading the block from one disk and also reading the parity info from a different disk), the number of disk accesses can be doubled.
The read function ssize_t read(int fd, void *buf, size_t count) also has buffer and count parameters, can these replace the role of user space buffer?
Yes; but if you're using it to replace the role of a user space buffer then you're mostly just implementing your own duplicate of fread().
It's more common to use fread() when you want treat the data as stream of bytes, and read() (or maybe mmap()) when you do not want to treat the data as a stream of bytes.
For a random example; maybe you're working with a BMP file; so you read the "guaranteed to be 14 bytes by the file format's spec" header; then check/decode/process the header; then (after determining where it is in the file, how big it is and what format it's in) you might seek() to the pixel data and read all of it into an array (then maybe spawn 8 threads to process the pixel data in the array).

Streamsets: SpoolDIR_01 Failed to process file

Hi I'm trying to run a pipeline to process a very large file (about 4milion records). Everytime it reaches to around 270, 000 it fails and then stops processing anymore records and returns this error.
'/FileLocation/FiLeNAME..DAT' at position '93167616': com.streamsets.pipeline.lib.dirspooler.BadSpoolFileException: com.streamsets.pipeline.api.ext.io.OverrunException: Reader exceeded the read limit '131072'.
If anyone else has experienced similar issue, please help. Thank you
I have checked the lines where it stops the pipeline but there seems to be nothing obvious there. Tried another file and still not working.
'/FileLocation/FiLeNAME..DAT' at position '93167616': com.streamsets.pipeline.lib.dirspooler.BadSpoolFileException: com.streamsets.pipeline.api.ext.io.OverrunException: Reader exceeded the read limit '131072'.
Looks like you're hitting the maximum record size. This limit is in place to guard against badly formatted data causing 'out of memory' errors.
Check your data format configuration and increase Max Record Length, Maximum Object Length, Max Line Length etc depending on the data format you are using.
See the Directory Origin documentation for more detail. Note in particular that you may have to edit sdc.properties if the records you are parsing are bigger than the system-wide limit of 1048576 bytes.
I recently received this error message as well. When I come up against such size limits in StreamSets, I'll often set the limit to something ridiculous:
Then set the maximum value to the value given to me in the subsequent error message:
I find it really unfortunate that StreamSets then fails to process the rest of a file when an extra-long record is encountered. This seems counter intuitive to me for a tool used to process vast amounts of data.

Ignore data coming in to TCP socket

Some protocols like HTTP can specify a message length, then send a (possibly very long) message. No other messages can be received while this message is being sent (something HTTP/2.0 tried to solve) so if you decide to ignore the message, you can't just continue waiting for messages and not pull its data.
Normally I read() up to the length of the message repeatedly into a junk buffer and just ignore the bytes. But that involves copying possibly millions of bytes from kernelspace into userspace (at 1 copy per memory page, so not millions of copies). Isn't there some way to just tell the kernel to discard the bytes instead of providing them?
It seemed like an obvious question, but the only answer I've been able to come up with is oddly resource heavy, either using splice() to dump the bytes into a pipe and seeking the pipe back to 0, or opening "/dev/null" and using sendfile() to send the bytes there. I could do that, and reserve a (single) file descriptor for flushing data out of clogged connections, without reading, but isn't there just a... ignore(descriptor, length) function?

What will happen if a process run by eclipse is terminated while the process is doing IO (writing)?

Suppose the process is writing 10k stuff to disk, and during that time terminate is issued (red button in eclipse), will the write be successful? Do I need to check integrity issues for that if I terminate it this way?
I was processing millions of documents, and I have to write one file for each document, they are small (< 10k usually). While the code is running, I found a change that could raise efficiency, so I terminated the process and made the changes. Then, I came up with this question.
The writing of the file will terminate with the process, leaving the file with partial content.

what happens when I write data to a blocking socket, faster than the other side reads?

suppose I write data really fast [I have all the data in memory] to a blocking socket.
further suppose the other side will read data very slow [like sleep 1 second between each read].
what is the expected behavior on the writing side in this case?
would the write operation block until the other side reads enough data, or will the write return an error like connection reset?
For a blocking socket, the send() call will block until all the data has been copied into the networking stack's buffer for that connection. It does not have to be received by the other side. The size of this buffer is implementation dependent.
Data is cleared from the buffer when the remote side acknowledges it. This is an OS thing and is not dependent upon the remote application actually reading the data. The size of this buffer is also implementation dependent.
When the remote buffer is full, it tells your local stack to stop sending. When data is cleared from the remote buffer (by being read by the remote application) then the remote system will inform the local system to send more data.
In both cases, small systems (like embedded systems) may have buffers of a few KB or smaller and modern servers may have buffers of a few MB or larger.
Once space is available in the local buffer, more data from your send() call will be copied. Once all of that data has been copied, your call will return.
You won't get a "connection reset" error (from the OS -- libraries may do anything) unless the connection actually does get reset.
So... It really doesn't matter how quickly the remote application is reading data until you've sent as much data as both local & remote buffer sizes combined. After that, you'll only be able to send() as quickly as the remote side will recv().
Output (send) buffer gets filled until it gets full and send() block until the buffer get freed enough to enqueue the packet.
As send manual page says:
When the message does not fit into the send buffer of the socket,
send() normally blocks, unless the socket has been placed in non-
blocking I/O mode.
Look at this: http://manpages.ubuntu.com/manpages/lucid/man2/send.2.html