Missing either data or entire lines on serial port read - perl

I am trying to read streaming serial data at 115200b and cant seem to pick it all up.
using the input method ($data = $Port -> input), I get the first 14 or 15 characters of every line. I need the whole line.
using the read method ($data= $Port -> read(4096)) and by adjusting the read_interval I can either get partials of every line using
$Port->read_interval(1);
or fully every third line using
$Port->read_interval(2);
I need all of every line.
here is the code
my $App_Main_Port = Win32::SerialPort->start ($Test_cfgfile);
$App_Main_Port->read_interval(1);
$App_Main_Port->read_char_time(1);
for ($i=0;;) {
# $data = $App_Main_Port -> input;
$data= $App_Main_Port -> read(4096);
print "$data\n";
}
by adjusting the read interval I get the results mentioned above. I started with the default values of 100 in the interval and char_time parameters, but only get every third line.
Tx for any insight!
Chris

Serial port reads often just return what's buffered in the serial port at that moment. A 16550 UART only buffers 16 bytes total, and that's what most PCs emulate.
From the Win32::SerialPort page, it appears you need to call ->read() in a list context, so you can find out how many bytes actually got read. If you really do want to block until you've received all 4K characters, try a loop like this:
# read in 4K bytes
my ($data, $temp, $count_in, $total_in);
$total_in = 0;
while ($total_in < 4096) {
( $count_in, $temp ) = $App_Main_Port->read( 4096 - $total_in );
$data .= $temp;
$total_in += $count_in;
}
That said, your real problem sounds like it could be a flow control issue. Even with such a loop, you might still get dropped characters if you don't have proper flow control. You might experiment with handshake("dtr") or handshake("rts").

Related

Using Powershell to remove illegal CRLF from csv row

Gentle Reader,
I have a year's worth of vendor csv files sitting in a directory. My task is to load them into a SQL Server DB as a "Historical Load". The files are mal-formed and while we are working with the vendor to re-send 365 new, properly structured files, I have been tasked with trying to work with what we have.
I'm restricted to using either C# (as a script task in SSIS) or Powershell.
Each file has no header but the schema is known and built into the SSIS package connection.
Each file has approx 35k rows and roughly a few dozen mal-formed rows per file.
Each properly formed row consists of 122 columns, 121 comma's.
Rows are NOT text qualified.
Example: (data cleaned of PII)
555222,555222333444,1,HN71232,1/19/2018 8:58:07 AM,3437,27.50,HECTOR EVERYMAN,25-Foot Garden Hose - ,1/03/2018 10:17:24 AM,,1835,,,,,online,,MERCH,1,MERCH,MI,,,,,,,,,,,,,,,,,,,,6611060033556677,2526677,,,,,,,,,,,,,,EVERYMAN,,,,,,,,,,,,,,,,,,,,,,,,,,,,,VWDEB,,,,,,,555666NA118855,2/22/2018 12:00:00 AM,,,,,,,,,,,,,,,,,,,,,2121,,,1/29/2018 9:50:56 AM,0,,,[CRLF]
555222,555222444888,1,CASUAL50,1/09/2018 12:00:00 PM,7000,50.00,JANE SMITH,$50 Casual Gift Card,1/19/2018 8:09:15 AM,1/29/2018 8:19:25 AM,1856,,,,,online,,FREE,1,CERT,GC,,,,,,,6611060033553311[CRLF]
,6611060033553311[CRLF]
,,,,,,,,,25,,,6611060033556677,2556677,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,CASUAL25,VWDEB,,,,,,,555222NA118065,1/22/2018 12:00:00 AM,,,,,,,,,,,,,,,,,,,,,,,,1/19/2018 12:00:15 PM,0,,,[CRLF]
555222,555222777666,1,CASHCS,1/12/2018 10:31:43 AM,2500,25.00,BOB SMITH,BIG BANK Rewards Cash Back Credit [...6S66],,,1821,,,,,online,,CHECK,1,CHECK,CK,,,,,,,,,,,,,,,,,,,,555222166446,5556677,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,VWDEB,,,1/23/2018 10:30:21 AM,,,,555666NA118844,1/22/2018 12:00:00 AM,,,,,,,,,,,,,,,,,,,,,,,,1/22/2018 10:31:26 AM,0,,,[CRLF]
Powershell Get-Content (I think...) reads until file into an array where each row is identified by the CRLF as the terminator. This means (again, I think) that mal-formed rows will be treated as an element of the array without respect to how many "columns" it holds.
C# Streamreader also uses CRLF as a marker but a streamreader object also has a few methods available like Peek and Read that may be useful.
Please, Oh Wise Ones, point me in the direction of least resistance. Using Powershell, as a script to process mal-formed csv files such that CRLFs that are not EOL are removed.
Thank you.
Based on #vonPryz design but in (Native¹) PowerShell:
$Delimiters = 121
Get-Content .\OldFile.csv |ForEach-Object { $Line = '' } {
if ($Line) { $Line += ',' + $_ } else { $Line = $_ }
$TotalMatches = ($Line |Select-String ',' -AllMatches).Matches.Count
if ($TotalMatches -ge $Delimiters ) {
$Line
$Line = ''
}
} |Set-Content .\NewFile.Csv
1) I guess performance might be improved by avoiding += and using dot .net methods along with text streamers
Honestly, your best bet is to get good data from the supplier. Trying to work around a mess will just cause problems later on. Garbage in, garbage out. Since it's you who wrote the garbage data in the database, congratulations, it's now your fault that the DB data is of poor quality. Please talk with your manager and the stakeholders first, so that you have in writing an agreement that you didn't break the data and it was broken to start with. I've seen such problems on ETL processing all too often.
A quick and dirty pseudocode without error handling, edge case processing, substring index assumptions, performance guarantees and whatnot goes like so,
while(dataInFile)
line = readline()
:parseLine
commasInLine = countCommas(line)
if commasInLine == rightAmount
addLineInOKBuffer(line)
else
commasNeeded = rightAmount - commasInLine
if commasNeeded < 0
# too many commas, two lines are combined
lastCommaLocation = getLastCommaIndex(line, commasNeeded)
addLineInOKBuffer(line.substring(0, lastCommaLocation)
line = line.substring(lastCommaLocation, line.end)
goto :parseline
else
# too few lines, need to read next line too
line = line.removeCrLf() + readline()
goto :parseline
The idea is that first you look for a line and count how many commas there are. If the count matches what's expected, the row is not broken. Store it in a buffer containing good data.
If you have too many commas, then the row contains at least two different elements. Then find the index of where the first element ends, extract it and store it in the good data buffer. Then remove already processed part of the line and start again.
If you have too few commas, then the row is splitted by a newline. Read the next line from the file, join it with the current line and start the parsing again from counting the lines.

perl: utf8 <something> does not map to Unicode while <something> doesn't seem to be present present

I'm using MARC::Lint to lint some MARC records, but every now and them I'm getting an error (on about 1% of the files):
utf8 "\xCA" does not map to Unicode at /usr/lib/x86_64-linux-gnu/perl/5.26/Encode.pm line 212.
The problem is that I've tried different methods but cannot find "\xCA" in the file...
My script is:
#!perl -w
use MARC::File::USMARC;
use MARC::Lint;
use utf8;
use open OUT => ':utf8';
my $lint = new MARC::Lint;
my $filename = shift;
my $file = MARC::File::USMARC->in( $filename );
while ( my $marc = $file->next() ) {
$lint->check_record( $marc );
# Print the errors that were found
print join( "\n", $lint->warnings ), "\n";
} # while
and the file can be downloaded here: http://eroux.fr/I14376.mrc
Is "\xCA" hidden somewhere? Or is this a bug in MARC::Lint?
The problem has nothing to do with MARC::Lint. Remove the lint check, and you'll still get the error.
The problem is a bad data file.
The file contains a "directory" of where the information is located in the file. The following is a human-readable rendition of the directory for the file you provided:
tagno|offset|len # Offsets are from the start of the data portion.
001|00000|0017 # Length include the single-byte field terminator.
006|00017|0019 # Offset and lengths are in bytes.
007|00036|0015
008|00051|0041
035|00092|0021
035|00113|0021
040|00134|0018
050|00152|0022
066|00174|0009
245|00183|0101
246|00284|0135
264|00419|0086
300|00505|0034
336|00539|0026
337|00565|0026
338|00591|0036
546|00627|0016
500|00643|0112
505|00755|9999 <--
506|29349|0051
520|29400|0087
533|29487|0115
542|29602|0070
588|29672|0070
653|29742|0013
710|29755|0038
720|29793|0130
776|29923|0066
856|29989|0061
880|30050|0181
880|30231|0262
Notice the length of the field with tag 505, 9999. This is the maximum value supported (because the length is stored as four decimal digits). The catch is that value of that field is far larger than 9,999 bytes; it's actually 28,594 bytes in size.
What happens is that the module extracts 9,999 bytes rather than 28,594. This happens to cut a UTF-8 sequence in half. (The specific sequence is CA BA, the encoding of ʼ.) Later, when the module attempts to decode that text, an error is thrown. (CA must be followed by another byte to be valid.)
Are these records you are generating? If so, you need to make sure that no field requires more than 9,999 bytes.
Still, the module should handle this better. It could read until it finds a end-of-field marker instead of using the length when it finds no end-of-field marker where it expects one and/or it could handle decoding errors in a non-fatal manner. It already has a mechanism to report these problems ($marc->warnings).
In fact, if it hadn't died (say if the cut happened to occur in between characters instead of in the middle of one), $marc->warnings would have returned the following message:
field does not end in end of field character in tag 505 in record 1

How to "jump" to a line of a file, rather than read file line by line, using Perl

I am opening a file containing a single but very long column. I want to retrieve from it just a short segment, starting at a specified line and ending at another specified line. Currently, my script is reading the file line by line until the desired lines are found. I am using:
my ( $from, $to ) = ( some line number, some larger line number );
my $count = 1;
my #seq = ();
while ( <SEQUENCE> ) {
print "$_ for $count\n";
$count++;
while ( $count >= $from && $count <= $to ) {
push( #seq, $_ );
last;
}
}
print "seq is: #seq\n";
Input looks like:
A
G
T
C
A
G
T
C
.
.
.
How might I "jump" to where I want to be?
You'll need to use seek to move to the correct portion of the file. ref: http://perldoc.perl.org/functions/seek.html
This works on bytes, not on lines, so generally if you need to use line seeking its not an option. However, since you're working on a fixed length line (2 or 3 bytes depending on your platform's EOL encoding) you can multiply the line length by the line you want (0 indexed) and you'll be at the correct location for reading.
If you happen to know that all the lines are of exactly the same length (accounting for line ending characters, generally 1 byte on Unix/Linux and 2 on Windows), you can use seek to go directly to a specified point in the file
The seek function lets you specify a file position in bytes/characters, not in lines. In the general case, the only way to go to a specified line number is to read from the beginning and skip that many lines (minus one).
Unless you have an index mapping line numbers to byte offsets; then you can look up the specified line number in the index and use seek to jump to that location. To do this, you have to build the index separately (a process that will require reading through the entire file) and make sure the index is always up to date. If the file changes frequently, this is likely to be impractical.
I'm not aware of any existing tools for building and using such an index, but I wouldn't be surprised if they exist. But it should be easy enough to roll your own.
But unless scanning the file to find the line number you want is a significant performance bottleneck, I wouldn't bother with the extra complexity.

Can I construct in Perl a network type message that has odd byte length

$str = "0xa"; #my hex number
$m = pack("n",hex("$str")); --> the output is 000a
$m = pack("c",hex("$str")); --> the output is 0a
I need the result to be only a. The bottom line is that, with pack, I can send on a socket, messages that have odd length (like A675). If I try to send A675B then with pack I will have A6750B
A675 is two bytes. A675B is two and a half bytes. Sockets don't support sending anything smaller than a byte. You could send a flag that tells the receiver to ignore one nybble of the message, but that's about it.

What's the simplest way of adding one to a binary string in Perl?

I have a variable that contains a 4 byte, network-order IPv4 address (this was created using pack and the integer representation). I have another variable, also a 4 byte network-order, subnet. I'm trying to add them together and add one to get the first IP in the subnet.
To get the ASCII representation, I can do inet_ntoa($ip&$netmask) to get the base address, but it's an error to do inet_ntoa((($ip&$netmask)+1); I get a message like:
Argument "\n\r&\0" isn't numeric in addition (+) at test.pm line 95.
So what's happening, the best as I can tell, is it's looking at the 4 bytes, and seeing that the 4 bytes don't represent a numeric string, and then refusing to add 1.
Another way of putting it: What I want it to do is add 1 to the least significant byte, which I know is the 4th byte? That is, I want to take the string \n\r&\0 and end up with the string \n\r&\1. What's the simplest way of doing that?
Is there a way to do this without having to unpack and re-pack the variable?
What's happening is that you make a byte string with $ip&$netmask, and then try to treat it as a number. This is not going to work, as such. What you have to feed to inet_ntoa is.
pack("N", unpack("N", $ip&$netmask) + 1)
I don't think there is a simpler way to do it.
Confusing integers and strings. Perhaps the following code will help:
use Socket;
$ip = pack("C4", 192,168,250,66); # why not inet_aton("192.168.250.66")
$netmask = pack("C4", 255,255,255,0);
$ipi = unpack("N", $ip);
$netmaski = unpack("N", $netmask);
$ip1 = pack("N", ($ipi&$netmaski)+1);
print inet_ntoa($ip1), "\n";
Which outputs:
192.168.250.1