Perl write to file returns huge weird stacktrace - perl

I have the following problem: when I try to save the file that contains a semicolon in the name it returns a huge and weird stacktrace of the characters on the page. I've tried to escape, to trim and to replace those semicolons, but the result is still the same. I use the following regex:
$value =~ s/([^a-zA-Z0-9_\-.]|;)/uc sprintf("%%%02x",ord($1))/eg;
(I've even added the |; part separately..)
So, when I open the file to write and call the print function it returns lots of weird stuff, like that:
PK!}�3y�[Content_Types].xml ���/�h9\�?�0���cz��:� �s_����o���>�T�� (it is a huge one, this is just a part of it).
Is there any way I could avoid this?
Thank you in advance!
EDIT:
Just interested - what is the PK responsible of in this string? I mean I can understand that those chars are just contents of the file, but what is PK ? And why does it show the content type?
EDIT 2.0:
I'm uploading the .docx file - when the name doesn't contain the semicolon it works all fine. This is the code for the file saving:
open (QSTR,">", "$dest_file") or die "can't open output file: $qstring_file";
print QSTR $value;
close (QSTR);
EDIT 3.0
This is a .cgi script, that is called after posting some data to the server. It has to save some info about the uploading file to a temp file (name, contents, size) in the manner of key-value pairs. So any file that contains the semicolon causes this error.
EDIT 4.0
Found the cause:
The CGI param function while uploading the params counts semicolon as the delimiter! Is there any way to escape it in the file header?

The PK in file header it means it is compressed ZIP like file, like docx.
One guess: The ; is not valid character in filename at the destination?
Your regexp is not good: (the dot alone is applicable to any character...)
$value =~ s/([^a-zA-Z0-9_\-.]|;)/uc sprintf("%%%02x",ord($1))/eg;
Try this:
#replace evey non valid char to underscore
$value =~ s/([^a-zA-Z0-9_\-\.\;])/_/g;

Related

My Perl variable to variable substitutions do not work

I have a substitution to make in a Perl script, which I do not seem to get working. I have a string in a text file which has the form:
T+30H
The string T+30H has to be written in many files and has to change from file to file. It is two digits and sometimes three digits. First I define the variable:
my $wrffcr=qr{T+\d+H};
After reading the file containing the string, I have the following substitution command (starting with the file capture)
#scrptlines=<$NCLSCRPT>;
foreach $scrptlines (#scrptlines) {
$scrptlines =~ s/$wrffcr/T+$fcrange2[$jj]H/g;
}
$fcrange2[$jj] is defined and I confirm its value by printing its value just before the above 4 lines of code.
print "$fcrange2[$jj]\n";
When I run my script, nothing changes for this particular substitution. I suspect it is to do with the way I define the string to be substituted.
I will appreciate any assistance.
Zilore Mumba
Watch out for the first + in my $wrffcr=qr{T+\d+H};. It'll make it match 1 or more Ts, not T followed by a +. You probably want
my $wrffcr=qr{T\+\d+H};

perl: utf8 <something> does not map to Unicode while <something> doesn't seem to be present present

I'm using MARC::Lint to lint some MARC records, but every now and them I'm getting an error (on about 1% of the files):
utf8 "\xCA" does not map to Unicode at /usr/lib/x86_64-linux-gnu/perl/5.26/Encode.pm line 212.
The problem is that I've tried different methods but cannot find "\xCA" in the file...
My script is:
#!perl -w
use MARC::File::USMARC;
use MARC::Lint;
use utf8;
use open OUT => ':utf8';
my $lint = new MARC::Lint;
my $filename = shift;
my $file = MARC::File::USMARC->in( $filename );
while ( my $marc = $file->next() ) {
$lint->check_record( $marc );
# Print the errors that were found
print join( "\n", $lint->warnings ), "\n";
} # while
and the file can be downloaded here: http://eroux.fr/I14376.mrc
Is "\xCA" hidden somewhere? Or is this a bug in MARC::Lint?
The problem has nothing to do with MARC::Lint. Remove the lint check, and you'll still get the error.
The problem is a bad data file.
The file contains a "directory" of where the information is located in the file. The following is a human-readable rendition of the directory for the file you provided:
tagno|offset|len # Offsets are from the start of the data portion.
001|00000|0017 # Length include the single-byte field terminator.
006|00017|0019 # Offset and lengths are in bytes.
007|00036|0015
008|00051|0041
035|00092|0021
035|00113|0021
040|00134|0018
050|00152|0022
066|00174|0009
245|00183|0101
246|00284|0135
264|00419|0086
300|00505|0034
336|00539|0026
337|00565|0026
338|00591|0036
546|00627|0016
500|00643|0112
505|00755|9999 <--
506|29349|0051
520|29400|0087
533|29487|0115
542|29602|0070
588|29672|0070
653|29742|0013
710|29755|0038
720|29793|0130
776|29923|0066
856|29989|0061
880|30050|0181
880|30231|0262
Notice the length of the field with tag 505, 9999. This is the maximum value supported (because the length is stored as four decimal digits). The catch is that value of that field is far larger than 9,999 bytes; it's actually 28,594 bytes in size.
What happens is that the module extracts 9,999 bytes rather than 28,594. This happens to cut a UTF-8 sequence in half. (The specific sequence is CA BA, the encoding of ʼ.) Later, when the module attempts to decode that text, an error is thrown. (CA must be followed by another byte to be valid.)
Are these records you are generating? If so, you need to make sure that no field requires more than 9,999 bytes.
Still, the module should handle this better. It could read until it finds a end-of-field marker instead of using the length when it finds no end-of-field marker where it expects one and/or it could handle decoding errors in a non-fatal manner. It already has a mechanism to report these problems ($marc->warnings).
In fact, if it hadn't died (say if the cut happened to occur in between characters instead of in the middle of one), $marc->warnings would have returned the following message:
field does not end in end of field character in tag 505 in record 1

Removing quotes that seem to be nonexistent in MatLab

I have an application that translates the .csv files I currently have into the correct format I need. But, the files I that I do have seem to have '"' double quotes around them, as seen in this image, which will not work with the program. As a result, I'm using this command to remove them:
for m = 1:currentsize(1)
for n = 1:currentsize(2)
replacement{m,n} = strrep(current{m,n}, '"', '');
end
end
I'm not entirely sure that this works though, as it spits this back at me as it runs:
Warning: Inputs must be character arrays or cell arrays of strings.
When I open the file in matlab, it seems to only have the single quotes around it, which is normal for any other file. However, when I open it in notepad++, it seems to have '"' double quotes around everything. Is there a way to remove these double quotes in any other way? My code doesn't seem to do anything as seen here:
After using xlswrite to write the replacement cell-array to a .csv file, one appears corrupted. Any idea why?
So, my questions are:
Is there any way to remove the quotes in a more efficient manner or without rewriting to a csv?
and
What exactly is causing the corruption in the xlswrite function? The variable replacement seems perfectly normal.
Thanks in advance!
Regarding the "corrupted" file. That's not a corrupted file, that's an xls file (not xlsx). You could verify this opening the text file in a hex editor to compare the signature. This happens when you chose no file extension or ask excel to write a file which can't be encoded into csv. I assume it's some character which causes the problems, try to isolate the critical line writing only parts of the cell.
Regarding your warning, not having the actual input I could only guess what's wrong. Again, I can only give some advices to debug the problem yourself. Run this code:
lastwarn('')
for m = 1:currentsize(1)
for n = 1:currentsize(2)
replacement{m,n} = strrep(current{m,n}, '"', '');
if ~isempty(lastwarn)
keyboard
lastwarn('')
end
end
end
This will launch the debugger when a warning is raised, allowing you to check the content of current{m,n}
It is mentionned in the documentation of strrep that :
The strrep function does not find empty strings for replacement. That is, when origStr and oldSubstr both contain the empty string (''), strrep does not replace '' with the contents of newSubstr.
You can use this user function substr like this :
for m = 1:currentsize(1)
for n = 1:currentsize(2)
replacement{m,n} = substr(current{m,n}, 2, -1);
end
end
Please use xlswrite function like this :
[status,message] = xlswrite(___)
and check the status if it is zero, that means the function did not succeed and an error message is generated in message as string.

How can I save Perl/Expect output that contains mixed ascii content?

I have a perl script that uses the expect library to login to a remote system. I'm getting the final output of the interaction with the before method:
$exp->before();
I'm saving this to a text file. When I use cat on the file it outputs fine in the terminal, but when I open the text file in an editor or try to process it the formatting is bizarre:
[H[2J[1;19HCIRCULATION ACTIVITY by TERMINAL (Nov 6,14)[11;1H
Is there a better way to save the output?
When I run enca it's identified as:
7bit ASCII characters
Surrounded by/intermixed with non-text data
you can remove none ascii chars.
$str1 =~ s/[^[:ascii:]]//g;
print "$str1\n";
I was able to remove the ANSI escape codes from my output by using the Text::ANSI::Util library's ta_strip() function:
my $ansi_string = $exp->before();
my $clean_string = ta_strip($ansi_string);

Perl: Problem with changing encoding in the middle of reading a file

I am using Perl to load some 'macro' files. These macros can, however, be encoded in various encodings, so there is a directive defined for users writing their macros (i.e.
#encoding iso-8859-2
at the beginning of the macro).
Every time this directive is encountered in the macro, a function setting encoding is called and looks sth like this:
sub change_encoding {
my ($file_handle, $encoding) = #_;
$file_handle->flush();
binmode($file_handle); # get rid of IO layers
binmode($file_handle,":encoding($encoding)");
}
The problem is that when I read the macro using standard
while($line = <$file_handle>){
process_macro($line);
}
I got messages saying "utf8 "\xXY" does not map to Unicode", but only if characters with diacritics is near the #encoding directive. I tried several examples and I was able to have half of the string with \xXY codes and other half of the string with correctly decoded characters, like here:
sub macro5_fn {
print "\xBElu\xBBou\xE8k\xFD k\xF9\xF2 úpěl ďábelské ódy\n";
}
If I put more comments before the function, all the characters are OK:
sub macro5_fn {
print "žluťoučký kůň úpěl ďábelské ódy\n";
}
Simply said, the number of correctly decoded characters depends on the distance of these characters from the #encoding directive, the ones that are close are not decoded correctly.
It seems to me that this is an issue of Perl and PerlIO (not) flushing the buffer. Or am I doing something wrong?
Thank you for your answers.
The problem is that <> reads more than just one line, so the next line or so is being interpreted under the old encoding before you ever see the #encoding directive for the new.
Your best bet is probably to read the file in binary mode and use the Encode module to decode each line from the current encoding.