Perl and reading files with different encodings - perl

I am using a perl script to read in a file, but I'm not sure what encoding the file is in. Basically, my file is a list of book titles, but each book has other info associated with it (author, publication date, etc). So each book title is within a discrete chunk of data for the book. So I iterate through the file line by line until I find the regular expression '/Book Title: (.*)/' and take what's in the paren. Then, I create a separate .txt file with the name of the text file being my book. However, in my unix server, when I look at the name of the file, it's actually not, for example, 'LordOfTheFlies.txt' but rather 'LordOfTheFlies^M.txt'
What is this '^M'? Is that a weird end of line encoding I'm not taking into account? I tried chomp but it doesn't seem to be working. What is the best file encoding for working with perl?

It's the additional carriage return character that Windows systems insert before line feed characters (M == 13th letter, hence ASCII 13 is visualised as ^M).
It has nothing to do with file encoding, it's just the line ending policy biting you. Perl is usually good at handling line ending characters correctly, but if they occur somewhere else than the end of a line you have to do it yourself. You can use s/\r// instead of chomp() to get them out.

Before processing the file, you need to know the encoding of the file, which is determined by the producer of the file.
That "^M" is control-M, which is a carriage return, and is not needed in Unix file systems.Looks like the file is created in Unix and transferred to Windows. It can also be added with ftp when text file are transfered as binaries.

Try chop, instead of 'chomp'. Chomp removes the 'new line character'. s/\r// is also good.
For your general question, you might want to use appropriate module for the file type you have to make your life easier and better with Perl.

Related

Perl Command Understanding

I have been working on a product code to resolve an issue but am stuck on a line of code
Can anyone help me understand what exactly does this command do?
perl -MText::CSV -lne 'BEGIN{$p = Text::CSV->new()} print join "|", $p->fields() if $p->parse($_)' /home/daily/${FULL_FILENAME} > /home/output.txt
I think its to copy the file to my home location with some transformations but not sure exactly
This is a slightly broken program that translates a comma-separated values (CSV) file to a pipe-separated values file.
The particular command-line switches are documented in perlrun. This is a "one-liner", so you can read about those to see what's going on there.
The Text::CSV module deals with CSV files, and the program is parsing a line from the file and re-outputting as a pipe-separated file.
But, this program deals with each line as a complete record. That might be fine for you, but at some point you might end up with a literal value that has a newline in it, like a,"b\nc",d. Now reading line-by-line breaks the program since the quotes appear to be unclosed within the first line. Note only that, it blindly concatenates the parsed fields without considering if any of the fields should be quoted. It might be unlikely that a pipe character would be in the data, but the problem isn't it's rarity but the consequences and costliness when it does show up.
The rewrite.pl example script in the related module Text::CSV_XS is a tool that could replace this one-liner. It properly reads the input and knows how to properly translate it.

Emacs, hex editor and determing that a text file is in DOS format

Here's the simplified version of my problem: I have two text files, different data but identical first line and generated by the same program, although possibly on different OS's. When emacs reads one of them it says it is in DOS format, while it does not when reading the other.
I used several Hex editors (Bless, GHex, OKTeta on Kubuntu) and on all of them I see the same thing, which is that every line ends with the sequence OD OA (CR LF) for both files, including the last line.
So my question is: how does emacs determine what is a DOS file and what is not, and is there something else in the file the the Hex editor would not show, or add?
Both files have the same name, in different directories. Also I came upon this problem because I have C++ code that parses strings and failed on the file that emacs lists as DOS, so the issue is really with the file content.
Last note: you will notice there is no C/C++ tag. I'm not looking for advice on how to modify my C++ code to handle the situation. I know how to do it.
Thanks for your help
a
Emacs handles DOS files by converting the CRLF to LF when reading the file and then the LF back into CRLF when writing it out. So if there is a lone LF in the file, reading&writing would end up adding a CR even if the buffer had not been modified. For this reason, if there is such a lone LF hidden in the middle of the file, Emacs will handle the file not as DOS but as a UNIX file.

how can we identify notepad file?

how can we identify notepad files which is created in two computer, is there a any way to get any information about in which computer it was created.Or whether it is build in xp or linux.
If you right click on the file, you should be able to see the permissions and attributes of the file.
Check at the end of the line. Under GNU/Linux lines end with \n (ascii: 0x0A) while under Miscrosoft W$ndos it is \r\n (ascii: 0x0D 0x0A).
Wikipedia: https://en.wikipedia.org/wiki/Newline
found this: http://bit.ly/J258Mr
for identifying a word document but some of the info is relevant
To see on which computer the document had been created, open the Word
document in a hex editor and look for "PID_GUID". This is followed by
a globally unique identifier that, depending upon the version of Word
used, may contain the MAC address of the system on which the file was
created.
Checking the user properties (as already mentioned) is a good way to
see who the creator of the original file was...so, if the document was
not created from scratch and was instead originally created on another
system, then the user information will be for the original file.
Another way to locate the "culprit" in this case is to parse the
contents of the NTUSER.DAT files for each user on each computer. While
this sounds like a lot of work, it really isn't...b/c you're only
looking for a couple of pieces of information. Specifically, you're
interested in the MRU keys for the version of Word being used, as well
as perhaps the RecentDocs keys."
The one thing I can think on the top of my mind is inspecting the newline characters on your file - I'm assuming your files do have multiple lines. If the file was generated using Windows then a newline would be characterized by the combination of carriage return and line feed characters (CR+LF) whereas a simple line feed (LF) would be a hint that the file was generated in a Linux machine.
Right click one the file--> Details . You can see the computer name where it was created and the date.

How can I force emacs (or any editor) to read a file as if it is in ASCII format?

I could not find this answer in the man or info pages, nor with a search here or on Google. I have a file which is, in essence, a text file, but it somehow got screwed up upon saving. (I think there are a few strange bytes at the front of the file accidentally.)
I am able to open the file, and it makes sense, using head or cat, but not using any sort of editor.
In the end, all I wish to do is open the file in emacs, delete the "messy" characters, and save it once cleaned up. The file, however, is huge, so I need something powerful like emacs to be able to open it.
Otherwise, I suppose I can try to create a script to read this in line by line, forcing the script to read it in text format, then write it. But I wanted something quick, since I won't be doing this over & over.
Thanks!
Mike
perl -i.bk -pe 's/[^[:ascii:]]//g;' file
Found this perl one liner here: http://www.perlmonks.org/?node_id=619792
Try M-xfind-file-literally in Emacs.
You could edit the file using hexl-mode, which lets you edit the file in hexadecimal. That would let you see precisely what those offending characters are, and remove them.
It sounds like you either got a different line ending in the file (eg: carriage returns on a *nix system) or it got saved in an unexpected encoding.
You could use strings to grab "printable characters in file". You might have to play with the --encoding though I have only ever used it to grab ascii strings from executable files.

Rename file containing '©' character

We received as input in our application (running on Windows) a list of files. These files were automatically extracted from a database with a script.
Apparently some of the names are containing special characters (like accents) and these characters are rendered as '©' on our side.
How can rename programmatically these text files (around 900'000) to get rid of this character?
We cannot change the source neither re-extract the files.
The problem is that because of this character another program involved with our system does not accept the files.
Have a look at the unix command rename. It allows you to apply a perl regex to the names of a bunch of files. In this case you might want something like:
$ rename 's/[^a-zA-Z0-9]//' *
In debian the rename command is part of the perl package. It should also be available on CPAN.
I ended up creating a new script that reads the input files and search for special characters in their title.
It was quite easy indeed:
string filename = filename.Replace("©", "e");
Since the '©' is in the filename, the script (in C#) is able to recognize it and replace the match accordingly. In this way I can loop through all the folders and subfolders simply reading the filename and change specials characters.
Thank you all for the contributions!