How does Perl know a file is binary? - perl

I know you can use the file test operator -B to test if a file is binary, but how does Perl implement this internally?

From perldoc -f -B:
The -T and -B switches work as follows.
The first block or
so of the file is examined for odd characters such as strange
control codes or characters with the high bit set. If too many
strange characters (>30%) are found, it’s a -B file;
otherwise it’s a -T file. Also, any file containing null in
the first block is considered a binary file.
If -T or -B
is used on a filehandle, the current IO buffer is examined
rather than the first block.
Both -T and -B return true on
a null file, or a file at EOF when testing a filehandle.
Because you have to read a file to do the -T test, on most
occasions you want to use a -f against the file first, as in
"next unless -f $file && -T $file".

According to Chapter 11 of the book Learning Perl:
The answer is **Perl cheats**: it opens the file, looks at the first few thousand bytes, and makes an educated guess. If it sees a lot of null bytes, unusual control characters, and bytes with the high bit set, then that looks like a binary file. If there’s not much weird stuff, then it looks like text. It sometimes guesses wrong. If a text file has a lot of Swedish or French words (which may have characters represented with the high bit set, as some ISO-8859-something variant, or perhaps even a Unicode version), it may fool Perl into declaring it binary. So it’s not perfect, but if you need to separate your source code from compiled files, or HTML files from PNGs, these tests should do the trick.

Related

Cleaning up text files with sed?

I have a bunch of text files that need cleaning up. Example
`E..4B?#.#...
..9J5.....P0.z.n9.9.. ........
.k#a..5
E...y^#.r...J5..
E...y_#.r...J5..
..9.P..n9..0.z............
….2..3..9…n7…..#.yr`
Is there any way sed can do this? Like notice weird patterns?
For this answer, I will assume that you have access to standard unix/linux tools.
Your file might be in some word-processor format. If so, the best way to get rid of the junk is to open it with that program. You may be able to find out which with file:
$ file mysteryfile
mysteryfile: Composite Document File V2 Document, Little Endian, Os: Windows, Version 6.1 ....
If that doesn't work, there is a standard unix utility for extracting text from binary files. It is called strings:
$ strings mysteryfile
Some
Recovered Text
...
The behavior of strings can be fine tuned with several options. See man strings.

How to rewrite a file from a shell script without any danger of truncating the file if out of disk space?

How to rewrite a file from a shell script without any danger of truncating the file if out of disk space?
This handy perl one liner replaces all occurrences of "foo" with "bar" in a file called test.txt:
perl -pi -e 's/foo/bar/g' test.txt
This is very useful, but ...
If the file system where test.txt resides has run out of disk space, test.txt will be truncated to a zero-byte file.
Is there a simple, race-condition-free way to avoid this truncation occuring?
I would like the test.txt file to remain unchanged and the command to return an error if the file system is out of space.
Ideally the solution should be easily used from a shell script without requiring additional software to be installed (beyond "standard" UNIX tools like sed and perl).
Thanks!
In general, this can’t be done. Remember that the out-of-space condition can hit anywhere along the sequence of actions that give the appearance of in-place editing. Once the filesystem is full, perl may not be able to undo previous actions in order to restore the original state.
A safer way to use the -i switch is to use a nonempty backup suffix, e.g.,
perl -pi.bak -e 's/foo/bar/g' test.txt
This way, if something goes wrong along the way, you still have your original data.
If you want to roll your own, be sure to check the value returned from the close system call. As the Linux manual page notes,
Not checking the return value of close() is a common but nevertheless serious programming error. It is quite possible that errors on a previous write(2) operation are first reported at the final close(). Not checking the return value when closing the file may lead to silent loss of data. This can especially be observed with NFS and with disk quota.
As with everything else in life, leave yourself more margin for error. Disk is cheap. Dig out the pocket change from your couch cushions and go buy yourself half a terabyte or so.
From perldoc perlrun:
-i[extension]
specifies that files processed by the "<>" construct are to be edited in-place.
It does this by renaming the input file, opening the output file by the original
name, and selecting that output file as the default for print() statements. The
extension, if supplied, is used to modify the name of the old file to make a
backup copy, following these rules:
If no extension is supplied, no backup is made and the current file is
overwritten.
[…]
Rephrased:
The backup filename is determined from the value of the -i-switch, if one is given.
The original file is renamed to the new filename, and opened for the script. Renaming is atomic on most filesystems.
A file with the name of the original file is opened for writing. The file will start with length zero, but is not identical to the original file (which has a different name now).
After the script has finished, and if no explicit backup extension was provided, the backup file is deleted. The original file is then lost.
Should the system run out of drive space, then the new file is endangered, not the original file which was never copied or moved (at least on filesystems with an inode-like concept).

How can I force emacs (or any editor) to read a file as if it is in ASCII format?

I could not find this answer in the man or info pages, nor with a search here or on Google. I have a file which is, in essence, a text file, but it somehow got screwed up upon saving. (I think there are a few strange bytes at the front of the file accidentally.)
I am able to open the file, and it makes sense, using head or cat, but not using any sort of editor.
In the end, all I wish to do is open the file in emacs, delete the "messy" characters, and save it once cleaned up. The file, however, is huge, so I need something powerful like emacs to be able to open it.
Otherwise, I suppose I can try to create a script to read this in line by line, forcing the script to read it in text format, then write it. But I wanted something quick, since I won't be doing this over & over.
Thanks!
Mike
perl -i.bk -pe 's/[^[:ascii:]]//g;' file
Found this perl one liner here: http://www.perlmonks.org/?node_id=619792
Try M-xfind-file-literally in Emacs.
You could edit the file using hexl-mode, which lets you edit the file in hexadecimal. That would let you see precisely what those offending characters are, and remove them.
It sounds like you either got a different line ending in the file (eg: carriage returns on a *nix system) or it got saved in an unexpected encoding.
You could use strings to grab "printable characters in file". You might have to play with the --encoding though I have only ever used it to grab ascii strings from executable files.

UNIX tty command and file command?

I am new to UNIX and when I was reading a book about UNIX, I came across following two problems that I didn't understand. I would really appreciate your help.
1) Look up the man page for the file command, and then use it on all files in the /dev directory. Can you group these files into two categories?
2) Run the tty command, and note the device name of your terminal. Now use this device name(/dev/pst/6) in the command cp /etc/passwd /dev/pts/6. what do you observe?
Fair question really... it's so easy for us to take so much for granted.
To read the manual page for the command called file, just type...
man file
...which will present a lot of information that will probably be quite confusing, but you'll get used to this stuff pretty quick if you keep at it. Crucially, file is a program that tries to categorise the files you ask it to. If you type...
file /dev/*
...that will do what the question asked, and invoke file with a list of the files in the /dev/ subdirectory. The list is actually prepared by the "shell" program that you're typing into, which then executes the file program and passes it the list. file then outputs some description of the files. On my computer, and where [SHELL-PROMPT] will be different on your computer, I typed file /dev/* and part of the output looked like:
[SHELL-PROMPT] file /dev/*
...lots of stuff...
/dev/cevt: character special (255/176)
/dev/console: character special (5/1)
/dev/core: symbolic link to `/proc/kcore'
/dev/cpqci: character special (10/209)
/dev/cpqhealth: directory
/dev/crom: character special (255/180)
...lots of stuff...
/dev/md8: block special (9/8)
/dev/md9: block special (9/9)
/dev/mem: character special (1/1)
/dev/mice: character special (13/63)
/dev/mouse0: character special (13/32)
/dev/mptctl: character special (10/220)
/dev/net: directory
/dev/nflog: character special (36/5)
/dev/null: character special (1/3)
/dev/parport0: character special (99/0)
...lots of stuff...
There's a filesystem entry for each directory/file combination (known as a path) in the left column, and file is describing the content in the right. Those descriptions may not make a lot of sense, but you can see that some patterns: some entries are "block special", others "character special", some are directory which implies you may find more files underneath (i.e. ls /dev/net/*). The numbers after "special" files are just operating system identifiers to differentiate the files mentioned. The import of this is that input and output from some devices connected to the computer is being made possible as if the device was a file in the filesystem. That "file" abstraction is being used as a general model for input and output. So, /dev/tty for example is tty - or terminal - device. Any data you try to read from there will actually be taken from the keyboard you're using to type into the shell (in the simple case), and anything you write there will become visible in the same terminal you're typing into. /dev/null is another interesting one: you can read and write from it, but it's an imaginary thing that never actually provides data (just indicates and End-of-File condition, and throws away any data written into it). You can keep reading from /dev/random and it will produce random values each time... good if you need random numbers or file content for encryption or some kind of statistical work.
2) Run the tty command, and note the
device name of your terminal. Now use
this device name(/dev/pst/6) in the
command cp /etc/passwd /dev/pts/6.
what do you observe?
By typing "tty" you can ask for the device representing your terminal...
[SHELL-PROMPT] tty
/dev/pts/11
But, I just said /dev/tty is another name for the same thing, so there's normally no need to use the "tty" program to find this more specific name. Still, if you create a couple terminal windows to your host, and type tty in each, you will see that each shell is connected to a different pseudo-terminal device. Still, each shell - and program run from the shell - can by default also refer to its own terminal input and output device as /dev/tty... it's a convenient context-sensitive name. The command...
cp /etc/passwd /dev/pts/6
...where you replace 6 with whatever your tty program really reported (e.g. 11 in my case), does the same thing as...
cp /etc/passwd /dev/tty
...it just reads the contents of the file /etc/passwd and writes them out on your screen. Now, the problem is that /etc/password looks like a lot of unintelligible junk to the average person - it's no wonder you couldn't make sense of what was happening. Try this instead...
echo "i said hello" > /tmp/hello.file
cp /tmp/hello.file /dev/tty
...and you'll see how to direct some specific, recognisable content into a new file (in this case putting it in the tmp "temporary" directory (the file will disappear when you reboot your PC), then copying that file content back to your screen.
(If you have logged on in two terminal windows, you can even go into one shell and copy the file to the /dev/pts/NN device reported by the other shell, effectively sending a message to the other window. You can even bypass the file and echo 'boo' > /dev/tty/NN. You'll only have permissions to do this if the same userid is logged into both windows.)

Read and delete text between two strings in perl

I need a way to read and delete text between two different strings found in some file, then delete the two strings. Like a "cut command." I would like to have the text stored in a variable.
I saw the post about reading text between two strings, but I could not figure out how to delete it as well.
I intend to execute the stored text in bash. Efficiency is desirable. This script is not going to be used on large files, but it may be executed many times sequentially so the faster the script works the better.
The stored text will usually have special characters.
Thanks
Specify the beginning and ending strings via the environment, and the file to use on the perl command line:
export START_STRING='abc def'
export END_STRING='ghi jkl'
perl -0777 -i -wpe's/\Q$ENV{START_STRING}\E(.*)\Q$ENV{END_STRING}\E/s;print STDERR $1' file_to_use 2>savedtext