How are perl's -T and -B implemented? - perl

What does perl's -T function really do? From the man page on perlfunc:
-T File is an ASCII text file (heuristic guess).
-B File is a "binary" file (opposite of -T).
Is the -B option simply equivalent to ! -T, or is it simply an inversion of the heuristic, such that some of the time, a file may be true for both -B and -T. Does the heuristic have, say, a threshold for control characters? Does it ignore tabs, EOLs, EOFs and NULs?

From the same page:
The -T and -B switches work as follows.
The first block or so of the file is examined to see if it is valid UTF-8 that includes non-ASCII characters. If, so it's a -T file. Otherwise, that same portion of the file is examined for odd characters such as strange control codes or characters with the high bit set. If more than a third of the characters are strange, it's a -B file; otherwise it's a -T file. Also, any file containing a zero byte in the examined portion is considered a binary file. (If executed within the scope of a use locale which includes LC_CTYPE , odd characters are anything that isn't a printable nor space in the current locale.) If -T or -B is used on a filehandle, the current IO buffer is examined rather than the first block. Both -T and -B return true on an empty file, or a file at EOF when testing a filehandle. Because you have to read a file to do the -T test, on most occasions you want to use a -f against the file first, as in next unless -f $file && -T $file .

Related

Converting only non utf-8 files to utf-8

I have a set of md files, some of them are utf-8 encoded, and others are not (windows-1256 actually).
I want to convert only non-utf-8 files to utf-8.
The following script can partly do the job:
for file in *.md;
do
iconv -f windows-1256 -t utf-8 "$file" -o "${file%.md}.🆕.md";
done
I still need to exclude the original utf-8 files from this process, (maybe using file command?). Try the following command to understand what I mean:
file --mime-encoding *
Notice that although file command isn't smart enough to detect the right character set of non-utf-8 files, It's enough in this case that it can distinguish between utf-8 and non-utf-8 files.
Thanks in advance for help.
You can use for example an if statement:
if file --mime-encoding "$file" | grep -v -q utf-8 ; then
iconv -f windows-1256 -t utf-8 "$file" -o "${file%.md}.🆕.md";
fi
If grep doesn't find a match, it returns a status code indicating failure. The if statement tests the status code

Why does "sed -n -i" delete existing file contents?

Running Fedora 25 server edition. sed --version gives me sed (GNU sed) 4.2.2 along with the usual copyright and contact info. I've create a text file sudo vi ./potential_sed_bug. Vi shows the contents of this file (with :set list enabled) as:
don't$
delete$
me$
please$
I then run the following command:
sudo sed -n -i.bak /please/a\testing ./potential_sed_bug
Before we discuss the results; here is what the sed man page says:
-n, --quiet, --silent
suppress automatic printing of pattern space
and
-i[SUFFIX], --in-place[=SUFFIX]
edit files in place (makes backup if extension supplied). The default operation mode is to break symbolic and hard links. This can be changed with --follow-symlinks and --copy.
I've also looked other sed command references to learn how to append with sed. Based on my understanding from the research I've done; the resulting file content should be:
don't
delete
me
please
testing
However, running sudo cat ./potential_sed_bug gives me the following output:
testing
In light of this discrepancy, is my understanding of the command I ran incorrect or is there a bug with sed/the environment?
tl;dr
Don't use -n with -i: unless you use explicit output commands in your sed script, nothing will be written to your file.
Using -i produces no stdout (terminal) output, so there's nothing extra you need to do to make your command quiet.
By default, sed automatically prints the (possibly modified) input lines to whatever its output target is, whether implied or explicitly specified: by default, to stdout (the terminal, unless redirected); with -i, to a temporary file that ultimately replaces the input file.
In both cases, -n suppresses this automatic printing, so that - unless you use explicit output functions such as p or, in your case, a - nothing gets printed to stdout / written to the temporary file.
Note that the automatic printing applies to the so-called pattern space, which is where the (possibly modified) input is held; explicit output functions such as p, a, i and c do not print to the pattern space (for potential subsequent modification), they print directly to the target stream / file, which is why a\testing was able to produce output, despite the use of -n.
Note that with -i, sed's implicit printing / explicit output commands only print to the temporary file, and not also to stdout, so a command using -i is invariably quiet with respect to stdout (terminal) output - there's nothing extra you need to do.
To give a concrete example (GNU sed syntax).
Since the use of -i is incidental to the question, I've omitted it for simplicity. Note that -i prints to a temporary file first, which, on completion, replaces the original. This comes with pitfalls, notably the potential destruction of symlinks; see the lower half of this answer of mine.
# Print input (by default), and append literal 'testing' after
# lines that contain 'please'.
$ sed '/please/ a testing' <<<$'yes\nplease\nmore'
yes
please
testing
more
# Adding `-n` suppresses the default printing, so only `testing` is printed.
# Note that the sequence of processing is exactly the same as without `-n`:
# If and when a line with 'please' is found, 'testing' is appended *at that time*.
$ sed -n '/please/ a testing' <<<$'yes\nplease\nmore'
testing
# Adding an unconditional `p` (print) call undoes the effect of `-n`.
$ sed -n 'p; /please/ a testing' <<<$'yes\nplease\nmore'
yes
please
testing
more

How to convert unicode to ASCII?

I must remove Unicode characters from many files (many cpp files!) and I'm looking for script or something to remove these unicode. the files are in many folders!
If you have it, you should be able to use iconv (the command-line tool, not the C function). Something like this:
$ for a in $(find . -name '*.cpp') ; do iconv -f utf-8 -t ascii -c "$a" > "$a.ascii" ; done
The -c option to iconv causes it to drop characters it can't convert. Then you'd verify the result, and go over them again, renaming the ".ascii" files to the plain filenames, overwriting the Unicode input files:
$ for a in $(find . -name '*.ascii') ; do mv $a $(basename $a .ascii) ; done
Note that both of these commands are untested; verify by adding echo after the do in each to make sure they seem sane.
Open the srt file in Gaupol, click on file, click on save as, drop menu for character encoding, select UTF-8, save the file.

How to search and replace in text files only?

I have a directory containing a bunch of files, some text some binary, with no consistent naming. I want to search and replace a string in text files only. So I went with:
perl -i -pne 's#/some/text/to/replace#/replacement/text#' *
Remove the -i option and you will see that binary files get caught. How do I modify this one-liner to skip binary files?
ack -n --text --sort -f . | xargs perl -i -pne 's…'
Abusing ack goes much quicker than writing your own solution with -T.
Well, this is all based on what your definition of a text file is. Perl 5 has the -T filetest operator that will tell you if a filename or filehandle is a text file (using Perl 5's definition):
perl -i -pne 'BEGIN{#ARGV=grep-T,#ARGV}s#regex#replacement#' *
The BEGIN block will filter out any files that don't pass the -T test, so they won't even be read (except for their first block because that is what -T uses to determine if they are text).
From perldoc -f -X
The -T and -B switches work as follows. The first block or so of the file is examined for odd characters such as strange control codes or characters with the high bit set. If too many strange characters (>30%) are found, it's a -B file; otherwise it's a -T file. Also, any file containing a zero byte in the first block is considered a binary file. If -T or -B is used on a filehandle, the current IO buffer is examined rather than the first block. Both -T and -B return true on an empty file, or a file at EOF when testing a filehandle. Because you have to read a file to do the -T test, on most occasions you want to use a -f against the file first, as in next unless -f $file && -T $file .

How can I check if a file exists in Perl?

I have a relative path
$base_path = "input/myMock.TGZ";
myMock.TGZ is the file name located in input folder.
The filename can change. But the path is always stored in $base_path.
I need to check if the file exists in $base_path.
Test whether something exists at given path using the -e file-test operator.
print "$base_path exists!\n" if -e $base_path;
However, this test is probably broader than you intend. The code above will generate output if a plain file exists at that path, but it will also fire for a directory, a named pipe, a symlink, or a more exotic possibility. See the documentation for details.
Given the extension of .TGZ in your question, it seems that you expect a plain file rather than the alternatives. The -f file-test operator asks whether a path leads to a plain file.
print "$base_path is a plain file!\n" if -f $base_path;
The perlfunc documentation covers the long list of Perl's file-test operators that covers many situations you will encounter in practice.
-r
File is readable by effective uid/gid.
-w
File is writable by effective uid/gid.
-x
File is executable by effective uid/gid.
-o
File is owned by effective uid.
-R
File is readable by real uid/gid.
-W
File is writable by real uid/gid.
-X
File is executable by real uid/gid.
-O
File is owned by real uid.
-e
File exists.
-z
File has zero size (is empty).
-s
File has nonzero size (returns size in bytes).
-f
File is a plain file.
-d
File is a directory.
-l
File is a symbolic link (false if symlinks aren’t supported by the file system).
-p
File is a named pipe (FIFO), or Filehandle is a pipe.
-S
File is a socket.
-b
File is a block special file.
-c
File is a character special file.
-t
Filehandle is opened to a tty.
-u
File has setuid bit set.
-g
File has setgid bit set.
-k
File has sticky bit set.
-T
File is an ASCII or UTF-8 text file (heuristic guess).
-B
File is a “binary” file (opposite of -T).
-M
Script start time minus file modification time, in days.
-A
Same for access time.
-C
Same for inode change time (Unix, may differ for other platforms)
You might want a variant of exists ... perldoc -f "-f"
-X FILEHANDLE
-X EXPR
-X DIRHANDLE
-X A file test, where X is one of the letters listed below. This unary operator takes one argument,
either a filename, a filehandle, or a dirhandle, and tests the associated file to see if something is
true about it. If the argument is omitted, tests $_, except for "-t", which tests STDIN. Unless
otherwise documented, it returns 1 for true and '' for false, or the undefined value if the file
doesn’t exist. Despite the funny names, precedence is the same as any other named unary operator.
The operator may be any of:
-r File is readable by effective uid/gid.
-w File is writable by effective uid/gid.
-x File is executable by effective uid/gid.
-o File is owned by effective uid.
-R File is readable by real uid/gid.
-W File is writable by real uid/gid.
-X File is executable by real uid/gid.
-O File is owned by real uid.
-e File exists.
-z File has zero size (is empty).
-s File has nonzero size (returns size in bytes).
-f File is a plain file.
-d File is a directory.
-l File is a symbolic link.
-p File is a named pipe (FIFO), or Filehandle is a pipe.
-S File is a socket.
-b File is a block special file.
-c File is a character special file.
-t Filehandle is opened to a tty.
-u File has setuid bit set.
-g File has setgid bit set.
-k File has sticky bit set.
-T File is an ASCII text file (heuristic guess).
-B File is a "binary" file (opposite of -T).
-M Script start time minus file modification time, in days.
if (-e $base_path)
{
# code
}
-e is the 'existence' operator in Perl.
You can check permissions and other attributes using the code on this page.
Use:
if (-f $filePath)
{
# code
}
-e returns true even if the file is a directory. -f will only return true if it's an actual file
You can use: if(-e $base_path)
if(-e $base_path)
{
print "Something";
}
would do the trick.
#!/usr/bin/perl -w
$fileToLocate = '/whatever/path/for/file/you/are/searching/MyFile.txt';
if (-e $fileToLocate) {
print "File is present";
}
Use the below code. Here -f checks if it's a file or not:
print "File $base_path is exists!\n" if -f $base_path;