Perl System Grep and Paste - perl

I have a file that looks like this:
Dog
BulldogTerrier
Cat
Persian
Ape
Gorilla
Dog
PitbullLabShepardHusky
I want to be able to search for each line containing dog and select everything until the next empty line and put it into a new file.
I want an output file like:
Dog
BulldogTerrier
Dog
PitbullLabShepardHusky
I know I can use grep to find the word dog but how can I use it, or with what can I use it, so that it grabs everything after it UNTIL the next empty line and moves it into another file.
I am writing a script in Perl to do this because there are other things I wish to add on that are made easier with Perl. I was going to use system(grep....) to find the word but I wasn't sure what to do after that.
I will also note that I want to be able to do this recursively. I have many files that look like what I had shown and I would like to extract the Dog block from all of them. So it would be something recursive from the directory.

perl -ne 'print if /^Dog/../^$/' file
The .. and ... operators in perl can join two conditionals. From the time that the first evaluates true until the second conditional evaluates true, the joined conditional will evaluate true. So you want to print from the time $_ =~ m/^Dog/ is true until $_ =~ m/^\s+$/ is true. The above is shorthand for that.
The distinction between .. vs ... is not important here because in this case, the conditionals cannot both be true on the same line.

IF you can use awk, then this can be done. By setting Record Selector to nothing awk works in block mode. Test if block starts with dog, and if yes do the default action, print the block.
awk '/^Dog/' ORS="\n\n" RS="" file
Dog
Bulldog
Terrier
Dog
Pitbull
Lab
Shepard
Husky

Related

Using sed, prepend line only once, if there's a match later in file content

I'd like to add a line on top of my output if my input file has a specific word.
However, if I'm just looking for specific string, then as I understand it, it's too late. The first line is already in the output and I can't prepend to it anymore.
Here's an exemple of input.
one
two
two
three
If I can find a line with, say, the word two, I'd like to add a new line before the first one, with for example FOUND. I want that line prepended only once, even if there are several matches.
So an input file without any two would remain unchanged, and the example file above would become:
FOUND
one
two
two
three
I know how to prepend with i\, but can't get the context right. From what I understood that would be around:
1{
/two/{ # This will will search "two" in the first line, how to look for it in the whole file ?
1i\
FOUND
}
}
EDIT:
I know how to do it using other languages/methods, that's not my question.
Sed has advanced features to work on several lines at once, append/prepend lines and is not limited to substitution. I have a sed file already filled with expressions to modify a python source file, which is why I'd prefer to avoid using something else. I want to be able to add an import at the beginning of a file if a certain class is used.
A Perl solution:
perl -i.bak -0077 -pE 'say "FOUND" if /two/;' in_file
The Perl one-liner uses these command line flags:
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.
-i.bak : Edit input files in-place (overwrite the input file). Before overwriting, save a backup copy of the original file by appending to its name the extension .bak.
-E : Tells Perl to look for code in-line, instead of in a file. Also enables all optional features. Here, enables say.
-0777 : Slurp files whole.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
sed is for doing s/old/new on individual strings, that's not what you're trying to do so you shouldn't bother trying to use sed. There's lots of ways to do this, this one will be very efficient, robust and portable to all Unix systems:
$ grep -Fq 'two' file && echo "FOUND"; cat file
FOUND
one
two
two
three
To operate on a stream instead of (or in addition to) a file and without needing to read the whole input into memory:
awk 'f{print; next} {buf[NR]=$0} /two/{print "FOUND"; for (i=1;i<=NR;i++) print buf[i]; f=1}'
e.g.:
$ cat file | awk 'f{print; next} {buf[NR]=$0} /two/{print "FOUND"; for (i=1;i<=NR;i++) print buf[i]; f=1}'
FOUND
one
two
two
three
That awk script will also work using any awk in any shell on every Unix box.

Removing text with command line?

I have a huge list of locations in this form in a text file:
ar,casa de piedra,Casa de Piedra,20,,-49.985133,-68.914673
gr,riziani,Ríziani,18,,39.5286111,20.35
mx,tenextepec,Tenextepec,30,,19.466667,-97.266667
Is there any way with command line to remove everything that isn't between the first and second commas? For example, I want my list to look like this:
casa de piedra
riziani
tenextepec
with Perl
perl -F/,/ -ane 'print $F[1]."\n"' file
Use cut(1):
cut -d, -f2 inputfile
With perl:
perl -pe 's/^.*?,(.*?),.*/$1/' filename
Breakdown of the above code
perl - the command to use the perl programming language.
-pe - flags.
e means "run this as perl code".
p means:
Set $_ to the first line of the file (given by filename)
Run the -e code
Print $_
Repeat from step 1 with the next line of the file
what -p actually does behind the scenes is best explained here.
s/.*?,(.*?),.*/$1/ is a regular expression:
s/pattern/replacement/ looks for pattern in $_ and replaces it with replacement
.*? basically means "anything" (it's more complicated than that but outside the scope of this answer)
, is a comma (nothing special)
() capture whatever is in them and save it in $1
.* is another (slightly different) "anything" (this time it's more like "everything")
$1 is what we captured with ()
so the whole thing basically says to search in $_ for:
anything
a comma
anything (save this bit)
another comma
everything
and replace it with the bit it saved. This effectively saves the stuff between the first and second commas, deletes everything, and then puts what it saved into $_.
filename is the name of your text file
To review, the code goes through your file line by line, applies the regular expression to extract your needed bit, and then prints it out.
If you want the result in a file, use this:
perl -pe 's/^.*?,(.*?),.*/$1/' filename > out.txt
and the result goes into a file named out.txt (that will be placed wherever your terminal is pointed to at the moment.) What this pretty much does is tell the terminal to print the command's result to a file instead of on the screen.
Also, if it isn't crucial to use the command line, you can just import into Excel (it's in CSV format) and work with it graphically.
With awk:
$ awk -F ',' '{ print $2 }' file

Convert a CSV file with embedded commas into a bash array by line efficiently

Normally, I do something like
IFS=','
columns=( $LINE )
where $LINE is a line from a csv file I'm reading.
However, how do I handle a csv file with embedded commas? I have to handle several hundred gigs of file so everything needs to be done quickly, i.e., no multiple readings of a line, definitely no loops (last time I tried that slowed it down several factors).
The general structure of the code is as follows
FILENAME=$1
cat $FILENAME | while read LINE
do
IFS=","
columns=( $LINE )
# affect columns changes here
newline="${columns[*]}"
echo "$newline"
done
Preferably, I need something that goes
FILENAME=$1
cat $FILENAME | while read LINE
do
IFS=","
# code to tell bash to ignore if IFS is within an open quote
columns=( $LINE )
# affect columns changes here
newline="${columns[*]}"
echo "$newline"
done
Any tips would be appreciated. Otherwise, I'll probably switch to using another language to handle this stuff.
Probably embedded commas is just the first obvious problem that you encountered while parsing those CSV files.
Future problems that might popped are:
embedded newline separator characters
embedded utf8 chars
special treatment for whitespaces, empty fields, spaces around commas, undef values
I generally tend to follow the philosophy that If there is a (reputable) module that parses some
format you have to parse, use it instead of making a homebrew
I don't think there is such a thing for bash, but there are some for Perl. I'd go for Text::CSV_XS. Being written in C I expect it to be very fast.
You can use sed or something similar to convert the commas within quotes to some other sequence or punctuation. If you don't care about the stuff in quotes then you do not even need to change them back. You can do this on the whole file:
sed 's/\("[^,"]*\),\([^"]*"\)/\1;\2/g' input.csv > intermediate.csv
or on each line:
line=$(echo $line | sed 's/\("[^,"]*\),\([^"]*"\)/\1;\2/g')
This isn't a complete answer, but it's a possible approach.
Find a character that never occurs in the input file. Use a C program that parses the CSV file and prints lines to standard output with a different delimiter. Writing that program is left as an exercise, but I'm sure there's CSV-parsing C source code out there. Pipe the output of the C program into your script.
For example:
FILENAME=$1
new_c_program $FILENAME | while read LINE
do
IFS="|"
# code to tell bash to ignore if IFS is within an open quote
columns=( $LINE )
# affect columns changes here
newline="${columns[*]}"
echo "$newline"
done
A minor point: I'd pick a name other than $newline; newline suggests an end-of-line marker rather than an entire line.
Another minor point: you have a "Useless Use Of cat" in the code in your question. You could replace this:
cat $FILENAME | while read LINE
do
...
done
by this:
while read LINE
do
...
done < $FILENAME
But if you replace cat by the hypothetical C program I suggested, you still need the pipe.

Perl one-liner that works on multiple lines?

I have a file that contains pairs of lines that look like this:
FIRST PIECE OF INFO
PIECE 2 | PIECE 3 | PIECE 4 | ...
I need to output this:
FIRST PIECE OF INFO\tPIECE 2\tPIECE 3 ...
An I also need to do some more regexp magic on the lines themselves.
Can this be done using a perl one-liner? My problem here is that using -p will handle the file one line at a time, whereas I need to deal with two lines at a time. My solution was first running another one-liner that removes all linebreaks from the file (I had another separator between different pairs of lines) but this seems too cumbersome and I'm sure there's a better way.
Well, the simple solution is to turn all the newlines and pipes into tabs. It sounds a bit crazy, but at first glance, it does sound like what you want:
perl -pwe 'tr/\n|/\t\t/' yourfile.txt
But there is something that does not match up with your problem description. You say:
I have a file that contains pairs of lines
Which would mean that your file actually looks something like this:
FIRST PIECE OF INFO
PIECE 2 | PIECE 3 | PIECE 4 | ...
SECOND PIECE OF INFO
PIECE 2a | PIECE 3b | PIECE 4b | ...
THIRD... etc
In which case blindly transliterating the newlines would put everything on a single line. Now, my interpretation of this is that what you want is something like this (with tabs and newlines denoted literally):
FIRST PIECE OF INFO\tPIECE 2\tPIECE 3\tPIECE 4 | ...\n
SECOND PIECE OF INFO\tPIECE 2a\tPIECE 3b\tPIECE 4b | ...\n
This is not achieved with a simple transliteration.
perl -plwe 'next if !/\S/; $_ = join "\t", $_, split /\s*\|\s*/,<>;' file.txt
Note: The next if !/\S/; statement is only to prevent the stream from being paused at the end in case the file contains an odd number of lines. If it does, the file handle <> will try to read from STDIN, and you will need to press Ctrl-D to manually stop it.
The Data::Dumper output looks like this, with $Data::Dumper::Useqq = 1 showing whitespace characters:
$VAR1 = "FIRST PIECE OF INFO\tPIECE 2\tPIECE 3\tPIECE 4\t...\n";
$VAR1 = "SECOND PIECE OF INFO\tPIECE 2a\tPIECE 3b\tPIECE 4b\t...\n";
The one-liner for the above output looks like this, somewhat rewritten:
perl -MData::Dumper -nlwe '
$Data::Dumper::Useqq=1;
next if !/\S/;
$_ = join "\t", $_, split /\s*\|\s*/,<>;
print Dumper $_;' file.txt
I can't help you with the more regexp magic without knowing what it is, but this will combine the lines as your describe
perl -lne 'print join "\t", $_, split /\|/, <ARGV>' myfile
Yet another approach:
perl -pe'$"="\t";chomp;$_="#{[$_,split q(\|),<>]}"'

VIM 7.2 Scripting problem with `:perldo` and multiple expressions

Background task
To eliminate X-Y problems I'll say what I'm doing: I'm trying to use :perldo in VIM 7.2 to complete two tasks:
Clear all trailing whitespace, including (clearing not deleting) lines that only have whitespace
s/\s+$//;
Remove non-tab whitespace that exists before the first-non space character
s/^ (\s*) (?=\S) / s#[^\t]##g;$_ /xe;
I'd like to do this all with one pass. Currently, using :perldo, I can get this working with two passes. (by using :perldo twice)
The command should look like this:
:perldo s/\s+$//; s/^ (\s*) (?=\S) / s#[^\t]##g;$_ /xe;
Perl background
In order to understand this problem you must know a little bit about Perl s/// automagically binds to the default variable $_ which the regex is free to modify. Most core functions operate on $_ by default.
perl -e'$_="foo"; s/foo/bar/; s/bar/baz/; print' # will print baz
The assumption is that you can chain expressions using :perldo in VIM and that it will work logically.
VIM not being nice
Now my VIM problem is better demonstrated with code -- I've reduced it to a simple test. Open a new buffer place the following text into it:
aa bb
aa
bb
Now run this :perldo s/a/z/; s/b/z/; The buffer now has:
za zb
aa
zb
Why was the first regex unsuccessful on the second row, and yet the second regex was successful by itself, and on the first row?
It appears the whole Perl expression you pass to :perldo must return a true / defined value, or the results are discarded, per-line.
Try this, nothing happens on any line:
:perldo s/a/z/; s/b/z/; 0
Try this, it works on all 3 lines as expected:
:perldo s/a/z/; s/b/z; 1
An example in the :perldo documentation hints at this:
:perldo $_ = reverse($_);1
but unfortunately it doesn't say explicitly what's going on.
Don't know what :perldo is doing exactly, but if you run something like
:perldo s/a/z/+s/b/z/
then you get something more like you'd expect.
Seems to me like only the last command is run on all lines in [range].