Printing lines if a specific column contains values >0 - perl

I have a tab-delimited .txt file in this format, containing numerous symbols, numerals and letters:
MUT 124 GET 288478 0 * = 288478 0
MUT 15 GET 514675 0 75MH = 514637 -113
MUT 124 GET 514637 0 75MH = 514675 113
I want to identify all lines that contain a >0 value in the 9th column (i.e. only the 3rd row above would be extracted) and then print column 4 + 9 from any matched lines.
Desired output (two column tab delimited .txt file):
514637 113
Is there a quick way to do this in terminal/on the command-line. If so, how?
I've only just begun to learn awk and perl so all my attempts so far have been nowhere near close. Not sure where to begin!

Easy in Perl
perl -lane 'print "$F[3]\t$F[8]" if $F[8] > 0' < input-file
-l appends a newline to everything you print
-a splits the input into the #F array
-n processes the input line by line

Can be done with the Perl one-liner:
$ perl -anE 'say join "\t", #F[3,8] if $F[8] > 0' data.txt
-n (non-autoprinting) - loop through lines, reading but not printing them
-a (auto-split) - split the input line stored in $_ into #F array (space is the default separator, change it with -F, ex. -F:)
-E 'CODE' (execute) - execute 'CODE' enabling feature bundle (like use 5.010) for your version of Perl
See perlrun for more.

awk handles it almost automatically!
awk '$9>0 {print $4,$9}' file
If you need to specify the input and output separator, say:
awk 'BEGIN{FS=OFS="\t"} $9>0 {print $4,$9}' file

Related

How to remove empty lines to one empty line between sentences in text files?

I have a text file with many empty lines between sentences. I used sed, gawk, grep but they dont work. :(. How can I do now? Thanks.
Myfile: Desired file:
a a
b b
c c
. .
d d
e e
f f
g g
. .
h
i
h j
i k
j .
k
.
You can use awk for this:
awk 'BEGIN{prev="x"}
/^$/ {if (prev==""){next}}
{prev=$0;print}' inputFile
or the compressed one liner:
awk 'BEGIN{p="x"}/^$/{if(p==""){next}}{p=$0;print}' inFl
This is a simple state machine that collapses multi-blank-lines into a single one.
The basic idea is this. First, set the previous line to be non-empty.
Then, for every line in the file, if it and the previous one are blank, just throw it away.
Otherwise, set the previous line to that value, print the line, and carry on.
Sample transcript, the following command:
$ echo '1
2
3
4
5
6
7
8
9
10' | awk 'BEGIN{p="x"}/^$/{if(p==""){next}}{p=$0;print}'
outputs:
1
2
3
4
5
6
7
8
9
10
Keep in mind that this is for truly blank lines (no content). If you're trying to collapse lines that have an arbitrary number of spaces or tabs, that will be a little trickier.
In that case, you could pipe the file through something like:
sed 's/^\s*$//'
to ensure lines with just whitespace become truly empty.
In other words, something like:
sed 's/^\s*$//' infile | awk 'my previous awk command'
To suppress repeated empty output lines with GNU cat:
cat -s file1 > file2
Here's one way using sed:
sed ':a; N; $!ba; s/\n\n\+/\n\n/g' file
Otherwise, if you don't mind a trailing blank line, all you need is:
awk '1' RS= ORS="\n\n" file
The Perl solution is even shorter:
perl -00 -pe '' file
You could do like this also,
awk -v RS="\0" '{gsub(/\n\n+/,"\n\n");}1' file
Explanation:
RS="\0" Once we set the null character as Record Seperator value, awk will read the whole file as single record.
gsub(/\n\n+/,"\n\n"); this replaces one or more blank lines with a single blank line. Note that \n\n regex matches a blank line along with the previous line's new line character.
Here is an other awk
awk -v p=1 'p=="" {p=1;next} 1; {p=$0}' file

Using command line to remove text?

I have a huge file that contains lines that follow this format:
New-England-Center-For-Children-L0000392290
Southboro-Housing-Authority-L0000392464
Crew-Star-Inc-L0000391998
Saxony-Ii-Barber-Shop-L0000392491
Test-L0000392334
What I'm trying to do is narrow it down to just this:
New-England-Center-For-Children
Southboro-Housing-Authority
Crew-Star-Inc
Test
Can anyone help with this?
Using GNU awk:
awk -F\- 'NF--' OFS=\- file
New-England-Center-For-Children
Southboro-Housing-Authority
Crew-Star-Inc
Saxony-Ii-Barber-Shop
Test
Set the input and output field separator to -.
NF contains number of fields. Reduce it by 1 to remove the last field.
Using sed:
sed 's/\(.*\)-.*/\1/' file
New-England-Center-For-Children
Southboro-Housing-Authority
Crew-Star-Inc
Saxony-Ii-Barber-Shop
Test
Simple greedy regex to match up to the last hyphen.
In replacement use the captured group and discard the rest.
Version 1 of the Question
The first version of the input was in the form of HTML and parts had to be removed both before and after the desired text:
$ sed -r 's|.*[A-Z]/([a-zA-Z-]+)-L0.*|\1|' input
Special-Restaurant
Eliot-Cleaning
Kennedy-Plumbing
Version 2 of the Question
In the revised question, it is only necessary to remove the text that starts with -L00:
$ sed 's|-L00.*||' input2
New-England-Center-For-Children
Southboro-Housing-Authority
Crew-Star-Inc
Saxony-Ii-Barber-Shop
Test
Both of these commands use a single "substitute" command. The command has the form s|old|new|.
The perl code for this would be: perl -nle'print $1 if(m{-.*?/(.*?-.*?)-})
We can break the Regex down to matching the following:
- for that's between the city and state
.*? match the smallest set of character(s) that makes the Regex work, i.e. the State
/ matches the slash between the State and the data you want
( starts the capture of the data you are interested in
.*?-.*? will match the data you care about
) will close out the capture
- will match the dash before the L####### to give the regex something to match after your data. This will prevent the minimal Regex from matching 0 characters.
Then the print statement will print out what was captured (your data).
awk likes these things:
$ awk -F[/-] -v OFS="-" '{print $(NF-3), $(NF-2)}' file
Special-Restaurant
Eliot-Cleaning
Kennedy-Plumbing
This sets / and - as possible field separators. Based on them, it prints the last_field-3 and last_field-2 separated by the delimiter -. Note that $NF stands for last parameter, hence $(NF-1) is the penultimate, etc.
This sed is also helpful:
$ sed -r 's#.*/(\w*-\w*)-\w*\.\w*</loc>$#\1#' file
Special-Restaurant
Eliot-Cleaning
Kennedy-Plumbing
It selects the block word-word after a slash / and followed with word.word</loc> + end_of_line. Then, it prints back this block.
Update
Based on your new input, this can make it:
$ sed -r 's/(.*)-L\w*$/\1/' file
New-England-Center-For-Children
Southboro-Housing-Authority
Crew-Star-Inc
Saxony-Ii-Barber-Shop
Test
It selects everything up to the block -L + something + end of line, and prints it back.
You can use also another trick:
rev file | cut -d- -f2- | rev
As what you want is every slice of - separated fields, let's get all of them but last one. How? By reversing the line, getting all of them from the 2nd one and then reversing back.
Here's how I'd do it with Perl:
perl -nle 'm{example[.]com/bp/(.*?)/(.*?)-L\d+[.]htm} && print $2' filename
Note: the original question was matching input lines like this:
<loc>http://www.example.com/bp/Lowell-MA/Special-Restaurant-L0000423916.htm</loc>
<loc>http://www.example.com/bp/Houston-TX/Eliot-Cleaning-L0000422797.htm</loc>
<loc>http://www.example.com/bp/New-Orleans-LA/Kennedy-Plumbing-L0000423121.htm</loc>
The -n option tells Perl to loop over every line of the file (but not print them out).
The -l option adds a newline onto the end of every print
The -e 'perl-code' option executes perl-code for each line of input
The pattern:
/regex/ && print
Will only print if the regex matches. If the regex contains capture parentheses you can refer to the first captured section as $1, the second as $2 etc.
If your regex contains slashes, it may be cleaner to use a different regex delimiter ('m' stands for 'match'):
m{regex} && print
If you have a modern Perl, you can use -E to enable modern feature and use say instead of print to print with a newline appended:
perl -nE 'm{example[.]com/bp/(.*?)/(.*?)-L\d+[.]htm} && say $2' filename
This is very concise in Perl
perl -i.bak -lpe's/-[^-]+$//' myfile
Note that this will modify the input file in-place but will keep a backup of the original data in called myfile.bak

multiline pattern delete with single line command

I would like to delete all empty segments in my file.
The empty segment can be specified by a pair of consecutive lines starting with START and ending with END. Valid segments will have some contents between lines starting with START and ending with END
Sample Input
Header
START arguments
END
Any contents
START arguments
...
something
...
END
Footer
Desired Output
Header
Any contents
START arguments
...
something
...
END
Footer
Here I'm looking for possible one liners. Any help would be appreciated.
Trials
I tried following awk. It works to some extent but it deletes START lines even in valid segments.
awk '/^START/ && getline && /^END$/ {next} 1' file
perl -00 -pe 's/START .*?\nEND//g' file
this is a better one.
the solution I gave earlier will discard whole paragraph if they are not separated by blank lines.
Earlier response below:
how about this perl one liner ?
perl -00 -ne 'print if not /START .*\nEND/' file
read-in file in paragraph mode and discard lines matching START <string><newline>END
Meanwhile people are suggesting nice solutions, I came up with alternative solution using sed
sed '/^START/N;/^START.*END$/d' file
Or as suggested by #jthill
sed '/^START/N; /\nEND$/d' file
gawk only
awk -v RS='START[^\n]*\nEND\n' '{printf "%s", $0}' file.txt
Perhaps the following will be helpful:
perl -ne 'print /^START/?do{$x=<>;$_,$x if $x!~/^END/}:$_' inFile
Output on your dataset:
Header
Any contents
START arguments
...
something
...
END
Footer
$ awk '{rec = rec $0 RS} END{ gsub(/START[^\n]*\nEND\n/,"",rec); printf "%s", rec }' file
Header
Any contents
START arguments
...
something
...
END
Footer
/^START/ {
startline=$0
next
}
/^END$/ && startline {
startline=""
next
}
startline {
print startline
}
startline=""
1

Reading line by line in perl

Suppose I have a text file with the following format
#ATDGGSGDTSG
NTCCCCC
+
#nddhdhnadn
#ATDGGSGDTSG
NTCCCCC
+
nddhdhnadn
Now its a repeating pattern of "4" lines and I every time want to print only the 2nd line i.e. the line after the line starting with "#" i.e 2nd line..6th line..etc.
How can I do it?
perl -ne 'print if $b and !/^#/; $b=/^#/' file
With awk:
$ awk 'NR%4==2' a
NTCCCCC
NTCCCCC
NR stands for number or record, in this case being number of line. Then, if we divide it by 4, we get all lines whose modulus is 2.
Update on your comment
And wat if I want the output to be > "nextline" NTCCCCC > "nextline"
NTCCCCC i.e. I want to add ">" before that line while redirecting the
output.
This way, for example:
$ awk 'NR%4==2 {print ">"; print $0}' a
>
NTCCCCC
>
NTCCCCC
Another example:
$ seq 30 | awk 'NR%4==2'
2
6
10
14
18
22
26
30
awk '/^\#/{getline;print}' your_file
You could have a variable like $printNextLine and loop over all your Input, Setting it to 1 whenever you see a line with # and printing the current line while Setting the variable back to 0 if it is 1.
Not as effective and short as the other answers but maybe more intuitive for someone new to perl.
awk '/^\#/{getline;print}' file

How to specify column separator for `perl -a`?

I'm trying to read a text file (using perl) where each line has several records, like this:
r1c1 & r1c2 & r1c3 \\
r2c1 & r2c2 & r2c3 \\
So, & is the record separator.
The Perl help says this:
$ perl -h
-0[octal] specify record separator (\0, if no argument)
Why you would use octal number is beyond me. But 046 is the octal ASCII of the separator &, so I tried this:
perl -046 -ane 'print join ",", #F; print "\n"' file.txt
where the desired output would be
r1c1,r1c2,r1c3 \\
r2c1,r2c2,r2c3 \\
But it doesn't work. How do you do it right?
I think you are mixing two separate things. The record separator that -0 affects is what divides the input up into "lines". -a makes each "line" then be split into #F, by default on whitespace. To change what -a splits on, use the -F switch, like -F'&'.
When in doubt about the perl command line options look at perldoc perlrun in the command line.
Also if you use the -l option perl -F'&' -lane ... it will remove the end-of-line (EOL) char of every line before pass it to your script and will add it for each print, so you don't need to put "\n" in your code. The fewer chars in a one liner the better.