SAG challenge (sed, awk, grep): multi patterns file filtering - sed

So my dear SOers, Let me be direct to the point:
specification: filter a text file using pairs of patterns.
Example: if we have a file:
line 1 blabla
line 2 more blabla
line 3 **PAT1a** blabla
line 4 blabla
line 5 **PAT1b** blabla
line 6 blabla
line 7 **PAT2a** blabla
line 8 blabla
line 9 **PAT2b** blabla
line 10 **PAT3a** blabla
line 11 blabla
line 12 **PAT3b** blabla
more and more blabla
should give:
line 3 **PAT1a** blabla
line 4 blabla
line 5 **PAT1b** blabla
line 7 **PAT2a** blabla
line 8 blabla
line 9 **PAT2b** blabla
line 10 **PAT3a** blabla
line 11 blabla
line 12 **PAT3b** blabla
I know how to filer only one part of it using 'sed':
sed -n -e '/PAT1a/,/PAT1b/{p}'
But how to filter all the snippets, do i need to write those pairs of patterns in a configuration file, read a pair from it, use the sed cmd above, go to next pair...?
Note: Suppose PAT1, PAT2 and PAT3, etc share no common prefix(like 'PAT' in this case)
One thing more: how to make a newline in quota text in this post without leaving a whole blank line?

I assumed the pattern pairs are given as a separate file. Then, when they appear in order in the input, you could use this awk script:
awk 'NR == FNR { a[NR] = $1; b[NR] = $2; next }
!s && $0 ~ a[i+1] { s = 1 }
s
s && $0 ~ b[i+1] { s = 0; i++ }' patterns.txt input.txt
And a more complicated version when the patterns can appear out of order:
awk 'NR == FNR { a[++n] = $1; b[n] = $2; next }
{ for (i = 1; !s && i <= n; i++) if ($0 ~ a[i]) s = i; }
s
s && $0 ~ b[s] { s = 0 }' patterns.txt input.txt

Awk.
$ awk '/[0-9]a/{o=$0;getline;$0=o"\n"$0;print;next}/[0-9]b/' file
line 3 PAT1a blabla
line 4 blabla
line 5 PAT1b blabla
line 7 PAT2a blabla
line 8 blabla
line 9 PAT2b blabla
line 10 PAT3a blabla
line 11 blabla
line 12 PAT3b blabla
Note: Since you said "share no common prefix", then I use the number and [ab] pattern for regex

Use the b command to skip all lines between the patterns and the d command to delete all other lines:
sed -e '/PAT1a/,/PAT1b/b' -e '/PAT2a/,/PAT2b/b' -e '/PAT3a/,/PAT3b/b' -e d

Related

xargs and sed to extract specific lines

I want to extract lines that have a particular pattern, in a certain column. For example, in my 'input.txt' file, I have many columns. I want to search the 25th column for 'foobar', and extract only those lines that have 'foobar' in the 25th column. I cannot do:
grep foobar input.txt
because other columns may also have 'foobar', and I don't want those lines. Also:
the 25th column will have 'foobar' as part of a string (i.e. it could be 'foobar ; muller' or 'max ; foobar ; john', or 'tom ; foobar35')
I would NOT want 'tom ; foobar35'
The word in column 25 must be an exact match for 'foobar' (and ; so using awk $25=='foobar' is not an option.
In other words, if column 25 had the following lines:
foobar ; muller
max ; foobar ; john
tom ; foobar35
I would want only lines 1 & 2.
How do I use xargs and sed to extract these lines? I am stuck at:
cut -f25 input.txt | grep -nw foobar | xargs -I linenumbers sed ???
thanks!
Do not use xargs and sed, use the other tool common on so many machines and do this:
awk '{if($25=="foobar"){print NR" "$0}}' input.txt
print NR prints the line number of the current match so the first column of the output will be the line number.
print $0 prints the current line. Change it to print $25 if you only want the matching column. If you only want the output, use this:
awk '{if($25=="foobar"){print $0}}' input.txt
EDIT1 to match extended question:
Use what #shellter and #Jotne suggested but add string delimiters.
awk -vFPAT="([^ ]*)|('[^']*')" -vOFS=' ' '$25~/foobar/' input.txt
[^ ]* matches all characters that are not a space.
'[^']*' matches everything inside single quotes.
EDIT2 to exclude everything but foobar:
awk -vFPAT="([^ ]*)|('[^']*')" -vOFS=' ' "\$25~/[;' ]foobar[;' ]/" input.txt
[;' ] only allows ;, ' and in front and after foobar.
Tested with this file:
1 "1 ; 1" 4
2 'kom foobar' 33
3 "ll;3" 3
4 '1; foobar' asd
7 '5 ;foobar' 2
7 '5;foobar' 0
2 'kom foobar35' 33
2 'kom ; foobar' 33
2 'foobar ; john' 33
2 'foobar;paul' 33
2 'foobar1;paul' 33
2 'foobarli;paul' 33
2 'afoobar;paul' 33
and this command awk -vFPAT="([^ ]*)|('[^']*')" -vOFS=' ' "\$2~/[;' ]foobar[;' ]/" input.txt
To get the line with foobar as part of the 25 field.
awk '$25=="foobar"' input.txt
$25 25th filed
== equal to
"foobar"
Since no action spesified, print the complete line will be done, same as {print $0}
Or
awk '$25~/^foobar$/' input.txt
This might work for you (GNU sed):
sed -En 's/\S+/\n&\n/25;s/\n(.*foobar.*)\n/\1/p' file
Surround the 25th field by newlines and pattern match for foobar between newlines.
If you only want to match the word foobar use:
sed -En 's/\S+/\n&\n/25;s/\n(.*\<foobar\>.*)\n/\1/p' file

Delete \n characters from line range in text file

Let's say we have a text file with 1000 lines.
How can we delete new line characters from line 20 to 500 (replace them with space for example)?
My try:
sed '20,500p; N; s/\n/ /;' #better not to say anything
All other lines (1-19 && 501-1000) should be preserved as-is.
As I'm familiar with sed, awk or perl solutions are welcomed, but please give an explanation with them as I'm a perl and awk newbie.
You could use something like this (my example is on a slightly smaller scale :-)
$ cat file
1
2
3
4
5
6
7
8
9
10
$ awk '{printf "%s%s", $0, (2<=NR&&NR<=5?FS:RS)}' file
1
2 3 4 5 6
7
8
9
10
The second %s in the printf format specifier is replaced by either the Field Separator (a space by default) or the Record Separator (a newline) depending on whether the Record Number is within the range.
Alternatively:
$ awk '{ORS=(2<=NR&&NR<=5?FS:RS)}1' file
1
2 3 4 5 6
7
8
9
10
Change the Output Record Separator depending on the line number and print every line.
You can pass variables for the start and end if you want, using awk -v start=2 -v end=5 '...'
This might work for you (GNU sed):
sed -r '20,500{N;s/^(.*)(\n)/\2\1 /;D}' file
or perhaps more readably:
sed ':a;20,500{N;s/\n/ /;ta}' file
Using a perl one-liner to strip the newline:
perl -i -pe 'chomp if 20..500' file
Or to replace it with a space:
perl -i -pe 's/\R/ / if 20..500' file
Explanation:
Switches:
-i: Edit <> files in place (makes backup if extension supplied)
-p: Creates a while(<>){...; print} loop for each “line” in your input file.
-e: Tells perl to execute the code on command line.
Code:
chomp: Remove new line
20 .. 500: if Range operator .. is between line numbers 20 to 500
Here's a perl version:
my $min = 5; my $max = 10;
while (<DATA>) {
if ($. > $min && $. < $max) {
chomp;
$_ .= " ";
}
print;
}
__DATA__
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Output:
1
2
3
4
5
6 7 8 9 10
11
12
13
14
15
It reads in DATA (which you can set to being a filehandle or whatever your application requires), and checks the line number, $.. While the line number is between $min and $max, the line ending is chomped off and a space added to the end of the line; otherwise, the line is printed as-is.

Get multi-line text in between horizontal delimiter with sed / awk

I would like to get multi-line text in between horizontal delimiter and ignore anything else before and after the delimiter.
An example would be:-
Some text here before any delimiter
----------
Line 1
Line 2
Line 3
Line 4
----------
Line 1
Line 2
Line 3
Line 4
----------
Some text here after last delimiter
And I would like to get
Line 1
Line 2
Line 3
Line 4
Line 1
Line 2
Line 3
Line 4
How do I do this with awk / sed with regex? Thanks.
You can try this.
file: a.awk:
BEGIN { RS = "-+" }
{
if ( NR > 1 && RT != "" )
{
print $0
}
}
run: awk -f a.awk data_file
If you can comfortably fit the entire file into memory, and if Perl is acceptable instead of awk or sed,
perl -0777 -pe 's/\A.*?\n-{10}\n//s;
s/(.*\n)-{10}\n.*?\Z/\1/s;
s/\n-{10}\n/\n\n\n/g' file >newfile
The main FAQs here are the -0777 option (slurp mode) and the /s (dot matches newlines) regex flag.
This might work for you:
sed '1,/^--*$/d;:a;$!{/\(^\|\n\)--*$/!N;//!ba;s///p};d' file

Splitting file based on variable

I have a file with several lines of the following:
DELIMITER ;
I want to create a separate file for each of these sections.
The man page of split command does not seem to have such option.
The split command only splits a file into blocks of equal size (maybe except for the last one).
However, awk is perfect for your type of problem. Here's a solution example.
Sample input
1
2
3
DELIMITER ;
4
5
6
7
DELIMITER ;
8
9
10
11
awk script split.awk
#!/usr/bin/awk -f
BEGIN {
n = 1;
outfile = n;
}
{
# FILENAME is undefined inside the BEGIN block
if (outfile == n) {
outfile = FILENAME n;
}
if ($0 ~ /DELIMITER ;/) {
n++;
outfile = FILENAME n;
} else {
print $0 >> outfile;
}
}
As pointed out by glenn jackman, the code also can be written as:
#!/usr/bin/awk -f
BEGIN {
n = 1;
}
$0 ~ /DELIMITER ;/ {
n++;
next;
}
{
print $0 >> FILENAME n;
}
The notation on the command prompt awk -v x="DELIMITER ;" -v n=1 '$0 ~ x {n++; next} {print > FILENAME n}' is more suitable if you don't use the script more often, however you can also save it in a file as well.
Test run
$ ls input*
input
$ chmod +x split.awk
$ ./split.awk input
$ ls input*
input input1 input2 input3
$ cat input1
1
2
3
$ cat input2
4
5
6
7
$ cat input3
8
9
10
11
The script is just a starting point. You probably have to adapt it to your personal needs and environment.

Concatenate Lines in Bash

Most command-line programs just operate on one line at a time.
Can I use a common command-line utility (echo, sed, awk, etc) to concatenate every set of two lines, or would I need to write a script/program from scratch to do this?
$ cat myFile
line 1
line 2
line 3
line 4
$ cat myFile | __somecommand__
line 1line 2
line 3line 4
sed 'N;s/\n/ /;'
Grab next line, and substitute newline character with space.
seq 1 6 | sed 'N;s/\n/ /;'
1 2
3 4
5 6
$ awk 'ORS=(NR%2)?" ":"\n"' file
line 1 line 2
line 3 line 4
$ paste - - < file
line 1 line 2
line 3 line 4
Not a particular command, but this snippet of shell should do the trick:
cat myFile | while read line; do echo -n $line; [ "${i}" ] && echo && i= || i=1 ; done
You can also use Perl as:
$ perl -pe 'chomp;$i++;unless($i%2){$_.="\n"};' < file
line 1line 2
line 3line 4
Here's a shell script version that doesn't need to toggle a flag:
while read line1; do read line2; echo $line1$line2; done < inputfile