sed remove lines with any alphabetical order - sed

im trying to remove all lines that have any 3 characters in alphabetical order with sed is there an easy way to do this instead of a bunch of pattern lines
sed -i '/abc/d
/bcd/d
....
/xyz/d' file.txt

With your attempted code, please try following awk code, where we are not writing all combinations of continuous alphabets. IMHO awk will be much efficient then sed here.
awk '
BEGIN{
FS=""
num=split("a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z",arr1,",")
for(i=1;i<=num;i++){ letters[arr1[i]]=i }
}
{
for(i=1;i<=NF;i++){
if(($i in letters) && ($(i+1) in letters) && ($(i+2) in letters)\
&& (letters[$i]+1==letters[$(i+1)]) && (letters[$i]+2==letters[$(i+2)])\
&& (letters[$(i+1)]+1==letters[$(i+2)])){
print $i $(i+1) $(i+2)
}
}
}
' Input_file
Explanation: Simple and detailed explanation for whole awk program would be:
Explanation of BEGIN block of awk program:
Creating field separator(FS) as NULL for all lines in awk so that each character could be compared to find out 3 consecutive occurrences of letters.
Then using split function of awk creating an array named arr1 where splitting all alphabets(small letters) into it with delimiter of , here.
Then starting a for loop till value of num(could be written as 26 also since number of alphabets are always fixed), where creating an array named letters which has index as alphabets and its value will be their place value(their number on which they occur, eg: for a it will be 1).
Explanation of main block of awk program:
Running a for loop from 1st field to till NF all fields of current line basically.
Then checking conditions there(basically checking if current field and next 2 fields are coming in letters array or not AND checking if their sequence is continuous or not).
If all conditions mentioned are met then printing current and next 2 fields(which will basically print 3 letters).

This might work for you (GNU sed):
sed -En '1{x;s/^/abcdefghijklmnopqrstuvwxyz/;x};G;/(...).*\n.*\1/!P' file
On the first line, introduce a literal alphabet in the hold space.
On each line, append the alphabet and using a three character back reference, compare it the the alphabet.
If there is a match, delete the line, otherwise, print the first line only.
N.B. The use of the -n turns off implicit printing and thus only when a match fails is the line printed.

Related

GREP Print Blank Lines For Non-Matches

I want to extract strings between two patterns with GREP, but when no match is found, I would like to print a blank line instead.
Input
This is very new
This is quite old
This is not so new
Desired Output
is very
is not so
I've attempted:
grep -o -P '(?<=This).*?(?=new)'
But this does not preserve the second blank line in the above example. Have searched for over an hour, tried a few things but nothing's worked out.
Will happily used a solution in SED if that's easier!
You can use
#!/bin/bash
s='This is very new
This is quite old
This is not so new'
sed -En 's/.*This(.*)new.*|.*/\1/p' <<< "$s"
See the online demo yielding
is very
is not so
Details:
E - enables POSIX ERE regex syntax
n - suppresses default line output
s/.*This(.*)new.*|.*/\1/ - finds any text, This, any text (captured into Group 1, \1, and then any text again, or the whole string (in sed, line), and replaces with Group 1 value.
p - prints the result of the substitution.
And this is what you need for your actual data:
sed -En 's/.*"user_ip":"([^"]*).*|.*/\1/p'
See this online demo. The [^"]* matches zero or more chars other than a " char.
With your shown samples, please try following awk code.
awk -F'This\\s+|\\s+new' 'NF==3{print $2;next} NF!=3{print ""}' Input_file
OR
awk -F'This\\s+|\\s+new' 'NF==3{print $2;next} {print ""}' Input_file
Explanation: Simple explanation would be, setting This\\s+ OR \\s+new as field separators for all the lines of Input_file. Then in main program checking condition if NF(number of fields) are 3 then print 2nd field (where next will take cursor to next line). In another condition checking if NF(number of fields) is NOT equal to 3 then simply print a blank line.
sed:
sed -E '
/This.*new/! s/.*//
s/.*This(.*)new.*/\1/
' file
first line: lines not matching "This.*new", remove all characters leaving a blank line
second lnie: lines matching the pattern, keep only the "middle" text
this is not the pcre non-greedy match: the line
This is new but that is not new
will produce the output
is new but that is not
To continue to use PCRE, use perl:
perl -lpe '$_ = /This(.*?)new/ ? $1 : ""' file
This might work for you:
sed -E 's/.*This(.*)new.*|.*/\1/' file
If the first match is made, the line is replace by everything between This and new.
Otherwise the second match will remove everything.
N.B. The substitution will always match one of the conditions. The solution was suggested by Wiktor Stribiżew.

remove last delimiter in sed/awk/perl

An input file is given, each line of which contains delimited data with extra delimiter at the end in data/header with or without enclosures.
Extra delimiter at the end it can contain with/without spaces.
Scenario 1 : Header & Data contain extra delimiter at the end
eno|ename|address|
A|B|C|
D|E|F|
Scenario 2 : Header doesn't contain extra delimiter at the end
eno|ename|address
A|B|C|
D|E|F|
Scenario 3 : With enclosures
eno|ename|address|
1|2|"A"|
Final output has to be like
Scenario 1 :
eno|ename|address
A|B|C
D|E|F
Scenario 2 :
eno|ename|address
A|B|C
D|E|F
Scenario 3 :
eno|ename|address
1|2|"A"
Solution which i have tried so far. But below solution won't work for all three scenarios is there anyway which i can make single command to support all the three scenarios in Sed/Awk/Perl
perl -pne 's/(.*)\|/$1/' filename
Could you please try following.
awk '{gsub(/\|$|\| +$/,"")} 1' Input_file
Explanation:
gsub is awk function which Globally substitute matched pattern with mentioned value.
Explanation of regex:
/\|$|\| +$/: Here there are 2 parts of regex. First is /\|$ and second is +$ which is segrigated with | where 1st regex is for removing | from last of the line and second regex removes | with space at last. So it basically takes care of both conditions successfully.
perl -lpe 's/\|\s*$//' file
will do it. That only removes pipes followed by optional whitespace at the end of each line. Note the $ line anchor.
I added the -l since each line's newline will get removes by the s/// command, and -l will put it back.
All you need is this:
sed 's/|$//'
A bit more generic. Let's assume you have the same problem, but with different field separators in different files. Some of these field separators are regular expressions (e.g. a sequence of blanks), others are just a single character c. With a tiny little awk program you can get far:
# remove_last_empty_field.awk
# 1. Get the correct `fs`
BEGIN { fs=FS; if(length(FS)==1) fs=(FS==" ") ? "[[:blank:]]+" : "["FS"]" }
# remove the empty field
{ sub(fs"$","") }
# Print the current record
1
Now you can run this on your various files as:
$ awk -f remove_last_empty_field.awk f1.txt
$ awk -f remove_last_empty_field.awk FS="|" f2.txt
$ awk -f remove_last_empty_field.awk FS="[|.*]" f3.txt
perl -pi -e 's/\|$//' Your_FIle

Using command line to remove text?

I have a huge file that contains lines that follow this format:
New-England-Center-For-Children-L0000392290
Southboro-Housing-Authority-L0000392464
Crew-Star-Inc-L0000391998
Saxony-Ii-Barber-Shop-L0000392491
Test-L0000392334
What I'm trying to do is narrow it down to just this:
New-England-Center-For-Children
Southboro-Housing-Authority
Crew-Star-Inc
Test
Can anyone help with this?
Using GNU awk:
awk -F\- 'NF--' OFS=\- file
New-England-Center-For-Children
Southboro-Housing-Authority
Crew-Star-Inc
Saxony-Ii-Barber-Shop
Test
Set the input and output field separator to -.
NF contains number of fields. Reduce it by 1 to remove the last field.
Using sed:
sed 's/\(.*\)-.*/\1/' file
New-England-Center-For-Children
Southboro-Housing-Authority
Crew-Star-Inc
Saxony-Ii-Barber-Shop
Test
Simple greedy regex to match up to the last hyphen.
In replacement use the captured group and discard the rest.
Version 1 of the Question
The first version of the input was in the form of HTML and parts had to be removed both before and after the desired text:
$ sed -r 's|.*[A-Z]/([a-zA-Z-]+)-L0.*|\1|' input
Special-Restaurant
Eliot-Cleaning
Kennedy-Plumbing
Version 2 of the Question
In the revised question, it is only necessary to remove the text that starts with -L00:
$ sed 's|-L00.*||' input2
New-England-Center-For-Children
Southboro-Housing-Authority
Crew-Star-Inc
Saxony-Ii-Barber-Shop
Test
Both of these commands use a single "substitute" command. The command has the form s|old|new|.
The perl code for this would be: perl -nle'print $1 if(m{-.*?/(.*?-.*?)-})
We can break the Regex down to matching the following:
- for that's between the city and state
.*? match the smallest set of character(s) that makes the Regex work, i.e. the State
/ matches the slash between the State and the data you want
( starts the capture of the data you are interested in
.*?-.*? will match the data you care about
) will close out the capture
- will match the dash before the L####### to give the regex something to match after your data. This will prevent the minimal Regex from matching 0 characters.
Then the print statement will print out what was captured (your data).
awk likes these things:
$ awk -F[/-] -v OFS="-" '{print $(NF-3), $(NF-2)}' file
Special-Restaurant
Eliot-Cleaning
Kennedy-Plumbing
This sets / and - as possible field separators. Based on them, it prints the last_field-3 and last_field-2 separated by the delimiter -. Note that $NF stands for last parameter, hence $(NF-1) is the penultimate, etc.
This sed is also helpful:
$ sed -r 's#.*/(\w*-\w*)-\w*\.\w*</loc>$#\1#' file
Special-Restaurant
Eliot-Cleaning
Kennedy-Plumbing
It selects the block word-word after a slash / and followed with word.word</loc> + end_of_line. Then, it prints back this block.
Update
Based on your new input, this can make it:
$ sed -r 's/(.*)-L\w*$/\1/' file
New-England-Center-For-Children
Southboro-Housing-Authority
Crew-Star-Inc
Saxony-Ii-Barber-Shop
Test
It selects everything up to the block -L + something + end of line, and prints it back.
You can use also another trick:
rev file | cut -d- -f2- | rev
As what you want is every slice of - separated fields, let's get all of them but last one. How? By reversing the line, getting all of them from the 2nd one and then reversing back.
Here's how I'd do it with Perl:
perl -nle 'm{example[.]com/bp/(.*?)/(.*?)-L\d+[.]htm} && print $2' filename
Note: the original question was matching input lines like this:
<loc>http://www.example.com/bp/Lowell-MA/Special-Restaurant-L0000423916.htm</loc>
<loc>http://www.example.com/bp/Houston-TX/Eliot-Cleaning-L0000422797.htm</loc>
<loc>http://www.example.com/bp/New-Orleans-LA/Kennedy-Plumbing-L0000423121.htm</loc>
The -n option tells Perl to loop over every line of the file (but not print them out).
The -l option adds a newline onto the end of every print
The -e 'perl-code' option executes perl-code for each line of input
The pattern:
/regex/ && print
Will only print if the regex matches. If the regex contains capture parentheses you can refer to the first captured section as $1, the second as $2 etc.
If your regex contains slashes, it may be cleaner to use a different regex delimiter ('m' stands for 'match'):
m{regex} && print
If you have a modern Perl, you can use -E to enable modern feature and use say instead of print to print with a newline appended:
perl -nE 'm{example[.]com/bp/(.*?)/(.*?)-L\d+[.]htm} && say $2' filename
This is very concise in Perl
perl -i.bak -lpe's/-[^-]+$//' myfile
Note that this will modify the input file in-place but will keep a backup of the original data in called myfile.bak

How to extract a string, number, or word from a line or database and save it to a variable? (script in bash)

My question can be split in 2. First I have a data file (file.dat) that looks like:
Parameter stuff number 1 (1029847) word index 2 (01293487), bla bla
Parameter stuff number 3 (134123) word index 4 (02983457), bla bla
Parameter stuff number 2 (109847) word index 3 (1029473), bla bla
etc...
I want to extract the number in brackets and save it to a variable for example the first one in line one to be 'x1', the second on the same line to be 'y1', for line 2 'x2' and 'y2', and so on... The numbers change randomly line after line, their position (in columns, if you like) stays the same line after line. The number of lines is variable (0 to 'n'). How can I do this? Please.
I have search for answers and I get lost with the many different commands one can use, however those answers attend to particular examples where the word is at the end or in brackets but only one per line, etc. Anyhow, here is what I have done so far (I am newby):
1) I get rid of the characters that are not part of the number in the string
sed -i 's/(//g' file.dat
sed -i 's/),//g' file.dat
2) Out of frustration I decided to output the whole lines to variables (getting closer?)
2.1) Get the number of lines to iterate for:
numlines=$(wc -l < file.dat)
2.2) Loop to numlines (I havent tested this bit yet!)
for i in {1..$numlines}
do
line${!i}=$(sed -n "${numlines}p" file.dat)
done
2.3) I gave up here, any help appreciated.
The second question is similar and merely out of curiosity: imagine a database separated by spaces, or tabs, or comas, any separator; this database has a variable number of lines ('n') and the strings per line may vary too ('k'). How do I extract the value of the 'i'th line on the 'j'th string, and save it to a variable 'x'?
Here is a quick way to store value in bash array variable.
x=("" $(awk -F"[()]" '{printf "%s ",$2}' file))
y=("" $(awk -F"[()]" '{printf "%s ",$4}' file))
echo ${x[2]}
134123
If you are going to use these data for more jobs, I would have done it in awk. Then you can use internal array in awk
awk -F"[()]" '{x[NR]=$2;y[NR]=$4}' file
#!/usr/bin/env bash
x=()
y=()
while read line; do
x+=("$(sed 's/[^(]*(\([0-9]*\)).*/\1/' <<< $line)")
y+=("$(sed 's/[^(]*([^(]*(\([0-9]*\)).*/\1/' <<< $line)")
done < "data"
echo "${x[#]}"
echo "${y[#]}"
x and y are declared as arrays. Then you loop over the input file and invoke a sed command to every line in your input file.
x+=(data) appends the value data to the array x. Instead of writing the value we want to store in the array, we use command substitution, which is done with $(command), instead of appending the literal meaning of $(command) to the array, the command is executed and its return value is stored in the array.
Let's look at the sed commands:
's' is the substitute command, with [^(]* we want to match everything, except (, then we match (. The following characters we want to store in the array, to do that we use \( and \), we can later reference to it again (with \1). The number is matched with [0-9]*. In the end we match the closing bracket ) and everything else with .*. Then we replace everything we matched (the whole line), with \1, which is just what we had between \( and \).
If you are new to sed, this might be highly confusing, since it takes some time to read the sed syntax.
The second sed command is very similar.
How do I extract the value of the 'i'th line on the 'j'th string, and
save it to a variable 'x'?
Try using awk
x=$(awk -v i=$i -v j=$j ' NR==i {print $j; exit}' file.dat)
I want to extract the number in brackets and save it to a variable for
example the first one in line one to be 'x1', the second on the same
line to be 'y1', for line 2 'x2' and 'y2', and so on...
Using awk
x=($(awk -F'[()]' '{print $2}' file.dat))
y=($(awk -F'[()]' '{print $4}' file.dat))
x1 can be accessed as ${x[0]} and y1 as ${y[0]}, likewise for other sequence of variables.

How to restrict a find and replace to only one column within a CSV?

I have a 4-column CSV file, e.g.:
0001 # fish # animal # eats worms
I use sed to do a find and replace on the file, but I need to limit this find and replace to only the text found inside column 3.
How can I have a find and replace only occur on this one column?
Are you sure you want to be using sed? What about csvfix? Is your CSV nice and simple with no quotes or embedded commas or other nasties that make regexes...a less than satisfactory way of dealing with a general CSV file? I'm assuming that the # is the 'comma' in your format.
Consider using awk instead of sed:
awk -F# '$3 ~ /pattern/ { OFS= "#"; $3 = "replace"; }'
Arguably, you should have a BEGIN block that sets OFS once. For one line of input, it didn't make any odds (and you'd probably be hard-pressed to measure a difference on a million lines of input, too):
$ echo "pattern # pattern # pattern # pattern" |
> awk -F# '$3 ~ /pattern/ { OFS= "#"; $3 = "replace"; }'
pattern # pattern #replace# pattern
$
If sed still seems appealing, then:
sed '/^\([^#]*#[^#]*\)#pattern#\(.*\)/ s//\1#replace#\2/'
For example (and note the slightly different input and output – you can fix it to handle the same as the awk quite easily if need be):
$ echo "pattern#pattern#pattern#pattern" |
> sed '/^\([^#]*#[^#]*\)#pattern#\(.*\)/ s//\1#replace#\2/'
pattern#pattern#replace#pattern
$
The first regex looks for the start of a line, a field of non-at-signs, an at-sign, another field of non-at-signs and remembers the lot; it looks for an at-sign, the pattern (which must be in the third field since the first two fields have been matched already), another at-sign, and then the residue of the line. When the line matches, then it replaces the line with the first two fields (unchanged, as required), then adds the replacement third field, and the residue of the line (unchanged, as required).
If you need to edit rather than simply replace the third field, then you think about using awk or Perl or Python. If you are still constrained to sed, then you explore using the hold space to hold part of the line while you manipulate the other part in the pattern space, and end up re-integrating your desired output line from the hold space and pattern space before printing the line. That's nearly as messy as it sounds; actually, possibly even messier than it sounds. I'd go with Perl (because I learned it long ago and it does this sort of thing quite easily), but you can use whichever non-sed tool you like.
Perl editing the third field. Note that the default output is $_ which had to be reassembled from the auto-split fields in the array #F.
$ echo "pattern#pattern#pattern#pattern" | sh -x xxx.pl
> perl -pa -F# -e '$F[2] =~ s/\s*pat(\w\w)rn\s*/ prefix-$1-suffix /; $_ = join "#", #F; ' "$#"
pattern#pattern# prefix-te-suffix #pattern
$
An explanation. The -p means 'loop, reading lines into $_ and printing $_ at the end of each iteration'. The -a means 'auto-split $_ into the array #F'. The -F# means the field separator is #. The -e is followed by the Perl program. Arrays are indexed from 0 in Perl, so the third field is split into $F[2] (the sigil — the # or $ — changes depending on whether you're working with a value from the array or the array as a whole. The =~ is a match operator; it applies the regex on the RHS to the value on the LHS. The substitute pattern recognizes zero or more spaces \s* followed by pat then two 'word' characters which are remembered into $1, then rn and zero or more spaces again; maybe there should be a ^ and $ in there to bind to the start and end of the field. The replacement is a space, 'prefix-', the remembered pair of letters, and '-suffix' and a space. The $_ = join "#", #F; reassembles the input line $_ from the possibly modified separate fields, and then the -p prints that out. Not quite as tidy as I'd like (so there's probably a better way to do it), but it works. And you can do arbitrary transforms on arbitrary fields in Perl without much difficulty. Perl also has a module Text::CSV (and a high-speed C version, Text::CSV_XS) which can handle really complex CSV files.
Essentially break the line into three pieces, with the pattern you're looking for in the middle. Then keep the outer pieces and replace the middle.
/\([^#]*#[^#]*#\[^#]*\)pattern\([^#]*#.*\)/s//\1replacement\2/
\([^#]*#[^#]*#\[^#]*\) - gather everything before the pattern, including the 3rd # and any text before the math - this becomes \1
pattern - the thing you're looking for
\([^#]*#.*\) - gather everything after the pattern - this becomes \2
Then change that line into \1 then the replacement, then everything after pattern, which is \2
This might work for you:
echo 0001 # fish # animal # eats worms|
sed 's/#/&\n/2;s/#/\n&/3;h;s/\n#.*//;s/.*\n//;y/a/b/;G;s/\([^\n]*\)\n\([^\n]*\).*\n/\2\1/'
0001 # fish # bnimbl # eats worms
Explanation:
Define the field to be worked on (in this case the 3rd) and insert a newline (\n) before it and directly after it. s/#/&\n/2;s/#/\n&/3
Save the line in the hold space. h
Delete the fields either side s/\n#.*//;s/.*\n//
Now process the field i.e. change all a's to b's. y/a/b/
Now append the original line. G
Substitute the new field for the old field (also removing any newlines). s/\([^\n]*\)\n\([^\n]*\).*\n/\2\1/
N.B. That in step 4 the pattern space only contains the defined field, so any number of commands may be carried out here and the result will not affect the rest of the line.