Awk show x by not y - sed

Is it possible to combine these in one line to be alittle more efficent.
awk '$4 ~/^[x]/' raw.txt > x_raw.txt
awk '!/y/' x_raw.txt > xy_raw.txt
Is this possible with maybe perl?

You don't need to switch to perl to do logical AND just use the operator &&. The following prints the lines where the fourth field starts with x and the line doesn't contain y:
awk '$4~/^x/&&!/y/' raw.txt > xy_raw.txt

Related

Replacing all occurrence after nth occurrence in a line in perl

I need to replace all occurrences of a string after nth occurrence in every line of a Unix file.
My file data:
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
My output data:
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
tried using sed: sed 's/://3g' test.txt
Unfortunately, the g option with the occurrence is not working as expected. instead, it is replacing all the occurrences.
Another approach using awk
awk -v c=':' -v n=2 'BEGIN{
FS=OFS=""
}
{
j=0;
for(i=0; ++i<=NF;)
if($i==c && j++>=n)$i=""
}1' file
$ cat file
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
$ awk -v c=':' -v n=2 'BEGIN{FS=OFS=""}{j=0;for(i=0; ++i<=NF;)if($i==c && j++>=n)$i=""}1' file
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
With GNU awk, using gensub please try following. This is completely based on your shown samples, where OP wants to remove : from 3rd occurrence onwards. Using gensub to segregate parts of matched values and removing all colons from 2nd part(from 3rd colon onwards) in it as per OP's requirement.
awk -v regex="^([^:]*:)([^:]*:)(.*)" '
{
firstPart=restPart=""
firstPart=gensub(regex, "\\1 \\2", "1", $0)
restPart=gensub(regex,"\\3","1",$0)
gsub(/:/,"",restPart)
print firstPart restPart
}
' Input_file
I have inferred based on the limited data you've given us, so it's possible this won't work. But I wouldn't use regex for this job. What you have there is colon delimited fields.
So I'd approach it using split to extract the data, and then some form of string formatting to reassemble exactly what you like:
#!/usr/bin/perl
use strict;
use warnings;
while (<DATA>) {
chomp;
my ( undef, $first, #rest ) = split /:/;
print ":$first:", join ( "", #rest ),"\n";
}
__DATA__
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
This gives you the desired result, whilst IMO being considerably clearer for the next reader than a complicated regex.
You can use the perl solution like
perl -pe 's~^(?:[^:]*:){2}(*SKIP)(?!)|:~~g if /^:account_id:/' test.txt
See the online demo and the regex demo.
The ^(?:[^:]*:){2}(*SKIP)(?!)|: regex means:
^(?:[^:]*:){2}(*SKIP)(?!) - match
^ - start of string (here, a line)
(?:[^:]*:){2} - two occurrences of any zero or more chars other than a : and then a : char
(*SKIP)(?!) - skip the match and go on to search for the next match from the failure position
| - or
: - match a : char.
And only run the replacement if the current line starts with :account_id: (see if /^:account_id:/').
Or an awk solution like
awk 'BEGIN{OFS=FS=":"} /^:account_id:/ {result="";for (i=1; i<=NF; ++i) { result = result (i > 2 ? $i : $i OFS)}; print result}' test.txt
See this online demo. Details:
BEGIN{OFS=FS=":"} - sets the input/output field separator to :
/^:account_id:/ - line must start with :account_id:
result="" - sets result variable to an empty string
for (i=1; i<=NF; ++i) { result = result (i > 2 ? $i : $i OFS)}; print result} - iterates over the fields and if the field number is greater than 2, just append the current field value to result, else, append the value + output field separator; then print the result.
I would use GNU AWK following way if n fixed and equal 2 following way, let file.txt content be
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
then
awk 'BEGIN{FS=":";OFS=""}{$2=FS $2 FS;print}' file.txt
output
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
Explanation: use : as field separator and nothing as output field separator, this itself does remove all : so I add : which have to be preserved: 1st (before second column) and 2nd (after second column). Beware that I tested it solely for this data, so if you would want to use it you should firstly test it with more possible inputs.
(tested in gawk 4.2.1)
This might work for you (GNU sed):
sed 's/:/\n/3;h;s/://g;H;g;s/\n.*\n//' file
Replace the third occurrence of : by a newline.
Make a copy of the line.
Delete all occurrences of :'s.
Append the amended line to the copy.
Join the two lines by removing everything from third occurrence of the copy to the third occurrence of the amended line.
N.B. The use of the newline is the best delimiter to use in the case of sed, as the line presented to seds commands are initially devoid of newlines. However the important property of the delimiter is that it is unique and therefore can be any such character as long as it is not found anywhere in the data set.
An alternative solution uses a loop to remove all :'s after the first two:
sed -E ':a;s/^(([^:]*:){2}[^:]*):/\1/;ta' file
With GNU awk for the 3rd arg to match() and gensub():
$ awk 'match($0,/(:[^:]+:)(.*)/,a){ $0=a[1] gensub(/:/,"","g",a[2]) } 1' file
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
and with any awk in any shell on every Unix box:
$ awk 'match($0,/:[^:]+:/){ tgt=substr($0,1+RLENGTH); gsub(/:/,"",tgt); $0=substr($0,1,RLENGTH) tgt } 1' file
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus

Replace a character with #(hash symble) only in 5th & 6th field

I am trying to Replace a character with #(hash symble) only in 5th & 6th field.
eg. I have to replace 'Z' with '#' only in 5th & 6th field (using perl or AWK script). And remaining fields containng 'Z' symbol should not be affected.
(just I'm updating the post to replace double quote(") instead of Z by #. Can I achive this? thanks for precious help)
eg: i/p file:
aa",bb,ccc,ddd,eee",ddd",fff
aa1",ba1,ccc1,"ddd1,eee"1,ddd1,fff1
z,aa2,bb2",ccc2,ddd2","eee2",ddd2,fff2"
Expected O/p file:
aa",bb,ccc,ddd,eee#,ddd#,fff
aa1",ba1,ccc1,#ddd1,eee#1,ddd1,fff1
aa2,bb2",ccc2,ddd2#,#eee2#,ddd2,fff2"
Thanks.
$ awk 'BEGIN{FS=OFS=","} {for (i=5;i<=6;i++) gsub(/Z/,"#",$i)} 1' file
x,aaZ,bb,ccc,ddd,eee#,dddZ,fff
y,aa1Z,ba1,ccc1,#ddd1,eee#1,ddd1,fff1
z,aa2,bb2Z,ccc2,ddd2#,#eee2,ddd2,fff2Z
Since its only two filed, loop can be omitted.
awk -F, -v OFS=, '{gsub(/Z/,"#",$5);gsub(/Z/,"#",$6)} 1' file
x,aaZ,bb,ccc,ddd,eee#,dddZ,fff
y,aa1Z,ba1,ccc1,#ddd1,eee#1,ddd1,fff1
z,aa2,bb2Z,ccc2,ddd2#,#eee2,ddd2,fff2Z
To replace " in fifth and sixth field:
awk -F, -v OFS=, '{gsub(/\"/,"#",$5);gsub(/\"/,"#",$6)} 1' file
aa",bb,ccc,ddd,eee#,ddd#,fff
aa1",ba1,ccc1,"ddd1,eee#1,ddd1,fff1
z,aa2,bb2",ccc2,ddd2#,#eee2#,ddd2,fff2"
Here is a Perl way to do the job:
perl -anF, -e '$"=","; s/Z/#/ for (#F)[4,5];print"#F";' < in1.txt
If you have mutiple Z in a field, you could use:
perl -anF, -e '$"=","; s/Z/#/g for (#F)[4,5];print"#F";' < in1.txt
Output:
aaZ,bb,ccc,ddd,eee#,ddd#,fff
aa1Z,ba1,ccc1,Zddd1,eee#1,ddd1,fff1
aa2,bb2Z,ccc2,ddd2Z,#eee2,ddd2,fff2Z
Edit according to comment:
in1.txt
aa",bb,ccc,ddd,eee",ddd",fff
aa1",ba1,ccc1,"ddd1,eee"1,ddd1,fff1
aa2,bb2",ccc2,ddd2","eee2,ddd2,fff2"
Command:
perl -anF'','' -e '$"=",";s/"/#/ for (#F)[4,5];print"#F";' < in1.txt
result:
aa",bb,ccc,ddd,eee#,ddd#,fff
aa1",ba1,ccc1,"ddd1,eee#1,ddd1,fff1
aa2,bb2",ccc2,ddd2",#eee2,ddd2,fff2"

How do I match multiple addresses in sed?

I want to execute some sed command for any line that matches either the and or or of multiple commands: e.g., sed '50,70/abc/d' would delete all lines in range 50,70 that match /abc/, or a way to do sed -e '10,20s/complicated/regex/' -e '30,40s/complicated/regex/ without having to retype s/compicated/regex/
Logical-and
The and part can be done with braces:
sed '50,70{/abc/d;}'
Further, braces can be nested for multiple and conditions.
(The above was tested under GNU sed. BSD sed may differ in small but frustrating details.)
Logical-or
The or part can be handled with branching:
sed -e '10,20{b cr;}' -e '30,40{b cr;}' -e b -e :cr -e 's/complicated/regex/' file
10,20{b cr;}
For all lines from 10 through 20, we branch to label cr
30,40{b cr;}
For all lines from 30 through 40, we branch to label cr
b
For all other lines, we skip the rest of the commands.
:cr
This marks the label cr
s/complicated/regex/
This performs the substitution on lines which branched to cr.
With GNU sed, the syntax for the above can be shortened a bit to:
sed '10,20{b cr}; 30,40{b cr}; b; :cr; s/complicated/regex/' file
To delete lines from 10 to 20 and 30 to 40 matching your complicated regex with GNU sed:
sed -e '10,20bA;30,40bA;b;:A;s/complicated/regex/;d' file
or:
sed -e '10,20bA' -e '30,40bA' -e 'b;:A;s/complicated/regex/;d' file
bA: jump to label :A
b: a jump without label -> jump to end of script
d: delete line
I don't think sed has the facility for multiple selection criteria, my advice would be to step up to awk, where you can do something like:
awk 'NR >= 50 && NR <= 70 && /abc/ {next} {print}' inputFile
awk '(NR >= 10 and NR <= 20) || (NR >= 30 && NR <= 40) {
sub("from-regex", "to-string", $0); print }'
sed is excellent for simple substitutions on individual lines but for anything else just use awk for clarity, robustness, portability, maintainability, etc...
awk '
(NR>=50 && NR<=70) && /abc/ { next }
(NR>=10 && NR<=20) || (NR>=30 && NR<=40) { sub(/complicated/,"regex") }
{ print }
' file

Keeping first character in string, in a specific single field

I am trying to remove all but the first character of a specific field in a .tab file. I want to keep only first character in fields 10 and 11.
Normally the fields have 35 characters in them, so I used:
awk '{gsub ("..................................$","",$10;print} file
however, there are some fields which have less than 35, and were ignored by this replace function. I tired using substring, but I cannot figure out how to make it field specific. I believe there is a way to use perl inside awk so that I can use the function
perl -pe 's/(.).*/$1/g'
but I am not sure how to do that and use the field as the input value, so the file comes out identical except for the altered field.
is there a way to do the perl equivalent with gsub, or the awk equivalent with perl?
help is appreciated!
One way using awk:
awk '{ for (i=10;i<=11;i++) { $i = substr( $i, 1, 1) } } { print }' infile
Another way using gensub function of gawk
gawk '{ for (i=10;i<=11;i++) { $i = gensub(/(.).*/ , "\\1", G , $i) } }1' infile
A shortest awk version, I could figure out:
awk '($10=substr($10,1,1))&&$11=substr($11,1,1)' infile
If the 10th and/or 11th field is not existing then the line is not printed.
Similar version in perl
perl -ane '$F[9]=~s/(.).*/$1/;$F[10]=~s/(.).*/$1/;print "#F\n"' infile
This prints the line even if 10th and/or 11th field is not defined.
Another way with perl:
perl -pe '$c=0; s/(\S+)/(++$c < 10 || $c > 11) ? $1 : substr($1,0,1)/eg' filename

divide each line in equal part

I would be happy if anyone can suggest me command (sed or AWK one line command) to divide each line of file in equal number of part. For example divide each line in 4 part.
Input:
ATGCATHLMNPHLNTPLML
Output:
ATGCA THLMN PHLNT PLML
This should work using GNU sed:
sed -r 's/(.{4})/\1 /g'
-r is needed to use extended regular expressions
.{4} captures every four characters
\1 refers to the captured group which is surrounded by the parenthesis ( ) and adds a space behind this group
g makes sure that the replacement is done as many times as possible on each line
A test; this is the input and output in my terminal:
$ echo "ATGCATHLMNPHLNTPLML" | sed -r 's/(.{4})/\1 /g'
ATGC ATHL MNPH LNTP LML
I suspect awk is not the best tool for this, but:
gawk --posix '{ l = sprintf( "%d", 1 + (length()-1)/4);
gsub( ".{"l"}", "& " ) } 1' input-file
If you have a posix compliant awk you can omit the --posix, but --posix is necessary for gnu awk and since that seems to be the most commonly used implementation I've given the solution in terms of gawk.
This might work for you (GNU sed):
sed 'h;s/./X/g;s/^\(.*\)\1\1\1/\1 \1 \1 \1/;G;s/\n/&&/;:a;/^\n/bb;/^ /s/ \(.*\n.*\)\n\(.\)/\1 \n\2/;ta;s/^.\(.*\n.*\)\n\(.\)/\1\2\n/;ta;:b;s/\n//g' file
Explanation:
h copy the pattern space (PS) to the hold space (HS)
s/./X/g replace every character in the HS with the same non-space character (in this case X)
s/^\(.*\)\1\1\1/\1 \1 \1 \1/ split the line into 4 parts (space separated)
G append a newline followed by the contents of the HS to the PS
s/\n/&&/ double the newline (to be later used as markers)
:a introduce a loop namespace
/^\n/bb if we reach a newline we are done and branch to the b namespace
/^ /s/ \(.*\n.*\)\n\(.\)/\1 \n\2/;ta; if the first character is a space add a space to the real line at this point and repeat
s/^.\(.*\n.*\)\n\(.\)/\1\2\n/;ta any other character just bump along and repeat
:b;s/\n//g all done just remove the markers and print out the result
This work for any length of line, however is the line is not exactly divisible by 4 the last portion will contain the remainder as well.
perl
perl might be a better choice here:
export cols=4
perl -ne 'chomp; $fw = 1 + int length()/$ENV{cols}; while(/(.{1,$fw})/gm) { print $1 . " " } print "\n"'
This re-calculates field-width for every line.
coreutils
A GNU coreutils alternative, field-width is chosen based on the first line of infile:
cols=4
len=$(( $(head -n1 infile | wc -c) - 1 ))
fw=$(echo "scale=0; 1 + $len / 4" | bc)
cut_arg=$(paste -d- <(seq 1 $fw 19) <(seq $fw $fw $len) | head -c-1 | tr '\n' ',')
Value of cut_arg is in the above case:
1-5,6-10,11-15,16-
Now cut the line into appropriate chunks:
cut --output-delimiter=' ' -c $cut_arg infile