Replacing all occurrence after nth occurrence in a line in perl - perl

I need to replace all occurrences of a string after nth occurrence in every line of a Unix file.
My file data:
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
My output data:
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
tried using sed: sed 's/://3g' test.txt
Unfortunately, the g option with the occurrence is not working as expected. instead, it is replacing all the occurrences.

Another approach using awk
awk -v c=':' -v n=2 'BEGIN{
FS=OFS=""
}
{
j=0;
for(i=0; ++i<=NF;)
if($i==c && j++>=n)$i=""
}1' file
$ cat file
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
$ awk -v c=':' -v n=2 'BEGIN{FS=OFS=""}{j=0;for(i=0; ++i<=NF;)if($i==c && j++>=n)$i=""}1' file
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus

With GNU awk, using gensub please try following. This is completely based on your shown samples, where OP wants to remove : from 3rd occurrence onwards. Using gensub to segregate parts of matched values and removing all colons from 2nd part(from 3rd colon onwards) in it as per OP's requirement.
awk -v regex="^([^:]*:)([^:]*:)(.*)" '
{
firstPart=restPart=""
firstPart=gensub(regex, "\\1 \\2", "1", $0)
restPart=gensub(regex,"\\3","1",$0)
gsub(/:/,"",restPart)
print firstPart restPart
}
' Input_file

I have inferred based on the limited data you've given us, so it's possible this won't work. But I wouldn't use regex for this job. What you have there is colon delimited fields.
So I'd approach it using split to extract the data, and then some form of string formatting to reassemble exactly what you like:
#!/usr/bin/perl
use strict;
use warnings;
while (<DATA>) {
chomp;
my ( undef, $first, #rest ) = split /:/;
print ":$first:", join ( "", #rest ),"\n";
}
__DATA__
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
This gives you the desired result, whilst IMO being considerably clearer for the next reader than a complicated regex.

You can use the perl solution like
perl -pe 's~^(?:[^:]*:){2}(*SKIP)(?!)|:~~g if /^:account_id:/' test.txt
See the online demo and the regex demo.
The ^(?:[^:]*:){2}(*SKIP)(?!)|: regex means:
^(?:[^:]*:){2}(*SKIP)(?!) - match
^ - start of string (here, a line)
(?:[^:]*:){2} - two occurrences of any zero or more chars other than a : and then a : char
(*SKIP)(?!) - skip the match and go on to search for the next match from the failure position
| - or
: - match a : char.
And only run the replacement if the current line starts with :account_id: (see if /^:account_id:/').
Or an awk solution like
awk 'BEGIN{OFS=FS=":"} /^:account_id:/ {result="";for (i=1; i<=NF; ++i) { result = result (i > 2 ? $i : $i OFS)}; print result}' test.txt
See this online demo. Details:
BEGIN{OFS=FS=":"} - sets the input/output field separator to :
/^:account_id:/ - line must start with :account_id:
result="" - sets result variable to an empty string
for (i=1; i<=NF; ++i) { result = result (i > 2 ? $i : $i OFS)}; print result} - iterates over the fields and if the field number is greater than 2, just append the current field value to result, else, append the value + output field separator; then print the result.

I would use GNU AWK following way if n fixed and equal 2 following way, let file.txt content be
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
then
awk 'BEGIN{FS=":";OFS=""}{$2=FS $2 FS;print}' file.txt
output
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
Explanation: use : as field separator and nothing as output field separator, this itself does remove all : so I add : which have to be preserved: 1st (before second column) and 2nd (after second column). Beware that I tested it solely for this data, so if you would want to use it you should firstly test it with more possible inputs.
(tested in gawk 4.2.1)

This might work for you (GNU sed):
sed 's/:/\n/3;h;s/://g;H;g;s/\n.*\n//' file
Replace the third occurrence of : by a newline.
Make a copy of the line.
Delete all occurrences of :'s.
Append the amended line to the copy.
Join the two lines by removing everything from third occurrence of the copy to the third occurrence of the amended line.
N.B. The use of the newline is the best delimiter to use in the case of sed, as the line presented to seds commands are initially devoid of newlines. However the important property of the delimiter is that it is unique and therefore can be any such character as long as it is not found anywhere in the data set.
An alternative solution uses a loop to remove all :'s after the first two:
sed -E ':a;s/^(([^:]*:){2}[^:]*):/\1/;ta' file

With GNU awk for the 3rd arg to match() and gensub():
$ awk 'match($0,/(:[^:]+:)(.*)/,a){ $0=a[1] gensub(/:/,"","g",a[2]) } 1' file
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
and with any awk in any shell on every Unix box:
$ awk 'match($0,/:[^:]+:/){ tgt=substr($0,1+RLENGTH); gsub(/:/,"",tgt); $0=substr($0,1,RLENGTH) tgt } 1' file
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus

Related

Extract substrings between strings

I have a file with text as follows:
###interest1 moreinterest1### sometext ###interest2###
not-interesting-line
sometext ###interest3###
sometext ###interest4### sometext othertext ###interest5### sometext ###interest6###
I want to extract all strings between ### .
My desired output would be something like this:
interest1 moreinterest1
interest2
interest3
interest4
interest5
interest6
I have tried the following:
grep '###' file.txt | sed -e 's/.*###\(.*\)###.*/\1/g'
This almost works but only seems to grab the first instance per line, so the first line in my output only grabs
interest1 moreinterest1
rather than
interest1 moreinterest1
interest2
Here is a single awk command to achieve this that makes ### field separator and prints each even numbered field:
awk -F '###' '{for (i=2; i<NF; i+=2) print $i}' file
interest1 moreinterest1
interest2
interest3
interest4
interest5
interest6
Here is an alternative grep + sed solution:
grep -oE '###[^#]*###' file | sed -E 's/^###|###$//g'
This assumes there are no # characters in between ### markers.
With GNU awk for multi-char RS:
$ awk -v RS='###' '!(NR%2)' file
interest1 moreinterest1
interest2
interest3
interest4
interest5
interest6
You can use pcregrep:
pcregrep -o1 '###(.*?)###' file
The regex - ###(.*?)### - matches ###, then captures into Group 1 any zero o more chars other than line break chars, as few as possible, and ### then matches ###.
o1 option will output Group 1 value only.
See the regex demo online.
sed 't x
s/###/\
/;D; :x
s//\
/;t y
D;:y
P;D' file
Replacing "###" with newline, D, then conditionally branching to P if a second replacement of "###" is successful.
This might work for you (GNU sed):
sed -n 's/###/\n/g;/[^\n]*\n/{s///;P;D}' file
Replace all occurrences of ###'s by newlines.
If a line contains a newline, remove any characters before and including the first newline, print the details up to and including the following newline, delete those details and repeat.

Join two specific lines with sed

I'm trying to manipulate a dataset with sed so I can do it in a batch because the datasets have the same structure.
I've a dataset with two rows (first line in this example is the 7th row) like this:
Enginenumber; ABX 105;Productionnumber.;01 2345 67-
"",,8-9012
What I want:
Enginenumber; ABX 105;Productionnumber.;01 2345 67-8-9012
So the numbers (8-9012) in the second line have been added at the end of the first line because those numbers belong to each other
What I've tried:
sed '8s/7s/' file.csv
But that one does not work and I think that one will just replace whole row 7. The 8-9012 part is on row 8 of the file and I want that part added to row 7. Any ideas and is this possible?
Note: In the question's current form, a sed solution is feasible - this was not the case originally, where the last ;-separated field of the joined lines needed transforming as a whole, which prompted the awk solution below.
Joining lines 7 and 8 as-is, merely by removing the line break between them, can be achieved with this simple sed command:
sed '7 { N; s/\n//; }' file.csv
awk solution:
awk '
BEGIN { FS = OFS = ";" }
NR==7 { r = $0; getline; sub(/^"",,/, ""); $0 = r $0 }
1
' file.csv
Judging by the OP's comments, an additional problem is the presence of CRLF line endings in the input. With GNU Awk or Mawk, adding RS = "\r\n" to the BEGIN block is sufficient to deal with this (or RS = ORS = "\r\n", if the output should have CRLF line endings too), but with BSD Awk, which only supports single-character input record separators, more work is needed.
BEGIN { FS = OFS = ";" } tells Awk to split the input lines into fields by ; and to also use ; on output (when rebuilding the line).
Pattern NR==7 matches input line 7, and executes the associated action ({...}) with it.
r = $0; getline stores line 7 ($0 contains the input line at hand) in variable r, then reads the next line (getline), at which point $0 contains line 8.
sub(/^"",,/, "") then removes substring "",, from the start of line 8, leaving just 8-9012.
$0 = r $0 joins line 7 and modified line 8, and by assigning the concatenation back to $0, the string assigned is split into fields by ; anew, and the resulting fields are joined to form the new $0, separated by OFS, the output field separator.
Pattern 1 is a common shorthand that simply prints the (possibly modified) record at hand.
With sed:
sed '/^[^"]/{N;s/\n.*,//;}' file
/^[^"]/: search for lines not starting with ", and if found:
N: next line is appended to the pattern space
s/\n.*,//: all characters up to last , are removed from second line

Use sed to replace word in 2-line pattern

I try to use sed to replace a word in a 2-line pattern with another word. When in one line the pattern 'MACRO "something"' is found then in the next line replace 'BLOCK' with 'CORE'. The "something" is to be put into a reference and printed out as well.
My input data:
MACRO ABCD
CLASS BLOCK ;
SYMMETRY X Y ;
Desired outcome:
MACRO ABCD
CLASS CORE ;
SYMMETRY X Y ;
My attempt in sed so far:
sed 's/MACRO \([A-Za-z0-9]*\)/,/ CLASS BLOCK ;/MACRO \1\n CLASS CORE ;/g' input.txt
The above did not work giving message:
sed: -e expression #1, char 30: unknown option to `s'
What am I missing?
I'm open to one-liner solutions in perl as well.
Thanks,
Gert
Using a perl one-liner in slurp mode:
perl -0777 -pe 's/MACRO \w+\n CLASS \KBLOCK ;/CORE ;/g' input.txt
Or using a streaming example:
perl -pe '
s/^\s*\bCLASS \KBLOCK ;/CORE ;/ if $prev;
$prev = $_ =~ /^MACRO \w+$/
' input.txt
Explanation:
Switches:
-0777: Slurp files whole
-p: Creates a while(<>){...; print} loop for each line in your input file.
-e: Tells perl to execute the code on command line.
When in one line the pattern 'MACRO "something"' is found then in the
next line replace 'BLOCK' with 'CORE'.
sed works on lines of input. If you want to perform substitution on the next line of a specified pattern, then you need to add that to the pattern space before being able to do so.
The following might work for you:
sed '/MACRO/{N;s/\(CLASS \)BLOCK/\1CORE/;}' filename
Quoting from the documentation:
`N'
Add a newline to the pattern space, then append the next line of
input to the pattern space. If there is no more input then sed
exits without processing any more commands.
If you want to make use of address range as in your attempt, then you need:
sed '/MACRO/,/CLASS BLOCK/{s/\(CLASS\) BLOCK/\1 CORE/}' filename
I'm not sure why do you need a backreference for substituting the macro name.
You could try this awk command also,
awk '{print}/MACRO/ {getline; sub (/BLOCK/,"CORE");{print}}' file
It prints all the lines as it is and do the replacing action on seeing a word MACRO on a line.
Since getline has so many pitfall I try not to use it, so:
awk '/MACRO/ {a++} a==1 {sub(/BLOCK/,"CORE")}1' file
MACRO ABCD
CLASS CORE ;
SYMMETRY X Y ;
This could do it
#!awk -f
BEGIN {
RS = ";"
}
/MACRO/ {
sub("BLOCK", "CORE")
}
{
printf s++ ? ";" $0 : $0
}
"line" ends with ;
sub BLOCK for CORE in "lines" with MACRO
print ; followed by "line" unless first line

Keeping first character in string, in a specific single field

I am trying to remove all but the first character of a specific field in a .tab file. I want to keep only first character in fields 10 and 11.
Normally the fields have 35 characters in them, so I used:
awk '{gsub ("..................................$","",$10;print} file
however, there are some fields which have less than 35, and were ignored by this replace function. I tired using substring, but I cannot figure out how to make it field specific. I believe there is a way to use perl inside awk so that I can use the function
perl -pe 's/(.).*/$1/g'
but I am not sure how to do that and use the field as the input value, so the file comes out identical except for the altered field.
is there a way to do the perl equivalent with gsub, or the awk equivalent with perl?
help is appreciated!
One way using awk:
awk '{ for (i=10;i<=11;i++) { $i = substr( $i, 1, 1) } } { print }' infile
Another way using gensub function of gawk
gawk '{ for (i=10;i<=11;i++) { $i = gensub(/(.).*/ , "\\1", G , $i) } }1' infile
A shortest awk version, I could figure out:
awk '($10=substr($10,1,1))&&$11=substr($11,1,1)' infile
If the 10th and/or 11th field is not existing then the line is not printed.
Similar version in perl
perl -ane '$F[9]=~s/(.).*/$1/;$F[10]=~s/(.).*/$1/;print "#F\n"' infile
This prints the line even if 10th and/or 11th field is not defined.
Another way with perl:
perl -pe '$c=0; s/(\S+)/(++$c < 10 || $c > 11) ? $1 : substr($1,0,1)/eg' filename

sed/awk/cut/grep - Best way to extract string

I have a results.txt file that is structured in this format:
Uncharted 3: Javithaxx l Rampant l Graveyard l Team Deathmatch HD (D1VpWBaxR8c)
Matt Darey feat. Kate Louise Smith - See The Sun (Toby Hedges Remix) (EQHdC_gGnA0)
The Matrix State (SXP06Oax70o)
Above & Beyond - Group Therapy Radio 014 (guest Lange) (2013-02-08) (8aOdRACuXiU)
I want to create a new file extracting the youtube URL ID specified in the last characters in each line line "8aOdRACuXiU"
I'm trying to build a URL like this in a new file:
http://www.youtube.com/watch?v=8aOdRACuXiU&hd=1
Note, I appended the &hd=1 to the string that I am trying to be replaced. I have tried using Linux reverse and cut but reverse or rev munges my data. The hard part here is that each line in my text file will have entries with parentheses and I only care about getting the data between the last set of parentheses. Each line has a variable length so that isn't helpful either. What about using grep and .$ for the end of the line?
In summary, I want to extract the youtube ID from results.txt and export it to a new file in the following format: http://www.youtube.com/watch?v=8aOdRACuXiU&hd=1
Using awk:
awk '{
v = substr( $NF, 2, length( $NF ) - 2 )
printf "%s%s%s\n", "http://www.youtube.com/watch?v=", v, "&hd=1"
}' infile
It yields:
http://www.youtube.com/watch?v=D1VpWBaxR8c&hd=1
http://www.youtube.com/watch?v=EQHdC_gGnA0&hd=1
http://www.youtube.com/watch?v=SXP06Oax70o&hd=1
http://www.youtube.com/watch?v=8aOdRACuXiU&hd=1
$ sed 's!.*(\(.*\))!http://www.youtube.com/watch?v=\1\&hd=1!' results.txt
http://www.youtube.com/watch?v=D1VpWBaxR8c&hd=1
http://www.youtube.com/watch?v=EQHdC_gGnA0&hd=1
http://www.youtube.com/watch?v=SXP06Oax70o&hd=1
http://www.youtube.com/watch?v=8aOdRACuXiU&hd=1
Here, .*(\(.*\)) looks for the last occurrence of a pair of parentheses, and captures the characters inside those parentheses. The captured group is then inserted into the URL using \1.
Using a perl one-liner :
perl -lne 'printf "http://www.youtube.com/watch?v=%s&hd=1\n", $& if /[^\(]+(?=\)$)/' file.txt
Or multi-line version :
perl -lne '
printf(
"http://www.youtube.com/watch?v=%s&hd=1\n",
$&
) if /[^\(]+(?=\)$)/
' file.txt