how to get the position of nth occurrence of a string in a file - sed

I have an xml file which contains data in a single line where same string is repeated multiple times in it.
I am looking to identify the position of nth occurrence of a string in that file so that i can split single file into multiple files based on that position so that it will be easy for processing.
sample data in file:
<id = 1><\id><id = 2><\id><id = 3><\id><id = 4><\id><id = 5><\id><id = 6><\id><id = 7><\id><id = 8><\id><id = 9><\id><id = 10><\id><id = 11><\id>
So i want to split the file based on the id tag. for eg i want to look for position of 5th occurrence of the id tag and need to split the file into 3 files totally
Output:
file_1:
<id = 1><\id><id = 2><\id><id = 3><\id><id = 4><\id>
file_2:
<id = 5><\id><id = 6><\id><id = 7><\id><id = 8><\id>
file_3:
<id = 9><\id><id = 10><\id><id = 11><\id>
I tried splitting the one line into multiple lines with a simple sed
sed 's/></>\n</g' $file > data.txt
Later with a simple grep i identified the line number and started splitting based on the line number. This is working for smaller files but some file are in GB's (10-20) which is causing issues.
Could you help me if there is any easy way to get the position of the nth occurrence of a string in file so that i can split single file into multiple files based on the string position.

This might work for you (GNU sed):
sed 's/<\\id>/&\n/4;P;D' file | sed -ne '1~3w file1' -e '2~3w file2' -e '3~3w file3'
In the first sed invocation, split each line into three following the fourth <\id> (BTW should this be </id>?).
Pipe the result to a second sed invocation.
In the second sed invocation, send the first line modulo three to file1, the second line modulo three to file2 and the third line modulo three to file3.
Alternative using split instead of the second invocation of sed:
sed 's/<\\id>/&\n/4;P;D' file | split -a 1 --nume=1 -dn r/3 - file

suggesting an gawk script (standard awk in most Linux machines) that can do all the splittings as well:
Parameters:
f : filename prefix
m : number of id elements in file
gawk script:
gawk 'BEGIN{RS="<\\\\id>"}{a=a$0 RT}NR%m==0{print a > f NR;a=""}END{if (a)print a> f (NR-(NR%m-1)) }' f="FileName_" m=5 input.txt
Sample run on given example: f=file_ m=4
gawk 'BEGIN{RS="<\\\\id>"}{a=a$0 RT}NR%m==0{print a > f NR;a=""}END{if (a)print a> f (NR-(NR%m-1)) }' f="file_" m=4 input.xml
file_1
<id = 1><\id><id = 2><\id><id = 3><\id><id = 4><\id>
file_5
<id = 5><\id><id = 6><\id><id = 7><\id><id = 8><\id>
file_9
<id = 9><\id><id = 10><\id><id = 11><\id>
gawk script explanation
BEGIN{RS="<\\\\id>"} # set awk record seperator to <\id>
{output = output $0 RT} # accumulate each record in output variable
(NR % chunkSize) == 0 { # when read chunkSize of records
# save output variable into file. named: filePrefix appended with record cout
print output > filePrefix NR;
output = ""; # reset output variable
}
END { # after processing the last record
if (output) { # there is existing output
lastChunk = NR-((NR % chunkSize) - 1); # compute the file begin ID
print output > filePrefix lastChunk; # write output to last file
}
}

Related

Replacing all occurrence after nth occurrence in a line in perl

I need to replace all occurrences of a string after nth occurrence in every line of a Unix file.
My file data:
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
My output data:
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
tried using sed: sed 's/://3g' test.txt
Unfortunately, the g option with the occurrence is not working as expected. instead, it is replacing all the occurrences.
Another approach using awk
awk -v c=':' -v n=2 'BEGIN{
FS=OFS=""
}
{
j=0;
for(i=0; ++i<=NF;)
if($i==c && j++>=n)$i=""
}1' file
$ cat file
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
$ awk -v c=':' -v n=2 'BEGIN{FS=OFS=""}{j=0;for(i=0; ++i<=NF;)if($i==c && j++>=n)$i=""}1' file
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
With GNU awk, using gensub please try following. This is completely based on your shown samples, where OP wants to remove : from 3rd occurrence onwards. Using gensub to segregate parts of matched values and removing all colons from 2nd part(from 3rd colon onwards) in it as per OP's requirement.
awk -v regex="^([^:]*:)([^:]*:)(.*)" '
{
firstPart=restPart=""
firstPart=gensub(regex, "\\1 \\2", "1", $0)
restPart=gensub(regex,"\\3","1",$0)
gsub(/:/,"",restPart)
print firstPart restPart
}
' Input_file
I have inferred based on the limited data you've given us, so it's possible this won't work. But I wouldn't use regex for this job. What you have there is colon delimited fields.
So I'd approach it using split to extract the data, and then some form of string formatting to reassemble exactly what you like:
#!/usr/bin/perl
use strict;
use warnings;
while (<DATA>) {
chomp;
my ( undef, $first, #rest ) = split /:/;
print ":$first:", join ( "", #rest ),"\n";
}
__DATA__
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
This gives you the desired result, whilst IMO being considerably clearer for the next reader than a complicated regex.
You can use the perl solution like
perl -pe 's~^(?:[^:]*:){2}(*SKIP)(?!)|:~~g if /^:account_id:/' test.txt
See the online demo and the regex demo.
The ^(?:[^:]*:){2}(*SKIP)(?!)|: regex means:
^(?:[^:]*:){2}(*SKIP)(?!) - match
^ - start of string (here, a line)
(?:[^:]*:){2} - two occurrences of any zero or more chars other than a : and then a : char
(*SKIP)(?!) - skip the match and go on to search for the next match from the failure position
| - or
: - match a : char.
And only run the replacement if the current line starts with :account_id: (see if /^:account_id:/').
Or an awk solution like
awk 'BEGIN{OFS=FS=":"} /^:account_id:/ {result="";for (i=1; i<=NF; ++i) { result = result (i > 2 ? $i : $i OFS)}; print result}' test.txt
See this online demo. Details:
BEGIN{OFS=FS=":"} - sets the input/output field separator to :
/^:account_id:/ - line must start with :account_id:
result="" - sets result variable to an empty string
for (i=1; i<=NF; ++i) { result = result (i > 2 ? $i : $i OFS)}; print result} - iterates over the fields and if the field number is greater than 2, just append the current field value to result, else, append the value + output field separator; then print the result.
I would use GNU AWK following way if n fixed and equal 2 following way, let file.txt content be
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
then
awk 'BEGIN{FS=":";OFS=""}{$2=FS $2 FS;print}' file.txt
output
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
Explanation: use : as field separator and nothing as output field separator, this itself does remove all : so I add : which have to be preserved: 1st (before second column) and 2nd (after second column). Beware that I tested it solely for this data, so if you would want to use it you should firstly test it with more possible inputs.
(tested in gawk 4.2.1)
This might work for you (GNU sed):
sed 's/:/\n/3;h;s/://g;H;g;s/\n.*\n//' file
Replace the third occurrence of : by a newline.
Make a copy of the line.
Delete all occurrences of :'s.
Append the amended line to the copy.
Join the two lines by removing everything from third occurrence of the copy to the third occurrence of the amended line.
N.B. The use of the newline is the best delimiter to use in the case of sed, as the line presented to seds commands are initially devoid of newlines. However the important property of the delimiter is that it is unique and therefore can be any such character as long as it is not found anywhere in the data set.
An alternative solution uses a loop to remove all :'s after the first two:
sed -E ':a;s/^(([^:]*:){2}[^:]*):/\1/;ta' file
With GNU awk for the 3rd arg to match() and gensub():
$ awk 'match($0,/(:[^:]+:)(.*)/,a){ $0=a[1] gensub(/:/,"","g",a[2]) } 1' file
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
and with any awk in any shell on every Unix box:
$ awk 'match($0,/:[^:]+:/){ tgt=substr($0,1+RLENGTH); gsub(/:/,"",tgt); $0=substr($0,1,RLENGTH) tgt } 1' file
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus

Join two specific lines with sed

I'm trying to manipulate a dataset with sed so I can do it in a batch because the datasets have the same structure.
I've a dataset with two rows (first line in this example is the 7th row) like this:
Enginenumber; ABX 105;Productionnumber.;01 2345 67-
"",,8-9012
What I want:
Enginenumber; ABX 105;Productionnumber.;01 2345 67-8-9012
So the numbers (8-9012) in the second line have been added at the end of the first line because those numbers belong to each other
What I've tried:
sed '8s/7s/' file.csv
But that one does not work and I think that one will just replace whole row 7. The 8-9012 part is on row 8 of the file and I want that part added to row 7. Any ideas and is this possible?
Note: In the question's current form, a sed solution is feasible - this was not the case originally, where the last ;-separated field of the joined lines needed transforming as a whole, which prompted the awk solution below.
Joining lines 7 and 8 as-is, merely by removing the line break between them, can be achieved with this simple sed command:
sed '7 { N; s/\n//; }' file.csv
awk solution:
awk '
BEGIN { FS = OFS = ";" }
NR==7 { r = $0; getline; sub(/^"",,/, ""); $0 = r $0 }
1
' file.csv
Judging by the OP's comments, an additional problem is the presence of CRLF line endings in the input. With GNU Awk or Mawk, adding RS = "\r\n" to the BEGIN block is sufficient to deal with this (or RS = ORS = "\r\n", if the output should have CRLF line endings too), but with BSD Awk, which only supports single-character input record separators, more work is needed.
BEGIN { FS = OFS = ";" } tells Awk to split the input lines into fields by ; and to also use ; on output (when rebuilding the line).
Pattern NR==7 matches input line 7, and executes the associated action ({...}) with it.
r = $0; getline stores line 7 ($0 contains the input line at hand) in variable r, then reads the next line (getline), at which point $0 contains line 8.
sub(/^"",,/, "") then removes substring "",, from the start of line 8, leaving just 8-9012.
$0 = r $0 joins line 7 and modified line 8, and by assigning the concatenation back to $0, the string assigned is split into fields by ; anew, and the resulting fields are joined to form the new $0, separated by OFS, the output field separator.
Pattern 1 is a common shorthand that simply prints the (possibly modified) record at hand.
With sed:
sed '/^[^"]/{N;s/\n.*,//;}' file
/^[^"]/: search for lines not starting with ", and if found:
N: next line is appended to the pattern space
s/\n.*,//: all characters up to last , are removed from second line

comparing lines with awk vs while read line

I have two files one with 17k lines and another one with 4k lines. I wanted to compare position 115 to position 125 with each line in the second file and if there is a match, write the entire line from the first file into a new file. I had come up with a solution where i read the file using 'cat $filename | while read LINE'. but it's taking around 8 mins to complete. is there any other way like using 'awk' to reduce this process time.
my code
cat $filename | while read LINE
do
#read 115 to 125 and then remove trailing spaces and leading zeroes
vid=`echo "$LINE" | cut -c 115-125 | sed 's,^ *,,; s, *$,,' | sed 's/^[0]*//'`
exist=0
#match vid with entire line in id.txt
exist=`grep -x "$vid" $file_dir/id.txt | wc -l`
if [[ $exist -gt 0 ]]; then
echo "$LINE" >> $dest_dir/id.txt
fi
done
How is this:
FNR==NR { # FNR == NR is only true in the first file
s = substr($0,115,10) # Store the section of the line interested in
sub(/^\s*/,"",s) # Remove any leading whitespace
sub(/\s*$/,"",s) # Remove any trailing whitespace
lines[s]=$0 # Create array of lines
next # Get next line in first file
}
{ # Now in second file
for(i in lines) # For each line in the array
if (i~$0) { # If matches the current line in second file
print lines[i] # Print the matching line from file1
next # Get next line in second file
}
}
Save it to a script script.awk and run like:
$ awk -f script.awk "$filename" "${file_dir}/id.txt" > "${dest_dir}/id.txt"
This will still be slow because for each line in second file you need to look at ~50% of the unique lines in first (assuming most line do in fact match). This can be significantly improved if you can confirmed that the lines in the second file are full line matches against the substrings.
For full line matches this should be faster:
FNR==NR { # FNR == NR is only true in the first file
s = substr($0,115,10) # Store the section of the line interested in
sub(/^\s*/,"",s) # Remove any leading whitespace
sub(/\s*$/,"",s) # Remove any trailing whitespace
lines[s]=$0 # Create array of lines
next # Get next line in first file
}
($0 in lines) { # Now in second file
print lines[$0] # Print the matching line from file1
}

Match a string from File1 in File2 and replace the string in File1 with corresponding matched string in File2

The title may be confusing, here's what I'm trying to do:
File1
12=921:5,895:5,813:5,853:5,978:5,807:5,1200:5,1067:5,827:5
File2
Tom 12 John 921 Mike 813
Output
Tom=John:5,Mike:5
The file2 has the values of the numbers in file1, and I want match and replace the numbers with string values. I tried this with my limited knowledge in awk, but couldn't do it.
Any help appreciated.
Here's one way using GNU awk. Run like:
awk -f script.awk file1 file2
Contents of script.awk:
BEGIN {
FS="[ =:,]"
}
FNR==NR {
a[$1]=$0
next
}
$2 in a {
split(a[$2],b)
for (i=3;i<=NF-1;i+=2) {
for (j=2;j<=length(b)-1;j+=2) {
if ($(i+1) == b[j]) {
line = (line ? line "," : "") $i ":" b[j+1]
}
}
}
print $1 "=" line
line = ""
}
Results:
Tom=John:5,Mike:5
Alternatively, here's the one-liner:
awk -F "[ =:,]" 'FNR==NR { a[$1]=$0; next } $2 in a { split(a[$2],b); for (i=3;i<=NF-1;i+=2) for (j=2;j<=length(b)-1;j+=2) if ($(i+1) == b[j]) line = (line ? line "," : "") $i ":" b[j+1]; print $1 "=" line; line = "" }' file1 file2
Explanation:
Change awk's field separator to a either a space, equals, colon or comma.
'FNR==NR { ... }' is only true for the first file in the arguments list.
So when processing file1, awk will add column '1' to an array and we assign the whole line as a value to this array element.
'next' will simply skip processing the rest of the script, and read the next line of input.
When awk has finished reading the input in file1, it will continue reading file2. However, this also resets 'FNR' to '1', so awk will skip processing the 'FNR==NR' block for file2 because it is not longer true.
So for file2: if column '2' can be found in the array mentioned above:
Split the value of the array element into another array. This essentially splits up the whole line in file1.
Now create two loops.
The first will loop through all the names in file2
And the second will loop through all the values in the (second) array (this essentially loops over all the fields in file1).
Now when a value succeeding a name in file2 is equal to one of the key numbers in file1, create a line construct that looks like: 'name:number_following_key_number_from_file1'.
When more names and values are found during the loops, the quaternary construct '( ... ? ... : ...)' adds these elements onto the end of the line. It's like an if statement; if there's already a line, add a comma onto the end of it, else don't do anything.
When all the loops are complete, print out column '1' and the line. Then empty the line variable so that it can be used again.
HTH. Goodluck.
The following may work as a template:
skrynesaver#busybox ~/ perl -e '$values="12=921:5,895:5,813:5,853:5,978:5,807:5,1200:5,1067:5,827:5";
$data = "Tom 12 John 921 Mike 813";
($line,$values)=split/=/,$values;
#values=split/,/,$values;
$values{$line}="=";
map{$_=~/(\d+)(:\d+)/;$values{$1}="$2";}#values;
if ($data=~/\w+\s$line\s/){
$data=~s/(\w+)\s(\d+)\s?/$1$values{$2}/g;
}
print "$data\n";
'
Tom=John:5Mike:5
skrynesaver#busybox ~/

join 2 lines only if field-1 are equals with sed or awk

input file:
$ cat t.txt
id1;value1_1
id1;value1_2
id2;value2_1
id3;value3_1
id4;value4_1
id4;value4_2
id5;value5_1
result would be:
id1;value1_1;id1;value1_2
id3;value3_1
id4;value4_1;id4;value4_2
id5;value5_1
using sed or awk. Please give your opinion.
Here's one way to do it:
awk -F';' 'BEGIN { getline; id=$1; line=$0 } { if ($1 != id) { print line; line = $0; } else { line = line ";" $0; } id=$1; } END { print line; }' t.txt
Explanation:
Set field separator to ;:
-F';'
Start by reading the first line of input (getline), save the first field ($1) as id, and the first line ($0) as line:
BEGIN { getline; id=$1; line=$0 }
For each line of input, check if the first field differs from the stored id:
if ($1 != id)
If it does, then print the saved line and store the new one ($0):
print line; line = $0;
Otherwise, append the new line to the stored line(s):
line = line ";" $0;
And save the new id:
id=$1
At the end, print whatever is left in line:
END { print line; }
I guess in your result example, the id2; line is missing by mistake, right?
anyway, you could try the awk line below:
awk -F';' '{a[$1]=($1 in a)?a[$1]";"$0:$0}END{for(x in a)print a[x]}' yourFile|sort
output would be:
id1;value1_1;id1;value1_2
id2;value2_1
id3;value3_1
id4;value4_1;id4;value4_2
id5;value5_1
This might work for you:
sed -e '1{h;d};H;${x;:a;s/\(\([^;]*;\)\([^\n]*\)\)\n\2/\1;\2/;ta;p};d' t.txt
Explanation:
Slurp file in to hold space (HS) then on end-of-file swap to the HS and using substitution concatenate lines with duplicate keys and print. N.B. lines normally printed are all deleted.
EDIT:
The above solution works (as far as I know) but for large volumes is not very fast (read incredibly slow). This solution is better:
# cat -A /tmp/t.txt
id1;value1_1$
id1;value1_2$
id2;value2_1$
id3;value3_1$
id4;value4_1$
id4;value4_2$
id5;value5_1$
# for x in {1..1000};do cat /tmp/t.txt;done |
> sed ':a;$!N;/^\([^;]*;\).*\n\1/s/\n//;ta;P;D'| sort | uniq
id1;value1_1;id1;value1_2
id2;value2_1
id3;value3_1
id4;value4_1;id4;value4_2
id5;value5_1