Combine some data from multiple lines - sed

Trying to combine data into one line where some fields match.
12345,this,is,one,line,1
13567,this,is,another,line,3
14689,and,this,is,another,6
12345,this,is,one,line,4
14689,and,this,is,another,10
Output
12345,this,is,one,line,1,4
13567,this,is,another,line,3
14689,and,this,is,another,6,10
Thanks

awk -F',' '{if($1 in a) {a[$1]=a[$1] "," $NF} else {a[$1]=$0}} END {asort(a); for(i in a) print a[i]}' < input.txt
Works well with given example.
Here is commented file version of the same awk script, parse.awk. Keep in mind that this version use only first field as unified row indicator. I'll rewrite it according author's comment above (all fields but the last one).
#!/usr/bin/awk -f
BEGIN { # BEGIN section is executed once before input file's content
FS="," # input field separator is comma (can be set with -F argument on command line)
}
{ # main section is executed on every input line
if($1 in a) { # this checks is array 'a' already contain an element with index in first field
a[$1]=a[$1] "," $NF # if entry already exist, just concatenate last field of current row
}
else { # if this line contains new entry
a[$1]=$0 # add it as a new array element
}
}
END { # END section is executed once after last line
asort(a) # sort our array 'a' by it's values
for(i in a) print a[i] # this loop goes through sorted array and prints it's content
}
Use this via
./parse.awk input.txt
Here is another version which takes all but the last field to compare rows:
#!/usr/bin/awk -f
BEGIN { # BEGIN section is executed once before input file's content
FS="," # input field separator is comma (can be set with -F argument on command line)
}
{ # main section is executed on every input line
idx="" # reset index variable
for(i=1;i<NF;++i) idx=idx $i # join all but the last field to create index
if(idx in a) { # this checks is array 'a' already contain an element with index in first field
a[idx]=a[idx] "," $NF # if entry already exist, just concatenate last field of current row
}
else { # if this line contains new entry
a[idx]=$0 # add it as a new array element
}
}
END { # END section is executed once after last line
asort(a) # sort our array 'a' by values
for(i in a) print a[i] # this loop goes through sorted array and prints it's content
}
Feel free to ask any further explanation.

This might work for you (GNU sed and sort):
sort -nt, -k1,1 -k6,6 file |
sed ':a;$!N;s/^\(\([^,]*,\).*\)\n\2.*,/\1,/;ta;P;D'

Related

get column list using sed/awk/perl

I have different files like below format
Scenario 1 :
File1
no,name
1,aaa
20,bbb
File2
no,name,address
5,aaa,ghi
7,ccc,mn
I would like to get column list which is having more number of columns and if it is in the same order
**Expected output for scenario 1 :**
no,name,address
Scenario 2 :
File1
no,name
1,aaa
20,bbb
File2
no,age,name,address
5,2,aaa,ghi
7,3,ccc,mn
Expected Results :
Both file headers and positions are different as a message
I am interested in any short solution using bash / perl / sed / awk.
Perl solution:
perl -lne 'push #lines, $_;
close ARGV;
next if #lines < 2;
#lines = sort { length $a <=> length $b } #lines;
if (0 == index "$lines[1],", $lines[0]) {
print $lines[1];
} else {
print "Both file headers and positions are different";
}' -- File1 File2
-n reads the input line by line and runs the code for each line
-l removes newlines from input and adds them to printed lines
closing the special file handle ARGV makes Perl open the next file and read from it instead of processing the rest of the currently opened file.
next makes Perl go back to the beginning of the code, it can continue once more than one input line has been read.
sort sorts the lines by length so that we know the longer one is in the second element of the array.
index is used to check whether the shorter header is a prefix of the longer one (including the comma after the first header, so e.g. no,names is correctly rejected)

Filter unique lines

I have a tile type called fasta which contains a header "> 12122" followed by a string. I would like to remove duplicated strings in the file and keep only one of the duplicated string (same which) and the corresponding header.
In the example below the AGGTTCCGGATAAGTAAGAGCC is duplicated
in:
>17-46151
AGGTTCCGGATAAGTAAGAGCC
>1-242
AGGTTCCGGATAAGTAAGAGCC
>18-41148
TCTTAACCCGGACCAGAAACTA
>43-16054
GTCCCACTCCGTAGATCTGTTC
>32-24116
TAGCATATCGAGCCTGAGAACA
>42-16312
TGATACGGATGTTATACGCAGC
out:
>1-242
AGGTTCCGGATAAGTAAGAGCC
>18-41148
TCTTAACCCGGACCAGAAACTA
>43-16054
GTCCCACTCCGTAGATCTGTTC
>32-24116
TAGCATATCGAGCCTGAGAACA
>42-16312
TGATACGGATGTTATACGCAGC
if order is mandatory
# Field are delimited by new line
awk -F "\n" '
BEGIN {
# Record is delimited by ">"
RS = ">"
}
# skip first "record" due to first ">"
NR > 1 {
# if string is not know, add it to "Order" list array
if ( ! ( $2 in L ) ) O[++a] = $2
# remember (last) peer label/string
L[$2] = $1
}
# after readiong the file
END{
# display each (last know) peer based on the order
for ( i=1; i<=a; i++ ) printf( ">%s\n%s\n", L[O[i]], O[i])
}
' YourFile
if order is not mandatory
awk -F "\n" 'BEGIN{RS=">"}NR>1{L[$2]=$1}END{for (l in L) printf( ">%s\n%s\n", L[l], l)}' YourFile
$ awk '{if(NR%2) p=$0; else a[$0]=p}END{for(i in a)print a[i] ORS i}' file
>18-41148
TCTTAACCCGGACCAGAAACTA
>32-24116
TAGCATATCGAGCCTGAGAACA
>1-242
AGGTTCCGGATAAGTAAGAGCC
>43-16054
GTCCCACTCCGTAGATCTGTTC
>42-16312
TGATACGGATGTTATACGCAGC
Explained:
{
if(NR%2) # every first (of 2) line in p
p=$0
else # every second line is the hash key
a[$0]=p
}
END{
for(i in a) # output every unique key and it's header
print a[i] ORS i
}
Here's a quick one-line awk solution for you. It should be more immediate than the other answers because it runs line by line rather than queuing the data (and looping through it) until the end:
awk 'NR % 2 == 0 && !seen[$0]++ { print last; print } { last = $0 }' file
Explanation:
NR % 2 == 0 runs only on even numbered records (lines, NR)
!seen[$0]++ stores and increments values and returns true only when there were no values in the seen[] hash (!0 is 1, !1 is 0, !2 is 0, etc.)
(Skipping to the end) last is set to the value of each line after we're otherwise done with it
{ print last; print } will print last (the header) and then the current line (gene code)
Note: while this preserves the original order, it shows the first uniquely seen instance while the expected output showed the final uniquely seen instance:
>17-46151
AGGTTCCGGATAAGTAAGAGCC
>18-41148
TCTTAACCCGGACCAGAAACTA
>43-16054
GTCCCACTCCGTAGATCTGTTC
>32-24116
TAGCATATCGAGCCTGAGAACA
>42-16312
TGATACGGATGTTATACGCAGC
If you want the final uniquely seen instance, you can reverse the file before passing to awk and then reverse it back afterwards:
tac file |awk … |tac

Join two specific lines with sed

I'm trying to manipulate a dataset with sed so I can do it in a batch because the datasets have the same structure.
I've a dataset with two rows (first line in this example is the 7th row) like this:
Enginenumber; ABX 105;Productionnumber.;01 2345 67-
"",,8-9012
What I want:
Enginenumber; ABX 105;Productionnumber.;01 2345 67-8-9012
So the numbers (8-9012) in the second line have been added at the end of the first line because those numbers belong to each other
What I've tried:
sed '8s/7s/' file.csv
But that one does not work and I think that one will just replace whole row 7. The 8-9012 part is on row 8 of the file and I want that part added to row 7. Any ideas and is this possible?
Note: In the question's current form, a sed solution is feasible - this was not the case originally, where the last ;-separated field of the joined lines needed transforming as a whole, which prompted the awk solution below.
Joining lines 7 and 8 as-is, merely by removing the line break between them, can be achieved with this simple sed command:
sed '7 { N; s/\n//; }' file.csv
awk solution:
awk '
BEGIN { FS = OFS = ";" }
NR==7 { r = $0; getline; sub(/^"",,/, ""); $0 = r $0 }
1
' file.csv
Judging by the OP's comments, an additional problem is the presence of CRLF line endings in the input. With GNU Awk or Mawk, adding RS = "\r\n" to the BEGIN block is sufficient to deal with this (or RS = ORS = "\r\n", if the output should have CRLF line endings too), but with BSD Awk, which only supports single-character input record separators, more work is needed.
BEGIN { FS = OFS = ";" } tells Awk to split the input lines into fields by ; and to also use ; on output (when rebuilding the line).
Pattern NR==7 matches input line 7, and executes the associated action ({...}) with it.
r = $0; getline stores line 7 ($0 contains the input line at hand) in variable r, then reads the next line (getline), at which point $0 contains line 8.
sub(/^"",,/, "") then removes substring "",, from the start of line 8, leaving just 8-9012.
$0 = r $0 joins line 7 and modified line 8, and by assigning the concatenation back to $0, the string assigned is split into fields by ; anew, and the resulting fields are joined to form the new $0, separated by OFS, the output field separator.
Pattern 1 is a common shorthand that simply prints the (possibly modified) record at hand.
With sed:
sed '/^[^"]/{N;s/\n.*,//;}' file
/^[^"]/: search for lines not starting with ", and if found:
N: next line is appended to the pattern space
s/\n.*,//: all characters up to last , are removed from second line

Match a string from File1 in File2 and replace the string in File1 with corresponding matched string in File2

The title may be confusing, here's what I'm trying to do:
File1
12=921:5,895:5,813:5,853:5,978:5,807:5,1200:5,1067:5,827:5
File2
Tom 12 John 921 Mike 813
Output
Tom=John:5,Mike:5
The file2 has the values of the numbers in file1, and I want match and replace the numbers with string values. I tried this with my limited knowledge in awk, but couldn't do it.
Any help appreciated.
Here's one way using GNU awk. Run like:
awk -f script.awk file1 file2
Contents of script.awk:
BEGIN {
FS="[ =:,]"
}
FNR==NR {
a[$1]=$0
next
}
$2 in a {
split(a[$2],b)
for (i=3;i<=NF-1;i+=2) {
for (j=2;j<=length(b)-1;j+=2) {
if ($(i+1) == b[j]) {
line = (line ? line "," : "") $i ":" b[j+1]
}
}
}
print $1 "=" line
line = ""
}
Results:
Tom=John:5,Mike:5
Alternatively, here's the one-liner:
awk -F "[ =:,]" 'FNR==NR { a[$1]=$0; next } $2 in a { split(a[$2],b); for (i=3;i<=NF-1;i+=2) for (j=2;j<=length(b)-1;j+=2) if ($(i+1) == b[j]) line = (line ? line "," : "") $i ":" b[j+1]; print $1 "=" line; line = "" }' file1 file2
Explanation:
Change awk's field separator to a either a space, equals, colon or comma.
'FNR==NR { ... }' is only true for the first file in the arguments list.
So when processing file1, awk will add column '1' to an array and we assign the whole line as a value to this array element.
'next' will simply skip processing the rest of the script, and read the next line of input.
When awk has finished reading the input in file1, it will continue reading file2. However, this also resets 'FNR' to '1', so awk will skip processing the 'FNR==NR' block for file2 because it is not longer true.
So for file2: if column '2' can be found in the array mentioned above:
Split the value of the array element into another array. This essentially splits up the whole line in file1.
Now create two loops.
The first will loop through all the names in file2
And the second will loop through all the values in the (second) array (this essentially loops over all the fields in file1).
Now when a value succeeding a name in file2 is equal to one of the key numbers in file1, create a line construct that looks like: 'name:number_following_key_number_from_file1'.
When more names and values are found during the loops, the quaternary construct '( ... ? ... : ...)' adds these elements onto the end of the line. It's like an if statement; if there's already a line, add a comma onto the end of it, else don't do anything.
When all the loops are complete, print out column '1' and the line. Then empty the line variable so that it can be used again.
HTH. Goodluck.
The following may work as a template:
skrynesaver#busybox ~/ perl -e '$values="12=921:5,895:5,813:5,853:5,978:5,807:5,1200:5,1067:5,827:5";
$data = "Tom 12 John 921 Mike 813";
($line,$values)=split/=/,$values;
#values=split/,/,$values;
$values{$line}="=";
map{$_=~/(\d+)(:\d+)/;$values{$1}="$2";}#values;
if ($data=~/\w+\s$line\s/){
$data=~s/(\w+)\s(\d+)\s?/$1$values{$2}/g;
}
print "$data\n";
'
Tom=John:5Mike:5
skrynesaver#busybox ~/

how to return the search results in perl

I would like to write a script which can return me the result whenever the regex meet.I have some difficulties in writing the regex i guess.
Content of My input file is as below:
Number a123;
Number b456789 vit;
alphabet fty;
I wish that it will return me the result of a123 and b456789, which is the string after "Number " and before ("\s" or ";").
I have tried with below cmd line:
my #result=grep /Number/,#input_file;
print "#results\n";
The result i obtained is shown below:
Number a123;
Number b456789 vit;
Wheareas the expected result should be like below:
a123
b456789
Can anyone help on this?
Perls grep function selects/filters all elements from a list that match a certain condition. In your case, you selected all elements that match the regex /Number/ from the #input_file array.
To select the non-whitespace string after Number use this Regex:
my $regex = qr{
Number # Match the literal string 'Number'
\s+ # match any number of whitespace characters
([^\s;]+) # Capture the following non-spaces-or-semicolons into $1
# using a negated character class
}x; # use /x modifier to allow whitespaces in pattern
# for better formatting
My suggestion would be to loop directly over the input file handle:
while(defined(my $line = <$input>)) {
$line =~ /$regex/;
print "Found: $1" if length $1; # skip if nothing was found
}
If you have to use an array, a foreach-loop would be preferable:
foreach my $line (#input_lines) {
$line =~ /$regex/;
print "Found: $1" if length $1; # skip if nothing was found
}
If you don't want to print your matches directly but to store them in an array, push the values into the array inside your loop (both work) or use the map function. The map function replaces each input element by the value of the specified operation:
my #result = map {/$regex/; length $1 ? $1 : ()} #input_file;
or
my #result = map {/$regex/; length $1 ? $1 : ()} <$input>;
Inside the map block, we match the regex against the current array element. If we have a match, we return $1, else we return an empty list. This gets flattened into invisibility so we don't create an entry in #result. This is different form returning undef, what would create an undef element in your array.
if your script is intended as a simple filter, you can use
$ cat FILE | perl -nle 'print $1 if /Number\s+([^\s;]+)/'
or
$ cat FILE | perl -nle 'for (/Number\s+([^\s;]+)/g) { print }'
if there can be multiple occurences on the same line.
perl -lne 'if(/Number/){s/.*\s([a-zA-Z])([\d]+).*$/\1\2/g;print}' your_file
tested below:
> cat temp
Number a123;
Number b456789 vit;
alphabet fty;
> perl -lne 'if(/Number/){s/.*\s([a-zA-Z])([\d]+).*$/\1\2/g;print}' temp
a123
b456789
>