Print Valid words with _ in between them - sed

I have done my research, but not able to find the solution to my problem.
I am trying to extract all valid words(Starting with a letter) in a string and concatenate them with underscore("_"). I am looking for solution with awk, sed or grep, etc.
Something like:
echo "The string under consideration" | (awk/grep/sed) (pattern match)
Example 1
Input:
1.2.3::L2 Traffic-house seen during ABCD from 2.2.4/5.2.3a to 1.2.3.X11
Desired output:
L2_Traffic_house_seen_during_ABCD_from
Example 2
Input:
XYZ-2-VRECYY_FAIL: Verify failed - Client 0x880016, Reason: Object exi
Desired Output:
XYZ_VRECYY_FAIL_Verify_failed_Client_Reason_Object_exi
Example 3
Input:
ABCMGR-2-SERVICE_CRASHED: Service "abcmgr" (PID 7582) during UPGRADE
Desired Output:
ABCMGR_SERVICE_CRASHED_Service_abcmgr_PID_during_UPGRADE

This might work for you (GNU sed):
sed 's/[[:punct:]]/ /g;s/\<[[:alpha:]]/\n&/g;s/[^\n]*\n//;s/ [^\n]*//g;y/\n/_/' file

A perl one-liner. It searches any alphabetic character followed by any number of word characters enclosed in word boundaries. Use the /g flag to try several matches for each line.
Content of infile:
1.2.3::L2 Traffic-house seen during ABCD from 2.2.4/5.2.3a to 1.2.3.X11
XYZ-2-VRECYY_FAIL: Verify failed - Client 0x880016, Reason: Object exi
ABCMGR-2-SERVICE_CRASHED: Service "abcmgr" (PID 7582) during UPGRADE
Perl command:
perl -ne 'printf qq|%s\n|, join qq|_|, (m/\b([[:alpha:]]\w*)\b/g)' infile
Output:
L2_Traffic_house_seen_during_ABCD_from_to_X11
XYZ_VRECYY_FAIL_Verify_failed_Client_Reason_Object_exi
ABCMGR_SERVICE_CRASHED_Service_abcmgr_PID_during_UPGRADE

One way using awk, with the contents of script.awk:
BEGIN {
FS="[^[:alnum:]_]"
}
{
for (i=1; i<=NF; i++) {
if ($i !~ /^[0-9]/ && $i != "") {
if (i < NF) {
printf "%s_", $i
}
else {
print $i
}
}
}
}
Run like:
awk -f script.awk file.txt
Alternatively, here is the one liner:
awk -F "[^[:alnum:]_]" '{ for (i=1; i<=NF; i++) { if ($i !~ /^[0-9]/ && $i != "") { if (i < NF) printf "%s_", $i; else print $i; } } }' file.txt
Results:
L2_Traffic_house_seen_during_ABCD_from_to_X11
XYZ_VRECYY_FAIL_Verify_failed_Client_Reason_Object_exi
ABCMGR_SERVICE_CRASHED_Service_abcmgr_PID_during_UPGRADE

This solution requires some tuning and I think one needs gawk to have regexp as "record separator"
http://www.gnu.org/software/gawk/manual/html_node/Records.html#Records
gawk -v ORS='_' -v RS='[-: \"()]' '/^[a-zA-Z]/' file.dat

Related

Large volume of queries on a large CSV file

I am trying to query a large csv file(100gb,+-1.1billion records) for partial matches in the url column of a csv. I aim to query for about 23000 possible matches.
Example input:
url,answer,rrClass,rrType,tlp,firstSeenTimestamp,lastSeenTimestamp,minimumTTLSec,maximumTTLSec,count
maps.google.com.,173.194.112.106,in,a,white,1442011301000,1442011334000,300,300,2
drive.google.com.,173.194.112.107,in,a,white,1442011301000,1442011334000,300,300,2
nokiantires.com.,185.53.179.22,in,a,white,1529534626596,1529534626596,600,600,1
woodpapersilk.,138.201.32.142,in,a,white,1546339972354,1553285334535,3886,14399,2
xn--c1yn36f.cn.,167.160.174.76,in,a,white,1501685257255,1515592226520,14400,14400,38
maps.google.com.malwaredomain.com.,118.193.165.236,in,a,white,1442148766000,1442148766000,600,600,1
whois.ducmates.blogspot.com.,216.58.194.193,in,a,white,1535969280784,1535969280784,44,44,1
Queries are of the following pattern: /^.*[someurl].*$/ each of these [someurls] come from a different file and can be assumed as an array of size 23000.
Matching Queries:
awk -F, '$1 ~ /^.*google\.com\.$/' > file1.out
awk -F, '$1 ~ /^.*nokiantires\.com\.$/' > file2.out
awk -F, '$1 ~ /^.*woodpapersilk\.com\.$/' > file3.out
awk -F, '$1 ~ /^.*xn--.*$/' > file4.out
Queries that match nothing:
awk -F, '$1 ~ /^.*seasonvintage\.com\.$/' > file5.out
awk -F, '$1 ~ /^.*java2s\.com\.$/' > file6.out
file1.out:
maps.google.com.,173.194.112.106,in,a,white,1442011301000,1442011334000,300,300,2
drive.google.com.,173.194.112.107,in,a,white,1442011301000,1442011334000,300,300,2
file2.out:
nokiantires.com.,185.53.179.22,in,a,white,1529534626596,1529534626596,600,600,1
file3.out:
woodpapersilk.,138.201.32.142,in,a,white,1546339972354,1553285334535,3886,14399,2
file4.out:
xn--c1yn36f.cn.,167.160.174.76,in,a,white,1501685257255,1515592226520,14400,14400,38
file 5.out and file6.out are both empty as nothing matches
I have also uploaded these inputs and outputs as a gist.
Essentially each query extracts a partial match in the url column.
Currently I use the following code with awk to search for possible matches:
awk -F, '$1 ~ /^.*xn--.*$/' file.out > filter.csv
This solution returns a valid response but it takes 14 minutes to query for one example. Unfortunately I am looking to query for 23000 possible matches.
As such I am looking for a more workable and efficient solution.
I have thought of/tried the following
Can i include all the tags in a huge regex or does this increase the inefficiency?
I have tried using MongoDB but it does not work well with just a single machine.
I have an AWS voucher that has about $30 left. Is there any particular AWS solution that could help here?
What would be a more workable solution to process these queries on said csv file?
Many thanks
Given what we know so far and guessing at the answers to a couple of questions, I'd approach this by separating the queries into "queries that can be matched by a hash lookup" (which is all but 1 of the queries in your posted example) and "queries that need a regexp comparison to match" (just xn--.*$ in your example) and then evaluating them as such when reading your records so that any $1 that can be matched by an almost instantaneous hash lookup against all hash-able queries will be done like that and only the few that need a regexp match will be handled sequentially in a loop:
$ cat ../queries
google.com.$
nokiantires.com.$
woodpapersilk.com.$
xn--.*$
seasonvintage.com.$
java2s.com.$
$ cat ../records
url,answer,rrClass,rrType,tlp,firstSeenTimestamp,lastSeenTimestamp,minimumTTLSec,maximumTTLSec,count
maps.google.com.,173.194.112.106,in,a,white,1442011301000,1442011334000,300,300,2
drive.google.com.,173.194.112.107,in,a,white,1442011301000,1442011334000,300,300,2
nokiantires.com.,185.53.179.22,in,a,white,1529534626596,1529534626596,600,600,1
woodpapersilk.,138.201.32.142,in,a,white,1546339972354,1553285334535,3886,14399,2
xn--c1yn36f.cn.,167.160.174.76,in,a,white,1501685257255,1515592226520,14400,14400,38
maps.google.com.malwaredomain.com.,118.193.165.236,in,a,white,1442148766000,1442148766000,600,600,1
whois.ducmates.blogspot.com.,216.58.194.193,in,a,white,1535969280784,1535969280784,44,44,1
.
$ cat ../tst.awk
BEGIN { FS="," }
NR==FNR {
query = $0
outFile = "file" ++numQueries ".out"
printf "" > outFile; close(outFile)
if ( query ~ /^[^.]+[.][^.]+[.][$]$/ ) {
# simple end of field string, can be hash matched
queriesHash[query] = outFile
}
else {
# not a simple end of field string, must be regexp matched
queriesRes[query] = outFile
}
next
}
FNR>1 {
matchQuery = ""
if ( match($1,/[^.]+[.][^.]+[.]$/) ) {
fldKey = substr($1,RSTART,RLENGTH) "$"
if ( fldKey in queriesHash ) {
matchType = "hash"
matchQuery = fldKey
outFile = queriesHash[matchQuery]
}
}
if ( matchQuery == "" ) {
for ( query in queriesRes ) {
if ( $1 ~ query ) {
matchType = "regexp"
matchQuery = query
outFile = queriesRes[matchQuery]
break
}
}
}
if ( matchQuery != "" ) {
print "matched:", matchType, matchQuery, $0, ">>", outFile | "cat>&2"
print >> outFile; close(outFile)
}
}
.
$ ls
$
$ tail -n +1 *
tail: cannot open '*' for reading: No such file or directory
.
$ awk -f ../tst.awk ../queries ../records
matched: hash google.com.$ maps.google.com.,173.194.112.106,in,a,white,1442011301000,1442011334000,300,300,2 >> file1.out
matched: hash google.com.$ drive.google.com.,173.194.112.107,in,a,white,1442011301000,1442011334000,300,300,2 >> file1.out
matched: hash nokiantires.com.$ nokiantires.com.,185.53.179.22,in,a,white,1529534626596,1529534626596,600,600,1 >> file2.out
matched: regexp xn--.*$ xn--c1yn36f.cn.,167.160.174.76,in,a,white,1501685257255,1515592226520,14400,14400,38 >> file4.out
.
$ ls
file1.out file2.out file3.out file4.out file5.out file6.out
$
$ tail -n +1 *
==> file1.out <==
maps.google.com.,173.194.112.106,in,a,white,1442011301000,1442011334000,300,300,2
drive.google.com.,173.194.112.107,in,a,white,1442011301000,1442011334000,300,300,2
==> file2.out <==
nokiantires.com.,185.53.179.22,in,a,white,1529534626596,1529534626596,600,600,1
==> file3.out <==
==> file4.out <==
xn--c1yn36f.cn.,167.160.174.76,in,a,white,1501685257255,1515592226520,14400,14400,38
==> file5.out <==
==> file6.out <==
$
The initial printf "" > outFile; close(outFile) is just to ensure you get an output file per query even if that query oesn't match, just like you asked for in your example.
If you're using GNU awk then it can manage the multiple open output files for you and then you can make these changes:
printf "" > outFile; close(outFile) -> printf "" > outFile
print >> outFile; close(outFile) -> print > outFile
which will be more efficient because then the output file isn't being opened+closed on every print.

SED code for removing newline

I am looking for sed command which will transform following line:
>AT1G01020.6 | ARV1 family protein | Chr1:6788-8737 REVERSE LENGTH=944 | 201606
AGACCCGGACTCTAATTGCTCCGTATTCTTCTTCTCTTGAGAGAGAGAGAGAGAGAGAGA
GAGAGAGAGCAATGGCGGCGAGTGAACACAGATGCGTGGGATGTGGTTTTAGGGTAAAGT
CATTGTTCATTCAATACTCTCCGGGGAAATTGCAAGGAAGTAGCAGATGAGTACATCGAG
TGTGAACGCATGATTATTTTCATCGATTTAATCCTTCACAGACCAAAGGTATATAGACAC
into
>AT1G01020.6 | ARV1 family protein | Chr1:6788-8737 REVERSE LENGTH=944 | 201606
AGACCCGGACTCTAATTGCTCCGTATTCTTCTTCTCTTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGCAATGGCGGCGAGTGAACACAGATGCGTGGGATGTGGTTTTAGGGTAAAGTCATTGTTCATTCAATACTCTCCGGGGAAATTGCAAGGAAGTAGCAGATGAGTACATCGAGTGTGAACGCATGATTATTTTCATCGATTTAATCCTTCACAGACCAAAGGTATATAGACAC
which means newline after > this character will remain unchanged, while on other cases newlines will be joined.
I have tried with the following line, but it is not working:
sed s/^!>\n$// <in.fasta>out.fasta
I have a 28MB fasta file which I need to transform.
sed is not a particularly good tool for this.
awk '/^>/ { if(prev) printf "\n"; print; next }
{ printf "%s", $0; prev = 1; }
END { if(prev) printf "\n" }' in.fasta >out.fasta
Using awk:
awk '/^>/{print (l?l ORS:"") $0;l="";next}{l=l $0}END{print l}' file
The line is printed if a > or the end of the file is reached, otherwise the line is buffered in the variable l.
Following awk may also help you here. Without using any array or variable's values solution.
awk 'BEGIN{ORS=""} /^>/{if(FNR==1){print $0 RS} else {print RS $0 RS};next}1' Input_file
OR
awk 'BEGIN{ORS=""} /^>/{printf("%s",FNR==1?$0 RS:RS $0 RS);next}1' Input_file

hash using sha1sum using awk

I have a "pipe-separated" file that has about 20 columns. I want to just hash the first column which is a number like account number using sha1sum and return the rest of the columns as is.
Whats the best way I can do this using awk or sed?
Accountid|Time|Category|.....
8238438|20140101021301|sub1|...
3432323|20140101041903|sub2|...
9342342|20140101050303|sub1|...
Above is an example of the text file showing just 3 columns. Only the first column has the hashfunction implemented on it. Result should like:
Accountid|Time|Category|.....
104a1f34b26ae47a67273fe06456be1fe97f75ba|20140101021301|sub1|...
c84270c403adcd8aba9484807a9f1c2164d7f57b|20140101041903|sub2|...
4fa518d8b005e4f9a085d48a4b5f2c558c8402eb|20140101050303|sub1|...
What the Best Way™ is is up for debate. One way to do it with awk is
awk -F'|' 'BEGIN { OFS=FS } NR == 1 { print } NR != 1 { gsub(/'\''/, "'\'\\\\\'\''", $1); command = ("echo '\''" $1 "'\'' | sha1sum -b | cut -d\\ -f 1"); command | getline hash; close(command); $1 = hash; print }' filename
That is
BEGIN {
OFS = FS # set output field separator to field separator; we will use
# it because we meddle with the fields.
}
NR == 1 { # first line: just print headers.
print
}
NR != 1 { # from there on do the hash/replace
# this constructs a shell command (and runs it) that echoes the field
# (singly-quoted to prevent surprises) through sha1sum -b, cuts out the hash
# and gets it back into awk with getline (into the variable hash)
# the gsub bit is to prevent the shell from barfing if there's an apostrophe
# in one of the fields.
gsub(/'/, "'\\''", $1);
command = ("echo '" $1 "' | sha1sum -b | cut -d\\ -f 1")
command | getline hash
close(command)
# then replace the field and print the result.
$1 = hash
print
}
You will notice the differences between the shell command at the top and the awk code at the bottom; that is all due to shell expansion. Because I put the awk code in single quotes in the shell commands (double quotes are not up for debate in that context, what with $1 and all), and because the code contains single quotes, making it work inline leads to a nightmare of backslashes. Because of this, my advice is to put the awk code into a file, say foo.awk, and run
awk -F'|' -f foo.awk filename
instead.
Here's an awk executable script that does what you want:
#!/usr/bin/awk -f
BEGIN { FS=OFS="|" }
FNR != 1 { $1 = encodeData( $1 ) }
47
function encodeData( fld ) {
cmd = sprintf( "echo %s | sha1sum", fld )
cmd | getline output
close( cmd )
split( output, arr, " " )
return arr[1]
}
Here's the flow break down:
Set the input and output field separators to |
When the row isn't the first (header) row, re-assign $1 to an encoded value
Print the entire row when 47 is true (always)
Here's the encodeData function break down:
Create a cmd to feed data to sha1sum
Feed it to getline
Close the cmd
On my system, there's extra info after sha1sum, so I discard it by spliting the output
Return the first field of the sha1sum output.
With your data, I get the following:
Accountid|Time|Category|.....
104a1f34b26ae47a67273fe06456be1fe97f75ba|20140101021301|sub1|...
c84270c403adcd8aba9484807a9f1c2164d7f57b|20140101041903|sub2|...
4fa518d8b005e4f9a085d48a4b5f2c558c8402eb|20140101050303|sub1|...
running by calling awk.script data (or ./awk.script data if you bash)
EDIT by EdMorton:
sorry for the edit, but your script above is the right approach but needs some tweaks to make it more robust and this is much easier than trying to describe them in a comment:
$ cat tst.awk
BEGIN { FS=OFS="|" }
NR==1 { for (i=1; i<=NF; i++) f[$i] = i; next }
{ $(f["Accountid"]) = encodeData($(f["Accountid"])); print }
function encodeData( fld, cmd, output ) {
cmd = "echo \047" fld "\047 | sha1sum"
if ( (cmd | getline output) > 0 ) {
sub(/ .*/,"",output)
}
else {
print "failed to hash " fld | "cat>&2"
output = fld
}
close( cmd )
return output
}
$ awk -f tst.awk file
104a1f34b26ae47a67273fe06456be1fe97f75ba|20140101021301|sub1|...
c84270c403adcd8aba9484807a9f1c2164d7f57b|20140101041903|sub2|...
4fa518d8b005e4f9a085d48a4b5f2c558c8402eb|20140101050303|sub1|...
The f[] array decouples your script from hard-coding the number of the field that needs to be hashed, the additional args for your function make them local and so always null/zero on each invocation, the if on getline means you won't return the previous success value if it fails (see http://awk.info/?tip/getline) and the rest is maybe more style/preference with a bit of a performance improvement.

print field number and field

I want to print the field number and field like this... Is awk the best way? If so how?
The # of fields in the input line may vary.
input_line ="a|b|c|d"
expected result:
1 a
2 b
3 c
4 d
I'm able to print the fields, but need help printing the field numbers. Here's what I have
echo "a|b|c|d" |awk -F"|" '{for (i=1; i<=NF; i++) print $i}'
a
b
c
d
You can use awk command like:
echo "a|b|c|d" | awk -F"|" '{for(i=1; i<=NF; i++) print i, $i}'
awk with a while loop should do the trick:
awk -F '|' '{ i = 1; while (i <= NF) { print i " " $i; i++; } }' <<< "a|b|c|d"

SED or AWK to make url querystring readable

I need to split a querystring to several unbounded amount of variables for debugging purposes:
The output comes from tshark and the purpose is to live debug google analytics events. The output from tshark looks like this:
82.387501 hampus -> domain.net 1261 GET /__utm.gif?utmwv=5.3.7&utms=22&utmn=1234&utmhn=domain.com&utmt=event&utme=5(x*y*z%2Fstart%2Fklipp%2F166_SS%20example)(10)&utmcs=UTF-8~ HTTP/1.1
What i want is a more human readable version:
utmhn: domain.com
utmt: event
utme: 5(x*y*z/start/klipp/166_SS/example)(10)
utmcs: UTF-8
or even better:
utmhn: domain.com
utmt: event
utme: 5(
x
y
z/start/klipp/166_SS/example
)(10)
utmcs: UTF-8
But can't get my head around sed (or awk) for this purpose...
file
82.387501 hampus -> domain.net 1261 GET /__utm.gif?utmwv=5.3.7&utms=22&utmn=1234&utmhn=domain.com&utmt=event&utme=5(x*y*z%2Fstart%2Fklipp%2F166_SS%20example)(10)&utmcs=UTF-8~ HTTP/1.1
command
sed 's/.*utmhn=/uthmhn: /
s/&utmt=/\nutmt: /
s/&utme=/\nutme: /
s/utmcs=/\nutmcs: /
s:[%]2F:/:g
s:[%]20: :g
s:[\(]:(\n\t :
s:\*:\n\t :g
s:[\)]:\n\t ):
s/[~].*$//' samp1.txt
output
uthmhn: domain.com
utmt: event
utme: 5(
x
y
z/start/klipp/166_SS example
)(10)&
utmcs: UTF-8
I'm not sure what to say about your %20 VS the expected result of '/' char in your sample data. Did you manually type some of this in?
Another way using Perl :
#!/usr/bin/perl -l
use strict; use warnings;
while (<>) {
my #arr;
my ($qs) = m/.*?GET.*?\?(\S+)\s/;
my #pairs = split(/[&~]/, $qs);
foreach my $pair (#pairs){
my ($name, $value) = split(/=/, $pair);
if ($name eq 'utme') {
$value =~ s!(%2F|%20)!/!g;
$value =~ s!\*!\n\t\t!g;
$value =~ s!\(!(\n\t\t!;
$value =~ s/\)\(/\n\t)(/;
}
# let's URI unescape stuff
$value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;
if ($name eq 'utmhn') {
print "$name: $value";
}
else {
push #arr, "$name: $value";
}
}
print join "\n", #arr;
print "\n";
}
OUTPUT
utmhn: domain.com
utmwv: 5.3.7
utms: 22
utmn: 1234
utmt: event
utme: 5(
x
y
z/start/klipp/166_SS/example
)(10)
utmcs: UTF-8
USAGE
tshark ... | ./script.pl
ADVANTAGES
I take care to display utmhn: domain.com at the first line
I run an URI unescape on values
It's not limited to
"utmhn",
"utmt",
"utme",
"utmcs" only
Here's one way using GNU awk. Run like:
awk -f script.awk file.txt
Contents of script.awk:
BEGIN {
FS="[ \t=&~]+"
OFS="\t"
}
{
for (i=1; i<=NF; i++) {
if ($i ~ /^utmhn$|^utmt$|^utme$|^utmcs$/) {
if ($i == "utme") {
sub(/\(/,"(\n\t ", $(i+1))
gsub(/*/,"\n\t ", $(i+1))
sub(/\)/,"\n\t )", $(i+1))
}
print $i":", $(i+1)
}
}
}
Results:
utmhn: domain.net
utmt: event
utme: 5(
x
y
z%2Fstart%2Fklipp%2F166_SS%20example
)(10)
utmcs: UTF-8
Alternatively, here's the one-liner:
awk 'BEGIN { FS="[ \t=&~]+"; OFS="\t" } { for (i=1; i<=NF; i++) { if ($i ~ /^utmhn$|^utmt$|^utme$|^utmcs$/) { if ($i == "utme") { sub(/\(/,"(\n\t ", $(i+1)); gsub(/*/,"\n\t ", $(i+1)); sub(/\)/,"\n\t )", $(i+1)) } print $i":", $(i+1) } } }' file.txt
assuming your data is in a file called "file":
awk -F "&" '{ for ( i=2;i<=NF;i++ ){sub(/=/,":\t",$i);sub(/[~].*$/,"",$i);gsub(/\%2F/,"/",$i);gsub(/\%20/," ",$i);print $i} }' tst
produces output:
utms: 22
utmn: 1234
utmhn: domain.com
utmt: event
utme: 5(x*y*z/start/klipp/166_SS example)(10)
utmcs: UTF-8
it's a bit dirty, but it works.
$ cat tst.awk
BEGIN { FS="[&=~]"; OFS=":\t" }
{
for (i=1;i<=NF;i++) {
map[$i]=$(i+1)
}
sub(/\(/,"&\n\t ", map["utme"])
gsub(/\*/,"\n\t ", map["utme"])
gsub(/%2./,"/", map["utme"])
sub(/\)/,"\n\t&", map["utme"])
print "utmhn", map["utmhn"]
print "utmt", map["utmt"]
print "utme", map["utme"]
print "utmcs", map["utmcs"]
}
$
$ awk -f tst.awk file
utmhn: domain.com
utmt: event
utme: 5(
x
y
z/start/klipp/166_SS/example
)(10)
utmcs: UTF-8
This might work for you (GNU sed):
sed 's/.*\(utmhn.*=\S*\).*/\1/;s/&/\n/g;s/=/:\t/g;s/(/&\n\t/;s/*/\n\t/g;s/%2F/\//g;s/%20/ /g;s/)/\n\t&/' file