make column headers unique sed or awk

make column headers unique sed or awk - sed

I have a large space separated text file with non-unique column headers. I would like to make column headers unique by doing something like this using sed or awk. A new set of names begins with the column names
input
index type colx...names paul peter sarah... names paul peter sarah.... names paul peter sarah
output
index type colx...0names 0paul 0peter 0sarah... 1names 1paul 1peter 1sarah.... 2names 2paul 2peter 2sarah
can you please help me with this?

this awk one-liner may help:
awk '{for(i=1;i<=NF;i++)printf "%s"(i==NF?"\n":" "),a[$i]++$i}'
test:
kent$ awk '{for(i=1;i<=NF;i++)printf "%s"(i==NF?"\n":" "),a[$i]++$i}'<<<"names paul peter sarah names paul peter sarah names paul peter sarah"
0names 0paul 0peter 0sarah 1names 1paul 1peter 1sarah 2names 2paul 2peter 2sarah
EDIT for the new requirement:
awk '{for(i=1;i<=NF;i++)a[$i]++; for(i=1;i<=NF;i++)$i=(a[$i]>1)?a[$i]++-2$i:$i}7'
test: ( I shortened your example, but it should be the same problem)
kent$ awk '{for(i=1;i<=NF;i++)a[$i]++; for(i=1;i<=NF;i++)$i=(a[$i]>1)?a[$i]++-2$i:$i}7'<<<"a b c x y z x y z"
a b c 0x 0y 0z 1x 1y 1z

I'm guessing your actual files looks something more like this:
names paul peter sarah names paul peter sarah names paul peter sarah
data1 ...
data2 ...
data3 ...
If that is the case this will do the trick:
$ awk 'NR==1{for(i=1;i<=NF;i++)$i=a[$i]++ $i}1' file
0names 0paul 0peter 0sarah 1names 1paul 1peter 1sarah 2names 2paul 2peter 2sarah
data1 ...
data2 ...
data3 ...
EDIT:
To skip the first 3 columns just start at column 4:
$ awk 'NR==1{for(i=4;i<=NF;i++)$i=a[$i]++ $i}1' file
index type colx 0names 0paul 0peter 0sarah 1names 1paul 1peter 1sarah 2names ...
data1 ...
data2 ...
data3 ...

Related

Understanding ibase and obase used

I'm trying to solve the following exercise:
Write a command line that takes numbers from variables FT_NBR1, in ’\"?! base, and FT_NBR2, in mrdoc base, and displays the sum of both in gtaio luSnemf base.
I know the solution is:
echo $FT_NBR1 + $FT_NBR2 | sed 's/\\/1/g' | sed 's/?/3/g' | sed 's/!/4/g' | sed "s/\'/0/g" | sed "s/\"/2/g" | tr "mrdoc" "01234" | xargs echo "ibase=5; obase=23;" | bc | tr "0123456789ABC" "gtaio luSnemf"
I don't understand why ibase=5 and obase=23.
I read about ibase and obase, and I understand this is a base conversion, from base 5 to base 23. Anyone can explain me why 5 and 23. Thank you

The exercise description is a bit weird. A better one would be
Write a command line that takes numbers from variables FT_NBR1, with numbers represented by the letters "’\"?!", and FT_NBR2, represented by "mrdoc", and displays the sum of both with numbers represented by "gtaio luSnemf".
A shorter answer would be
echo $FT_NBR1 + $FT_NBR2 | tr "\'\\\\\"\?" "01234" | tr "mrdoc" "01234" | xargs echo "ibase=5; obase=23;" | bc | tr "0123456789ABC" "gtaio luSnemf"
Let's take it from the beginning:
echo $FT_NBR1 + $FT_NBR2 creates the expression using the input strings
tr "\'\\\\\"\?" "01234" translates the first input alphabet into numbers
tr "mrdoc" "01234" translates the second input alphabet into numbers
xargs echo "ibase=5; obase=23;" prepends number base information; the input base is 5 and the output base is 13, but obase must be expressed in the base of ibase and 13 in base 5 is 23.
bc does the actual calculation
tr "0123456789ABC" "gtaio luSnemf" does the translation into the output alphabet.

Why is \t not being applied to this string?

Almost trivial to ask, but i'm confused and curious. Why is the "\t" special character not applying a tab for:
It looks like the "\t" character only applied a single space rather than a tab. However if I move the "\t" character over, it applies it like so
Any ideas?

'Tabs' are actually applied. Tabs' width is usually determined by your terminal. In StackOverflow's (web-fronted) case, it's 4 characters wide. Output goes like this.
b
a b
aa b
aaa b
aaaa b
aaaaa b
aaaaaa b
aaaaaaa b
aaaaaaaa b
aaaaaaaaa b
aaaaaaaaaa b
aaaaaaaaaaa b
aaaaaaaaaaaa b
aaaaaaaaaaaaa b
aaaaaaaaaaaaaa b
aaaaaaaaaaaaaaa b
Not really an answer but explains the problem well.

Add leading 0 in sed substitution

I have input data:
foo 24
foobar 5 bar
bar foo 125
and I'd like to have output:
foo 024
foobar 005 bar
bar foo 125
So I can use this sed substitutions:
s,\([a-z ]\+\)\([0-9]\)\([a-z ]*\),\100\2\3,
s,\([a-z ]\+\)\([0-9][0-9]\)\([a-z ]*\),\10\2\3,
But, can I make one substitution, that will do the same? Something like:
if (one digit) then two leading 0
elif (two digits) then one leading 0
Regards.

I doubt that the "if - else" logic can be incorporated in one substitution command without saving the intermediate data (length of the match for instance). It doesn't mean you can't do it easily, though. For instance:
$ N=5
$ sed -r ":r;s/\b[0-9]{1,$(($N-1))}\b/0&/g;tr" infile
foo 00024
foobar 00005 bar
bar foo 00125
It uses recursion, adding one zero to all numbers that are shorter than $N digits in a loop that ends when no more substitutions can be made. The r label basically says: try to do substitution, then goto r if found something to substitute. See more on flow control in sed here.

Use two substitute commands: the first one will search for one digit and will insert two zeroes just before, and the second one will search for a number with two digits and will insert one zero just before. GNU sed is needed because I use the word boundary command to search for digits (\b).
sed -e 's/\b[0-9]\b/00&/g; s/\b[0-9]\{2\}\b/0&/g' infile
EDIT to add a test:
Content of infile:
foo 24 9
foo 645 bar 5 bar
bar foo 125
Run previous command with following output:
foo 024 009
foo 645 bar 005 bar
bar foo 125

Add the max number of leading zeros first, then take this number of characters from the end:
echo 55 | sed -e 's:^:0000000:' -e 's:0\+\(.\{8\}\)$:\1:'
00000055

You seem to have the sed options covered, here's one way with awk:
BEGIN { RS="[ \n]"; ORS=OFS="" }
/^[0-9]+$/ { $0 = sprintf("%03d", $0) }
{ print $0, RT }

I find the following sed approach to pad an integer number with zeroes to 5 (n) digits quite straighforward:
sed -e "s/\<\([0-9]\{1,4\}\)\>/0000\1/; s/\<0*\([0-9]\{5\}\)\>/\1/"
If there is at least one, at most 4 (n-1) digits, add 4 (n-1) zeroes in
front
If there is any number of zeroes followed by 5 (n) digits after the first transformation, keep just these last 5 (n) digits
When there happen to be more than 5 (n) digits, this approach behaves the usual way -- nothing is padded or trimmed.
Input:
0
1
12
123
1234
12345
123456
1234567
Output:
00000
00001
00012
00123
01234
12345
123456
1234567

This might work for you (GNU sed):
echo '1.23 12,345 1 12 123 1234 1' |
sed 's/\(^\|\s\)\([0-9]\(\s\|$\)\)/\100\2/g;s/\(^\|\s\)\([0-9][0-9]\(\s\|$\)\)/\10\2/g'
1.23 12,345 001 012 123 1234 001
or perhaps a little easier on the eye:
sed -r 's/(^|\s)([0-9](\s|$))/\100\2/g;s/(^|\s)([0-9][0-9](\s|$))/\10\2/g'

Removing whitespaces before and after specified character

LIST.txt
ark = 1 bark= 2 car =3 dorm =4
ark=8 bark = 25 car = 33 dorm =5
I have a file named LIST.txt as shown above. I want the ouput to be like shown in OUTPUT.txt
OUTPUT.txt
ark=1 bark=2 car=3 dorm=4
ark=8 bark=25 car=33 dorm=5
I was not succesful in removing the spaces before and after the "=". The method i tried to remove the whitespaces yielded me something like this
ark=1bark=2car=3dorm=4
ark=8bark=25car=33dorm=5
Can anyone help me out with this.

perl -pe 's/\s*=\s*/=/g' LIST.txt
outputs
ark=1 bark=2 car=3 dorm=4
ark=8 bark=25 car=33 dorm=5

Wordnet synsets using perl

I installed Wordnet::Similarity and Wordnet::QueryData as an easy way to calculate information content score and probability that comes with these modules. But I'm stuck at this basic problem: given a word, print n words similar to it - which should not be difficult that iterating through the synsets and doing join.
using the wn command and piping it with a whole lot of tr, sort | uniq I can get all the words:
wn cat -synsn | grep -v Sense | tr '=' ' ' | tr '>' ' ' | tr '\t' ' ' | tr ',' '\n' | sort | uniq
OUTPUT
8 senses of cat
adult female
adult male
African tea
Arabian tea
big cat
bozo
cat
cat
CAT
Caterpillar
cat-o'-nine-tails
computed axial tomography
computed tomography
computerized axial tomography
computerized tomography
CT
excitant
felid
feline
gossip
gossiper
gossipmonger
guy
hombre
kat
khat
man
newsmonger
qat
quat
rumormonger
rumourmonger
stimulant
stimulant drug
Synonyms/Hypernyms (Ordered by Estimated Frequency) of noun cat
tracked vehicle
true cat
whip
woman
X-radiation
X-raying
but its kinda nasty,and needs further clean up.
What my script looks like is below, and what I want to get is all the words in cat#n1...8.
SCRIPT
use WordNet::QueryData;
my $wn = WordNet::QueryData->new( noload => 1);
print "Senses: ", join(", ", $wn->querySense("cat#n")), "\n";
print "Synset: ", join(", ", $wn->querySense("cat", "syns")), "\n";
print "Hyponyms: ", join(", ", $wn->querySense("cat#n#1", "hypo")), "\n";
OUTPUT:
Senses: cat#n#1, cat#n#2, cat#n#3, cat#n#4, cat#n#5, cat#n#6, cat#n#7, cat#n#8
Synset: cat#n, cat#v
Hyponyms: domestic_cat#n#1, wildcat#n#3
SCRIPT
use WordNet::QueryData;
my $wn = WordNet::QueryData->new;
foreach $word (qw/cat#n/) {
#senses = $wn->querySense($word);
foreach $wps (#senses) {
#gloss = $wn -> querySense($wps, "syns");
print "$wps : #gloss\n";
}
}
OUTPUT:
cat#n#1 : cat#n#1 true_cat#n#1
cat#n#2 : guy#n#1 cat#n#2 hombre#n#1 bozo#n#2
cat#n#3 : cat#n#3
cat#n#4 : kat#n#1 khat#n#1 qat#n#1 quat#n#1 cat#n#4 Arabian_tea#n#1 African_tea#n#1
cat#n#5 : cat-o'-nine-tails#n#1 cat#n#5
cat#n#6 : Caterpillar#n#2 cat#n#6
cat#n#7 : big_cat#n#1 cat#n#7
cat#n#8 : computerized_tomography#n#1 computed_tomography#n#1 CT#n#2 computerized_axial_tomography#n#1 computed_axial_tomography#n#1 CAT#n#8
P.S.
I have never written perl before, but have been looking into perl scripts since morning - and can now understand the basic stuff. Just need to know if there is cleaner way to do this using the api docs - couldn't figure out from the api or usergroup archives.
Update:
I think I'll settle with:
wn cat -synsn | sed '1,6d' |sed 's/Sense [[:digit:]]//g' | sed 's/[[:space:]]*=> //' | sed '/^$/d'
sed rocks!

I think you'll find the following hepful...
http://marimba.d.umn.edu/WordNet-Pairs/
What are the N most similar words to X, according to WordNet?
This data seeks to answer that question, where similarity is based on
measures from WordNet::Similarity. http://wn-similarity.sourceforge.net
-------------- verb data
These files were created with WordNet::Similarity version 2.05 using
WordNet 3.0. They show all the pairwise verb-verb similarities found
in WordNet according to the path, wup, lch, lin, res, and jcn measures.
The path, wup, and lch are path-based, while res, lin, and jcn are based
on information content.
As of March 15, 2011 pairwise measures for all verbs using the six
measures above are availble, each in their own .tar file. Each *.tar
file is named as WordNet-verb-verb-MEASURE-pairs.tar, and is approx
2.0 - 2.4 GB compressed. In each of these .tar files you will find
25,047 files, one for each verb sense. Each file consists of 25,048 lines,
where each line (except the first) contains a WordNet verb sense and the
similarity to the sense featured in that particular file. Doing
the math here, you find that each .tar file contains about 625,000,000
pairwise similarity values. Note that these are symmetric (sim (A,B)
= sim (B,A)) so you have a bit more than 300 million unique values.
-------------- noun data
As of August 19, 2011 pairwise measures for all nouns using the path
measure are available. This file is named WordNet-noun-noun-path-pairs.tar.
It is approximately 120 GB compressed. In this file you will find
146,312 files, one for each noun sense. Each file consists of
146,313 lines, where each line (except the first) contains a WordNet
noun sense and the similarity to the sense featured in that particular
file. Doing the math here, you find that each .tar file contains
about 21,000,000,000 pairwise similarity values. Note that these
are symmetric (sim (A,B) = sim (B,A)) so you have around 10 billion
unique values.
We are currently running wup, res, and lesk, but do not have an
estimated date of availability yet.

Put this is a script, say synonym.sh
wn $1 -synsn | sed '1,6d' |sed 's/Sense [[:digit:]]//g' | sed 's/[[:space:]]*=> //' | sed '/^$/d' | sed 's/ //g' | grep -iv $1 | tr '\n' ','
wn $1 -synsv | sed '1,6d' |sed 's/Sense [[:digit:]]//g' | sed 's/[[:space:]]*=> //' | sed '/^$/d' | sed 's/ //g' | grep -iv $1 | tr '\n' ',';echo
From your perl script
system("/path/synonym.sh","kittens");
system("/path/synonym.sh","cats");

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

make column headers unique sed or awk - sed

Related

Understanding ibase and obase used

Why is \t not being applied to this string?

Add leading 0 in sed substitution

Removing whitespaces before and after specified character

Wordnet synsets using perl

Categories

Resources