I hope you're having a great day,
I want to remove two patterns, I want to remove the parts that contains the word images from a text that I have:
in the files test1 I have this:
APP:Server1:files APP:Server2:images APP:Server3:misc APP:Server4:xml APP:Server5:json APP:Server6:stats APP:Server7:graphs APP:Server8:images-v2
I need to remove APP:Server2:image and APP:Server8:images-v2 ... I want this output:
APP:Server1:files APP:Server3:misc APP:Server4:xml APP:Server5:json APP:Server6:stats APP:Server7:graphs
I'm trying this:
cat test1 | sed 's/ .*images.* / /g'
You need to make sure that your wildcards do not allow spaces:
cat data | sed 's/ [^ ]*image[^ ]* / /g'
This should work for you
sed 's/\w{1,}:Server[2|8]:\w{1,} //g'
\w matches word characters (letters, numbers, _)
{1,} matches one or more of the preceeding item (\w)
[2|8] matches either the number 2 or 8
cat test.file
APP:Server1:files APP:Server2:images APP:Server3:misc APP:Server4:xml APP:Server5:json APP:Server6:stats APP:Server7:graphs APP:Server8:images-v2
The below command removes the matching lines and leaves blanks in their place
tr ' ' '\n' < test.file |sed 's/\w\{1,\}:Server[2|8]:\w\{1,\}.*$//'
APP:Server1:files
APP:Server3:misc
APP:Server4:xml
APP:Server5:json
APP:Server6:stats
APP:Server7:graphs
To remove the blank lines, just add a second option to the sed command, and paste the contents back together
tr ' ' '\n' < test.file |sed 's/\w\{1,\}:Server[2|8]:\w\{1,\}.*$//;/^$/d'|paste -sd ' ' -
APP:Server1:files APP:Server3:misc APP:Server4:xml APP:Server5:json APP:Server6:stats APP:Server7:graphs
GNU aWk alternative:
awk 'BEGIN { RS="APP:" } $0=="" { next } { split($0,map,":");if (map[2] ~ /images/ ) { next } OFS=RS;printf " %s%s",OFS,$0 }'
Set the record separator to "APP:" and then process the text in between as separate records. If the record is blank, skip to the next record. Split the record into array map based on ":" as the delimiter, then check if there is image in the any of the text in the second index. If there is, skip to the next record, otherwise print along with the record separator.
I need to generate a file.sql file from a file.csv, so I use this command :
cat file.csv |sed "s/\(.*\),\(.*\)/insert into table(value1, value2)
values\('\1','\2'\);/g" > file.sql
It works perfectly, but when the values exceed 9 (for example for \10, \11 etc...) it takes consideration of only the first number (which is \1 in this case) and ignores the rest.
I want to know if I missed something or if there is another way to do it.
Thank you !
EDIT :
The not working example :
My file.csv looks like
2013-04-01 04:00:52,2,37,74,40233964,3860,0,0,4878,174,3,0,0,3598,27.00,27
What I get
insert into table
val1,val2,val3,val4,val5,val6,val7,val8,val9,val10,val11,val12,val13,val14,val15,val16
values
('2013-04-01 07:39:43',
2,37,74,36526530,3877,0,0,6080,
2013-04-01 07:39:430,2013-04-01 07:39:431,
2013-04-01 07:39:432,2013-04-01 07:39:433,
2013-04-01 07:39:434,2013-04-01 07:39:435,
2013-04-01 07:39:436);
After the ninth element I get the first one instead of the 10th,11th etc...
As far I know sed has a limitation of supporting 9 back references. It might have been removed in the newer versions (though not sure). You are better off using perl or awk for this.
Here is how you'd do in awk:
$ cat csv
2013-04-01 04:00:52,2,37,74,40233964,3860,0,0,4878,174,3,0,0,3598,27.00,27
$ awk 'BEGIN{FS=OFS=","}{print "insert into table values (\x27"$1"\x27",$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16 ");"}' csv
insert into table values ('2013-04-01 04:00:52',2,37,74,40233964,3860,0,0,4878,174,3,0,0,3598,27.00,27);
This is how you can do in perl:
$ perl -ple 's/([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+)/insert into table values (\x27$1\x27,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16);/' csv
insert into table values ('2013-04-01 04:00:52',2,37,74,40233964,3860,0,0,4878,174,3,0,0,3598,27.00,27);
Try an awk script (based on #JS웃 solution):
script.awk
#!/usr/bin/env awk
# before looping the file
BEGIN{
FS="," # input separator
OFS=FS # output separator
q="\047" # single quote as a variable
}
# on each line (no pattern)
{
printf "insert into table values ("
print q $1 q ", "
print $2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16
print ");"
}
Run with
awk -f script.awk file.csv
One-liner
awk 'BEGIN{OFS=FS=","; q="\047" } { printf "insert into table values (" q $1 q ", " $2","$3","$4","$5","$6","$7","$8","$9","$10","$11","$12","$13","$14","$15","$16 ");" }' file.csv
I wanted to grep a string at the first occurrence ONLY from a file (file.dat) and replace it by reading from another file (output). I have a file called "output" as an example contains "AAA T 0001"
#!/bin/bash
procdir=`pwd`
cat output | while read lin1 lin2 lin3
do
srt2=$(echo $lin1 $lin2 $lin3 | awk '{print $1,$2,$3}')
grep -m 1 $lin1 $procdir/file.dat | xargs -r0 perl -pi -e 's/$lin1/$srt2/g'
done
Basically what I wanted is: When ever a string "AAA" is grep'ed from the file "file.dat" at the first instance, I want to replace the second and third column next to "AAA" by "T 0001" but still keep the first column "AAA" as it is. Th above script basically does not work. Basically "$lin1" and $srt2 variables are not understood inside 's/$lin1/$srt2/g'
Example:
in my file.dat I have a row
AAA D ---- CITY COUNTRY
What I want is :
AAA T 0001 CITY COUNTRY
Any comments are very appreciated.
If you have output file like this:
$ cat output
AAA T 0001
Your file.dat file contains information like:
$ cat file.dat
AAA D ---- CITY COUNTRY
BBB C ---- CITY COUNTRY
AAA D ---- CITY COUNTRY
You can try something like this with awk:
$ awk '
NR==FNR {
a[$1]=$0
next
}
$1 in a {
printf "%s ", a[$1]
delete a[$1]
for (i=4;i<=NF;i++) {
printf "%s ", $i
}
print ""
next
}1' output file.dat
AAA T 0001 CITY COUNTRY
BBB C ---- CITY COUNTRY
AAA D ---- CITY COUNTRY
Say you place the string for which to search in $s and the string with which to replace in $r, wouldn't the following do?
perl -i -pe'
BEGIN { ($s,$r)=splice(#ARGV,0,2) }
$done ||= s/\Q$s/$r/;
' "$s" "$r" file.dat
(Replaces the first instance if present)
This will only change the first match in the file:
#!/bin/bash
procdir=`pwd`
while read line; do
set $line
sed '0,/'"$1"'/s/\([^ ]* \)\([^ ]* [^ ]*\)/\1'"$2 $3"'/' $procdir/file.dat
done < output
To change all matching lines:
sed '/'"$1"'/s/\([^ ]* \)\([^ ]* [^ ]*\)/\1'"$2 $3"'/' $procdir/file.dat
I posted a question before a week and the answer was simply (use join):
join <(sort file1) <(sort file2) >output
to join files that have something common which is usually the first field.
I have the following two files:
genes.txt
ENSG001 ENSG002
ENSG002 ENSG001
ENSG003 ENSG004
features.txt
ENSG001 400
ENSG002 350
ENSG003 210
ENSG004 100
I need to join these two files to be like this:
output.txt
ENSG001 400 ENSG002 350
ENSG002 350 ENSG001 400
ENSG003 210 ENSG004 100
I know the answer is in join command but I can't figure out how to join based on two fields. I tried
join -j 1 <(sort genes.txt) <(sort features.txt) >attempt1.txt
but the result will looks like this:
attempt1.txt
ENSG001 ENSG002 400
ENSG002 ENSG001 350
ENSG003 ENSG004 210
I then tried
join -j 2 <(sort -k 2 genes.txt) <(sort -k 2 features.txt) >attempt2.txt
attempt2.txt is empty
Does (join) have the ability to join two files based on two fields ? If no then how can I do it ?
%features;
open $fd, '<', 'features.txt' or die $!;
while (<$fd>) {
($k, $v) = split;
$features{$k} = $v;
}
close $fd or die $!;
open $fd, '<', 'genes.txt' or die $!;
while (<$fd>) {
s/(\w+)/$1 $features{$1}/g;
print;
}
close $fd or die $!;
Thank you all guys I have managed to answer it by tricking the problem.
First I joined the files normally, I then changed the position of first and second field, I next joined the modified output file another time with features, and finally I switched the positions of fields again.
join <(sort genes.txt) <(sort features.txt) >tmp
cat tmp | awk '{ print $2, $1, $3 }' >tmp2
join <(sort tmp2) <(sort features.txt) >tmp3
cat tmp3 | awk '{ print $2, $3, $1, $4 }' >output.txt
To the best of my knowledge, join does NOT support this. See join manpage.
However, you can accomplish this in 2 ways:
Turn the first space/tab in the file into a caret (or other character you will never see in the file), then use join as before which will treat the first 2 fields as 1 field:
perl -pi -e 's/^(\S+)\s+/$1#/' file1
perl -pi -e 's/^(\S+)\s+/$1#/' file2
join <(sort file1) <(sort file2) >output
tr "#" " " output > output.final
Do it in Perl. You can do
the blunt approach (perreal's answer: slurp in 2 files at once); this takes a lot of memory if both files are large
The more memory conserving approach (cdtits's answer: slurp in a smaller file, store in a hash, then apply the lookups to line-by-line read of second file)
For really gynormous files, do a linear approach:
sort both files, read 1 line of each file; if they match, print the match; if not; skip 1 line in the file whose ID was smaller.
In case that "ENST" in features.txt is "ENSG", here is an awk solution that works well on given example:
awk 'BEGIN {while(getline <"features.txt") f[$1]=$2} {print $1,f[$1],$2,f[$2]}' < genes.txt
I can explain in detail if you need to.
Using perl:
use strict;
use warnings;
open GIN, "<genes.txt" or die("genes");
open FIN, "<features.txt" or die("features");
my %relations;
my %values;
while (<GIN>) {
my ($r1, $r2) = split;
$relations{$r1} = $r2;
}
while (<FIN>) {
my ($k, $v) = split;
$values{$k} = $v;
}
for my $r1 (sort keys %relations) {
my $r2 = $relations{$r1};
print "$r1 $values{$r1} $r2 $values{$r2}\n";
}
close FIN; close GIN;
Your approach is generally right. It should be achievable by something like
join -o '1.1 2.2 1.2 1.3' <(
join -o '1.1 1.2 2.2' -1 2 <(sort -k 2 genes.txt) <(sort features.txt) |
sort
) <(sort features.txt)
If I place ENSG004 instead of ENST004 into features.txt I will get exactly what you are looking for:
$ join -o '1.1 2.2 1.2 1.3' <(
join -o '1.1 1.2 2.2' -1 2 <(sort -k 2 genes.txt) <(sort features.txt) |
sort
) <(sort features.txt)
ENSG001 400 ENSG002 350
ENSG002 350 ENSG001 400
ENSG003 210 ENSG004 100
There is less verbose version but there is harder to keep track of fields:
join -o '1.2 2.2 1.1 1.3' -1 2 <(
join -1 2 <(sort -k 2 genes.txt) <(sort features.txt) |
sort -k 2
) <(sort features.txt)
If you are going process really big data it will should work pretty effective to tens of GB (and also should be better then most of RDBMS's if features.txt and genes.txt are comparable in size):
TMP=`mktemp`
sort features.txt > "$TMP"
sort -k 2 genes.txt | join -o '1.1 1.2 2.2' -1 2 - "$TMP" | sort |
join -o '1.1 2.2 1.2 1.3' - "$TMP"
rm "$TMP"