Is there an Awk or Perl one-liner that can remove all rows that do not have complete data. I find myself frequently needing something like this. For example, I currently have a tab-delimited file that looks like this:
1 asdf 2
9 asdx 4
3 ddaf 6
5 4
2 awer 4
How can a line that does not have a value in field 2 be removed?
How could a line that does not have a value in one of ANY field be removed?
I was trying to do something like this:
awk -F"\t" '{if ($2 != ''){print $0}}' 1.txt > 2.txt
Thanks.
In awk, if you know that every row should have exactly 3 elements:
awk -F'\t+' 'NF == 3' INFILE > OUTFILE
I would just look for consecutive tabs, or a leading or trailing tab:
perl -ne 'next if /\t\t/ or /^\t/ or /\t$/; print' tabfile
perl -lane 'print if $#F == 2' INFILE
For the specific solution:
awk -F'\t' '$2 != ""' input.txt > output.txt
For a solution that is generic:
awk -F'\t' -vCOLS=3 '{
valid=1;
for (i=1; i<=COLS; ++i) {
if ($i == "") {
valid=0;
break;
}
}
if (valid) {
print;
}
}' input.txt > output.txt
perl -F/\t/ -nle 'print if #F == 3' 1.txt > 2.txt
Just use awk 'NF == 3' INFILE > OUTFILE and it will just use white space (tabs, spaces etc..) as the field splitter.
Related
I have to add a field showing the difference in percentage between 2 fields in a file like:
BI,1266,908
BIL,494,414
BKC,597,380
BOOM,2638,654
BRER,1453,1525
BRIG,1080,763
DCLE,0,775
The output should be:
BI,1266,908,-28.3%
BIL,494,414,-16.2%
BKC,597,380,-36.35%
BOOM,2638,654,-75.2%
BRER,1453,1525,5%
BRIG,1080,763,-29.4%
DCLE,0,775,-
Note the zero in the last row. Either of these fields could be zero. If a zero is present in either field, N/A or - is acceptable.
What I'm trying --
Perl:
perl -F, -ane 'if ($F[2] > 0 || $F[3] > 0){print $F[0],",",$F[1],",",$F[2],100*($F[2]/$F[3])}' file
I get Illegal division by zero at -e line 1, <> line 2. If I change the || to && it prints nothing.
In awk:
awk '$2>0{$4=sprintf("%d(%.2f%)", $3, ($3/$2)*100)}1' file
Just prints the file.
$ awk -F, '$2 == 0 || $3 == 0 { printf("%s,-\n", $0); next }
{ printf("%s,%.2f%%\n", $0, 100 * ($3 / $2) - 100) }' input.csv
BI,1266,908,-28.28%
BIL,494,414,-16.19%
BKC,597,380,-36.35%
BOOM,2638,654,-75.21%
BRER,1453,1525,4.96%
BRIG,1080,763,-29.35%
DCLE,0,775,-
How it works: if the second or third columns are equal to 0, add a - field to the line. Otherwise, calculate the percentage difference and add that.
Your perl's main issue was confusing awk's 1-based column indexes with perl's 0-based column indexes.
perl -F, -ane 'print "$1," if /(.+)/;if ($F[1] > 0 && $F[2] > 0){printf ("%.2f%", ((100*$F[2]/$F[1])-100)) } else {print "-"};print "\n"' file
The $1 here refers to the capture group (.+) which means "The whole line but the linefeed". The rest is probably self-explanatory if you understand the awk.
You're not telling awk that the fields are separated by commas so it's assuming the default, spaces, and so $2 is never greater than zero because it's null as there's only 1 space-separated field per line. Change it to:
$ awk 'BEGIN{FS=OFS=","} $2>0{$4=sprintf("%d(%.2f%)", $3, ($3/$2)*100)}1' file
BI,1266,908,908(71.72%)
BIL,494,414,414(83.81%)
BKC,597,380,380(63.65%)
BOOM,2638,654,654(24.79%)
BRER,1453,1525,1525(104.96%)
BRIG,1080,763,763(70.65%)
DCLE,0,775
and then tweak it for your desired output:
$ awk 'BEGIN{FS=OFS=","} {$4=($2 && $3 ? sprintf("%.2f%", (($3/$2)-1)*100) : "N/A")} 1' file
BI,1266,908,-28.28%
BIL,494,414,-16.19%
BKC,597,380,-36.35%
BOOM,2638,654,-75.21%
BRER,1453,1525,4.96%
BRIG,1080,763,-29.35%
DCLE,0,775,N/A
I have a file that has around 500 rows and 480K columns, I am required to move columns 2,3 and 4 at the end. My file is a comma separated file, is there a quicker way to arrange this using awk or sed?
You can try below solution -
perl -F"," -lane 'print "#F[0]"," ","#F[4..$#F]"," ","#F[1..3]"' input.file
You can copy the columns easily, moving will take too long for 480K columns.
$ awk 'BEGIN{FS=OFS=","} {print $0,$2,$3,$4}' input.file > output.file
what kind of a data format is this?
Another technique, just bash:
while IFS=, read -r a b c d e; do
echo "$a,$e,$b,$c,$d"
done < file
Testing with 5 fields:
$ cat foo
1,2,3,4,5
a,b,c,d,e
$ cat program.awk
{
$6=$2 OFS $3 OFS $4 OFS $1 # copy fields to the end and $1 too
sub(/^([^,],){4}/,"") # remove 4 first columns
$1=$5 OFS $1 # catenate current $5 (was $1) to $1
NF=4 # reduce NF
} 1 # print
Run it:
$ awk -f program.awk FS=, OFS=, foo
1,5,2,3,4
a,e,b,c,d
So theoretically this should work:
{
$480001=$2 OFS $3 OFS $4 OFS $1
sub(/^([^,],){4}/,"")
$1=$480000 OFS $1
NF=479999
} 1
EDIT: It did work.
Perhaps perl:
perl -F, -lane 'print join(",", #F[0,4..$#F,1,2,3])' file
or
perl -F, -lane '#x = splice #F, 1, 3; print join(",", #F, #x)' file
Another approach: regular expressions
perl -lpe 's/^([^,]+)(,[^,]+,[^,]+,[^,]+)(.*)/$1$3$2/' file
Timing it with a 500 line file, each line containing 480,000 fields
$ time perl -F, -lane 'print join(",", #F[0,4..$#F,1,2,3])' file.csv > file2.csv
40.13user 1.11system 0:43.92elapsed 93%CPU (0avgtext+0avgdata 67960maxresident)k
0inputs+3172752outputs (0major+16088minor)pagefaults 0swaps
$ time perl -F, -lane '#x = splice #F, 1, 3; print join(",", #F, #x)' file.csv > file2.csv
34.82user 1.18system 0:38.47elapsed 93%CPU (0avgtext+0avgdata 52900maxresident)k
0inputs+3172752outputs (0major+12301minor)pagefaults 0swaps
And pure text manipulation is the winner
$ time perl -lpe 's/^([^,]+)(,[^,]+,[^,]+,[^,]+)(.*)/$1$3$2/' file.csv > file2.csv
4.63user 1.36system 0:20.81elapsed 28%CPU (0avgtext+0avgdata 20612maxresident)k
0inputs+3172752outputs (0major+149866minor)pagefaults 0swaps
I would like to remove lines that have less than 2 columns from a file:
awk '{ if (NF < 2) print}' test
one two
Is there a way to store these lines into variable and then remove it with xargs and sed, something like
awk '{ if (NF < 2) VARIABLE}' test | xargs sed -i /VARIABLE/d
GNU sed
I would like to remove lines that have less than 2 columns
less than 2 = remove lines with only one column
sed -r '/^\s*\S+\s+\S+/!d' file
If you would like to split the input into two files (named "pass" and "fail"), based on condition:
awk '{if (NF > 1 ) print > "pass"; else print > "fail"}' input
If you simply want to filter/remove lines with NF < 2:
awk '(NF > 1){print}' input
I'm trying to add a column (with the content '0') to the middle of a pre-existing tab-delimited text file. I imagine sed or awk will do what I want. I've seen various solutions online that do approximately this but they're not explained simply enough for me to modify!
I currently have this content:
Affx-11749850 1 555296 CC
I need this content
Affx-11749850 1 0 555296 CC
Using the command awk '{$3=0}1' filename messes up my formatting AND replaces column 3 with a 0, rather than adding a third column with a 0.
Any help (with explanation!) so I can solve this problem, and future similar problems, much appreciated.
Using the implicit { print } rule and appending the 0 to the second column:
awk '$2 = $2 FS "0"' file
Or with sed, assuming single space delimiters:
sed 's/ / 0 /2' file
Or perl:
perl -lane '$, = " "; $F[1] .= " 0"; print #F'
awk '{$2=$2" "0; print }' your_file
tested below:
> echo "Affx-11749850 1 555296 CC"|awk '{$2=$2" "0;print}'
Affx-11749850 1 0 555296 CC
I have a Pipe delimited file (sample below) and I need to delete records which has Null value in fields 2(email),4(mailing-id),6(comm_id). In this sample, row 2,3,4 should be deleted. The output should be saved to another file. If 'awk' is the best option, please let me know a way to achieve this
id|email|date|mailing-id|seg_id|comm_id|oyb_id|method
|-fabianz-#yahoo.com|2010-06-23 11:47:00|0|1234|INCLO|1000002|unknown
||2010-06-23 11:47:00|0|3984|INCLO|1000002|unknown
|-maddog-#web.md|2010-06-23 11:47:00|0||INCLO|1000002|unknown
|-mse-#hanmail.net|2010-06-23 11:47:00|0||INCLO|1000002|unknown
|-maine-mei#web.md.net|2010-06-23 11:47:00|0|454|INCLO|1000002|unknown
Here is an awk solution that may help. However, to remove rows 2, 3 and 4, it is necessary to check for null vals in fields 2 and 5 only (i.e. not fields 2, 4 and 6 like you have stated). Am I understanding things correctly? Here is the awk to do what you want:
awk -F "|" '{ if ($2 == "" || $5 == "") next; print $0 }' file.txt > results.txt
cat results.txt:
id|email|date|mailing-id|seg_id|comm_id|oyb_id|method
|-fabianz-#yahoo.com|2010-06-23 11:47:00|0|1234|INCLO|1000002|unknown
|-maine-mei#web.md.net|2010-06-23 11:47:00|0|454|INCLO|1000002|unknown
HTH
Steve is right, it is field 2 and 5 that are missing in the sample given. Email missing for line two and the seq_id missing for line three and four
This is a slightly simplified version of steve's solution
awk -F "|" ' $2!="" && $5!=""' file.txt > results.txt
If column 2,4 and 6 are the important one, the solution would be:
awk -F "|" ' $2!="" && $4!="" && $6!=""' file.txt > results.txt
This might work for you:
sed 'h;s/[^|]*/\n&/2;s/[^|]*/\n&/4;s/[^|]*/\n&/6;/\n|/d;x' file.txt > results.txt