Character count (length) within specific column - perl

Is there a one-line method to obtain character length for strings held within a specific column of a tab-delimited .txt file and then append these counts onto the final column (number of columns may be variable)?
Sample Data:
1 AA
2 BBB
3 CCCCC
4 EE
5 DDD
6 AAA
7 FFFFF
8 AA
9 BBB
10 NNN
To get the counts, I have attempted to use:
perl -lane 'print length $F[2]' in > out
perl -F, -Mopen=:locale -lane 'print length $F[2]' in > out
However, the results are empty.
I have also tried:
perl -lane '$_.=$F[2]; print length $_'
But this, as I now realise, prints the number of characters for the entire line rather than a specific column.
I am not sure how I would then append the final column.
Desired Output (when counting column 2):
1 AA 2
2 BBB 3
3 CCCCC 5
4 EE 2
5 DDD 3
6 AAA 3
7 FFFFF 5
8 AA 2
9 BBB 3
10 NNN 3

It seems that you were close. Perl array indices start at zero, so how about using the length of $F[1]? You will also need some sort of separator
perl -lape '$_ .= "\t". length($F[1])' input
output
1 AA 2
2 BBB 3
3 CCCCC 5
4 EE 2
5 DDD 3
6 AAA 3
7 FFFFF 5
8 AA 2
9 BBB 3
10 NNN 3
If you want the output exactly as you show, then you will need to use printf like this
perl -lane 'printf qq{%-4d%-8s%d\n}, #F, length($F[1])' input
output
1 AA 2
2 BBB 3
3 CCCCC 5
4 EE 2
5 DDD 3
6 AAA 3
7 FFFFF 5
8 AA 2
9 BBB 3
10 NNN 3

Related

How to update kdb table transversely

I have these two tables:
tab:([]col1:`abc`def`ghe`abc;val_00:`a`b`c`e;val_01:`d`e`f`t;val_02:`g`h`e`g;val_03:`r`t`y`o)
tab2:([]col1:`abc`abc`abc`abc`def`def`def`ghe`ghe`ghe;col2:0 1 2 3 4 5 6 7 8 9;col3:`Ashley`Peter`John`Molly`Apple`Orange`Banana`Robin`Tony`Bob)
and this is the result I am looking for:
tabResult:([]col1:`abc`def`ghe`abc;val_00:`Ashley`b`c`Ashley;val_01:`Peter`e`f`Peter;val_02:`John`h`e`John;val_03:`Molly`t`y`Molly)
col1 val_00 val_01 val_02 val_03
abc Ashley Peter John Molly
def b e h t
ghe c f e y
abc Ashley Peter John Molly
I would like to update tab depending on tab2. If col1=`abc,col2=1 in tab2, I would like to update val_01 to `Peter in tab, and if col1 =`abc,col2=2 in tab2, I would like to update val_02 field with `John in tab etc.
This is what I have so far:
{![tab;enlist(=;`col1;enlist x);0b;(enlist y)!enlist z]} . (`abc;`val_01;)
The function above works if the field is numerical and I use a number as the last arg. However, I am not sure how to update symbols and how to generalise this function for all tables.
If I'm understanding your request correctly, you're trying to update a field that has a long type with values that are of symbol type. This is going to fail with a 'type error as column values are expected to be uniform in type. What you can alternatively do is create new columns for the symbol entries, and after that select the columns you want.
Is something like this what you had in mind? I've assumed that the column name is determined by its col2 value in tab. Also it looks like you have two val_01 columns in your tab input, I assumed one of these was supposed to be val_02.
q)(uj/){![tab;enlist(=;`col1;enlist x);0b;(enlist`$"val_0",string[y],"_sym")!enlist enlist z]}.'flip tab2`col1`col2`col3
col1 val_00 val_01 val_02 val_03 val_01_sym val_02_sym val_03_sym val_04_sym val_05_sym val_06_sym val_07_sym val_08_sym val_09_sym
-----------------------------------------------------------------------------------------------------------------------------------
abc 1 2 2 3 Peter
def 2 2 3 2
ghe 3 3 1 1
abc 1 2 2 3 John
def 2 2 3 2
ghe 3 3 1 1
abc 1 2 2 3 Molly
def 2 2 3 2
ghe 3 3 1 1
abc 1 2 2 3
def 2 2 3 2 Apple
ghe 3 3 1 1
abc 1 2 2 3
def 2 2 3 2 Orange
ghe 3 3 1 1
abc 1 2 2 3
def 2 2 3 2 Banana
ghe 3 3 1 1
abc 1 2 2 3
def 2 2 3 2
ghe 3 3 1 1 Robin
abc 1 2 2 3
def 2 2 3 2
ghe 3 3 1 1 Tony
abc 1 2 2 3
def 2 2 3 2
ghe 3 3 1 1 Bob
EDIT:
Based on your comments, I've amended my solution:
q)cols[tab]#{![x;enlist(=;`col1;enlist y`col1);0b;(enlist`$"val_0",string y`col2)!enlist enlist y`col3]}/[tab;tab2]
col1 val_00 val_01 val_02 val_03
--------------------------------
abc Ashley Peter John Molly
def b e h t
ghe c f e y
abc Ashley Peter John Molly

prepend text to every n:th line in a textfile

This sed comandline script prepends text on every line in a file:
sed -i 's/^/to be prepended/g' text.txt
How can I make it so it only do that on every nth line?
I am working with sequencing data and in the "norma" multiple fasta format there is first an identifier line staring with a > and then have additional text.
The next line starts with a random DNA sequence like "AATTGCC" and so on when that string is done its new line and new identifier, how can i prepend text (additional bases) to the beginning of the sequence line?
Just use the following GNU sed syntax:
sed '0~Ns/^/to be prepended/'
# ^^^
# set N to the number you want!
for example, prepend HA to lines numbers that are multiple of 4:
$ seq 10 | sed '0~4s/^/HA/'
1
2
3
HA4
5
6
7
HA8
9
10
Or to those that are on the form 4N+1:
$ seq 10 | sed '1~4s/^/HA/'
HA1
2
3
4
HA5
6
7
8
HA9
10
From the sed manual → 3.2. Selecting lines with sed:
first~step
This GNU extension matches every stepth line starting with line first. In particular, lines will be selected when there exists a non-negative n such that the current line-number equals first + (n * step). Thus, to select the odd-numbered lines, one would use 1~2; to pick every third line starting with the second, ‘2~3’ would be used; to pick every fifth line starting with the tenth, use ‘10~5’; and ‘50~0’ is just an obscure way of saying 50.
By the way, there is no need to use /g for global replacement, since ^ can just be replaced once on every line.
$ seq 10 | perl -pe's/^/to be prepended / unless $. % 3'
1
2
to be prepended 3
4
5
to be prepended 6
7
8
to be prepended 9
10
$ seq 10 | perl -pe's/^/to be prepended / unless $. % 3 - 1'
to be prepended 1
2
3
to be prepended 4
5
6
to be prepended 7
8
9
to be prepended 10
$ seq 10 | perl -pe's/^/to be prepended / unless $. % 3 - 2'
1
to be prepended 2
3
4
to be prepended 5
6
7
to be prepended 8
9
10
You have an idea.
seq 15|awk -v line=4 'NR%line==0{$0="Prepend this text : " $0}1'
1
2
3
Prepend this text : 4
5
6
7
Prepend this text : 8
9
10
11
Prepend this text : 12
13
14
15

Percentage of each value in a row respective to first value

I've a big file which consisting the data in following format.
11 6 2 3
19 5 1 13
9 3 0 6
15 7 1 7
7 6 0 1
9 3 4 2
I want to calculate percentage of each value of a row starting from 2nd column respective to the first column value. Something like (6/11)*100; (2/11)*100; (3/11)*100 for every row in the file.
Expected output
54.5 18.1 27.2
26.3 5.2 68.4
...
...
I've tried in awk,
awk '{a=($2/$1)*100; b=($3/$1)*100; c=($4/$1)*100}END{print a, b, c}'`
and the result is awk: cmd. line:1: (FILENAME=try FNR=27) fatal: division by zero attempted. Is that only due to presence of 0 in some of rows or anything wrong with the awk oneliner?
Yes, you have zeros in some $1s. You need something like:
$ cat file
11 6 2 3
0 5 1 13
9 3 0 6
15 7 1 7
0 6 0 1
9 3 4 2
$ awk 'BEGIN{CONVFMT="%.1f"} {for (i=2;i<=NF;i++) $i=($1==0?"NaN":$i*100/$1)} 1' file
11 54.5 18.2 27.3
0 NaN NaN NaN
9 33.3 0 66.7
15 46.7 6.7 46.7
0 NaN NaN NaN
9 33.3 44.4 22.2
Replace "NaN" with whatever else you want displayed when $1 is zero if you don't like "NaN" (you should have included that case in your sample input).
Using perl from command line,
perl -lane 'print join "\t", map $F[0] ? $_*100/$F[0] : "Nan", #F[1..$#F]' file

print every 4 columns to one row in perl or awk

would you please help me how to convert every 4-sequantial rows into one tab-separated column?
convert:
A
1
2
3
3
3
4
1
to :
A 1 2 3
3 3 4 1
A simple way to do this is to use xargs:
$ xargs -n4 < file
A 1 2 3
3 3 4 1
With awk you would do:
$ awk '{printf "%s%s",$0,(NR%4?FS:RS)}' file
A 1 2 3
3 3 4 1
Another flexible approach is to use pr:
$ pr -tas' ' --columns 4 file
A 1 2 3
3 3 4 1
Both the awk and pr solution can be easily modified to change the output separator to a TAB:
$ pr -at --columns 4 file
A 1 2 3
3 3 4 1
$ awk '{printf "%s%s",$0,(NR%4?OFS:RS)}' OFS='\t' file
A 1 2 3
3 3 4 1
$ perl -pe 's{\n$}{\t} if $. % 4' old.file > new.file
or simply (thanks to mpapec's comment):
$ perl -pe 'tr_\n_\t_ if $. % 4' old.file > new.file

Conditionally replacing cell values with column names

I have a 165 x 165 rank matrix such that each row has values ranging from 1-165. I want to parse each row and delete all values >= 5, sort each row in increasing order, then replace the values 1-5 with the name of the column from the original matrix.
For example, for row k the values 1 ,2 3, 4, 5, would result after the first two transformations and would be replaced by p,d, m, n, a.
I am assuming that your array consists of an array of arrays...
Neither Awk, Sed, or Perl have multi-dimensional arrays. However, they can be emulated in Perl by using arrays of arrays.
$a[0]->[0] = xx;
$a[0]->[1] = yy;
[...]
$a[0]->[164] = zz;
$a[1]->[0] = qq;
$a[1]->[1] = rr;
[...]
$a[164]->[164] = vv;
Does this make sense?
I'm calling the row $x and columns $y, so an element in your array will be $array[$x]->[$y]. Is that good?
Okay, your column names will be in row $array[0], so if we find a value less than five in $array[$x]->[$y], we know the column name is in $array[0]->[$y]. Is that good?
for my $x (1..164) { #First row is column names
for my $y (0..164) {
if ($array[$x]->[$y] <= 5) {
$array[$x]->[$y] = $array[0]->[$y];
}
}
}
I'm simply going through all the rows, and for each row, all the columns, and checking the value. If the value is less than or equal to five, I replace it with the column name.
I hope I'm not doing your homework for you.
This GNU sed solution might work although it will need scaling up as I only used a 10x10 matrix for testing purposes:
# { echo {a..j};for x in {1..10};do seq 1 10 | shuf |sed 'N;N;N;N;N;N;N;N;N;s/\n/ /g';done; }> test_data
# cat test_data
a b c d e f g h i j
4 5 9 3 6 2 10 8 7 1
3 7 4 2 1 6 10 5 8 9
10 9 3 1 2 7 8 5 6 4
5 10 4 9 7 8 1 3 6 2
8 6 5 9 1 4 3 2 7 10
2 8 9 3 5 6 10 1 4 7
3 9 8 2 1 4 10 6 7 5
3 7 2 1 8 6 10 4 5 9
1 10 8 3 6 5 4 2 7 9
7 2 3 5 6 1 10 4 8 9
# cat test_data |
sed -rn '1{h;d};s/[0-9]{2,}|[6-9]/0/g;G;s/\n|$/ &/g;s/$/&1 2 3 4 5 /;:a;s/^(\S*) (.*\n)(\S* )(.*)/\2\4\1\3/;ta;s/\n//;s/0[^ ]? //g;:b;s/([1-5])(.*)\1(.)/\3\2/;tb;p'
j f d a b
e d a c h
d e c j h
g j h c a
e h g f c
h a d i e
e d a f j
d c a h i
a h d g f
f b c h d
The sed command works as follows.
The first line of the data file contains the column headings is stored in the hold space then the pattern space (current line) is deleted. For all subsequent data lines all two or more digit numbers and values 6 to 9 are converted to 0. The column names are appended, along with a newline to the data values. Spaces are inserted before the newline and end of string. The data is transformed into a lookup and the sorted values i.e.. 1 2 3 4 5 is prepended to it. The newline is removed along with any 0 values and associated lookups. The values 1 to 5 are replaced by the column names in the lookup.
EDIT:
I may have misunderstood the problem regarding sorting columns or rows, if so it's a minimal fix - replace 1 2 3 4 5 by the original values and perform a numeric sort prior to replacing the numeric data with column names from the lookup.