Transpose using AWK or Perl - perl

Hi I would like to use AWK or Perl to get an output file in the format below. My input file is a space separated text file. This is similar to an earlier question of mine, but in this case the input and output has no formatting. My column positions may change so would appreciate a technique which does not reference column number
Input File
id quantity colour shape size colour shape size colour shape size
1 10 blue square 10 red triangle 12 pink circle 20
2 12 yellow pentagon 3 orange rectangle 4 purple oval 6
Desired Output
id colour shape size
1 blue square 10
1 red triangle 12
1 pink circle 20
2 yellow pentagon 3
2 orange rectangle 4
2 purple oval 6
I am using this code by Dennis Williamson. Only problem is the output I get has no space separation in the transposed fields. I require one space separation
#!/usr/bin/awk -f
BEGIN {
col_list = "quantity colour shape"
# Use a B ("blank") to add spaces in the output before or
# after a format string (e.g. %6dB), but generally use the numeric argument
# columns to be repeated on multiple lines may appear anywhere in
# the input, but they will be output together at the beginning of the line
repeat_fields["id"]
# since these are individually set we won't use B
repeat_fmt["id"] = "%-1s "
# additional fields to repeat on each line
ncols = split(col_list, cols)
for (i = 1; i <= ncols; i++) {
col_names[cols[i]]
forms[cols[i]] = "%-1s"
}
}
# save the positions of the columns using the header line
FNR == 1 {
for (i = 1; i <= NF; i++) {
if ($i in repeat_fields) {
repeat[++nrepeats] = i
repeat_look[i] = i
rformats[i] = repeat_fmt[$i]
}
if ($i in col_names) {
col_nums[++n] = i
col_look[i] = i
formats[i] = forms[$i]
}
}
# print the header line
for (i = 1; i <= nrepeats; i++) {
f = rformats[repeat[i]]
sub("d", "s", f)
gsub("B", " ", f)
printf f, $repeat[i]
}
for (i = 1; i <= ncols; i++) {
f = formats[col_nums[i]]
sub("d", "s", f)
gsub("B", " ", f)
printf f, $col_nums[i]
}
printf "\n"
next
}
{
for (i = 1; i <= NF; i++) {
if (i in repeat_look) {
f = rformats[i]
gsub("B", " ", f)
repeat_out = repeat_out sprintf(f, $i)
}
if (i in col_look) {
f = formats[i]
gsub("B", " ", f)
out = out sprintf(f, $i)
coln++
}
if (coln == ncols) {
print repeat_out out
out = ""
coln = 0
}
}
repeat_out = ""
}
Output
id quantitycolourshape
1 10bluesquare
1 redtrianglepink
2 circle12yellow
2 pentagonorangerectangle
My apologies for not including all info about the actual file earlier. I did this only for simplicity, but it did not capture all my requirements.
In my actual file I am looking to transpose fields n_cell and n_bsc for NODE SITE CHILD
NODE SITE CHILD n_cell n_bsc
Here is a link to the actual file I am working on

<>;
print("id colour shape size\n");
while (<>) {
my #combined_fields = split;
my $id = shift(#combined_fields);
while (#combined_fields) {
my #fields = ( $id, splice(#combined_fields, 0, 3) );
print(join(' ', #fields), "\n");
}
}

You have told us that your real data consists of over 5,000 columns and that its column positions may change and I'm afraid that really isn't enough.
So in the absence of any proper information I have written this, which uses the header line to calculate the number and size of the sets of data, where the id column is, and in which column the first set starts.
It works fine on your example data, but I can only guess whether it will work on your live file.
use strict;
use warnings;
my #headers = split ' ', <>;
my %headers;
$headers{$_}++ for #headers;
die "Expected exactly one 'id' column" unless $headers{id} // 0 == 1;
my $id_index = 0;
$id_index++ while $headers[$id_index] ne 'id';
my #labels = grep $headers{$_} > 1, keys %headers;
my $set_size = #labels;
my $num_sets = $headers{$labels[0]};
my $start_index = 0;
$start_index++ while $headers[$start_index] ne $labels[0];
my #reformat;
while (<>) {
my #fields = split;
next unless #fields;
my $id = $fields[$id_index];
for (my $i = $start_index; $i < #fields; $i+=$set_size) {
push #reformat, [ $id, #fields[$i..$i + $set_size - 1] ];
}
}
unshift #labels, 'id';
print "#labels\n";
print "#$_\n" for #reformat;
output
id colour shape size
1 blue square 10
1 red triangle 12
1 pink circle 20
2 yellow pentagon 3
2 orange rectangle 4
2 purple oval 6

Related

Format of the date field gets changed through Spreadsheet::ParseExcel

I have an Excel Sheet (A.xls) which has following content:
Date,Value
10/1/2020,36.91
10/2/2020,36.060001
I got following output using same script with Perl v5.6.1 on solaris 5.8
>>./a4_test.pl
INFO>Excel File=A.xls,#WorkSheet=1,AuthorID=Sahoo, Ashish
DEBUG>row 2 - col 0:10-2-20
DEBUG>row 2 - col 1:36.060001
And I got different output for date field using same script with perl v5.26.3 on solaris 5.11
>>./a4_test.pl
INFO>Excel File=A.xls,#WorkSheet=1,AuthorID=Sahoo, Ashish
DEBUG>row 2 - col 0:2020-10-02
DEBUG>row 2 - col 1:36.060001
I used 0.2602 version of Spreadsheet::ParseExcel on Solaris 8 machine and 0.65 version on Solaris 11 machine.
Why am I getting different output while reading date field from
excel sheet through Spreadsheet::ParseExcel module?
#!/usr/perl/5.12/bin/perl -w
use Spreadsheet::ParseExcel;
my $srce_file = "a.xls";
my $oExcel = new Spreadsheet::ParseExcel;
my $oBook = $oExcel->Parse($srce_file);
my %hah_sheet = ();
my $header_row = 1;
my($iR, $iC, $oWkS, $oWkC);
my $book = $oBook->{File};
my $nsheet= $oBook->{SheetCount};
my $author= $oBook->{Author};
unless($nsheet){
print "ERR>No worksheet found for source file:$srce_file\n";
return 0;
}
else{
print "INFO>Excel
File=$srce_file,#WorkSheet=$nsheet,AuthorID=$author\n";
}
for(my $iSheet=0; $iSheet < $oBook->{SheetCount} ; $iSheet++) {
next if($iSheet >0);
$oWkS = $oBook->{Worksheet}[$iSheet];
my $rows = 0;
for(my $iR = $oWkS->{MinRow}; defined $oWkS->{MaxRow} && $iR <= $oWkS->{MaxRow} ; $iR++) {
$rows++;
my $str_len = 0;
for(my $iC = $oWkS->{MinCol}; defined $oWkS->{MaxCol} && $iC <= $oWkS->{MaxCol}; $iC++) {
$oWkC = $oWkS->{Cells}[$iR][$iC];
next if ($iR <$header_row);
if (defined($oWkC)){
my $cell_value = $oWkC->Value;
$cell_value =~s/\n+//g; #removed newline inside the value
#
##if the first column at header row is null then skip. Column might be shifted
if($iR==$header_row && $iC == 0){
last unless($cell_value);
}
if($iR == $header_row){
$hah_sheet{$iR}{$iC} = uc($cell_value);
}else {
$hah_sheet{$iR}{$iC} = $cell_value;
$str_len += length($cell_value);
##View cell value by row/column
print "DEBUG>row ${iR} - col ${iC}:$cell_value\n";
}
}else{
$hah_sheet{$iR}{$iC} = ""; #keep position for NULL value
}
} # END of Column loop
} # END of Row loop
} # END of Worksheet
If you search for "date" in Changes, you see this:
0.33 2008.09.07
- Default format for formatted dates changed from 'm-d-yy' to 'yyyy-mm-dd'
This explains why you see different date formats between versions 0.2602 and 0.65 of Spreadsheet::ParseExcel.
If you always want your code to print the same format regardless of which version you are using, you could transform the date in your code. For example, if you always want to see yyyy-mm-dd:
$cell_value =~ s/^(\d+)-(\d+)-(\d+)$/sprintf '%04d-%02d-%02d', 2000+$3, $1, $2/e;
Or, vice versa:
$cell_value =~ s/^(\d+)-(\d+)-(\d+)$/sprintf '%0d-%0d-%02d', $2, $3, $1-2000/e;

Perl - Converting special characters to corresponding numerical values in quality score

I am trying to convert the following set of characters into their corresponding values for a quality score that accompanies a fasta file:
!"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
They should have the values 0-93. So when I input a fastq file that uses these symbols I want to output the numerical values for each in a quality score file.
I have tried putting them into an array using split // and then making into a hash where each key is the symbol and the value is its position in the array:
for (my $i = 0; $i<length(#qual); $i++) {
print "i is $i, elem is $qual[$i]\n";
$hash{$qual[$i]} = $i;
I have tried hard coding the hash:
my %hash = {"!"=>"0", "\""=>"1", "#"=>"2", "\$"=>"3"...
With and without escapes for the special characters that require them but cannot seem to get this to work.
This merely outputs:
.
.
.
i is 0, elem is !
i is 1, elem is "
i is 0, elem is !
i is 1, elem is "
i is 0, elem is !
i is 1, elem is "
" 1
Use of uninitialized value $hash{"HASH(0x100804ed0)"} in concatenation (.) or string at convert_fastq.pl line 24, <> line 40.
HASH(0x100804ed0)
! 0
Does anyone have any ideas? I appreciate the help.
Perhaps subtracting 33 from the character's ord to yield the value you want would be helpful:
use strict;
use warnings;
my $string = q{!"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~};
for ( split //, $string ) {
print "$_ = ", ord($_) - 33, "\n";
}
Partial output:
! = 0
" = 1
# = 2
$ = 3
% = 4
& = 5
' = 6
( = 7
) = 8
* = 9
+ = 10
...
This way, you don't need to build a hash with character/value pairs, but just use $val = ord ($char) - 33; to get the value.
{ ... }
is similar to
do { my %anon; %anon = ( ... ); \%anon }
So when you did
my %hash = { ... };
you assigned a single item to the hash (a reference to a hash) rather than a list of key-values as you should. Perl warned you about that with the following:
Reference found where even-sized list expected
(Why didn't you mention this?!)
You should be using
my %decode_map = ( ... );
For example,
my %decode_map;
{
my $encoded = q{!"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~};
my #encoded = split //, $encoded;
$decode_map{$encoded[$_]} = $_ for 0..$#encoded;
}
Given that those are basically the non-whitespace printable ASCII characters, so you could simply use
my %decode_map = map { chr($_ + 0x21) => $_ } 0x21..0x7E;
Which means you could avoid building the hash at all, replacing
my %decode_map = map { chr($_ + 0x21) => $_ } 0x21..0x7E;
die if !exists($decode_map{$c});
my $num = $decode_map{$c};
with just
die if ord($c) < 0x21 || ord($c) > 0x7E;
my $num = ord($c) - 0x21;
From a language-agnostic point of view: Use an array with 256 entries, one for each ASCII character. You can then store 0 at ['!'], 1 at ['"'] and so on. When parsing the input, you can lookup the index of a char in that array directly. Fore careful error handling, you could store -1 at all invalid chars and check that while parsing the file.

How to perform operation on specific bits in perl

Lets assume I have a hexadecimal value 0x78. I need to add 1 to first 4 bits ie 3:0 and add 2 to last 4 bits ie. [7:4]. Further when I add 1 to 0xF it should not roll over to the next value and should stay as 0xF. Same applies for subtraction. The approach I have tried so far is:
$byte=0x78;
$byte2 = unpack('b4', $byte);
print "byte2 = $byte2 \n";
--> Here the output is 1000 so I have tried to extract the first 4 bits, and similarly I can right shift and extract last 4 bits and perform the operation.
But to perform addition or subtraction, I wanted to convert 1000 back to hex format so that I can do 0x8 +/- 1. For that I tried:
$hex2 = sprintf('%02x', $byte2);
print "hex2 = $hex2 \n";
--> Output is 3e8. I do not understand why I get 3e8 instead of just 8 or 08, since it is supposed to print only 2 values in hex format.
In the above command when I manually enter
$hex2 = sprintf('%02x', 0b1000); I get the correct result. So perl is taking it as a string rather than a numeric value. Is there some way I can convert that string to a binary number? Any other easier method or approach would be helpful.
We can get each byte by ANDing and shifting:
$byte1 = $byte & 0xf;
$byte2 = ($byte & 0xf0) >> 4;
printf "byte1: 0x%x\n", $byte1;
printf "byte2: 0x%x\n", $byte2;
# prints
byte1: 0x8
byte2: 0x7
Addition/subtraction with special conditions you listed can be done on these bytes and the new value can be reconstructed with shifts and addition:
($byte1 < 0xf) ? ($byte1 += 1) : ($byte1 = 0xf);
($byte2 < 0xe) ? ($byte2 += 2) : ($byte2 = 0xf);
# or do subtraction stuff.
$new_val = ($byte2 << 4) + $byte1;
printf "new val: 0x%x\n", $new_val;
# prints
new val: 0x99
You're getting '3e8' because $byte2 is '1000', which, when translated into hex is '0x3e8'.
I think you're better off with something like:
sub byte_to_two_nibbles($) {
my $byte = shift;
return int($byte / 16), ($byte % 16);
}
sub two_nibbles_to_byte($$) {
return $_[0] * 16 + $_[1];
}
my ($msn, $lsn) = byte_to_two_nibbles 0x78;
$msn += 1; $msn = 15 if $msn > 15;
$lsn += 2; $lsn = 15 if $lsn > 15;
my $result = two_nibbles_to_byte $msn, $lsn;
You can use oct function:
$byte2 = oct("0b$byte2");
my $hex2 = sprintf('%02x', $byte2);
print "hex2 = $hex2 \n";
Prints:
hex2 = 08

Perl: test for an arbitrary bit in a bit string

I'm trying to parse CPU node affinity+cache sibling info in Linyx sysfs.
I can get a string of bits, just for example:
0000111100001111
Now I need a function where I have a decimal number (e.g. 4 or 5) and I need to test whether the nth bit is set or not. So it would return true for 4 and false for 5. I could create a string by shifting 1 n number of times, but I'm not sure about the syntax, and is there an easier way? Also, there's no limit on how long the string could be, so I want to avoid decimal <-> binary conversoins.
Assuming that you have the string of bits "0000111100001111" in $str, if you do the precomputation step:
my $bit_vector = pack "b*", $str;
you can then use vec like so:
$is_set = vec $bit_vector, $offset, 1;
so for example, this code
for (0..15) {
print "$_\n" if vec $bit_vector, $_, 1;
}
will output
4
5
6
7
12
13
14
15
Note that the offsets are zero-based, so if you want the first bit to be bit 1, you'll need to add/subtract 1 yourself.
Well, this seems to work, and I'm not going for efficiency:
sub is_bit_set
{
my $bitstring = shift;
my $bit = shift;
my $index = length($bitstring) - $bit - 1;
if (substr($bitstring, $index, 1) == "1") {
return 1;
}
else {
return 0;
}
}
Simpler variant without bit vector, but for sure vector would be more efficient way to deal.
sub is_bit_set
{
my $bitstring = shift;
my $bit = shift;
return int substr($bitstring, -$bit, 1);
}

How do I turn a table into a matrix?

If I got a table in a text file such like
A B 1
A C 2
A D 1
B A 3
C D 2
A E 1
E D 2
C B 2
. . .
. . .
. . .
And I got another symbol list in another text file. I want to transform this table into a Perl data structure like:
_ A D E . . .
A 0 1 1 . . .
D 1 0 2 . . .
E 1 2 0 . . .
. . . . . . .
But I only need some selected symbol, for example A, D and E are selected in the symbol text but B and C are not.
Use an array for the first one and a 2-dimentional hash for the second one. The first one should look roughly like:
$list[0] # row 1 - the value is "A B 1"
And the hash like:
$hash{A}{A} # the intersection of A and A - the value is 0
Figuring out how to implement a problem is about 75% of the mental battle for me. I'm not going to go into specifics about how to print the hash or the array, because that's easy and I'm also not entirely clear on how you want it printed or how much you want printed. But converting the array to the hash should look a bit like this:
foreach (#list) {
my ($letter1, $letter2, $value) = split(/ /);
$hash{$letter1}{$letter2} = $value;
}
At least, I think that's what you're looking for. If you really want you could use a regular expression, but that's probably overkill for just extracting 3 values out of a string.
EDIT: Of course, you could forgo the #list and just assemble the hash straight from the file. But that's your job to figure out, not mine.
you can try this with awk:
awk -f matrix.awk yourfile.txt > newfile.matrix.txt
where matrix.awk is :
BEGIN {
OFS="\t"
}
{
row[$1,$2]=$3
if (!($2 in f2)) { header=(header)?header OFS $2:$2;f2[$2]}
if (col1[c]!=$1)
col1[++c]=$1
}
END {
printf("%*s%s\n", length(col1[1])+2, " ",header)
ncol=split(header,colA,OFS)
for(i=1;i<=c;i++) {
printf("%s", col1[i])
for(j=1;j<=ncol;j++)
printf("%s%s%c", OFS, row[col1[i],colA[j]], (j==ncol)?ORS:"")
}
}
Another way to do this would be to make a two-dimensional array -
my #fArray = ();
## Set the 0,0th element to "_"
push #{$fArray[0]}, '_';
## Assuming that the first line is the range of characters to skip, e.g. BC
chomp(my $skipExpr = <>);
while(<>) {
my ($xVar, $yVar, $val) = split;
## Skip this line if expression matches
next if (/$skipExpr/);
## Check if these elements have already been added in your array
checkExists($xVar);
checkExists($yVar);
## Find their position
for my $i (1..$#fArray) {
$xPos = $i if ($fArray[0][$i] eq $xVar);
$yPos = $i if ($fArray[0][$i] eq $yVar);
}
## Set the value
$fArray[$xPos][$yPos] = $fArray[$yPos][$xPos] = $val;
}
## Print array
for my $i (0..$#fArray) {
for my $j (0..$#{$fArray[$i]}) {
print "$fArray[$i][$j]", " ";
}
print "\n";
}
sub checkExists {
## Checks if the corresponding array element exists,
## else creates and initialises it.
my $nElem = shift;
my $found;
$found = ($_ eq $nElem ? 1 : 0) for ( #{fArray[0]} );
if( $found == 0 ) {
## Create its corresponding column
push #{fArray[0]}, $nElem;
## and row entry.
push #fArray, [$nElem];
## Get its array index
my $newIndex = $#fArray;
## Initialise its corresponding column and rows with '_'
## this is done to enable easy output when printing the array
for my $i (1..$#fArray) {
$fArray[$newIndex][$i] = $fArray[$i][$newIndex] = '_';
}
## Set the intersection cell value to 0
$fArray[$newIndex][$newIndex] = 0;
}
}
I am not too proud regarding the way I have handled references but bear with a beginner here (please leave your suggestions/changes in comments). The above mentioned hash method by Chris sounds a lot easier (not to mention a lot less typing).
CPAN has many potentially useful suff. I use Data::Table for many purposes. Data::Pivot also looks promising, but I have never used it.