Tokenizing with Perl and Unstructured data - perl

I have the following data (from a text file), I would like to split / get each element, and even those element that are blanks (some grades as you can see are not listed, which means they are 0, so I want to get them also)
CRN SUB CRSE SECT COURSE TITLE INSTRUCTOR A A- B+ B B- C+ C C- D+ D D- F I CR NC W WN INV TOTAL
----- -- ---- ---- ----------------- ----------------- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- -----
33450 XX 9950 AIP OVERSEAS-AIP SPAI NOT FOUND 1 1 2
33092 XX 9950 ALB ddddddd, SPN. vi NOT FOUND 1 1
33494 XX 9950 W16 OVERSEAS Univ.Wes NOT FOUND 1 1
INSTRUCTOR TOTALS NOT FOUND 2 1 18 1 2 24
PERCENTAGE DISTRI NOT FOUND 8 4 75 4 8 ******
33271 PE 3600 001 Global Geography sfnfbg,dsdassaas 2 2 1 1 2 3 6 5 3 3 1 29
INSTRUCTOR TOTALS snakdi,plid 2 2 1 1 2 3 6 5 3 3 1 29
PERCENTAGE DISTRI krapsta,lalalal 7 7 3 3 7 10 21 17 10 10 3 ***
The problem as you can see, I don't have a specific delimiter, because some grades are missing, if they weren't, I could have getting all the data from the line start until the first grade ('A') and then all the grades and splitting them by /\s+/, but thats not the case.
any suggestions (if there are any....) would be awesome.
thanks,

There are irregularities at places in some columns (note that first total values 18 and 75 are partially in next column), but if you don't need them, you can try something like this:
my #data;
# skip header
my $hdr = <DATA>;
my $sep = <DATA>;
while(<DATA>) {
chomp;
# skip empty and total lines
next if /^\s*$/ || /^[ ]{5}/;
push #data, [
map { s/^\s+//; s/\s+$//; $_ } # trim each column
unpack 'A6A7A7A7 A18A20 A4A4A4A4A4A4A4A4A4A4A4A4A4A4A4A4A4A4 A10', $_
];
}
use Data::Dump;
dd \#data;
__DATA__
CRN SUB CRSE ...
----- -- ---- ...
You might need to tweak column boundaries in unpack template for real data, but this should get you started.

This looks like it would be best to write or find a column-based text parser? I have found DataExtract-FixedWidth on CPAN, but have no personal experience with it. The format looks pretty messy, especially with the numbers on the column borders. You would have to do some kind of pre-processing or heuristics anyway…

Related

Create a Boolean column displaying comparison between 2 other columns in kdb+

I'm currently learning kdb+/q.
I have a table of data. I want to take 2 columns of data (just numbers), compare them and create a new Boolean column that will display whether the value in column 1 is greater than or equal to the value in column 2.
I am comfortable using the update command to create a new column, but I don't know how to ensure that it is Boolean, how to compare the values and a method to display the "greater-than-or-equal-to-ness" - is it possible to do a simple Y/N output for that?
Thanks.
/ dummy data
q) show t:([] a:1 2 3; b: 0 2 4)
a b
---
1 0
2 2
3 4
/ add column name 'ge' with value from b>=a
q) update ge:b>=a from t
a b ge
------
1 0 0
2 2 1
3 4 1
Use a vector conditional:
http://code.kx.com/q/ref/lists/#vector-conditional
q)t:([]c1:1 10 7 5 9;c2:8 5 3 4 9)
q)r:update goe:?[c1>=c2;1b;0b] from t
c1 c2 goe
-------------
1 8 0
10 5 1
7 3 1
5 4 1
9 9 1
Use meta to confirm the goe column is of boolean type:
q)meta r
c | t f a
-------| -----
c1 | j
c2 | j
goe | b
The operation <= works well with vectors, but in some cases when a function needs atoms as input for performing an operation, you might want to use ' (each-both operator).
e.g. To compare the length of symbol string with another column value
q)f:{x<=count string y}
q)f[3;`ab]
0b
q)t:([] l:1 2 3; s: `a`bc`de)
q)update r:f'[l;s] from t
l s r
------
1 a 1
2 bc 1
3 de 0

Matlab (textscan), read characters from specified column and row

I have a number of text files with data, and want to read a specific part of each file (time information), which is always located at the end of the first row of each file. Here's an example:
%termo2, 30-Jan-2016 12:27:20
I.e. I would like to get "12:27:20".
I've tried using textscan, which I have used before for similar problems. I figured there are 3 columns of this row, with single white space as delimiter.
I first tried to specify these as strings (%s):
fid = fopen(fname);
time = textscan(fid,'%s %s %s');
I also tried to specify the date and time using datetime format:
time = textscan(fid,'%s %{dd-MMM-yyyy}D %{HH:mm:ss}D')
Both of these just produce a blank cell. (I've also tried a number of variations, such as defining the delimiter as ' ', with the same result)
Thanks for any help!
Here's the entire file (not sure pasting here is the right way to do this - i'm new to both matlab and stackoverflow..):
%termo2, 30-Jan-2016 12:27:20
%
%102
%
%stimkod stimtyp
% 1 Next:Pain
% 2 Next:Brush
% vaskod text
% 1 Obeh -> Beh
% 2 Inte alls intensiv -> Mycket intensiv
% stimnr starttid stimkod vaskod VASstart VASmark VAS
1 78.470 2 1 96.470 100.708 6.912
1 78.470 2 2 96.470 104.739 2.763
2 138.822 1 2 156.821 162.619 7.615
2 138.822 1 1 156.821 166.659 2.496
3 199.117 2 2 217.116 222.978 2.897
3 199.117 2 1 217.116 224.795 5.773
4 258.612 2 1 276.612 280.419 5.395
4 258.612 2 2 276.612 284.145 4.622
5 320.068 1 1 338.068 340.689 4.396
5 320.068 1 2 338.068 346.090 2.722
6 377.348 1 2 395.347 398.809 6.336
6 377.348 1 1 395.347 404.465 3.391
7 443.707 2 1 461.707 464.840 6.604
7 443.707 2 2 461.707 473.703 3.652
8 503.122 1 2 521.122 526.009 4.285
8 503.122 1 1 521.122 529.808 3.646
9 568.546 2 2 586.546 586.546 5.000
9 568.546 2 1 586.546 595.496 6.412
10 629.953 2 1 647.953 650.304 7.034
10 629.953 2 2 647.953 655.600 6.615
11 694.305 1 1 712.305 714.416 4.669
11 694.305 1 2 712.305 721.079 2.478
12 751.537 2 2 769.537 773.511 7.307
12 751.537 2 1 769.537 777.423 8.225
13 813.944 1 2 831.944 834.958 7.731
13 813.944 1 1 831.944 839.255 1.363
14 872.448 2 1 890.448 893.829 6.813
14 872.448 2 2 890.448 899.439 2.600
15 939.880 1 2 957.880 963.811 4.332
15 939.880 1 1 957.880 966.603 2.786
16 998.328 2 1 1016.327 1020.707 5.837
16 998.328 2 2 1016.327 1025.275 2.664
17 1062.911 1 2 1080.910 1082.967 2.792
17 1062.911 1 1 1080.910 1088.674 4.094
18 1125.182 1 1 1143.182 1144.379 0.619
18 1125.182 1 2 1143.182 1151.786 8.992
If you're not reading in the entire file, you could just read the first line using fgetl, split on the strings (using regexp) and then grab the last element.
parts = regexp(fgetl(fid), '\s+', 'split');
last = parts{end};
That being said, there doesn't seem to be anything wrong with the way you're using textscan if your file is actually how you say. You could alternately do something like:
parts = textscan(fid, '%s', 3);
last = parts{end}
Update
Also, be sure to rewind the file pointer using frewind before trying to parse the file to ensure that it starts at the top of the file.
frewind(fid)

Character count (length) within specific column

Is there a one-line method to obtain character length for strings held within a specific column of a tab-delimited .txt file and then append these counts onto the final column (number of columns may be variable)?
Sample Data:
1 AA
2 BBB
3 CCCCC
4 EE
5 DDD
6 AAA
7 FFFFF
8 AA
9 BBB
10 NNN
To get the counts, I have attempted to use:
perl -lane 'print length $F[2]' in > out
perl -F, -Mopen=:locale -lane 'print length $F[2]' in > out
However, the results are empty.
I have also tried:
perl -lane '$_.=$F[2]; print length $_'
But this, as I now realise, prints the number of characters for the entire line rather than a specific column.
I am not sure how I would then append the final column.
Desired Output (when counting column 2):
1 AA 2
2 BBB 3
3 CCCCC 5
4 EE 2
5 DDD 3
6 AAA 3
7 FFFFF 5
8 AA 2
9 BBB 3
10 NNN 3
It seems that you were close. Perl array indices start at zero, so how about using the length of $F[1]? You will also need some sort of separator
perl -lape '$_ .= "\t". length($F[1])' input
output
1 AA 2
2 BBB 3
3 CCCCC 5
4 EE 2
5 DDD 3
6 AAA 3
7 FFFFF 5
8 AA 2
9 BBB 3
10 NNN 3
If you want the output exactly as you show, then you will need to use printf like this
perl -lane 'printf qq{%-4d%-8s%d\n}, #F, length($F[1])' input
output
1 AA 2
2 BBB 3
3 CCCCC 5
4 EE 2
5 DDD 3
6 AAA 3
7 FFFFF 5
8 AA 2
9 BBB 3
10 NNN 3

Modifying Script to include the Count of a each time a name appears from a table

I have a script below that takes my FILE1 and parses out FILE2 only if the first column of FILE1 matches column number 10 of FILE2. So it will print out the rows I need. This part works great. The part I am having a tad bit of difficulty is inserting a sort of count for the output. The goal of the script is take column 10 at the end and produce an output. In my list there are 12 names and I want to get the count of each name. For the example below, I have used four names.
FILE1:
name1 15
name2 15
name2 30
name5 15
name4 10
name2 5
name2 5
FILE2:
23 15 5.4 1.3 5 55 128 21799 + 32 name2 1 77 0 1
23 20 5.4 1.3 5 55 128 7998 + 18 name4 1 77 0 1
23 20 5.4 1.3 6 55 128 9984 + 13 name4 1 77 1 1
23 20 5.4 1.3 7 55 128 7998 + 14 name5 1 77 2 1
23 20 5.4 1.3 6 55 128 994 + 14 name1 1 77 3
23 20 5.4 1.3 9 55 128 984 + 5 name7 1 77 4 1
23 20 5.4 1.3 5 55 128 99 + 5 name8 1 77 5 1
Expected Output
$VAR1 = {
'name1' => 1,
'name2' => 4,
'name4' => 1,
'name5' => 1,
};
5 55 128 21799 32 name2 77 0 1
5 55 128 7998 18 name4 77 0 1
6 55 128 9984 13 name4 77 1 1
7 55 128 7998 14 name5 77 2 1
6 55 128 994 14 name1 77 3 1
name1 1
name2 1
name4 2
name5 1
You can test the script it works. The part I am having difficulty with is inserting the count of each name based on the output. The print \%x is a way of checking if my original list was truly used as I am working with a much larger set of data. If someone could point me the right direction on how to modify my script without changing it drastically that would be great. I feel like this script fulfills the majority of my needs even if it is not the most efficient way of doing it.
use strict;
use Data::Dumper;
my %x;
open(FILE1, $ARGV[0]) or die "Cannot open the file: $!";
while (my $line = <FILE1>) {
my #array = split(" ", $line);
$x{$array[0]}++;
}
close FILE1;
print Dumper( \%x );
my %count;
open(FILE2, $ARGV[1]) or die "Cannot open the file: $!";
while (my $line = <FILE2>) {
my #name = split(" ", $line);
my $y = $name[9];
if ( $x{ $y } ) {
print join(" ", #name[4,5,6,7,9,11,12,13]), "\n";
$count{#name[9]}++;
}
}
print Dumper (\%count);
close FILE2;
exit;
Script now counts. Just need to debug.
the "minimal" change would be to set the elements of %x to 0 in the FILE1 loop, then check for exists $x{$y} in the FILE2 loop and do ++$x{$y} inside the condition body. Now at the end %x has the counts of all the occurrences.
The usual way (as mentioned in the comments of the question) would be to declare an additional %count and perform the same ++$count{$y} inside the if block as in the above method.
The first has the advantage and disadvantage (depending on your needs) of reporting the count even when the name has zero found occurrences.

Find common elements in a file

The program that I would like to write has the same aim of the File row confrontation. This time the file I have is put in a different way:
1 2
1 3
1 4
2 1
2 3
2 4
2 5
3 1
...
8 6
8 7
8 9
9 8
I want to find:
when the first element of a row appears in the second position of the other rows and if the first element of the subsequent rows appear alongside the row taken in exam;
if it found then I want to print "I have found the link x y";
if the "link" exists, then I want to count how many "neighbours" they share, where by eighbours I mean how many elements in the second column they have in common and print "I found z triangles".
The file is sorted.
In this case the program will start founding the first "couple" 1 2 in the file but reversed and it will find it at the 4th row (2 1). Then it looks if the 3 ( second row and neighbour of 1) is also present in 2 ( and it is the case because it exists 2 3) and so on. At the end it will found that the "there is the link 1 2" and it "found 2 triangles" (1 - 2 - 3 and 1 - 2 - 4). I think the answer sould not be so different from the answer in the upper link, but I don't know how to arrange the files from a file made like this.
The first part of the problem is to find only the index of the inverted matching pairs? While reading this problem yesterday I had the feeling that grep may be of use;
#!usr/bin/perl
use warnings;
use strict;
my #parry;
while (<DATA>){
push #parry, [split(' ',$_)];
}
##remind is reverse matched indices;
my #remind = grep {
my $ind = $_;
grep { #reverse #{$parry[$_]} == #{parry[$ind]} did not appear to work.
#{$parry[$_]}[0] == #{$parry[$ind]}[1] &&
#{$parry[$_]}[1] == #{$parry[$ind]}[0];
} 0..$#parry
} 0..$#parry;
grep { print $_,': ',#{$parry[$_]},$/ } #remind;
__END__
1 2
1 3
1 4
2 1
2 3
2 4
2 5
3 1
8 6
8 7
8 9
9 8
output is
0: 12
1: 13
3: 21
7: 31
10: 89
11: 98
from here you then want to find say for
7[0] 7[1] (3 1) with neighbour row 6 and 8 with col 2?
6[1]
7[1] (1 5) and/or
7[1] (1 6) exist in the original set (in #parry)?
8[1]
Which they do not so no triangle.