Big text file processing

Big text file processing - import

I need to implement lazy loading in Mathematica. I have a 600 Mb CSV text file which I need to process. This file contains a lot of duplicated records:
1;0;0;13;6
1;0;0;13;6
..........
2;0;0;13;6
2;0;0;13;6
..........
etc.
So instead of loading them all into memory, I'd like to create a list containing records and the number of times this record was encountered in the file:
{{10000,{1,0,0,13,6}}, {20000,{2,0,0,13,6}}, ...}
I couldn't find a way to do it with Import function. I'm looking for something like
Import["my_file.csv", "CSV", myProcessingFunction]
where myProcessingFunction will take one record at a time and create a dataset. Is it possible to do this with Import or any other Mathematica function?

If it were me, I'd probably do this using unix sort and uniq, but since you ask about Mathematica.... I'd use ReadList[] to read blocks of lines, and define downvalues to find the unique strings an keep track of how many we've seen before.
(* Create some test data *)
Export["/tmp/test.txt", Flatten[{Range[1000], Range[1000]}], "Lines"];
countUniqueLines[file_String, blockSize_Integer] := Module[{stream, map, block, keys, out},
map[_]:=0;
stream = OpenRead[file];
CheckAbort[While[(block=ReadList[stream, String, blockSize])=!={},
(map[#]=map[#]+1)& /# block;];, Close[stream];Clear[map]];
Close[stream];
keys = Cases[DownValues[map][[All, 1, 1, 1]], _String];
out = {#, map[#]}& /# keys;
Clear[map];
out
]
countUniqueLines["/tmp/test.txt", 500]
(* Alternative implementation if you have a little more memory *)
Tally[Import["/tmp/test.txt", "Lines"]]

I think you want the Read[] function.

Perhaps there are better alternatives than Mathematica for doing this.
A small awk script:
{a[$0]++}
END { ... print loop ... }
will accumulate the repeated records. Of course you may suffer overflows depending on the number of distinct records.
Or sort the file first and the counting will not overflow. In awk, the non-overflows program may be something like;
BEGIN{ p =""; i=0}
{if (($0 != p) && (i != 0) ) {print $0,i ; p =$0; i=0; next}}
{i++; p = $0}
Perhaps Perl is better, but I'm old fashioned.
HTH!

I would recommend you to consider loading it first into a database system like MySQL and then you can access it from Mathematica using the DatabaseLink.

Related

How to make a two-dimensional matrix w/o for?

I'm coding the matrix for the 2-dimensional graph, now.
Although it's so simple equation, it takes a lot of time for performing. I think it could get faster.
especially, "for - command term" could be simplified I think.
How can I simplify this?
q=1:1:30
x(q)=330+q*0.3
F=1:30:8970
T=x(1)-0.3:0.001:x(30)+0.3
n=size(T,2)
k=1:1:n
for a=1:1:30
I(a,k)=F(a)*exp(-2.*(T(:,k)))
end
happy=sum(I)
plot(k,I)

I would say that the time is used to print results. Try to use ; at the end of each line, it will fasten computation.
You can also replace the for loop by the following element by element computation:
a = (1:1:30).';
aux = repmat(exp(-2.*(T(:,k))), length(a), 1);
a = repmat(a, 1, length(k));
I = a.'.*aux.';

How can I convert this select statement to functional form?

I am having a couple of issues to put this in a functional format.
select from tableName where i=fby[(last;i);([]column_one;column_two)]
This is what I got:
?[tableName;fby;enlist(=;`i;(enlist;last;`i);(+:;(!;enlist`column_one`column_two;(enlist;`column_one;`column_two))));0b;()]
but I get a type error.
Any suggestions?

Consider using the following function, adjust from the buildQuery function given in the whitepaper on Parse Trees. This is a pretty useful tool for quickly developing in q, this version is an improvement on that given in the linked whitepaper, having been extended to handle updates by reference (i.e., update x:3 from `tab)
\c 30 200
tidy:{ssr/[;("\"~~";"~~\"");("";"")] $[","=first x;1_x;x]};
strBrk:{y,(";" sv x),z};
//replace k representation with equivalent q keyword
kreplace:{[x] $[`=qval:.q?x;x;"~~",string[qval],"~~"]};
funcK:{$[0=t:type x;.z.s each x;t<100h;x;kreplace x]};
//replace eg ,`FD`ABC`DEF with "enlist`FD`ABC`DEF"
ereplace:{"~~enlist",(.Q.s1 first x),"~~"};
ereptest:{((0=type x) & (1=count x) & (11=type first x)) | ((11=type x)&(1=count x))};
funcEn:{$[ereptest x;ereplace x;0=type x;.z.s each x;x]};
basic:{tidy .Q.s1 funcK funcEn x};
addbraks:{"(",x,")"};
//where clause needs to be a list of where clauses, so if only one whereclause need to enlist.
stringify:{$[(0=type x) & 1=count x;"enlist ";""],basic x};
//if a dictionary apply to both, keys and values
ab:{$[(0=count x) | -1=type x;.Q.s1 x;99=type x;(addbraks stringify key x),"!",stringify value x;stringify x]};
inner:{[x]
idxs:2 3 4 5 6 inter ainds:til count x;
x:#[x;idxs;'[ab;eval]];
if[6 in idxs;x[6]:ssr/[;("hopen";"hclose");("iasc";"idesc")] x[6]];
//for select statements within select statements
//This line has been adjusted
x[1]:$[-11=type x 1;x 1;$[11h=type x 1;[idxs,:1;"`",string first x 1];[idxs,:1;.z.s x 1]]];
x:#[x;ainds except idxs;string];
x[0],strBrk[1_x;"[";"]"]
};
buildSelect:{[x]
inner parse x
};
We can use this to create the functional query that will work
q)n:1000
q)tab:([]sym:n?`3;col1:n?100.0;col2:n?10.0)
q)buildSelect "select from tab where i=fby[(last;i);([]col1;col2)]"
"?[tab;enlist (=;`i;(fby;(enlist;last;`i);(flip;(lsq;enlist`col1`col2;(enlist;`col1;`col2)))));0b;()]"
So we have the following as the functional form
?[tab;enlist (=;`i;(fby;(enlist;last;`i);(flip;(lsq;enlist`col1`col2;(enlist;`col1;`col2)))));0b;()]
// Applying this
q)?[tab;enlist (=;`i;(fby;(enlist;last;`i);(flip;(lsq;enlist`col1`col2;(enlist;`col1;`col2)))));0b;()]
sym col1 col2
----------------------
bah 18.70281 3.927524
jjb 35.95293 5.170911
ihm 48.09078 5.159796
...

Glad you were able to fix your problem with converting your query to functional form.
Generally it is the case that when you use parse with a fby in your statement, q will convert this function into its k definition. Usually you should just be able to replace this k code with the q function itself (i.e. change (k){stuff} to fby) and this should run properly when turning the query into functional form.
Additionally, if you check out https://code.kx.com/v2/wp/parse-trees/ it goes into more detail about parse trees and functional form. Additionally, it contains a script called buildQuery which will return the functional form of the query of interest as a string which can be quite handy and save time when a functional form is complex.

I actually got it myself ->
?[tableName;((=;`i;(fby;(enlist;last;`i);(+:;(!;enlist`column_one`column_two;(enlist;`column_one;`column_two)))));(in;`venue;enlist`venueone`venuetwo));0b;()]
The issues was a () missing from the statement. Works fine now.
**if someone wants to add a more detailed explanation on how manual parse trees are built and how the generic (k){} function can be replaced with the actual function in q feel free to add your answer and I'll accept and upvote it

Fastest type to use for comparing hashes in matlab

I have a table in Matlab with some columns representing 128 bit hashes.
I would like to match rows, to one or more rows, based on these hashes.
Currently, the hashes are represented as hexadecimal strings, and compared with strcmp(). Still, it takes many seconds to process the table.
What is the fastest way to compare two hashes in matlab?
I have tried turning them into categorical variables, but that is much slower. Matlab as far as I know does not have a 128 bit numerical type. nominal and ordinal types are deprecated.
Are there any others that could work?
The code below is analogous to what I am doing:
nodetype = { 'type1'; 'type2'; 'type1'; 'type2' };
hash = {'d285e87940fb9383ec5e983041f8d7a6'; 'd285e87940fb9383ec5e983041f8d7a6'; 'ec9add3cf0f67f443d5820708adc0485'; '5dbdfa232b5b61c8b1e8c698a64e1cc9' };
entries = table(categorical(nodetype),hash,'VariableNames',{'type','hash'});
%nodes to match. filter by type or some other way so rows don't match to
%themselves.
A = entries(entries.type=='type1',:);
B = entries(entries.type=='type2',:);
%pick a node/row with a hash to find all counterparts of
row_to_match_in_A = A(1,:);
matching_rows_in_B = B(strcmp(B.hash,row_to_match_in_A.hash),:);
% do stuff with matching rows...
disp(matching_rows_in_B);
The hash strings are faithful representations of what I am using, but they are not necessarily read or stored as strings in the original source. They are just converted for this purpose because its the fastest way to do the comparison.

Optimization is nice, if you need it. Try it out yourself and measure the performance gain for relevant test cases.
Some suggestions:
Sorted arrays are easier/faster to search
Matlab's default numbers are double, but you can also construct integers. Why not use 2 uint64's instead of the 128bit column? First search for the upper 64bit, then for the lower; or even better: use ismember with the row option and put your hashes in rows:
A = uint64([0 0;
0 1;
1 0;
1 1;
2 0;
2 1]);
srch = uint64([1 1;
0 1]);
[ismatch, loc] = ismember(srch, A, 'rows')
> loc =
4
2
Look into the compare functions you use (eg edit ismember) and strip out unnecessary operations (eg sort) and safety checks that you know in advance won't pose a problem. Like this solution does. Or if you intend do call a search function multiple times, sort in advance and skip the check/sort in the search function later on.

Seed based randomization perl

I have an array defined say #Array(1..31). Now I have a code where I randomly select a number for a certain number of times and store the results in another array. Example below :
$a1 = $Array[rand(#Array)];
push (#a2, $a1);
Now when I execute this script multiple times, I see that the new array contains very different patter everytime. But I do not want that, I want to generate a similar pattern everytime- where seed comes into picture.
Can someone please help me in how to incorporate seed to randomly select elements from array which can be predictable.?

You do not replace rand with srand: you use srand to initialize the seed for rand: so call srand(0) once and then use rand as you had been.
From your comment, you can use:
srand(0);
sub random {
my $random_select = $_[rand(#_)];
print " The random number selected is $random_select\n";
return $random_select;
}
or back to your original code just add the first line to it:
BEGIN { srand(0) }
$a1 = $Array[rand(#Array)];
push (#a2, $a1);

kdb c++ interface: create byte list from std::string

The following is very slow for long strings:
std::string s = "long string";
K klist = DBVec::CreateList(KG , s.length());
for (int i=0; i<s.length(); i++)
{
kG(klist)[i]=s.c_str()[i];
}
It works acceptably fast (<100ms) for strings up to 100k, but slows to a crawl (tens of minutes, possibly hours) for strings of a few million characters. I don't see anything other than kG that can create nonlinearity. I don't see any reason for accessor function kG to be non-constant time, but there is just nothing else in this loop. Unfortunately I don't know how kG works due to lack of documentation.
Question: given a blob of binary data as std::string, what's the efficient way to construct a byte list?

kG is a macro defined in k.h which expands to ((x)->G0), i.e. follow the G0 pointer of the K object
http://kx.com/q/d/a/c.htm#Strings documents kp, which creates a K string object directly from a string, so presumably you could do K klist = kp(s.c_str()), which is probably faster

This works:
memcpy(kG(klist), s.c_str(), s.length());
Still wonder why that loop is not O(N).

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Big text file processing - import

I think you want the Read[] function.

I would recommend you to consider loading it first into a database system like MySQL and then you can access it from Mathematica using the DatabaseLink.

Related

How to make a two-dimensional matrix w/o for?

How can I convert this select statement to functional form?

Fastest type to use for comparing hashes in matlab

Seed based randomization perl

kdb c++ interface: create byte list from std::string

Categories

Resources