How to save a matrix whose every element is a cell containing a 17 column MATLAB table? - matlab

Lets say I have data from a certain activity over 500 days. The duration of activity varies over those 500 days. And every day's activity is 17 columns long.
Everyday activity looks like this:
I created a (500 X 1) mat file of zeros called 'activity_database.mat' and then I tried to do this in MATLAB:
clear
load 'activity_database.mat'
for v=1:500
////////////////////////////
DO SOMETHING TO GET A TABLE
///////////////////////////
activity_data{v}=merged_table;
save('activity_database.mat','activity_data')
end
Now, after running the code. When I try to load the activity_database.mat , I receive the following error:
Error using load
Unable to read MAT-file C:\Users\jackryan\activity_database.mat. File might be corrupt.
What am I doing wrong, here? Also, the database is 50000 elements actually, so I am expecting out of space error too (about 30 GB). Is there a way to store all this data in reasonable space complexity bounds?

Instead of cumulating the entire data in a single file, you could actually save a file per day, in a specified order. Something like:
first_date = datenum(2012, 12, 20);
db_folder = '//somewhere/over/the/rainbow/';
for v=1:500
%// DO SOMETHING TO GET A TABLE
mat_name = sprintf('activity_day_%s.mat', datestr(first_date+v-1,'yyyymmdd'));
save(fullfile(db_folder,mat_name), 'merged_table');
end;
You should not have problems about over-sized .mat files, and you can load selectively the data depending on days.

Related

Talend - How to get tFlowToIterate size and tFileInputRegex size?

Good day,
I have component tFileInputRegex and tFlowToIterate to read data from a text file, I saw there are number of row record being process as follow:
Which is 3412 rows, may I know how can I get this value in tJava_2 ?
currently I am using _NB_LINE but getting null:
System.out.println("total is " + (Integer)globalMap.get("tFileInputRegex_1_NB_LINE"));
System.out.println("total2 is " + (Integer)globalMap.get("tFlowToIterate_2_NB_LINE"));
In your example, tJava_2 executes within the iteration, i.e. once for each row. In that component, you can use globalMap.get("tFlowToIterate_2_CURRENT_ITERATION").toString() to get the number of rows processed so far. Please note that instead of casting it to Integer you need to convert it to a string as shown above in order to output it the way you do.
If you need the total number of rows, you can use globalMap.get("tFileInputRegex_1_NB_LINE").toString() - but it is only available after the end of the loop, which means the component where you access it needs to be connected to tFileInputRegex_1 via OnSubjobOk trigger.

How to check if the stream of rows has ended

Is there a way for me to know if the stream of rows has ended? That is, if the job is on the last row?
What im trying to do is for every 10 rows do something, my problem are the last rows, for example in 115 rows, the last 5 wont happen but i need them to.
There is no built-in functionality in Talend which tells you if you're on the last row. You can work around this using one of the following:
Get the row count beforehand. For instance, if you have a file, you
can use tFileRowCount to count the number of rows, then when you
process your file, you use a variable for your current row
number, and so you can tell if you've reached the last row. If your
data come from a database, you could either issue a query that
returns the total number of rows beforehand, or modify your main
query to return the total number of rows in an additional column and
use that (using ranking functions).
Do some processing after the subjob has ended: There may be situations
where you need a special processing for the last row, you can achieve
this by getting the last row processed by the previous subjob (which
you have already saved, for instance, by putting a tSetGlobalVar
after your target, when your subjob is done, your variable contains the last written value).
Edit
For your use case, what you could do is first store the result of the API call in memory using tHashOutput, then read it with a tHashInput in order to process it, and you'll know then how many rows you have retrieved using tHashOutput's global variable tHashOuput_X_NB_LINE.

tJavaFlex behaviour when changing loop position

Having some problems in a job, and I suspect it is due to a lack of understanding of tJavaFlex. I am generating 10 rows in this test job, and am generating loop inside a tJavaFlex:
So there are 10 rows coming in, and a loop in the Start and End section. I was expecting that for each row coming in, it would generate 10 identical rows coming out. And that I would see iterations 0,1,2,3....9 for each row.
What I got was this. This looks to me like the entire job is running 10 times, and so I have 100 random values coming through the flow from the tRowGenerator.
If I move the for loop into the Main Code section, I get close to the behaviour I was expecting. I am expecting each row when it comes in to be repeated 10 times, and for 1 row coming in to produce 10 output rows. What I get is this.
But even then my tLogRow is only generating one row for each 10 iterations it seems (look at the tLogRow output after iteration 9 above why not 10 items?). I had thought I would be getting 10 rows for each single row coming in and I would see this in the tLogRow.
What I need to do is take a value from a field coming in, do some reg exp parsing and split into an array, and then for each item in the array create lines in the output flow. i.e. 1 row coming in can be turned into x number of rows coming out using a string.split() method.
Can someone explain the behaviour above, and also advise on the best approach to get one value coming in, do some java manipulation and then generate multiple rows coming out?
Any advice appreciated.
Yes you don't use it correctly.
The initial part is for initiate variable. (executed one time before the first tow)
In the principal you put your loop (executed one time at each row)
In the final you store in global variable for example.(executed one time after the last row)
The principal code will be executed at each row in a tjavaflex. So don't put a for loop inside you can do like the example in the screen.
You tjavaflex comportement is normal. you have ten row so each row the for loop wil be executed 10 time (i<10)
You can use it like :
You dont need to create your own loop.
By putting the for loop in the Start code, your main code will be triggered by the loop and by incoming rows, and it will be executed n*r times.
The behaviour of subjob that contains a tJavaFlex, reveils that component before tJavaFlex is included into its starting code, and the after component is included in the ending code, but that may depend to many conditions like data propagation and trigger type.
start code :
System.out.print("tJavaFlex is starting...");
int i = 0;
Main code :
i++;
System.out.print("tJavaFlex inside Main Code...iteration:"+i);
row8.ITEM_NAME = row7.ITEM_NAME;
row8.ITEM_COUNT = row7.ITEM_COUNT;
End code :
System.out.print("tJavaFlex is ending...");
System.out.print(row7.ITEM_NAME);
Instead of main flow in row5, try using iterate flow to connect tJavaFlex

loading multiple non-CSV tables to R, and perform a function on each file.

First day on R. I may be expecting too much from it but here is what I'm looking for:
I have multiple files (140 tables), and each table has two columns (V1=values & V2=frequencies). I use the following code to get the Avg from each table:
I was wondering if it's possible to do this once instead of 140 times!
i.e: to load all files and get an exported file that shows Avg of each table in front of the original name of the file.
-I use read.table to load files as read.CSV doesn't work well for some reason.
I'll appreciate any input!
Sum(V1*V2)/Sum(V2)

KDB+/Q query too heavy to handle

I want to grab data from a KDB data base for a list of roughly 200 days within the last two years. The 200 days are in no particular pattern.
I only need the data from 09:29:00.000 to 09:31:00.000 everyday.
My first approach was to query all of the last two years data that have time stamp between 09:29:00.000 and 09:31:00.000, because I didn't see a way to just query the particular 200 days that I need.
However this proved to be too much for my server to handle.
Then I tried to summarize the 2 minute data for each date into an average and just print out the average, so now I will only have 200 rows of data as output. But somehow this still turns out to be too much. I'm not sure if this is because I'm not selecting the data correctly.
My other suspicion is that the query is garbing all the data first then averaging each date, which means averaging is not making it easier to handle.
Here's the code that I have:
select maxPriceB:max(price), minPriceB:min(price), avgPriceB:avg(price), avgSizeB:avg(qty) by date from dms where date within(2015.01.01, 2016.06,10), time within(09:29:00.000, 09:31:00.000), sym = `ZF6
poms is the table that the data is in
ZFU6 is the symbol that im looking for
I tried adding the key word distinct after select.
I want to know if there's anyway to break up the query, or make the query lighter for the server to handle.
Thank you!
If you use 32-bit kdb+ and get infamous 'wsfull error then you may try processing one day at a time like this:
raze{select maxPriceB:max(price), minPriceB:min(price), avgPriceB:avg(price), avgSizeB:avg(qty)
from dms where date=x,sym=`ZF6,time within 09:29:00.000 09:31:00.000}each 2015.01.01+1+til 2016.06.10-2015.01.01