Aggregate/sum function of a table in Matlab - matlab

In matlab I have read in a table from a csv file, then moved two columns I am interested in into a new table. These columns are "ID" (of a person, 1-400) and then another ID to represent their occupation (1-12).
What I want to do is create a simple table with 12 records and 2 columns, there is a record for each job, and the number of user IDs who have this job must be aggregated/summed, such a table could be easily bar charted. At the moment I have 400 user records, all with their IDs and one of the 12 possible job IDs.
So much like an SQL aggregate/sum function, but I want to do it in Matlab, with a table object. The problem I am having is finding how to do this without using a cell array or something similar.
Thanks!

I know that you found an answer yourself, but I would like to mention the histc function, which avoids the loop (and is faster for larger matrices):
JobCounts = histc(OccupationTable(:,2), 1:NumberOfJobs);
Combining this with the job number gives the desired result:
result = [(1:NumberOfJobs)' JobCounts];

Nevermind, solved it. Just looped through the job numbers and ran "sum" where the ID was equal to what I wanted:
for i = 1:1:NumberOfJobs;
JobCounts(i,:) = sum(OccupationTable(:,2) == i);
end

Related

Split a Matlab table in several tables dynamically

I am working in MATLAB and I did not find yet a way to split a table T in different tables {T1,T2,T3,...} dynamically. What I mean with dynamic is that it must be done based on some conditions of the table T that are not known a priori. For now, I do it in a non-dynamic way with the following code (I hard-code the number of tables I want to have).
%% Separate data of table T in tables T1,T2,T3
starting_index = 1;
T1 = T(1:counter_simulations(1),:);
starting_index = counter_simulations(1)+1;
T2 = T(starting_index:starting_index+counter_simulations(2)-1,:);
starting_index = starting_index + counter_simulations(2);
T3 = T(starting_index:starting_index+counter_simulations(3)-1,:);
Any ideas on how to do it dynamically? I would like to do something like that:
for (i=1:number_of_tables_to_create)
T{i} = ...
end
EDIT: the variable counter_simulations is an array containing the number of rows I want to extract for each table. Example: counter_simulations(1)=200 will mean that the first table will be T1= T(1:200, :). If counter_simulations(2)=300 the first table will be T1= T(counter_simulations(1)+1:300, :) and so on.
I hope I was clear.
Should I use cell arrays instead of tables maybe?
Thanks!
For the example you give, where counter_simulations contains a list of the number of rows to take from T in each of the output tables, MATLAB's mat2cell function actually implements this behaviour directly:
T = mat2cell(T,counter_simulations);
While you haven't specified the contents of counter_simulations, it's clear that if sum(counter_simulations) > height(T) the example would fail. If sum(counter_simulations) < height(T) (and so your desired output doesn't contain the last row(s) of T) then you would need to add a final element to counter_simulations and then discard the resulting output table:
counter_simulations(end+1) = height(T) - sum(counter_simulations);
T = mat2cell(T,counter_simulations);
T(end) = [];
Whether this solution applies to all examples of
some conditions of the table T that are not known a priori
you ask for in the question depends on the range of conditions you actually mean; for a broad enough interpretation there will not be a general solution but you might be able to narrow it down if mat2cell performs too specific a job for your actual problem.

Transpose table with unknown amount of variables

I would like to transpose table a to table b without knowing exactly how many procedures there are. Is there a way to include a loop inside a query?
Thank you in advance!
So far I am just checking what the maximum amount of 'procedures' are, I put all the procedures in an array, and then query all elements from this array. However I would like a query that always works without first defining the maximum amount of procedures.

How to generate line segments from survey records using PostGIS

I have a table of survey locations as
id,from,to,azimuth,x,y
'L1',0,5,120,508776,7098873
'L1',5,10,141,null,null
'L1',10,24,121,null,null
'L2',0,12,135,507882,8020098
'L2',12,15,121,null,null
'L2',15,25,null,null
Each line "id" can have 2 or more records defining their geometry.
Using a postgis query, how can I create line segments for each of these records, assuming the x and y values for the line starts are in EPSG:3578?
I've tried LAG and LEAD OVER (Partition BY "id" order by "from_m"), but I get lost in the recursion needed. Is what I'm attempting possible?
If you select the whole table the database will run the LAG/LEAD OVER for every selected row. You can return a new column with the result. I think this is already the recursion you need.
I once did this with time based positions and created a line from the temporal latest point to his ancestor. I ordered by time - in your case you need to order by 'from' or 'to'. Sometimes running an independent query for every id is easier than doing the whole thing at once.
Have a look at this thread, they are trying pretty much the opposite of what you want, but maybe it will help you clarify things.

kdb+/q optimize union function

To give you a bit of background. I have a process which does this large complex calculation which takes a while to complete. It runs on a timer. After some investigation I realise that what is causing the slowness isn't the actual calculation but the internal q function, union.
I am trying to union two simple tables, table A and table B. A is approximately 5m rows and B is 500. Both tables have only two columns. First column is a symbol. Table A is actually a compound primary key of a table. (Also, how do you copy directly from the console?)
n:5000000
big:([]n?`4;n?100)
small:([]500?`4;500?100)
\ts big union small
I tried keying both columns and upserting, join and then distinct, "big, small where not small in big" but nothing seems to work :(
Any help will be appreciated!
If you want to upsert the big table it has to be keyed and upsert operator should be used. For example
n:5000000
//big ids are unique numbers from 0 to 499999
//table is keyed with 1! operator
big:1!([]id:(neg n)?n;val:n?100)
//big ids are unique numbers. 250 from 0-4999999 and 250 from 500000-1000000 intervals
small:([]id:(-250?n),(n+-250?n);val:500?100)
If big is global variable it is efficient to upsert it as
`big upsert small
if big is local
big: big upsert small
As the result big will have 500250 elements, because there are 250 common keys (id column) in big and small tables
this may not be relevant, but just a quick thought. If your big table has a column which has type `sym and if this column does not really show up that much throughout your program, why not cast it to string or other value? if you are doing this update process every single day then as the data gets packed in your partitioned hdb, whenever the new data is added, kdb+ process has to reassign/rewrite its sym file and i believe this is the part that actually takes a lot of time, not the union calculation itself..
if above is true, i'd suggest either rewriting your schema for the table which minimises # of rehashing(not sure if this is the right term though!) on your symfile. or, as the above person mentioned, try to assign attribute to your table.. this may reduce the time too.

Stored Procedure in postgresql, multiple queries w/ aggregates

I'm trying to write a store procedure that can take some input parameters (obviously), run multiple queries against those, taking the output from those and doing calculations, and from those calculations and the original queries, outputting a formatted text string like:
Number of Rows for max(Z) matching condition x and y of total rows matching x (x&y/x*100).
To explain the max(Z) bit, this will be the username field, it won't matter which actual entry is picked, because the where clause will filter the results by user id, is there a saner way to do this?
For starters break the code up into multiple procedures. Don't create one procedure that does all of these things.