Matlab: data from cell matrix to struct. How can I to organize my data with keys? - matlab

Let me suppose I'm facing some data obtained a by SQL database query as below (of course my real case is bigger, thoudans of rows and many columns).
key_names header1 header2 header3
-------------------------------------
key1 a 1 bar
key2 b 2 foo
key3 c 3 bla
My goal is to organize data in Matlab (at work I must use it) in a smart and effecient way to get the following results:
Access data by key obtaining the whole row, like dataset(key, :)
Access data by key plus header getting back a single value dataset.header(key)
If possible, getting a whole column (for all keys).
First of all, I used the dataset class provided by the Statistic Toolbox because it has all these features, but I decided to move away because it is really slow (from what I got, basically it is a wrapper onto cell arrays): the bottleneck of my code was getting the data instead of performing computations. In fact, I read that is better trying to avoid it as much as possible.
The newer class table looks more efficient but still not very much: from what I have understood, it is the new version of dataset as explained in the official documentation.
I considered also using containers.Map but it looks not to have the access by both key and column.
Therefore, struct seems to be the best choice as it is really fast and it has all the features I'm looking for.
So here my questions:
Did someone face my same problem? Which way to organize data is the best one?
Let me suppose struct is the best. How can I efficiently create and fill a structure like this: mystruct.key.header?
I'd like to get something like this:
mystruct.key1.header1
ans = a
Of course I could loop but there must be a better way. I documented in this good starting point but the struct is created empty:
fn1 = {'a', 'b', 'c'}; %first level
fn2 = {'d', 'e', 'f'}; %second level
s2 = cell2struct(cell(size(fn2(:))),fn2(:));
s = cell2struct(repmat({s2},size(fn1(:))),fn1(:))
In the cell2struct documentation all the examples do not rename all the levels. The deal help is a good way to fill the data (depending on the Matlab version as from 7.0 it was substituted with a new coding style) but I'm still missing how to combine the parts of creating the structure with the filling one.
Any suggestion or code example is really appreciated.

If you think, or sure, that structs are the best option for you, you can use table2struct. First, import all the data into Matlab as a table, and then convert it to a structure.
mystruct = table2struct(data);
to access your data you would use the following syntax:
mystruct(key).header
if key is an array, then you need to collect all the values to a list using either a cell array:
values = {mystruct(key).header}
or different variables:
[v1, v2, v3] = mystruct(key).header
but the latter option is problematic if you are not sure hoe many outputs to expect.
I'm not sure what will be more convenient to you, but you can also convert to a scalar structure by setting 'ToScalar' argument to true.

Related

Handle missing data

I was wondering how people typically handle missing data problems?
I read some articles about imputing missing data, where basically the idea is to replace the missing data by some value calculated in some way.
For example, suppose I have a table with some missing cells, and I want to fill these cells using some imputation technique. I image I should first use some carefully chosen function f and apply f on some existing data in the table to compute the value to replace a specific missing value. Is this true?

Transpose data using Talend

I have this kind of data:
I need to transpose this data into something like this using Talend:
Help would be much appreciated.
dbh's suggestion should work indeed, but I did not try it.
However, I have another solution which doesn't require to change input format and is not too complicated to implement. Indeed the job has only 2 transformation components (tDenormalize and tMap).
The job looks like the following:
Explanation :
Your input is read from a CSV file (could be a database or any other kind of input)
tDenormalize component will Denormalize your column value (column 2), based on value on id column (column 1), separating fields with a specific delimiter (";" in my case), resulting as shown in 2 rows.
tMap : split the aggregated column into multiple columns, by using java's String.split() method and spreading the resulting array into multiple columns. The tMap should like like this:
Since Talend doesn't accept to store Array objects, make sure to store the splitted String in Object format. Then, cast that object into Array on the right side of the Map.
That approach should give you the expected result.
IMPORTANT:
tNormalize might shuffle the rows, meaning for bigger input, you might encounter unsorted output. Make sure to sort it if needed or use tDenormalizeSortedRow instead.
tNormalize is similar to an aggregation component meaning it scans the whole input before processing, which results into possible performance issues with particularly big inputs (tens of millions of records).
Your input is probably wrong (you have 5 entries with 1 as id, and 6 entries with 2 as id). 6 columns are expected meaning you should always have 6 lines per id. If not, then you should implement dbh's solution, and you probably HAVE TO add a column with a key.
You can use Talend's tPivotToColumnsDelimited component to achieve this. You will most likely need an additional column in your data to represent the field name.
Like "Identifier, field name, value "
Then you can use this component to pivot the data and write a file as output. If you need to process the data further, read the resulting file with tFileInoutDelimited .
See docs and an example at
https://help.talend.com/display/TalendOpenStudioComponentsReferenceGuide521EN/13.43+tPivotToColumnsDelimited

iOS : Storing a table of rows and columns

Am just mulling over what's the best way i.e. data structure to store a data that has several rows and columns. Shoudl I store it as :
1. an array of arrays?
2. NSDictionary?
or is there any grid-like data structure in iOS where I can easily fetch any row/column with ease from the data structure? For example, I must be able to fetch the value in 3rd column in row 5. Currently, say, I store each row as an array and the store these arrays in another array (so an array of arrays, say), then to fetch the value in column 3 in row 5, I need to fetch the 5th row in the array of arrays, and then in the resulting array, I need to fetch the 3rd object. Is there a better way to do this? Thoughts please?
then to fetch the value in column 3 in row 5, I need to fetch the 5th
row in the array of arrays, and then in the resulting array, I need to
fetch the 3rd object. Is there a better way to do this?
An array of arrays is fine for the implementation, and the collection subscripting that was recently added to Objective-C makes this easier -- you can use an expression like
NSString *s = myData[m][n];
to get the string at the nth column of the mth row.
That said, it may still be a good idea to create a separate class for your data structure, so that the rest of your code is protected from needing to know about how the data is stored. That would also simplify the process of changing the implementation from, say, an array of arrays to a SQLite table or something else.
Your data storage class doesn't need to be fancy or complicated. Here's a first pass:
#interface DataTable
- (id)objectAtRow:(NSInteger)row column:(NSInteger)column;
- (void)setObjectAtRow:(NSInteger)row column:(NSInteger)column;
#end
I'm sure you can see how to implement those in terms of an array of arrays. You'll have to do a little work to add rows and/or columns when the caller tries to set a value outside the current bounds. You might also want to add support for things like fast enumeration and writing to and reading from property lists, but that can come later.
There are other ways of doing it, but there's nothing wrong with the method you are using. You could use an NSDictionary with a key of type NSIndexPath, for example, or even a string key of the form "row,col", but I don't see any advantage in those except for sparse matrices.
You can either use an array of arrays, as you're doing, or an array of dictionaries. Either is fine, and I don't think there's any preference for one over the other. It all depends on which way is most convenient for you to set up the data structure in the first place. Accessing the data for the table view is equally easy using either method.

How to alter Postgres table data based on its contents?

This is probably a super simple question, but I'm struggling to come up with the right keywords to find it on Google.
I have a Postgres table that has among its contents a column of type text named content_type. That stores what type of entry is stored in that row.
There are only about 5 different types, and I decided I want to change one of them to display as something else in my application (I had been directly displaying these).
It struck me that it's funny that my view is being dictated by my database model, and I decided I would convert the types being stored in my database as strings into integers, and enumerate the possible types in my application with constants that convert them into their display names. That way, if I ever got the urge to change any category names again, I could just change it with one alteration of a constant. I also have the hunch that storing integers might be somewhat more efficient than storing text in the database.
First, a quick threshold question of, is this a good idea? Any feedback or anything I missed?
Second, and my main question, what's the Postgres command I could enter to make an alteration like this? I'm thinking I could start by renaming the old content_type column to old_content_type and then creating a new integer column content_type. However, what command would look at a row's old_content_type and fill in the new content_type column based off of that?
If you're finding that you need to change the display values, then yes, it's probably a good idea not to store them in a database. Integers are also more efficient to store and search, but I really wouldn't worry about it unless you've got millions of rows.
You just need to run an update to populate your new column:
update table_name set content_type = (case when old_content_type = 'a' then 1
when old_content_type = 'b' then 2 else 3 end);
If you're on Postgres 8.4 then using an enum type instead of a plain integer might be a good idea.
Ideally you'd have these fields referring to a table containing the definitions of type. This should be via a foreign key constraint. This way you know that your database is clean and has no invalid values (i.e. referential integrity).
There are many ways to handle this:
Having a table for each field that can contain a number of values (i.e. like an enum) is the most obvious - but it breaks down when you have a table that requires many attributes.
You can use the Entity-attribute-value model, but beware that this is too easy to abuse and cause problems when things grow.
You can use, or refer to my implementation solution PET (Parameter Enumeration Tables). This is a half way house between between 1 & 2.

What's a real world example of something you would represent with a hash?

I'm just trying to get a grip on when you would need to use a hash and when it might be better to use an array. What kind of real-world object would a hash represent, say, in the case of strings?
I believe sometimes a hash is referred to as a "dictionary", and I think that's a good example in itself. If you want to look up the definition of a word, it's nice to just do something like:
definition['pernicious']
Instead of trying to figure out the correct numeric index that the definition would be stored at.
This answer assumes that by "hash" you're basically just referring to an associative array.
I think you're looking at things in the wrong direction. It is not the object which determines if you should use a hash but the manner in which you are accessing it. A common use of a hash is when using a lookup table. If your objects are strings and you want to check if they exist in a Dictionary, looking them up will (assuming the hash works properly) by O(1). WIth sorting, the time would instead be O(logn), which may not be acceptable.
Thus, hashes are ideal for use with Dictionaries (hashmaps), sets (hashsets), etc.
They are also a useful way of representing an object without storing the object itself (for passwords).
The phone book - key = name, value = phone number.
I also think of the old World Book Encyclopedias (actual books). Each article is "hashed" into a single book (cat goes in the "C" volume).
Any time you have data that is well served by a 1-to-1 map.
For example, grades in a class:
"John Smith" => "B+"
"Jacob Jenkens" => "C"
etc
In general hashes are used to find things fast - a hash map can be used to assosiate one thing with another fast, a hash set will just store things "fast".
Please consider also the hash function complexity and cost when considering whether it's better to use a hash container or a normal less then container - the additional size of the hash value and the time needed to compute a "perfect" hash, and the time needed to make a 1:1 comparision at the end in case of a hash function conflict may in fact be a lot higher then just going through a tree structure with logharitmic complexity using the less then operators.
When you need to associate one variable with another. There isn't a "type limit" to what can be a key/value in a hash.
Hashed have many uses. Aside from cryptographic uses, they are commonly used for quick lookups of information. To get similarly quick lookups using an array you would need to keep the array sorted and then used a binary search. With a hash you get the fast lookup without having to sort. This is the reason most scripting languages implement hashing under one name or another (dictionaries, et al).
I use one often for a "dictionary" of settings for my app.
Setting | Value
I load them from the database or config file, into hashtable for use by my app.
Works well, and is simple.
One example could be zip code associated with an area, city or any postal address.
A good example is a cache with lot's of elements in it. You have some identifer by which you want to look up the a value (say an URL, and you want to find the according cached webpage). You want these lookups to be as fast as possible and don't want to search through all the stored pages everytime some URL is requested. A hash table is a great data structure for a problem like this.
One real world example I just wrote is when I was adding up the amount people spent on meals when filing expense reports.I needed to get a daily total with no idea how many items would exist on a particular day and no idea what the date range for the expense report would be. There are restrictions on how much a person can expense with many variables (What city, weekend, etc...)
The hash table was the perfect tool to handle this. The key was the date the value was the receipt amount (converted to USD). The receipts could come in in any order, i just keep getting the value for that date and adding to it until the job was done. Displaying was easy as well.
(php code)
$david = new stdclass();
$david->name = "david";
$david->age = 12;
$david->id = 1;
$david->title = "manager";
$joe = new stdclass();
$joe->name = "joe";
$joe->age = 17;
$joe->id = 2;
$joe->title = "employee";
// option 1: lets put users by index
$users[] = $david;
$users[] = $joe;
// option 2: lets put users by title
$users[$david->title] = $david;
$users[$joe->title] = $joe;
now the question: who is the manager?
answer:
$users["manager"]