I was wondering how people typically handle missing data problems?
I read some articles about imputing missing data, where basically the idea is to replace the missing data by some value calculated in some way.
For example, suppose I have a table with some missing cells, and I want to fill these cells using some imputation technique. I image I should first use some carefully chosen function f and apply f on some existing data in the table to compute the value to replace a specific missing value. Is this true?
Related
After reading an Excel spreadsheet in Matlab I unfortunately have NaNs included in my resulting table. So for example this Excel table:
would result in this table:
where an additional column of NaNs occurs. I tried to remove the NaNs with the following code snippet:
measurementCells = readtable('MWE.xlsx','ReadVariableNames',false,'ReadRowNames',true);
measurementCells = measurementCells(any(isstruct(measurementCells('TIME',1)),1),:);
However this results in a 0x6 table, without any values present anymore. How can I properly remove the NaNs without removing any data from the table?
Either this:
tab = tab(~any(ismissing(tab),2),:);
or:
tab = rmmissing(tab);
if you want to remove rows that contain one or more missing value.
If you want instead to replace missing values with other values, read about how fillmissing (https://mathworks.com/help/matlab/ref/fillmissing.html) and standardizeMissing (https://mathworks.com/help/matlab/ref/standardizemissing.html) functions work. The examples are exhaustive and should help you to find the solution that best fits your needs.
One last solution you have is to spot (and manipulate in the way you prefer) NaN values within the call to the readtable function using the EmptyValue parameter. But this works only against numeric data.
Let me suppose I'm facing some data obtained a by SQL database query as below (of course my real case is bigger, thoudans of rows and many columns).
key_names header1 header2 header3
-------------------------------------
key1 a 1 bar
key2 b 2 foo
key3 c 3 bla
My goal is to organize data in Matlab (at work I must use it) in a smart and effecient way to get the following results:
Access data by key obtaining the whole row, like dataset(key, :)
Access data by key plus header getting back a single value dataset.header(key)
If possible, getting a whole column (for all keys).
First of all, I used the dataset class provided by the Statistic Toolbox because it has all these features, but I decided to move away because it is really slow (from what I got, basically it is a wrapper onto cell arrays): the bottleneck of my code was getting the data instead of performing computations. In fact, I read that is better trying to avoid it as much as possible.
The newer class table looks more efficient but still not very much: from what I have understood, it is the new version of dataset as explained in the official documentation.
I considered also using containers.Map but it looks not to have the access by both key and column.
Therefore, struct seems to be the best choice as it is really fast and it has all the features I'm looking for.
So here my questions:
Did someone face my same problem? Which way to organize data is the best one?
Let me suppose struct is the best. How can I efficiently create and fill a structure like this: mystruct.key.header?
I'd like to get something like this:
mystruct.key1.header1
ans = a
Of course I could loop but there must be a better way. I documented in this good starting point but the struct is created empty:
fn1 = {'a', 'b', 'c'}; %first level
fn2 = {'d', 'e', 'f'}; %second level
s2 = cell2struct(cell(size(fn2(:))),fn2(:));
s = cell2struct(repmat({s2},size(fn1(:))),fn1(:))
In the cell2struct documentation all the examples do not rename all the levels. The deal help is a good way to fill the data (depending on the Matlab version as from 7.0 it was substituted with a new coding style) but I'm still missing how to combine the parts of creating the structure with the filling one.
Any suggestion or code example is really appreciated.
If you think, or sure, that structs are the best option for you, you can use table2struct. First, import all the data into Matlab as a table, and then convert it to a structure.
mystruct = table2struct(data);
to access your data you would use the following syntax:
mystruct(key).header
if key is an array, then you need to collect all the values to a list using either a cell array:
values = {mystruct(key).header}
or different variables:
[v1, v2, v3] = mystruct(key).header
but the latter option is problematic if you are not sure hoe many outputs to expect.
I'm not sure what will be more convenient to you, but you can also convert to a scalar structure by setting 'ToScalar' argument to true.
I have this kind of data:
I need to transpose this data into something like this using Talend:
Help would be much appreciated.
dbh's suggestion should work indeed, but I did not try it.
However, I have another solution which doesn't require to change input format and is not too complicated to implement. Indeed the job has only 2 transformation components (tDenormalize and tMap).
The job looks like the following:
Explanation :
Your input is read from a CSV file (could be a database or any other kind of input)
tDenormalize component will Denormalize your column value (column 2), based on value on id column (column 1), separating fields with a specific delimiter (";" in my case), resulting as shown in 2 rows.
tMap : split the aggregated column into multiple columns, by using java's String.split() method and spreading the resulting array into multiple columns. The tMap should like like this:
Since Talend doesn't accept to store Array objects, make sure to store the splitted String in Object format. Then, cast that object into Array on the right side of the Map.
That approach should give you the expected result.
IMPORTANT:
tNormalize might shuffle the rows, meaning for bigger input, you might encounter unsorted output. Make sure to sort it if needed or use tDenormalizeSortedRow instead.
tNormalize is similar to an aggregation component meaning it scans the whole input before processing, which results into possible performance issues with particularly big inputs (tens of millions of records).
Your input is probably wrong (you have 5 entries with 1 as id, and 6 entries with 2 as id). 6 columns are expected meaning you should always have 6 lines per id. If not, then you should implement dbh's solution, and you probably HAVE TO add a column with a key.
You can use Talend's tPivotToColumnsDelimited component to achieve this. You will most likely need an additional column in your data to represent the field name.
Like "Identifier, field name, value "
Then you can use this component to pivot the data and write a file as output. If you need to process the data further, read the resulting file with tFileInoutDelimited .
See docs and an example at
https://help.talend.com/display/TalendOpenStudioComponentsReferenceGuide521EN/13.43+tPivotToColumnsDelimited
I had a column with data as => a,b,c,d,e
I need to display(in worksheet) as
a
b
c
d
e
Note: need to be split based on ','
Do I need to to use calculation field or any other approach is there???
Went through split function but is used to generate new columns, I want to store in a single column.
is this something that could work? (you said it's just a matter of visualization without altering data, right?)
I just created a CF like this:
REPLACE(value,",","
")
EDIT: since it seems that your need involves a data manipulation (you want multiple row instead of one) I think that the best way is using the split function even though, as you noticed, it will create new columns.
Otherwise if it's just a visualization need, you could use the solution posted before which shows your data ("a,b,c,d,e") in the same cell with the same horizontal alignment, just replacing commas with CR
This is probably a super simple question, but I'm struggling to come up with the right keywords to find it on Google.
I have a Postgres table that has among its contents a column of type text named content_type. That stores what type of entry is stored in that row.
There are only about 5 different types, and I decided I want to change one of them to display as something else in my application (I had been directly displaying these).
It struck me that it's funny that my view is being dictated by my database model, and I decided I would convert the types being stored in my database as strings into integers, and enumerate the possible types in my application with constants that convert them into their display names. That way, if I ever got the urge to change any category names again, I could just change it with one alteration of a constant. I also have the hunch that storing integers might be somewhat more efficient than storing text in the database.
First, a quick threshold question of, is this a good idea? Any feedback or anything I missed?
Second, and my main question, what's the Postgres command I could enter to make an alteration like this? I'm thinking I could start by renaming the old content_type column to old_content_type and then creating a new integer column content_type. However, what command would look at a row's old_content_type and fill in the new content_type column based off of that?
If you're finding that you need to change the display values, then yes, it's probably a good idea not to store them in a database. Integers are also more efficient to store and search, but I really wouldn't worry about it unless you've got millions of rows.
You just need to run an update to populate your new column:
update table_name set content_type = (case when old_content_type = 'a' then 1
when old_content_type = 'b' then 2 else 3 end);
If you're on Postgres 8.4 then using an enum type instead of a plain integer might be a good idea.
Ideally you'd have these fields referring to a table containing the definitions of type. This should be via a foreign key constraint. This way you know that your database is clean and has no invalid values (i.e. referential integrity).
There are many ways to handle this:
Having a table for each field that can contain a number of values (i.e. like an enum) is the most obvious - but it breaks down when you have a table that requires many attributes.
You can use the Entity-attribute-value model, but beware that this is too easy to abuse and cause problems when things grow.
You can use, or refer to my implementation solution PET (Parameter Enumeration Tables). This is a half way house between between 1 & 2.