Working with the output from recfromcsv - matlab

I'm porting a Matlab script to Python. Below is an extract:
%// Create a list of unique trade dates
DateList = unique(AllData(:,1));
%// Loop through the dates
for DateIndex = 1:size(DateList,1)
CalibrationDate = DateList(DateIndex);
%// Extract the data for a single cablibration date (but all expiries)
SubsetIndices = ismember(AllData(:,1) , DateList(DateIndex)) == 1;
SubsetAllExpiries = AllData(SubsetIndices, :);
AllData is an N-by-6 cell matrix, the first 2 columns are dates (strings) and the other 4 are numbers. In python I will be getting this data out of a csv so something like this:
import numpy as np
AllData = np.recfromcsv(open("MyCSV.csv", "rb"))
So now if I'm not mistaken AllData is a numpy array of ordinary tuples. Is this is best format to have this data in? The goal will be to extract a list of unique dates from column 1, and for each date extract the rows with that date in column 1 (column one is ordered). Then for each row in column one do some maths on the numbers and date in the remaining 5 columns.
So in matlab I can get the list of dates by unique(AllData(:,1)) and then I can get the records (rows) corresponding to that date (i.e. with that date in columns one) like this:
SubsetIndices = ismember(AllData(:,1) , MyDate) == 1;
SubsetAllExpiries = AllData(SubsetIndices, :);
How can I best achieve the same results in Python?

To put things in context, np.recfromcsv is just a modified version of np.genfromtxt which outputs record arrays instead of structured arrays.
A structured array lets you access the individual fields (here, your columns) by their names, like in my_array["field_one"] while a record array gives you the same plus the possibility to access the fields as attributes, like in my_array.field_one. I'm not fond of "access-as-attributes", so I usually stick to structured arrays.
For your information, structurede/record arrays are not arrays of tuples, but arrays of some numpy object call a np.void: it's a block of memory composed of as many sub-blocks you have of fields, the size of each sub-block depending on its datatype.
That said, yes, what you seem to have in mind is exactly the kind of usage for a structured array. The approach would then be:
to take your dates array and filter them to find the unique elements.
to find the indices of these unique elements, as an array of integers we'll call, say, matching;
to use matching to access the corresponding records (eg, rows of your array) using fancy indexing, as
my_array[matching].
to perform your computations on the records, as you want.
Note that you can keep your dates as strings or transform them into datetime objects using a user-defined converter, as described in the documentation. For example, your could transform a YYYY-MM-DD into a datetime object with a lambda s:datetime.dateime.strptime(s,"%Y-%m-%d"). That way, instead of having, say, a N array where each row (a record) consists of two dates as strings and 4 floats, you would have a N array where each row consists of two datetime objects and 4 floats.
Note the shape of your array (via my_array.shape), it says (N,), meaning it's a 1D array, even if it looks like a 2D table with multiple columns. You can access individual fields (each "column") by using its name. For example, if we create an array consisting of one string field called first and one int field called second, like that:
x = np.array([('a',1),('b',2)], dtype=[('first',"|S10"),('second',int)])
you could access the first column with
>>> x['first']
array(['a', 'b'],
dtype='|S10')

Related

How do I Map and Remap string values to Int or double in scala

I have a data file of some columns. I am performing some mathematical computations on the values for that purpose I want to map my non Integer value columns to Int and then after the operations on the values I want to remap them.
Following are my columns values
atom_id,molecule_id,element,type,charge
d100_1,d100,c,22,-0.128
d100_10,d100,h,3,0.132
d100_11,d100,c,29,0.002
d100_12,d100,c,22,-0.128
d100_13,d100,c,22,-0.128
Suppose I want to map only 2 columns and then remap those columns values only. I have searched for methods and found String Indexer but it maps all of the columns of the DF, I need to map only specific columns and then remap the values of those specific columns. Any help will be appreciated.
//edited Part
I have the following columns in my DataFrame
ind1,inda,logp,lumo,mutagenic,element
1,0,4.23,-1.246,yes,c
1,0,4.62,-1.387,yes,b
0,0,2.68,-1.034,no,h
1,0,6.26,-1.598,yes,c
1,0,2.4,-3.172,yes,a
Basically I am writing the code for synthetic Data Generation based on the given input data, so I want to use column values i.e ind1,inda,logp,lumo,mutagenic,element. single row at a time and after applying some math functions on it I will get a row which will consist of 6 values and each value will be representing the corresponding column value.
Now the problem is that all column values are of type double except mutagenic and element. I want to map this mutagenic and element columns to double values for example yes to 0 and No to 1 so that I can use them and then when I will receive the output row then I will reverse map that generated mutagenic value back to the corresponding string value using that mapping function.
Hope so I am clear this time

How to sort an array while keeping the order of the index row matching the sorted row?

Easiest way is to show you through excel:
Unsorted:
Sorted:
This example is with excel, but I would need to do the same thing in matlab with thousands of entries (with 2 rows if possible).
Here is my code so far:
%At are random numbers between 0 and 2, 6000 entries.
[sorted]=sort(At);
max=sorted(end);
min=sorted(1);
%need the position of the min and max
But this is only 1 row that's being sorted and it has no numbers in the second row, and no index. How would I add one and keep it following my first row?
Thank you!
I don't have access to Matlab, but try
[sorted, I] = sort(At);
Where I will be a corresponding vector of indices of At. See the Matlab Documentation for details.
You have a couple of options here. For the simple case where you just need the indices, the fourth form of sort listed in the docs already does this for you:
[sorted, indices] = sort(At);
In this case, At(indices) is the same as sorted.
If your "indices" are actually another distinct array, you can use sortrows:
toSort = [At(:) some_other_array(:)];
sorted = sortrows(toSort);
In this case sorted(:, 1) will be the sorted array from the first example and sorted(:, 2) will be the other array sorted according to At.
sortrows accepts a second parameter which tells you the column to sort by. This can be a single column or a list of columns, like in Excel. It can also provide a second output argument, the indices, just like regular sort.

Search for an exact match in string

Given a table with the following format in MATLAB:
itemids keywords
1 3D,children,anim,pixar,3D,3D pixar
2 3D,4D pixar,3D car
... ...
I want to count the number of times each keyword is repeated in each item. All the list of unique keywords are available in keywords = {'3D';'Children';'anim';'pixar' ...}. The output is a matrix TF with rows equal to the number of items and columns equal to length(keywords).
One of the difficulties here is to search for an exact match for each string. I am currently using strcmp() which seems to be giving all the entries with a given word, not exact match. In my case I would need to differentiate between 3D and 3D pixar.
This can be done using the ismember function in MATLAB. I am assuming that keywords for each item is actually a single string in which case you will need to split the keywords before doing ismember.
relevantKeyWords = {'3D','Children','anim','pixar'};
keywordsInItem = strtrim(strsplit(keywordsStr,',')) % Split the words and trim each word
tmp = ismember(relevantKeywords,keywordsInItem);
tmp will be of size 1 x length(relevantKeywords) indicating if the relevant keyword was found.

When to use a cell, matrix, or table in Matlab

I am fairly new to matlab and I am trying to figure out when it is best to use cells, tables, or matrixes to store sets of data and then work with the data.
What I want is to store data that has multiple lines that include strings and numbers and then want to work with the numbers.
For example a line would look like
'string 1' , time, number1, number 2
. I know a matrix works best if al elements are numbers, but when I use a cell I keep having to convert the numbers or strings to a matrix in order to work with them. I am running matlab 2012 so maybe that is a part of the problem. Any help is appreciated. Thanks!
Use a matrix when :
the tabular data has a uniform type (all are floating points like double, or integers like int32);
& either the amount of data is small, or is big and has static (predefined) size;
& you care about the speed of accessing data, or you need matrix operations performed on data, or some function requires the data organized as such.
Use a cell array when:
the tabular data has heterogeneous type (mixed element types, "jagged" arrays etc.);
| there's a lot of data and has dynamic size;
| you need only indexing the data numerically (no algebraic operations);
| a function requires the data as such.
Same argument for structs, only the indexing is by name, not by number.
Not sure about tables, I don't think is offered by the language itself; might be an UDT that I don't know of...
Later edit
These three types may be combined, in the sense that cell arrays and structs may have matrices and cell arrays and structs as elements (because thy're heterogeneous containers). In your case, you might have 2 approaches, depending on how you need to access the data:
if you access the data mostly by row, then an array of N structs (one struct per row) with 4 fields (one field per column) would be the most effective in terms of performance;
if you access the data mostly by column, then a single struct with 4 fields (one field per column) would do; first field would be a cell array of strings for the first column, second field would be a cell array of strings or a 1D matrix of doubles depending on how you want to store you dates, the rest of the fields are 1D matrices of doubles.
Concerning tables: I always used matrices or cell arrays until I
had to do database related things such as joining datasets by a unique key; the only way I found to do this in was by using tables. It takes a while to get used to them and it's a bit annoying that some functions that work on cell arrays don't work on tables vice versa. MATLAB could have done a better job explaining when to use one or the other because it's not super clear from the documentation.
The situation that you describe, seems to be as follows:
You have several columns. Entire columns consist of 1 datatype each, and all columns have an equal number of rows.
This seems to match exactly with the recommended situation for using a [table][1]
T = table(var1,...,varN) creates a table from the input variables,
var1,...,varN . Variables can be of different sizes and data types,
but all variables must have the same number of rows.
Actually I don't have much experience with tables, but if you can't figure it out you can always switch to using 1 cell array for the first column, and a matrix for all others (in your example).

Time series generation matlab

I have an excel worksheet with several entries of column data. The data is arranged in pairs such that the first column contains dates and the second contains time series data corresponding to that date. So for example time series 1 will be in columns A and B where is is the dates and B is the data. Column C is blank before columns D and E contain the entries for time series 2 and so on and so forth...
How do I merge these into a single file in Matlab where the dates match up? Specifically I would want the first column to contain the dates and the other columns to contain the data. I have tried to do this with fts and merge functions but so far failed..
You could grab the dates like this: dates = [raw{:,1}]' and the data like this data = reshape([raw{:, 2:3:end}]', size(raw,1), []); to get normal matlab matrices in case you want to manipulate them in matlab.
Otherwise if you just want to send them straight back to excel then:
data = [raw(:,1) reshape(raw(:, 2:3:end)];
xlswrite(...blablafilename_etc..., data);
But in this case you should have use a VBA macro :/