Generate subset of data with known mean - matlab

I have a dataset of n observations (nx1 vector) and would like to create a subset of this data, whose mean is known in advance, by selecting at random only n/3 observations (or within some constraint, ie where the mean of the data subset is within a range about the known mean).
Can someone please help me with the code do this in matlab?
Note, I don't want to use the rand function to create random data as I already have my data collected.
For example on a smaller scale: If I had the following dataset of 12 observations:
data = [8;7;4;6;9;6;4;7;3;2;1;1];
but then wanted to randomly select a subset of this data containing only 4 observations with a mean of 4 (or with a mean between 3.5-4.5 for example):
Then the answer might be datasubset=[7;3;2;4] but the answer could also be datasubset=[6;4;2;4] or datasubset=[6;4;3;4].
It doesn't matter if there are several possible solutions, I just need one of them, though I'd like to know the alternative solutions also.

Related

How to pass a vector from tableau to R

I have a need to pass a vector of arguments to Rserve from tableau. Specifically, I am using IRR calculations in R (on Rserve), and i want to pass vector of cash-flows that are as columns in my table (instead of rows/measure). So, i want to collect all those CF in a vector and pass it on to Rserve. Passing them one at a time slows down IO.
SCRIPT_REAL("r_func(c(.arg1, .arg2, .arg3))",sum(cf1), sum(cf2), sum(cf3))
cf1..cfn are cashflows corresponding to various periods. Above code works well when cf are few but takes a long time when i have few hundereds. Further, time spent is not in calculation but IO when communicating with remote Rserve. If i have a local Rserve, this calculation happens under few seconds while on remote, it takes well over a minute.
Also, want to point out that tableau / Rserve, set one argument after another and that takes time. My expectation is that once i have a vector, it would be just 1 transfer and setting of arguments, and therefore this should speed up
The first step in understanding how Tableau interacts with R or Python, is understanding how Tableau's table calcs work.
Tableau Script_XXX() functions are table calculations which means that you invoke them on a vector of aggregate query results and the corresponding R or Python code needs to return a vector usually of the same size. (I think you may be able to return a scalar or smaller vector which gets replicated to appear like a vector of the same size as the argument -- but not certain)
You can control how your data is partitioned into vectors, and also the ordering of data in the vectors, by editing the table calc to specify the partitioning and addressing for that calc.
Partitioning determines how your aggregate query results are broken up into vectors for calculation purposes. Addressing determines how the elements of each vector are ordered. You can either do that based on the physical layout of the table structure, or (better) based on the specific dimensions.
See the Tableau on-line help for table calcs for more info, and look online training videos from Tableau or blog entries (especially from anyone named Bora)
One way to test your understanding of these concepts is create a Tableau table (i.e., a viz with a mark type of text) with several dimensions on row and column shelves. Then create calculated fields for INDEX() and SIZE() and display them on text. Finally, change the partitioning and addressing in different ways by editing those table calcs. Try several different permutations. When you can confidently predict what those functions will produce for different settings, then you're ready to do more complex tasks - such as talking to R.
It is also instructive to experiment with FIRST(), LAST(), LOOKUP(), WINDOW_SUM() etc -- and finally dig into PREVIOUS_VALUE(). Warning, PREVIOUS_VALUE() is a bit odd, and does not behave the way you probably assume it does. Still, it is a useful technique that can implement a recursive calculation, and is about as close to a for loop as Tableau gets.

Most appropriate analysis method - Clustering?

I have 2 large data frames with similar variables representing 2 separate surveys. Some rows (participants) in each data frame correspond to the other and I would like to link these two together.
There is an index in both dataframes though this index indicates locality of the survey (i.e region) and not individual IDs.
Merging is not possible as in most cases there is an identical index values for different participants.
Given that merging on an index value from the 2 data frames is not possible, I wish to compare similar variables (binary) from both data frames that (in addition to the index values common to both data frame) in order to give me a highest likelihood of a match. I can then (with some margin of error) match rows with the most similar values for similar variables and merge them together.
What do you think would be the appropriate method for doing this? Clustering?
Best,
James
That obviously is not clustering. You don't want large groups of records.
What you want to do is an approximate JOIN.

running NN software with my own data

New with Matlab.
When I try to load my own date using the NN pattern recognition app window, I can load the source data, but not the target (it is never on the drop down list). Both source and target are in the same directory. Source is 5000 observations with 400 vars per observation and target can take on 10 different values (recognizing digits). Any Ideas?
Before you do anything with your own data you might want to try out the example data sets available in the toolbox. That should make many problems easier to find later on because they definitely work, so you can see what's wrong with your code.
Regarding your actual question: Without more details, e.g. what your matrices contain and what their dimensions are, it's hard to help you. In your case some of the problems mentioned here might be similar to yours:
http://www.mathworks.com/matlabcentral/answers/17531-problem-with-targets-in-nprtool
From what I understand about nprtool your targets have to consist of a matrix with only one 1 (for the correct class) in either row or column (depending on the input matrix), so make sure that's the case.

Different Aggregation calculations of a measure using two dimensions in Tableau

It is a Tableau 8.3 Desktop Edition question.
I am trying to aggregate data using two different dimensions. So, I want to aggregate twice: first I want to sum over all the rows and then multiply the results in a cummulative manner (so I can build a graph). How do I do that? Ok, too vague, here follow some more details:
I have a set of historical data. The columns are the date, the rows are the categories.
Easy part: I would like to sum all the rows.
Hard part: Given this those summations I want to build a graph that for each date it shows the product of all the summations from the earlier date till this date.
In another words:
Take the sum of all rows, call it x_i, where i is the date.
For each date i find y_i such that y_i = x_0 * x_1 * ... * x_i (if there is missing data, consider it to be one)
Then show a line graph for the y values versus the date.
I have searched for a solution for this and tried to figure it out by myself, but failed.
Thank you very much for your time and help :)
You need n calculated fields (number of columns you have), and manually do the calculation you need:
y_i = sum(field0)*sum(field1)
Basically because you cannot iterate on columns. For tableau, each column represent a different dimension or measure. So it won't consider that there is a logic order among them, meaning, it won't assume that column A comes before column B. It will assume A and B are different things.
Tableau works better with tables organized as databases. So if you have year columns, you should reorganize your data, eliminate all those columns and create a single field called 'Date', which will identify the value of your measure for that date. Yes, you will have less columns but far more rows. But Tableau works better this way (for very good reasons).
Tableau 9.0 allows you to do that directly. I only watched a demo (it was launched yesterday), but I understand that now there is an option to selected those columns (in the Data Connection tab) and convert them to a database format.
With that done, you can use a PREVIOUS_VALUE function to help you. I'm not with Tableau right now. As soon as I get to it I'll update this with the final answer . Unless you take the lead and discover yourself before that ;)

Merge sensor data for clustering/neural net usage

I have several datasets i.e. matrices that have a 2 columns, one with a matlab date number and a second one with a double value. Here an example set of one of them
>> S20_EavesN0x2DEAir(1:20,:)
ans =
1.0e+05 *
7.345016409722222 0.000189375000000
7.345016618055555 0.000181875000000
7.345016833333333 0.000177500000000
7.345017041666667 0.000172500000000
7.345017256944445 0.000168750000000
7.345017465277778 0.000166875000000
7.345017680555555 0.000164375000000
7.345017888888889 0.000162500000000
7.345018104166667 0.000161250000000
7.345018312500001 0.000160625000000
7.345018527777778 0.000158750000000
7.345018736111110 0.000160000000000
7.345018951388888 0.000159375000000
7.345019159722222 0.000159375000000
7.345019375000000 0.000160625000000
7.345019583333333 0.000161875000000
7.345019798611111 0.000162500000000
7.345020006944444 0.000161875000000
7.345020222222222 0.000160625000000
7.345020430555556 0.000160000000000
Now that I have those different sensor values, I need to get them together into a matrix, so that I could perform clustering, neural net and so on, the only problem is, that the sensor data was taken with slightly different timings or timestamps and there is nothing I can do about that from a data collection point of view.
My first thought was interpolation to make one sensor data set fit another one, but that seems like a messy approach and I was thinking maybe I am missing something, a toolbox or function that would enable me to do this quicker without me fiddling around. To even complicate things more, the number of sensors grew over time, therefore I am looking at different start dates as well.
Someone a good idea on how to go about this? Thanks
I think your first thought about interpolation was the correct one, at least if you plan to use NNs. Another option would be to use approaches which are designed to deal with missing data, like http://en.wikipedia.org/wiki/Dempster%E2%80%93Shafer_theory for example.
It's hard to give an answer for the clustering part, because I have no idea what you're looking for in the data.
For the neural network, beside interpolating there are at least two other methods that come to mind:
training separate networks for each matrix
feeding them all together to the same network, with a flag specifying which matrix the data is coming from, i.e. something like: input (timestamp, flag_m1, flag_m2, ..., flag_mN) => target (value) where the flag_m* columns are mutually exclusive boolean values - i.e. flag_mK is 1 iff the line comes from matrix K, 0 otherwise.
These are the only things I can safely say with the amount of information you provided.