Gaussian mixture model in pyspark - pyspark

I have gone through the link https://spark.apache.org/docs/latest/mllib-clustering.html regarding fitting a GMM in pyspark. I have carried out the same operation successfully in python, but after several iteration, I am unable to run in pyspark.
The questions i have are as follow;
1. The link mentioned above & another example of fitting GMM in pyspark I checked, takes a txt file with no column headings. I have a csv with 17 columns. The code is,
data = sc.textFile("..path/mydata.csv")
parsedData = data.map(lambda line: array([float(x) for x in line.strip().split(' ')]))
This worked, but when i am trying to fit GaussianMixture.train specifying some components, It is not working.
If the data used in the examples have no column headings, how can I judge which column is coming from which distribution & how the change in pattern is appearing?
How can I get heat-map from here so that whenever a new data comes in, I will use my trained model's heat-map to judge the distribution pattern of my new test data & can point out the mis-matches.
Thanks.

Related

No Model Summary For GLMs in Pyspark / SparkML

I'm familiarizing myself with Pyspark and SparkML at the moment. To do so I use the titanic dataset to train a GLM for predicting the 'Fare' in that dataset.
I'm following closely the Spark documentation. I do get a working model (which I call glm_fare) but when I try to assess the trained model using summary I get the following error message:
RuntimeError: No training summary available for this GeneralizedLinearRegressionModel
Why is this?
The code for training was as such:
glm_fare = GeneralizedLinearRegression(
labelCol="Fare",
featuresCol="features",
predictionCol='prediction',
family='gamma',
link='log',
weightCol='wght',
maxIter=20
)
glm_fit = glm_fare.fit(training_df)
glm_fit.summary
Just in case someone comes across this question, I ran into this problem as well and it seems that this error occurs when the Hessian matrix is not invertible. This matrix is used in the maximization of the likelihood for estimating the coefficients.
The matrix is not invertible if one of the eigenvalues is 0, which occurs when there is multicollinearity in your variables. This means that one of the variables can be predicted with a linear combination of the other variables. Consequently, the effect of each of the variables cannot be identified with any significance.
A possible solution would be to find the variables that are (multi)collinear and remove one of them from the regression. Note however that multicollinearity is only a problem if you want to interpret the coefficients and not when the model is used for prediction.
It is documented possibly there could be no summary available for a model in GeneralizedLinearRegressionModel docs.
However you can do an initial check to avoid the error:
glm_fit.hasSummary() which is a public boolean method.
Using it as
if glm_fit.hasSummary():
print(glm_fit.summary)
Here is a direct like to the Pyspark source code
and the GeneralizedLinearRegressionTrainingSummary class source code and where the error is thrown
Make sure your input variables for one hot encoder starts from 0.
One error I made that caused summary not created is, I put quarter(1,2,3,4) directly to one hot encoder, and get a vector of length 4, and one column is 0. I converted quarter to 0,1,2,3 and problem solved.

Target Encoding in Python Model

I made a model in python and this uses target encoding. I used a dataset with 25000 rows and this gets divided into training and test data sets. The model is really working fine. However, I now want to run the model on totally fresh data - say just one row of data in an excel file. I need to know the code for it and will really appreciate it if someone can help. I am somewhat new to python. Here is the part of code I have written to create the training and test data sets from 25000 rows and train the model on training and predict on the test. However, I need the code that runs this model that uses target encoding to predict fresh data. If I need to post more code for greater clarity please let me know.
train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.2)
rf = RandomForestClassifier(n_jobs=-1)
rf.fit(train_x.values, train_y.values)
pred_train = rf.predict(train_x.values)
pred = rf.predict(test_x.values)
Thanks
You might want to have have a look the comment section in this notebook-
here
"After we apply target encoding on the train data and target. We can have the result for one category like column A has a,b,c. Then we calculate the mean of each a,b,c in column A and apply it to the test data.We then apply it to test using the pd.merge function."

How to use Decision Tree Classification Matlab?

I have data in form of rows and columns where rows represent a record and column represents its attributes.
I also have the labels (classes) for those records.
I know about decision trees concept and I would like to use matlab for classification of unseen records using decision trees.
How can this be done? I followed this link but its not giving me correct output-
Decision Tree in Matlab
Essentially I want to construct a decision tree based on training data and then predict the labels of my testing data using that tree. Can someone please give me a good and working example for this ?
I used following code to achieve it. And it is working correctly
function DecisionTreeClassifier(trainingFile, testingFile, labelsFile, outputFile)
training = csvread(trainingFile);
labels = csvread(labelsFile);
testing = csvread(testingFile);
tree = ClassificationTree.fit(training,labels)
prediction = predict(tree, testing)
csvwrite(outputFile, prediction)
ClassificationTree.fit will be removed in a future release. Use fitctree instead.

importing data with a very complex structure into matlab

i have some data files that i would like to load into matlab. unfortunatly, they have a quite complex structure - at least compared to what i am used to. you should be able to download an old example of this here, https://www.dropbox.com/s/vbh6kl334c5zg1s/fn1_2.out (it opens fine in notepad or wordpad)
it is data files based on synchrotron data where both the raw data, regularized "raw" data and the (indirect) fourier transformed data+fit to data is listed. there are furthermore some statistics from the fourier transformation.
I just need to quote the results from the statistics in my paper, so while it would be nice to plot some of the results, it is not strictly necessary. I need, however, the raw and regularized data together with the fit, and the fourier transformed data.
My problem
in the data file, the results from the statistical analysis is shown before the data i need. but the size of the columns from the statistical analysis varies from data file to data file. this means that i cannot just include the statistics in the header unless i manually change the number of header lines for each file i import. i need to analysis groups of 5 data files together and i would at least need to analyze around 30 files this time so i would like to avoid it if possible. in the future i would again need to load this kind of data files - so even if changing the number of headerlines 30 times does not sound bad it would be nice to be able do it automatically
Possible solution
both the he raw and regularized data together with the fit as well as the fourier transformed data are preceded by a specific line that tells me that after this and a blank/empty line, the data begins
so i though that maybe i could use regular expressions to tell matlab to ignore everything until you see this specific line, ignore this line and one more, and then import data
i googled and found this topic where regular expressions are used: Trying to parse a fairly complex text file
but i am new to regular expressions and the code suggested is a bit complex for me. i can gather that he uses named capture but i am not quite sure i understand how he uses it and if i can adopt it to me need. i have checked the official matlab documentation but their examples are somewhat simpler :) (http://www.mathworks.se/help/matlab/matlab_prog/regular-expressions.html#bqm94nz-1)
Sorry for writing such a long post. any suggestions on how to proceed with this problem will be greatly appreciated
/Martin
EDIT
the code i have used based on the link in the comment:
fileName = 'data.dat';
inputfile = fopen(fileName);
% Ignore all until we see one that just consists of this:
startString = ' R P(R) ERROR';
mydata = [];
while 1
tline = fgetl(inputfile);
% Break if we hit end of file, or the start marker
if ~ischar(tline) || strcmp(tline, startString)
break
end
data = sscanf(tline, '%f', 3 );
mydata(end+1,:) = data;
end
fclose(inputfile);
When i run the code i get the error:
Subscripted assignment dimension mismatch.
mydata(end+1,:) = data;
any suggestions will be greatly appriciated and my apologize for the strange layout/leaving the link in the comment. i am not allowed to include more than two links in a post and i cannot add a new answer yet - both due to me having to low rep :)
Since the blocks are separated by at least two new lines you can use that to separate the text into blocks and analyse them individually. Try this code
fileH = fopen('fn1_2.out');
content = fscanf(fileH, '%c', inf);
fclose(fileH);
splitstring = regexp(content, '\r\n\r\n', 'split');
blocks = regexp(splitstring, '\d\.\d{4}.*\r\n.*\d\.\d{4}','match');
numericBlocksIdx = find(cellfun(#(x) ~isempty(x), blocks));
numericBlocks = splitstring(numericBlocksIdx);
Now the numericBlocks{1}, numericBlocks{2}, ... contain the tables that you are interested in. Note that for some tables the headers are also included because they are not separated by two new lines. From here you can use functions like textscan to read the data into matrices.

How to export the output of a SOM in MATLAB

Ok, so this question is related to my ongoing task of getting text data categorized, you can refer to this question for more details on how I approached this problem.
I used the standard matlab function "nctool" (neural clustering tool) to get my inputs organized on a plane of 10x10 SOM nodes. I also got the output of this map (i.e. which of my inputs ended up on which node) saved in to the "output" variable in my workspace.
I would now like to get this data out and see if I can write another script. I'm aware of the 'save' and some of the export functions in MATLAB, however it seems that MATLAB does not support ascii export of this variable since it is a sparse matrix.
I am currently writing a script to get this thing exported out, however if someone already has a solution, please post. Otherwise I will do so after I finish testing it.
Update: I found a workaround to this fairly easily:
% convert a sparse matrix to full
output = full(output);
% output this to a file (excel)
xlswrite('test.csv',output);