Splitting Pyspark dataframe into equal parts, record function runtime - pyspark

I have a Pyspark dataframe split into 10 partitions. I'm trying to run a linear regression on each partition and store the runtime for each partition in a variable (to be able to plot later - ultimately, I want to compare the efficiency of several models and visualize this)
I tried this:
ts = [timeit.timeit(M1.predict(m1features))]
but it doesn't work because m1features would need to be a vector, and I want to measure Spark's performance. How would I go about measuring runtime of the prediction on each partition?

Related

Multi-label clustering

I have a question regarding a task that I am trying to solve. The data that I have are characterisation data,
meaning that I have a label (PASS/FAIL) for every single datapoint.
So my data matrix, is of n rows and m columns and the target variables are again a matrix of
n rows and m columns composed of binary values (0s and 1s).
My task is to apply clustering and partition all these datapoints into two clusters, one being for PASS
datapoints and the other for FAIL datapoints. I wasn't able to find an algorithm that can solve
this type of 'multi-label' problem with clustering.
I tried to implement algorithms like k-means but while tuning the number of clusters to initialise
I get k=6 which doesn't really make sense. In the data, outliers are already dropped and they
are normalised as well.
I have a large amount of features on my data matrix (eg. >3000) and I tried to apply
dimensionality reduction methods like PCA to at least drop the features that are more
irrelevant than the rest. But I am not sure if this would be applicable in my case when
I have a binary matrix as target variables.
Is there a specific algorithm that can solve this type of problem and if so, what is the
necessary pre-processing I should be doing before applying it?

Evaluation parameters like Accuracy, Precision and recall in Pyspark 3.0+., Confusion Matrix in Pyspark

What's the best way to Create the Confusion matrix and thereby the evaluation parameters like Accuracy, Precision and recall in Pyspark 3.0+. I have seen others answers but they are too slow for even a small pyspark dataframe with just 800K rows with the label and predictions.
The solution here Confusion Matrix to get precsion,recall, f1score
does a sorting of the rows which is very slow if the dataframe is huge. Is ther no other better way. I also tried the collect function ( also suggested in the same link) but thats slow aswell.

Latent Dirichlet Allocation and Analyzing Two Data Sets using MALLET

I am currently analyzing two datasets. Dataset A has about 600000+ documents whereas Dataset B has about 7000+ documents. Does this mean that the topic outputs will be more about Dataset A because it has a larger N? The output of mallet in Rapidminer still accounts for which documents fall under each topic. I wonder if there is a way to make the two datasets be interpreted with equal weights?
I am assuming you're mixing the two documents in the training corpus altogether and peform the training. Under this assumption, then it is very likely that the topic outputs will be more about dataset "coming" from A rather than B, as the Gibbs sampling would construct topics according to the co-occurence of tokens which most likely falls from A as well. However inter-topics or similarity of topic across two datasets overlaps is also possible.
You can sample document A instead so that it has same number of documents as B, assuming their topics structure is not that different. Or, you can check the log output from --output-state parameter to see exactly the assigned topic (z) for each token.

Calculating integral using Spark and DataFrames

I have a Spark DataFrame representing energy consumption (in kW) of particular device in particular moment (datestamped). I would like to calculate energy consumption in kWh, it means calculate integral over this dataset for a given time interval. How can I accomplish it using Spark?

How to generate a 'clusterable' dataset in MATLAB

I need to test my Gap Statistics algorithm (which should tell me the optimum k for the dataset) and in order to do so I need to generate a big dataset easily clustarable, so that I know a priori the optimum number of clusters. Do you know any fast way to do it?
It very much depends on what kind of dataset you expect - 1D, 2D, 3D, normal distribution, sparse, etc? And how big is "big"? Thousands, millions, billions of observations?
Anyway, my general approach to creating easy-to-identify clusters is concatenating sequential vectors of random numbers with different offsets and spreads:
DataSet = [5*randn(1000,1);20+3*randn(1,1000);120+25*randn(1,1000)];
Groups = [1*ones(1000,1);2*ones(1000,1);3*ones(1000,1)];
This can be extended to N features by using e.g.
randn(1000,5)
or concatenating horizontally
DataSet1 = [5*randn(1000,1);20+3*randn(1,1000);120+25*randn(1,1000)];
DataSet2 = [-100+7*randn(1000,1);1+0.1*randn(1,1000);20+3*randn(1,1000)];
DataSet = [DataSet1 DataSet2];
and so on.
randn also takes multidimensional inputs like
randn(1000,10,3);
For looking at higher-dimensional clusters.
If you don't have details on what kind of datasets this is going to be applied to, you should look for these.