How to use Spark MlLib/Pipelines to build 1 model per each user [duplicate] - scala

This question already has an answer here:
Run ML algorithm inside map function in Spark
(1 answer)
Closed 4 years ago.
I want to train different models for each user in my dataset. Is there built in support for that in Spark MlLib/Pipelines?
If not, what's the easiest/cleanest way to train multiple and separate models for each user?

Unfortunately Spark-ML doesn't provide the ability to separate concept "single model - single user". But you can make a custom logic as you wish. I see two possible variants of solving this task.
The first scenario for solving this situation is following to the next algorithm (I took everything for example - you will have different steps, but algorithm will logically similar):
You must obtain training data for the specific user - (e.g. read data csv file from hdfs, s3 etc.)
Train model for the Dataset which depends on the user related data - let's consider the next situation your dataset has two columns - the specific criteria X and user's productivity Y and latest parameter is changeable for user group - you must train your model for instance with LinearRegression so predict if user can do work in the time or can't.
Next, you save data to the disk on call trained model depending on the
user's id, group or etc.
The second approach is to train your model so it was applicable to every user, you must choose options for algorithm so it didn't depend on group of user, in other words, generalize algorithm of training model to all user groups - in this case, you don't have a sense of separation
"single-model--> single user". If the second variant is more complicated to the implementation on your dataset, follow the first approach.

Related

Literature for Classification Problem with Changing Classes

I am currently looking at a text classification problem (say N classes), for which labeled training data exists. Now, ocasionally, a new class is created and some of the labels in the "old" training data become wrong because they now should have the new class label. So the new class recruits from the old classes.
We can assume that we have some new labeled data for the new class, or even that from an input stream of new data we eventually obtain the correct labels by human verification (the goal, however, is to require as few manual corrections as possible).
How to set up a classifier that may face new "recruiting" classes from time to time? Are you aware of approaches/literature for the specific setting described above?
Perhaps, basic strategies may include
trying to relabel the training data and re-train,
using incremental classifiers (e.g., KNN)

DeepLearning4J - Acquiring Data and Train Model

I try to create the easiest of a NeuralNetwork and training it with some data:
Therefore I created a test.csv with a the following pattern:
number,number+1;
number2,number2+1
...
I try to make a linear regression with the network...
But I do not find a way to acquire the data, DataSetIterator does not work.
How to fit the Data, how to test the Data?
In our examples, we encourage people to use datavec + recordreaderdatasetiterator.
Datavec has all of the various data loading components.
I'm not sure what you mean about "datasetiterator not working" wihtout seeing any code, but it seems like you didn't really look at our examples.
In there are multiple examples of a csv record reader you can use for both regression and classification use cases.
Consider reorienting your data pipeline to use those.
Those examples are always found here:
https://github.com/deeplearning4j/dl4j-examples
If you follow any of those, the same pattern emerges:
Record reader for whatever data format -> RecordReaderDataSetIterator
The iterator allows you to specify common constructors such as whether it is a regression or not, which column your label is etc.

User Classification in RapidMiner - output should be the user based on a fed test data

How can I use RapidMiner to run the classifier on a test data, and classify a user based on that data - I need it to actually output who the classified user is, and not its performance. Any help would be greatly appreciated.
I found the answer to my question!
You just have to use an example (row) with Attributes(Column Headers) and then feed it to the Apply Model operator. Make sure you remove the label(or what you want to be predicted) from that example.
The results will give you a row with an added attribute called Prediction.

Spark/Mllib Train many GaussianMixture models in a distributed way

I've been playing around with the Gaussian Mixture Models provided for spark/mllib.
I found it really nice to generate a GaussianMixture from an enormous number of vectors/points. However, this is not always the case in ML. Very often you do not need to generate a model from numberless vectors, but to generate a numberless models -each one- from a few vectors (i.e., building a GMM for each user of a database with hundred of users).
At this point, I do not know how to proceed with the mllib, as I cannot see an easy way to distribute in both by users and by data.
Example:
Let featuresByUser = RDD[user, List[Vectors]],
the natural way to train a GMM for each user might be something like
featuresByUser.mapValues(
feats => new GaussianMixture.set(nGaussians).run(sc.parallelize(feats))
)
However, it is well-known that this is forbidden in spark. The inside sc.parallelize is not in the driver, so this leads to an error.
So the question are,
should the Mllib methods accept Seq[Vector] as input apart from
RDD[Vector] Thus, the programmer could choose one of the other depending on the problem.
Is there any other workaround that I'm missing to deal with this case (using mllib)?
Mllib unfortunately is currently not meant to create many models, but only one at the time, which was confirmed at a recent Spark meetup in London.
What you can do is launch a separate job for each model in a separate thread in the driver. This is described in the job scheduling documentation. So you would create one RDD per user and run a Gaussian mixture on each, running the 'action' that makes the thing run for each on a separate thread.
Another option, if the amount of data per user fits on one instance, you can do a Gaussian mixture on each user with something else than Mllib. This approach was described in the meetup in a case where sklearn was used within PySpark to create multiple models. You'd do something like:
val users: List[Long] = getUsers
val models = sc.parallelize(users).map(user => {
val userData = getDataForUser(user)
buildGM(userData)
})

data mining project Dilemma

I research a set of data, consisting of two data files:
The first contains user id id artists and ranking of users for artists that want to rank.
The second data file contains id and name artists
I have chosen research question which is:
Is the artist is Popular or not?
In other words,by given the new singer, who will not found in the data file, using algorithms, we will classify it as an artist and to know if it is a popular or not.
For Prediction step I chose to use logistic regression method
But my problem is earlier.
I do not know how, technically, to determine who from the existing data will be defined as successful as an artist who is unsuccessful.
I thought of some methods, for example:k-means with k=2 (but in this method i have a problem with function disance),knn with k=2 etc.
I need guidance ,refers to how i will make to clustering to the Existing data
and general tips to the project.
thank you.