How to treat a date column so as to perfom kmeans clustering - cluster-analysis

Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In storage.mode(x) <- "double" : NAs introduced by coercion
The result of this error is due to the date column in my dataset.
May i please get clarity on how i should treat my date column if i also want to include it in my dataset to perfom a kmeans clustering algorithm.

Related

PYSPARK : Finding a Mean of a variables excluding the top 1 percentile of data

I have a dataset which is getting grouped by multiple variables where we finding aggregates like mean , std dev etc. Now i want to find Mean of a variables excluding the top 1 percentile of data
I am trying something like
df_final=df.groupby(groupbyElement).agg(mean('value').alias('Mean'),stddev('value').alias('Stddev'),expr('percentile(value, array(0.99))')[0].alias('99_percentile'),mean(when(col('value')<=col('99_percentile'),col('value')))
But it seems spark cannot use the agg name which is defined in the same group statement.
I even tried this ,
~df_final=df.groupby(groupbyElement).agg(mean('value').alias('Mean'),stddev('value').alias('Stddev'),mean(when(col('value')<=expr('percentile(value, array(0.99))')[0],col('value')))~
But it throws below error:
pyspark.sql.utils.AnalysisException: 'It is not allowed to use an aggregate function in the argument of another aggregate function. Please use the inner aggregate function in a sub-query.
I hope some one would be able to answer this
Update :
I try doing the otherway
Here's a straightforward modification of your code. It will aggregate df twice. As far as I can tell, that's what is required.
df_final=(
df.join(df
.groupby(groupbyElement)
.agg(expr('percentile(value, array(0.99))')[0].alias('99_percentile'),
on=["groupbyElement"], how="left"
)
.groupby(groupbyElement)
.agg(mean('value').alias('Mean'),
stddev('value').alias('Stddev'),
mean(when(col('value')<=col('99_percentile'), col('value')))
)

IllegalArgumentException: 'Field "label" does not exist Spark MLlib

I'm trying to model some data with a logistic regression, part of spark MLlib. For the model creation I've got the following columns:
ID,
features,
label
I can split it into Train and value data via
(trainsample,testsample) = sample.randomSplit([0.7, 0.3], seed)
Also, I can define my model:
lr = LogisticRegression(featuresCol="features", labelCol="label",
predictionCol="prediction")
Then I can train and test it with:
lrmodel = lr.fit(trainsample)
result = lrmodel.transform(testmodel)
All fine. But now I want to use my model and predict unlabeled data. I am always getting
the following Error:
IllegalArgumentException: 'Field "label" does not exist
I tried to create a dummy label column (all values 999). But than, all my predictions belong to one class (class 6 for 7 different classes). So the label seems to influence my predictions, even with a pretrained model.
Maybe "lrmodel.transform" is just for testing and there is other syntax for use the model. But I didn't find anything to this topic. Any help would be appreciated.
found the issue... I had the label within my featureset x_x... Thanks for your help

Using MLUtils.convertVectorColumnsToML() inside a UDF?

I have a Dataset/Dataframe with a mllib.linalg.Vector (of Doubles) as one of the columns. I would like to add another column to this dataset of type ml.linalg.Vector to this data set (so I will have both types of Vectors). The reason is I am evaluating few algorithms and some of those expect mllib vector and some expect ml vector. Also, I have to feed o/p of one algorithm to another and each use different types.
Can someone please help me convert mllib.linalg.Vector to ml.linalg.Vector and append a new column to the data set in hand. I tried using MLUtils.convertVectorColumnsToML() inside an UDF and regular functions but not able to get it to working. I am trying to avoid creating a new dataset and then doing inner join and dropping the columns as the data set will be huge eventually and joins are expensive.
You can use the method toML to convert from mllib to ml vector. An UDF and usage example can look like this:
val convertToML = udf((mllibVec: org.apache.spark.mllib.linalg.Vector) = > {
mllibVec.asML
})
val df2 = df.withColumn("mlVector", convertToML($"mllibVector"))
Assuming df to be the original dataframe and the column with the mllib vector to be named mllibVector.

How to handle missing numerical features when using Spark MLlib Decision Trees?

How do I handle a missing numerical feature when using Decision Trees in Spark MLlib?
I am considering replacing the missing feature with the mean of the other values, however I'm not sure what's the impact on the model quality. Does Spark MLlib provide any support for this common issue?
Every DataFrame can take advantage of the DataFrameNaFunctions which can drop the offending record (not the whole column), fill which can fill the offending datum with static "dummy data" or replace which can replace the offending datum with specified data.
https://spark.apache.org/docs/2.1.1/api/scala/#org.apache.spark.sql.DataFrameNaFunctions
scala> df.na
res20: org.apache.spark.sql.DataFrameNaFunctions = org.apache.spark.sql.DataFrameNaFunctions#e7e9006
scala> df.na.
drop fill replace

How to do pandas groupby([multiple columns]) so its result can be looked up

I have two dataframes: tr is a training-set, ts is a test-set.
They contain columns uid (a user_id), categ (a categorical), and response.
response is the dependent variable I'm trying to predict in ts.
I am trying to compute the mean of response in tr, broken out by columns uid and categ:
avg_response_uid_categ = tr.groupby(['uid','categ']).response.mean()
This gives the result but (unwantedly) the dataframe index is a MultiIndex. (this is the groupby(..., as_index=True) behavior):
MultiIndex[--5hzxWLz5ozIg6OMo6tpQ SomeValueOfCateg, --65q1FpAL_UQtVZ2PTGew AnotherValueofCateg, ...
But instead I want the result to keep the two columns 'uid', 'categ' and keep them separate.
Should I use aggregate() instead of groupby()?
Trying groupby(as_index=False) is useless.
The result seems to differ depending on whether you do:
tr.groupby(['uid','categ']).response.mean()
or:
tr.groupby(['uid','categ'])['response'].mean() # RIGHT
i.e. whether you slice a single Series, or a DataFrame containing a single Series. Related: Pandas selecting by label sometimes return Series, sometimes returns DataFrame