PCA() got an unexpected keyword argument 'k' - pyspark

I am trying t perform pca from a spark application using PySpark API on a python script. I doing This way:
pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
PCAmodel = pca.fit(data)
when I run those two code line in the pyspark shell it work and return good results, but in an application script, I am getting the type of error:
PCA() got an unexpected keyword argument 'k'
PS: In both case I am using Spark 2.2.0
where is the problem? why it does work in the PySpark shell and not for the application?

You probably imported from ml in one case:
from pyspark.ml.feature import PCA
and mllib in the other:
from pyspark.mllib.feature import PCA

Are you sure you have not also imported PCA from scikit-learn, after you imported it from PySpark in your application script?
spark.version
# u'2.2.0'
from pyspark.ml.feature import PCA
from sklearn.decomposition import PCA
# PySpark syntax with scikit-learn PCA function
pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
# Error:
TypeError: __init__() got an unexpected keyword argument 'k'
Reversing the order of imports will not produce the error (not shown).

Try renaming your classes:
from pyspark.ml.feature import PCA as PCAML
from sklearn.decomposition import PCA as PCASK
pca_ml = PCAML(k=3, inputCol="features", outputCol="pcaFeatures")
There should be no confusion, then, which one you call.

Related

How do I calculate a simple one-sample t-statistic in Scala-Spark in an AWS EMR cluster?

I'm a data scientist and still relatively new to Scala. I'm trying to understand the Scala documentation and run a t-test from any existing package. I am looking for sample Scala code on a dummy data set that will work and insight into understanding how to understand the documentation.
I'm working in an EMR Notebook (basically Jupyter notebook) in an AWS EMR cluster environment. I tried referring to this documentation but apparently I am not able to understand it: https://commons.apache.org/proper/commons-math/javadocs/api-3.6/org/apache/commons/math3/stat/inference/TTest.html#TTest()
Here's what I've tried, using multiple load statements for two different packages that have t-test functions. I have multiple lines for the math3.state.inference package since I'm not entirely certain the differences between each and wanted to make sure this part wasn't the problem.
import org.apache.commons.math3.stat.inference
import org.apache.commons.math3.stat.inference._ // note sure if this means, import all classes/methods/functions
import org.apache.commons.math3.stat.inference.TTest._
import org.apache.commons.math3.stat.inference.TTest
import org.apache.spark.mllib.stat.test
No errors there.
import org.apache.asdf
Returns an error, as expected.
The documentation for math3.state.inference says there is a TTest() constructor and then shows a bunch of methods. How does this tell me how to use these functions/methods/classes? I see the following "method" does what I'm looking for:
t(double m, double mu, double v, double n)
Computes t test statistic for 1-sample t-test.
but I don't know how to use it. Here's just several things I've tried:
inference.t
inference.StudentTTest
test.student
test.TTest
TTest.t
etc.
But I get errors like the following:
An error was encountered:
<console>:42: error: object t is not a member of package org.apache.spark.mllib.stat.test
test.t
An error was encountered:
<console>:42: error: object TTest is not a member of package org.apache.spark.mllib.stat.test
test.TTest
...etc.
So how do I fix these issues/calculate a simple, one-sample t-statistic in Scala with a Spark kernel? Any instructions/guidance on how to understand the documentation will be helpful for the long-term as well.
the formula for computing one sample t test is quite straightforward to implement as a udf (user defined function)
udfs are how we can write custom functions to apply to different rows of the DataFrame. I assume you are okay with generating the aggregated values using standard groupby and agg functions.
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.expressions.UserDefinedFunction
val data = Seq((310, 40, 300.0, 18.5), (310, 41, 320.0, 14.5)).toDF("mu", "sample_size", "sample_mean", "sample_sd")
+---+-----------+-----------+---------+
| mu|sample_size|sample_mean|sample_sd|
+---+-----------+-----------+---------+
|310| 40| 300.0| 18.5|
|310| 41| 320.0| 14.5|
+---+-----------+-----------+---------+
val testStatisticUdf: UserDefinedFunction = udf {
(sample_mean: Double, mu:Double, sample_sd:Double, sample_size: Int) =>
(sample_mean - mu) / (sample_sd / math.sqrt(sample_size.toDouble))
}
val result = data.withColumn("testStatistic", testStatisticUdf(col("sample_mean"), col("mu"), col("sample_sd"), col("sample_size")))
+---+-----------+-----------+---------+-------------------+
| mu|sample_size|sample_mean|sample_sd| testStatistic|
+---+-----------+-----------+---------+-------------------+
|310| 40| 300.0| 18.5|-3.4186785515333833|
|310| 41| 320.0| 14.5| 4.4159477499536886|
+---+-----------+-----------+---------+-------------------+

Why is there a problem with the code? I am connected to the clusters

I was trying to apply an UDF function to round those pct, maybe there are better ways, I am open to it because I am new to the pyspark.
When I removed the udf function to give up rounding the numbers, it worked so I am confident with the dataframe.
So guys, genius, please help me, love&peace
I tried spqrk.sql in databricks to get this dataframe and it looked good.
Here are the codes:
from pyspark.sql.types import IntegerType
round_func = udf(lambda x:round(x,2), IntegerType())
q2_res = q2_res.withColumn('pct_DISREGARD', round_func(col('pct')))
display(q2_res)
ERROR:
AttributeError: 'NoneType' object has no attribute '_jvm'
Apparently we can't use any of pyspark.sql.functions with udf. A detailed explanation is given in this thread. You're trying to use round function, so it's not going to work, as it only works on columns. We can achieve the same functionality in a much easier way:
from pyspark.sql.types import IntegerType
import pyspark.sql.functions as f
q2_res = q2_res.withColumn('pct_DISREGARD', f.round('pct', 2).astype(IntegerType()))
It's usually advisable to avoid UDFs as much as possible because these are usually slower than native dataframe operations.

pickling networkx graph: can't pickle generator objects

I am trying to pickle my networkx graph but getting the next error
can't pickle generator objects
I read TypeError: can't pickle generator objects that you can't pickle generator. how can I find where is my generator in the graph object ? is there a way to traverse recursively on object and find type == generator ?
This is likely an issue with networkx version on Python 2.x, stemming from the change in networkx functions (e.g., for calculating shortest path lengths) that return a generator in the recent versions of the package as opposed to a dictionary in the 1.x versions.
A workaround would be checking whether the object returned by networkx is a generator and if it is, converting it to picklable object, for example the following code was tested using Python 2.7.16 and networkx 2.2:
import networkx, types, cPickle
G = networkx.cycle_graph(5)
val = networkx.shortest_path_length(G)
if isinstance(val, types.GeneratorType): # newer versions of networkx returns generator
val_new = {source: target_dict for source, target_dict in val}
val = val_new
cPickle.dump(val, open("test.pkl", 'w'))

Can't convert Dataframe to Labeled Point

My program uses Spark.ML, I use logistic regression on dataframes. However I would like to use LogisticRegressionWithLBFGS too so I want to convert my dataframe into LabeledPoint.
The following code gives me an error
val model = new LogisticRegressionWithLBFGS().run(dff3.rdd.map(row=>LabeledPoint(row.getAs[Double]("label"),org.apache.spark.mllib.linalg.SparseVector.fromML(row.getAs[org.apache.spark.ml.linalg.SparseVector]("features")))))
Error :
org.apache.spark.ml.linalg.DenseVector cannot be cast to org.apache.spark.ml.linalg.SparseVector
So I changed SparseVector to DenseVector but it doesn't work :
org.apache.spark.ml.linalg.SparseVector cannot be cast to org.apache.spark.ml.linalg.DenseVector
Have you tried to use org.apache.spark.mllib.linalg.Vectors.fromML instead?
Note: This answer is a copy paste from the comments to allow it to be closed.

Aggregate function in spark-sql not found

I am new to Spark and I am trying to make use of some aggregate features, like sum or avg. My query in spark-shell works perfectly:
val somestats = pf.groupBy("name").agg(sum("days")).show()
When I try to run it from scala project it is not working thou, throwing an error messageé
not found: value sum
I have tried to add
import sqlContext.implicits._
import org.apache.spark.SparkContext._
just before the command, but it does not help. My spark version is 1.4.1 Am I missing anything?
You need this import:
import org.apache.spark.sql.functions._
You can use sum method directly on GroupedData (groupBy return this type)
val somestats = pf.groupBy("name").sum("days").show()