I am trying to pickle my networkx graph but getting the next error
can't pickle generator objects
I read TypeError: can't pickle generator objects that you can't pickle generator. how can I find where is my generator in the graph object ? is there a way to traverse recursively on object and find type == generator ?
This is likely an issue with networkx version on Python 2.x, stemming from the change in networkx functions (e.g., for calculating shortest path lengths) that return a generator in the recent versions of the package as opposed to a dictionary in the 1.x versions.
A workaround would be checking whether the object returned by networkx is a generator and if it is, converting it to picklable object, for example the following code was tested using Python 2.7.16 and networkx 2.2:
import networkx, types, cPickle
G = networkx.cycle_graph(5)
val = networkx.shortest_path_length(G)
if isinstance(val, types.GeneratorType): # newer versions of networkx returns generator
val_new = {source: target_dict for source, target_dict in val}
val = val_new
cPickle.dump(val, open("test.pkl", 'w'))
Related
The abs method of the purescript-numbers package is documented here:
https://pursuit.purescript.org/packages/purescript-numbers/9.0.0/docs/Data.Number#v:abs
However, consider this simple code example which implements this method:
import Data.Number (abs, toNumber, isFinite, sqrt)
j :: Number
j = abs -32.5
This produces the following error:
Cannot import value abs from module Data.Number
It either does not exist or the module does not export it.
Is this a bug, or intended behavior?
What is the correct way to import / use the abs method of the Data.Number library?
My suspicion is that you're probably using PureScript compiler v0.14.x and a corresponding package set 0.14.x, but at the same time are looking at the latest versions of the libraries on Pursuit.
Just about a week ago, the new version of PureScript compiler 0.15.0 came out, and with it came many breaking changes (most notably, ES-format foreign modules), and as is tradition in such cases, the community took the opportunity to do some refactoring while we're at it.
One instance of such refactoring was moving the abs function (as well as many other functions) from the Math module (in the math library) to the Data.Number module (in the numbers library).
This means that if you're using PureScript 0.15 and higher, abs is in Data.Number, but if your version of PureScript is lower, the function is in the Math module.
I'm a data scientist and still relatively new to Scala. I'm trying to understand the Scala documentation and run a t-test from any existing package. I am looking for sample Scala code on a dummy data set that will work and insight into understanding how to understand the documentation.
I'm working in an EMR Notebook (basically Jupyter notebook) in an AWS EMR cluster environment. I tried referring to this documentation but apparently I am not able to understand it: https://commons.apache.org/proper/commons-math/javadocs/api-3.6/org/apache/commons/math3/stat/inference/TTest.html#TTest()
Here's what I've tried, using multiple load statements for two different packages that have t-test functions. I have multiple lines for the math3.state.inference package since I'm not entirely certain the differences between each and wanted to make sure this part wasn't the problem.
import org.apache.commons.math3.stat.inference
import org.apache.commons.math3.stat.inference._ // note sure if this means, import all classes/methods/functions
import org.apache.commons.math3.stat.inference.TTest._
import org.apache.commons.math3.stat.inference.TTest
import org.apache.spark.mllib.stat.test
No errors there.
import org.apache.asdf
Returns an error, as expected.
The documentation for math3.state.inference says there is a TTest() constructor and then shows a bunch of methods. How does this tell me how to use these functions/methods/classes? I see the following "method" does what I'm looking for:
t(double m, double mu, double v, double n)
Computes t test statistic for 1-sample t-test.
but I don't know how to use it. Here's just several things I've tried:
inference.t
inference.StudentTTest
test.student
test.TTest
TTest.t
etc.
But I get errors like the following:
An error was encountered:
<console>:42: error: object t is not a member of package org.apache.spark.mllib.stat.test
test.t
An error was encountered:
<console>:42: error: object TTest is not a member of package org.apache.spark.mllib.stat.test
test.TTest
...etc.
So how do I fix these issues/calculate a simple, one-sample t-statistic in Scala with a Spark kernel? Any instructions/guidance on how to understand the documentation will be helpful for the long-term as well.
the formula for computing one sample t test is quite straightforward to implement as a udf (user defined function)
udfs are how we can write custom functions to apply to different rows of the DataFrame. I assume you are okay with generating the aggregated values using standard groupby and agg functions.
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.expressions.UserDefinedFunction
val data = Seq((310, 40, 300.0, 18.5), (310, 41, 320.0, 14.5)).toDF("mu", "sample_size", "sample_mean", "sample_sd")
+---+-----------+-----------+---------+
| mu|sample_size|sample_mean|sample_sd|
+---+-----------+-----------+---------+
|310| 40| 300.0| 18.5|
|310| 41| 320.0| 14.5|
+---+-----------+-----------+---------+
val testStatisticUdf: UserDefinedFunction = udf {
(sample_mean: Double, mu:Double, sample_sd:Double, sample_size: Int) =>
(sample_mean - mu) / (sample_sd / math.sqrt(sample_size.toDouble))
}
val result = data.withColumn("testStatistic", testStatisticUdf(col("sample_mean"), col("mu"), col("sample_sd"), col("sample_size")))
+---+-----------+-----------+---------+-------------------+
| mu|sample_size|sample_mean|sample_sd| testStatistic|
+---+-----------+-----------+---------+-------------------+
|310| 40| 300.0| 18.5|-3.4186785515333833|
|310| 41| 320.0| 14.5| 4.4159477499536886|
+---+-----------+-----------+---------+-------------------+
I am trying t perform pca from a spark application using PySpark API on a python script. I doing This way:
pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
PCAmodel = pca.fit(data)
when I run those two code line in the pyspark shell it work and return good results, but in an application script, I am getting the type of error:
PCA() got an unexpected keyword argument 'k'
PS: In both case I am using Spark 2.2.0
where is the problem? why it does work in the PySpark shell and not for the application?
You probably imported from ml in one case:
from pyspark.ml.feature import PCA
and mllib in the other:
from pyspark.mllib.feature import PCA
Are you sure you have not also imported PCA from scikit-learn, after you imported it from PySpark in your application script?
spark.version
# u'2.2.0'
from pyspark.ml.feature import PCA
from sklearn.decomposition import PCA
# PySpark syntax with scikit-learn PCA function
pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
# Error:
TypeError: __init__() got an unexpected keyword argument 'k'
Reversing the order of imports will not produce the error (not shown).
Try renaming your classes:
from pyspark.ml.feature import PCA as PCAML
from sklearn.decomposition import PCA as PCASK
pca_ml = PCAML(k=3, inputCol="features", outputCol="pcaFeatures")
There should be no confusion, then, which one you call.
I am trying to run connected components algorithm on a graph using the Scala API as shown in the programming guide and other examples.
val graph = Graph.fromDataSet(vertices, edges, env).getUndirected
val maxIterations = 10
val components = graph.run(new ConnectedComponents(maxIterations))
And I get the following error:
Type mismatch, expected: GraphAlgorithm[Long,String,Long,NotInferedT], actual: ConnectedComponents[Nothing,Nothing]
And even if I add
val components = graph.run(new ConnectedComponents[Long,String,Long](maxIterations))
I get:
Type mismatch, expected: GraphAlgorithm[Long,String,Long,NotInferedT], actual: ConnectedComponents[Long,String]
My imports are these:
import org.apache.flink.api.scala._
import org.apache.flink.graph.library.ConnectedComponents
import org.apache.flink.graph.{Vertex, Edge}
import org.apache.flink.graph.scala.Graph
Can someone please explain why this is happening?
the ConnectedComponents Gelly library algorithm takes 2 type parameters, the vertex ID type and the edge value type, so you need to call it like this for example graph.run(new ConnectedComponents[Long, NullValue](maxIterations). Also, since it is a Java implementation, make sure to import java.lang.Long.
You can also look at the org.apache.flink.graph.scala.example.ConnectedComponents which uses the GSA version of the library algorithm.
The problem is that the ConnectedComponents implementation expects vertices to have a java.lang.Long vertex value. Unfortunately, scala.Long and java.lang.Long are not type compatible. Thus, in order to use the algorithm, your vertices data set must be of type DataSet[K, java.lang.Long] with K being an arbitrary key type.
This looks like a typical java/scala type mismatch. Please check again whether you use java.lang.Long or scala.Long in this case.
vasia and Till Rohrmann were right, but in my case the whole problem backtracks to the creation of vertices and edges and not just the usage of the connected components algorithm. I was creating vertices and edges using scala.Long and not java.lang.Long.
I am trying to implement an ML pipeline in Spark using Scala and I used the sample code available on the Spark website. I am converting my RDD[labeledpoints] into a data frame using the functions available in the SQlContext package. It gives me a NoSuchElementException:
Code Snippet:
Error Message:
Error at the line Pipeline.fit(training_df)
The type Vector you have inside your for-loop (prob: Vector) takes a type parameter; such as Vector[Double], Vector[String], etc. You just need to specify the type you data your vector will store.
As a site note: The single argument overloaded version of createDataFrame() you use seems to be experimental. In case you are planning to use it for some long term project.
The pipeline in your code snippet is currently empty, so there is nothing to be fit. You need to specify the stages using .setStages(). See the example in the spark.ml documentation here.