I have a DataFrame containing three DataFrames of the same type (same parquet schema). They only differ in the content/values they are containing:
I want to flatten the structure, so that the three DataFrames are getting merged into one single Parquet DataFrame containing all of the content/values.
I tried it with flatten and flatMap, but with that I always get the error:
Error: No implicit view available from org.apache.spark.sql.DataFrame => Traversable[U].parquetsFiles.flatten
Error: not enough arguments for method flatten: (implicit as Trav: org.apache.spark.sql.DataFrame => Traversable[U], implicit m: scala.reflect.ClassTag[U]. Unspecified value parameters asTrav, m. parquetFiles.flatten
I also converted it to a List and then tried to flatten and this is also producing the same error.
Do you have any idea how to solve it or what is the problem here?
Thanks, Alex
The scala compiler is looking for a way to convert the DataFrames to a Traversable so it can apply the flatten. But a DataFrame is not Traversable, so it will fail. Also, no ClassTag available because DataFrames are not statically typed.
The code you're looking for is
parquetFiles.reduce(_ unionAll _)
which can be optimized by the DataFrame execution engine.
So it seems like you want to join these three DataFrames together, to do that the unionAll function would work really well. You could do parquetFiles.reduce((x, y) => x.unionAll(y)) (note this will explode on an empty list but if you might have that just look at one of the folds instead of reduce).
Related
I try to split tuple of ints to two rows in RDD.
vertices=edges.map(lambda x:(x[0],)).union(edges.map(lambda x:(x[1],))).distinct()
I try this code and it is working, but I want code that run less in runtime, without using the GraphFrames package.
You can use flatMap:
edges.flatMap(lambda x: x).distinct()
In Scala, you would simply call .flatMap(identity) instead.
If you use the DataFrame API you can just use explode on your only column e.g. df.select(explode("edge"))
What I'm trying to achieve is to execute Scala code. Convert result Scala RDD[Row] to PySparkRDD of Rows. Perform some python operations and convert RDD of pySpark Rows back to Scala's RDD[Row].
To get RDD to pySpark RDD I'm doing this:
In Scala I have this method
import org.apache.spark.sql.execution.python.EvaluatePython.{javaToPython, toJava}
def toPythonRDD(rdd: RDD[Row]): JavaRDD[Array[Byte]] = {
javaToPython(rdd.map(r => toJava(r, r.schema)))
}
Later in pySpark I create new RDD calling
RDD(jrdd, sc, BatchedSerializer(PickleSerializer()))
I end up with RDD of pySpark Rows. I'd like to revert that process.
I can easily get Scala's JavaRDD[Array[Byte]] by accessing rdd._jrdd. My main problem is that I don't know hwo to convert/unplickle it back to RDD[Row].
I've tried
sc._jvm.SerDe.pythonToJava(rdd._to_java_object_rdd(), True)
and
sc._jvm.SerDe.pythonToJava(rdd._jrdd, True)
both crash with similar exception
net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)
net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)
I know that I can easily pass DF back and forth between Scala and Python, but my records don't have uniform schema. I'm using RDD of Row's, because I though there will already be a pickler I'd be able to reuse and it works, but so far in only one direction.
I understand that one can convert an RDD to a Dataset using rdd.toDS. However there also exists rdd.toDF. Is there really any benefit of one over the other?
After playing with the Dataset API for a day, I find out that almost any operation takes me out to a DataFrame (for instance withColumn). After converting an RDD with toDS, I often find out that another conversion to a DataSet is needed, because something brought me to a DataFrame again.
Am I using the API wrongly? Should I stick with .toDF and only convert to a DataSet in the end of a chain of operations? Or is there a benefit to using toDS earlier?
Here is a small concrete example
spark
.read
.schema (...)
.json (...)
.rdd
.zipWithUniqueId
.map[(Integer,String,Double)] { case (row,id) => ... }
.toDS // now with a Dataset API (should use toDF here?)
.withColumnRenamed ("_1", "id" ) // now back to a DataFrame, not type safe :(
.withColumnRenamed ("_2", "text")
.withColumnRenamed ("_2", "overall")
.as[ParsedReview] // back to a Dataset
Michael Armburst nicely explained that shift to dataset and dataframe and the difference between the two. Basically in spark 2.x they converged dataset and dataframe API into one with slight difference.
"DataFrame is just DataSet of generic row objects. When you don't know all the fields, DF is the answer".
I have a large DataFrame (Spark 1.6 Scala) which looks like this:
Type,Value1,Value2,Value3,...
--------------------------
A,11.4,2,3
A,82.0,1,2
A,53.8,3,4
B,31.0,4,5
B,22.6,5,6
B,43.1,6,7
B,11.0,7,8
C,22.1,8,9
C,3.2,9,1
C,13.1,2,3
From this I want to group by Type and apply machine learning algorithms and/or perform complex functions on each group.
My objective is perform complex functions on each group in parallel.
I have tried the following approaches:
Approach 1) Convert Dataframe to Dataset and then use ds.mapGroups() api. But this is giving me an Iterator of each group values.
If i want to perform RandomForestClassificationModel.transform(dataset: DataFrame), i need a DataFrame with only a particular group values.
I was not sure converting Iterator to a Dataframe within mapGroups is a good idea.
Approach 2) Distinct on Type, then map on them and then filter for each Type with in the map loop:
val types = df.select("Type").distinct()
val ff = types.map(row => {
val type = row.getString(0)
val thisGroupDF = df.filter(col("Type") == type)
// Apply complex functions on thisGroupDF
(type, predictedValue)
})
For some reason, the above is never completing (seems to be getting into some kind of infinite loop)
Approach 3) Exploring Window functions, but did not find a method which can provide dataframe of particular group values.
Please help.
I'm trying to expend my RDD table by one column (with string values) using this question answers but I cannot add a column name this way... I'm using Scala.
Is there any easy way to add a column to RDD?
Apache Spark has a functional approach to the elaboration of data. Fundamentally, an RDD[T] is some sort of collection of objects (RDD stands for Resilient Distributed Data structure).
Following the functional approach, you elaborate the objects inside the RDD using transformations. Transformations construct a new RDD from a previous one.
One example of transformation is the map method. Using map, you can transform each object of your RDD in every other type of object you need. So, if you have a data structure that represents a row, you can trasform that structure in a new one with an added row.
For example, take the following piece of code.
val rdd: (String, String) = sc.pallelize(List(("Hello", "World"), ("Such", "Wow"))
// This new RDD will have one more "column",
// which is the concatenation of the previous
val rddWithOneMoreColumn =
rdd.map {
case(a, b) =>
(a, b, a + b)
In this example an RDD of Tuple2 (a.k.a. a couple) is transformed into an RDD of Tuple3, simply applying a function to each RDD element.
Clearly, you have to apply an action over the object rddWithOneMoreColumn to make the computation happen. In fact, Apache Spark computes lazily the result of all of your transformation.