Split row of tuple to two row in RDD - pyspark

I try to split tuple of ints to two rows in RDD.
vertices=edges.map(lambda x:(x[0],)).union(edges.map(lambda x:(x[1],))).distinct()
I try this code and it is working, but I want code that run less in runtime, without using the GraphFrames package.

You can use flatMap:
edges.flatMap(lambda x: x).distinct()
In Scala, you would simply call .flatMap(identity) instead.
If you use the DataFrame API you can just use explode on your only column e.g. df.select(explode("edge"))

Related

Pyspark apply multiple groupBy UDF's

I am trying to call 2 UDF's within the same groupBy function.
I have one UDF that takes a group and returns a Pandas dataframe with one row and multiple columns.
I have another that takes just one feature and returns a single value.
Is there a way to run both of them in the same groupBy. I run the first UDF with the applyInPandas function but can't find a way to run any other function with it running.

Unpickle/convert pyspark RDD of Rows to Scala RDD[Row]

What I'm trying to achieve is to execute Scala code. Convert result Scala RDD[Row] to PySparkRDD of Rows. Perform some python operations and convert RDD of pySpark Rows back to Scala's RDD[Row].
To get RDD to pySpark RDD I'm doing this:
In Scala I have this method
import org.apache.spark.sql.execution.python.EvaluatePython.{javaToPython, toJava}
def toPythonRDD(rdd: RDD[Row]): JavaRDD[Array[Byte]] = {
javaToPython(rdd.map(r => toJava(r, r.schema)))
}
Later in pySpark I create new RDD calling
RDD(jrdd, sc, BatchedSerializer(PickleSerializer()))
I end up with RDD of pySpark Rows. I'd like to revert that process.
I can easily get Scala's JavaRDD[Array[Byte]] by accessing rdd._jrdd. My main problem is that I don't know hwo to convert/unplickle it back to RDD[Row].
I've tried
sc._jvm.SerDe.pythonToJava(rdd._to_java_object_rdd(), True)
and
sc._jvm.SerDe.pythonToJava(rdd._jrdd, True)
both crash with similar exception
net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)
net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)
I know that I can easily pass DF back and forth between Scala and Python, but my records don't have uniform schema. I'm using RDD of Row's, because I though there will already be a pickler I'd be able to reuse and it works, but so far in only one direction.

Add list as column to Dataframe in pyspark

I have a list of integers and a sqlcontext dataframe with the number of rows equal to the length of the list. I want to add the list as a column to this dataframe maintaining the order. I feel like this should be really simple but I can't find an elegant solution.
You cannot simply add a list as a dataframe column since list is local object and dataframe is distirbuted. You can try one of thw followin approaches:
convert dataframe to local by collect() or toLocalIterator() and for each row add corresponding value from the list OR
convert list to dataframe adding an extra column (with keys from dataframe) and then join them both

Applying function to Spark Dataframe Column

Coming from R, I am used to easily doing operations on columns. Is there any easy way to take this function that I've written in scala
def round_tenths_place( un_rounded:Double ) : Double = {
val rounded = BigDecimal(un_rounded).setScale(1, BigDecimal.RoundingMode.HALF_UP).toDouble
return rounded
}
And apply it to a one column of a dataframe - kind of what I hoped this would do:
bid_results.withColumn("bid_price_bucket", round_tenths_place(bid_results("bid_price")) )
I haven't found any easy way and am struggling to figure out how to do this. There's got to be an easier way than converting the dataframe to and RDD and then selecting from rdd of rows to get the right field and mapping the function across all of the values, yeah? And also something more succinct creating a SQL table and then doing this with a sparkSQL UDF?
You can define an UDF as follows:
val round_tenths_place_udf = udf(round_tenths_place _)
bid_results.withColumn(
"bid_price_bucket", round_tenths_place_udf($"bid_price"))
although built-in Round expression is using exactly the same logic as your function and should be more than enough, not to mention much more efficient:
import org.apache.spark.sql.functions.round
bid_results.withColumn("bid_price_bucket", round($"bid_price", 1))
See also following:
Updating a dataframe column in spark
How to apply a function to a column of a Spark DataFrame?

Spark/Scala flatten and flatMap is not working on DataFrame

I have a DataFrame containing three DataFrames of the same type (same parquet schema). They only differ in the content/values they are containing:
I want to flatten the structure, so that the three DataFrames are getting merged into one single Parquet DataFrame containing all of the content/values.
I tried it with flatten and flatMap, but with that I always get the error:
Error: No implicit view available from org.apache.spark.sql.DataFrame => Traversable[U].parquetsFiles.flatten
Error: not enough arguments for method flatten: (implicit as Trav: org.apache.spark.sql.DataFrame => Traversable[U], implicit m: scala.reflect.ClassTag[U]. Unspecified value parameters asTrav, m. parquetFiles.flatten
I also converted it to a List and then tried to flatten and this is also producing the same error.
Do you have any idea how to solve it or what is the problem here?
Thanks, Alex
The scala compiler is looking for a way to convert the DataFrames to a Traversable so it can apply the flatten. But a DataFrame is not Traversable, so it will fail. Also, no ClassTag available because DataFrames are not statically typed.
The code you're looking for is
parquetFiles.reduce(_ unionAll _)
which can be optimized by the DataFrame execution engine.
So it seems like you want to join these three DataFrames together, to do that the unionAll function would work really well. You could do parquetFiles.reduce((x, y) => x.unionAll(y)) (note this will explode on an empty list but if you might have that just look at one of the folds instead of reduce).