I need to create a Scala Spark DF as below. This question may be silly but need to know what is the best approach to create small structures for testing purpose
For creating a minimal DF.
For creating a minimal RDD.
I've tried the following code so far without success :
val rdd2 = sc.parallelize(Seq("7","8","9"))
and then creating to DF by
val dfSchema = Seq("col1", "col2", "col3")
and
rdd2.toDF(dfSchema: _*)
Here's a sample Dataframe I'd like to obtain :
c1 c2 c3
1 2 3
4 5 6
abc_spark, here's a sample you can use to easily create Dataframes and RDDs for testing :
import spark.implicits._
val df = Seq(
(1, 2, 3),
(4, 5, 6)
).toDF("c1", "c2", "c3")
df.show(false)
+---+---+---+
|c1 |c2 |c3 |
+---+---+---+
|1 |2 |3 |
|4 |5 |6 |
+---+---+---+
val rdd: RDD[Row] = df.rdd
rdd.map{_.getAs[Int]("c2")}.foreach{println}
Gives
5
2
You are missing one "()" in Seq. Use it as below:
scala> val df = sc.parallelize(Seq(("7","8","9"))).toDF("col1", "col2", "col3")
scala> df.show
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 7| 8| 9|
+----+----+----+
Related
Input data:
val inputDf = Seq(Seq("a", "b", "c"), Seq("X", "Y", "Z")).toDF
println("Input:")
inputDf.show(false)
Here is how look Input:
+---------+
|value |
+---------+
|[a, b, c]|
|[X, Y, Z]|
+---------+
Here is how look Expected:
+---+---+---+
|0 |1 |2 |
+---+---+---+
|a |b |c |
|X |Y |Z |
+---+---+---+
I tried use code like this:
val ncols = 3
val selectCols = (0 until ncols).map(i => $"arr"(i).as(s"col_$i"))
inputDf
.select(selectCols:_*)
.show()
But I have errors, because I need some :Unit
Another way to create a dataframe ---
df1 = spark.createDataFrame([(1,[4,2, 1]),(4,[3,2])], [ "col2","col4"])
OUTPUT---------
+----+---------+
|col2| col4|
+----+---------+
| 1|[4, 2, 1]|
| 4| [3, 2]|
+----+---------+
package spark
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.col
object ArrayToCol extends App {
val spark = SparkSession.builder()
.master("local")
.appName("DataFrame-example")
.getOrCreate()
import spark.implicits._
val inptDf = Seq(Seq("a", "b", "c"), Seq("X", "Y", "Z")).toDF("value")
val d = inptDf
.withColumn("0", col("value").getItem(0))
.withColumn("1", col("value").getItem(1))
.withColumn("2", col("value").getItem(2))
.drop("value")
d.show(false)
}
// Variant 2
val res = inptDf.select(
$"value".getItem(0).as("col0"),
$"value".getItem(1).as("col1"),
$"value".getItem(2).as("col2")
)
// Variant 3
val res1 = inptDf.select(
col("*") +: (0 until 3).map(i => col("value").getItem(i).as(s"$i")): _*
)
.drop("value")
I want to perform a lookup on myMap. When col2 value is "0000" I want to update it with the value related to col1 key. Otherwise I want to keep the existing col2 value.
val myDF :
+-----+-----+
|col1 |col2 |
+-----+-----+
|1 |a |
|2 |0000 |
|3 |c |
|4 |0000 |
+-----+-----+
val myMap : Map[String, String] ("2" -> "b", "4" -> "d")
val broadcastMyMap = spark.sparkContext.broadcast(myMap)
def lookup = udf((key:String) => broadcastMyMap.value.get(key))
myDF.withColumn("col2", when ($"col2" === "0000", lookup($"col1")).otherwise($"col2"))
I've used the code above in spark-shell and it works fine but when I build the application jar and submit it to Spark using spark-submit it throws an error:
org.apache.spark.SparkException: Failed to execute user defined function(anonfun$5: (string) => string)
Caused by: java.lang.NullPointerException
Is there a way to perform the lookup without using UDF, which aren't the best option in terms of performance, or to fix the error?
I think I can't just use join because some values of myDF.col2 that have to be kept could be sobstituted in the operation.
your NullPointerException is NOT Valid.I proved with sample program like below.
its PERFECTLY WORKING FINE. you execute the below program.
package com.example
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.expressions.UserDefinedFunction
object MapLookupDF {
Logger.getLogger("org").setLevel(Level.OFF)
def main(args: Array[String]) {
import org.apache.spark.sql.functions._
val spark = SparkSession.builder.
master("local[*]")
.appName("MapLookupDF")
.getOrCreate()
import spark.implicits._
val mydf = Seq((1, "a"), (2, "0000"), (3, "c"), (4, "0000")).toDF("col1", "col2")
mydf.show
val myMap: Map[String, String] = Map("2" -> "b", "4" -> "d")
println(myMap.toString)
val broadcastMyMap = spark.sparkContext.broadcast(myMap)
def lookup: UserDefinedFunction = udf((key: String) => {
println("getting the value for the key " + key)
broadcastMyMap.value.get(key)
}
)
val finaldf = mydf.withColumn("col2", when($"col2" === "0000", lookup($"col1")).otherwise($"col2"))
finaldf.show
}
}
Result :
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
+----+----+
|col1|col2|
+----+----+
| 1| a|
| 2|0000|
| 3| c|
| 4|0000|
+----+----+
Map(2 -> b, 4 -> d)
getting the value for the key 2
getting the value for the key 4
+----+----+
|col1|col2|
+----+----+
| 1| a|
| 2| b|
| 3| c|
| 4| d|
+----+----+
note: there wont be significant degradation for a small map broadcasted.
if you want to go with a dataframe you can go as convert map to dataframe
val df = myMap.toSeq.toDF("key", "val")
Map(2 -> b, 4 -> d) in dataframe format will be like
+----+----+
|key|val |
+----+----+
| 2| b|
| 4| d|
+----+----+
and then join like this
DIY...
I created a DataFrame as follows:
import spark.implicits._
import org.apache.spark.sql.functions._
val df = Seq(
(1, List(1,2,3)),
(1, List(5,7,9)),
(2, List(4,5,6)),
(2, List(7,8,9)),
(2, List(10,11,12))
).toDF("id", "list")
val df1 = df.groupBy("id").agg(collect_set($"list").as("col1"))
df1.show(false)
Then I tried to convert the WrappedArray row value to string as follows:
import org.apache.spark.sql.functions._
def arrayToString = udf((arr: collection.mutable.WrappedArray[collection.mutable.WrappedArray[String]]) => arr.flatten.mkString(", "))
val d = df1.withColumn("col1", arrayToString($"col1"))
d: org.apache.spark.sql.DataFrame = [id: int, col1: string]
scala> d.show(false)
+---+----------------------------+
|id |col1 |
+---+----------------------------+
|1 |1, 2, 3, 5, 7, 9 |
|2 |4, 5, 6, 7, 8, 9, 10, 11, 12|
+---+----------------------------+
What I really want is to generate an output like the following:
+---+----------------------------+
|id |col1 |
+---+----------------------------+
|1 |1$2$3, 5$7$ 9 |
|2 |4$5$6, 7$8$9, 10$11$12 |
+---+----------------------------+
How can I achieve this?
You don't need a udf function, a simple concat_ws should do the trick for you as
import org.apache.spark.sql.functions._
val df1 = df.withColumn("list", concat_ws("$", col("list")))
.groupBy("id")
.agg(concat_ws(", ", collect_set($"list")).as("col1"))
df1.show(false)
which should give you
+---+----------------------+
|id |col1 |
+---+----------------------+
|1 |1$2$3, 5$7$9 |
|2 |7$8$9, 4$5$6, 10$11$12|
+---+----------------------+
As usual, udf function should be avoided if inbuilt functions are available since udf function would require serialization and deserialization of column data to primitive types for calculation and from primitives to columns respectively
even more concise you can avoid the withColumn step as
val df1 = df.groupBy("id")
.agg(concat_ws(", ", collect_set(concat_ws("$", col("list")))).as("col1"))
I hope the answer is helpful
I am going to add new column to a dataframe with expression.
for example, I have a dataframe of
+-----+----------+----------+-----+
| C1 | C2 | C3 |C4 |
+-----+----------+----------+-----+
|steak|1 |1 | 150|
|steak|2 |2 | 180|
| fish|3 |3 | 100|
+-----+----------+----------+-----+
and I want to create a new column C5 with expression "C2/C3+C4", assuming there are several new columns need to add, and the expressions may be different and come from database.
Is there a good way to do this?
I know that if I have an expression like "2+3*4" I can use scala.tools.reflect.ToolBox to eval it.
And normally I am using df.withColumn to add new column.
Seems I need to create an UDF, but how can I pass the columns value as parameters to UDF? especially there maybe multiple expression need different columns calculate.
This can be done using expr to create a Column from an expression:
val df = Seq((1,2)).toDF("x","y")
val myExpression = "x+y"
import org.apache.spark.sql.functions.expr
df.withColumn("z",expr(myExpression)).show()
+---+---+---+
| x| y| z|
+---+---+---+
| 1| 2| 3|
+---+---+---+
Two approaches:
import spark.implicits._ //so that you could use .toDF
val df = Seq(
("steak", 1, 1, 150),
("steak", 2, 2, 180),
("fish", 3, 3, 100)
).toDF("C1", "C2", "C3", "C4")
import org.apache.spark.sql.functions._
// 1st approach using expr
df.withColumn("C5", expr("C2/(C3 + C4)")).show()
// 2nd approach using selectExpr
df.selectExpr("*", "(C2/(C3 + C4)) as C5").show()
+-----+---+---+---+--------------------+
| C1| C2| C3| C4| C5|
+-----+---+---+---+--------------------+
|steak| 1| 1|150|0.006622516556291391|
|steak| 2| 2|180| 0.01098901098901099|
| fish| 3| 3|100| 0.02912621359223301|
+-----+---+---+---+--------------------+
In Spark 2.x, you can create a new column C5 with expression "C2/C3+C4" using withColumn() and org.apache.spark.sql.functions._,
val currentDf = Seq(
("steak", 1, 1, 150),
("steak", 2, 2, 180),
("fish", 3, 3, 100)
).toDF("C1", "C2", "C3", "C4")
val requiredDf = currentDf
.withColumn("C5", (col("C2")/col("C3")+col("C4")))
Also, you can do the same using org.apache.spark.sql.Column as well.
(But the space complexity is bit higher in this approach than using org.apache.spark.sql.functions._ due to the Column object creation)
val requiredDf = currentDf
.withColumn("C5", (new Column("C2")/new Column("C3")+new Column("C4")))
This worked perfectly for me. I am using Spark 2.0.2.
Is there a way that I can pass a data frame as an optional input function parameter in Scala?
Ex:
def test(sampleDF: DataFrame = df.sqlContext.emptyDataFrame): DataFrame = {
}
df.test(sampleDF)
Though I am passing a valid data frame here , it is always assigned to an empty data frame, how can I avoid this?
Yes you can pass dataframe as a parameter to a function
lets say you have a dataframe as
import sqlContext.implicits._
val df = Seq(
(1, 2, 3),
(1, 2, 3)
).toDF("col1", "col2", "col3")
which is
+----+----+----+
|col1|col2|col3|
+----+----+----+
|1 |2 |3 |
|1 |2 |3 |
+----+----+----+
you can pass it to a function as below
import org.apache.spark.sql.DataFrame
def test(sampleDF: DataFrame): DataFrame = {
sampleDF.select("col1", "col2") //doing some operation in dataframe
}
val testdf = test(df)
testdf would be
+----+----+
|col1|col2|
+----+----+
|1 |2 |
|1 |2 |
+----+----+
Edited
As eliasah pointed out that #Garipaso wanted optional argument. This can be done by defining the function as
def test(sampleDF: DataFrame = sqlContext.emptyDataFrame): DataFrame = {
if(sampleDF.count() > 0) sampleDF.select("col1", "col2") //doing some operation in dataframe
else sqlContext.emptyDataFrame
}
If we pass a valid dataframe as
test(df).show(false)
It will give output as
+----+----+
|col1|col2|
+----+----+
|1 |2 |
|1 |2 |
+----+----+
But if we don't pass argument as
test().show(false)
we would get empty dataframe as
++
||
++
++
I hope the answer is helpful