How write code that creates a Dataset with columns that have the elements of an array column as values and their names being positions? - scala

Input data:
val inputDf = Seq(Seq("a", "b", "c"), Seq("X", "Y", "Z")).toDF
println("Input:")
inputDf.show(false)
Here is how look Input:
+---------+
|value |
+---------+
|[a, b, c]|
|[X, Y, Z]|
+---------+
Here is how look Expected:
+---+---+---+
|0 |1 |2 |
+---+---+---+
|a |b |c |
|X |Y |Z |
+---+---+---+
I tried use code like this:
val ncols = 3
val selectCols = (0 until ncols).map(i => $"arr"(i).as(s"col_$i"))
inputDf
.select(selectCols:_*)
.show()
But I have errors, because I need some :Unit

Another way to create a dataframe ---
df1 = spark.createDataFrame([(1,[4,2, 1]),(4,[3,2])], [ "col2","col4"])
OUTPUT---------
+----+---------+
|col2| col4|
+----+---------+
| 1|[4, 2, 1]|
| 4| [3, 2]|
+----+---------+

package spark
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.col
object ArrayToCol extends App {
val spark = SparkSession.builder()
.master("local")
.appName("DataFrame-example")
.getOrCreate()
import spark.implicits._
val inptDf = Seq(Seq("a", "b", "c"), Seq("X", "Y", "Z")).toDF("value")
val d = inptDf
.withColumn("0", col("value").getItem(0))
.withColumn("1", col("value").getItem(1))
.withColumn("2", col("value").getItem(2))
.drop("value")
d.show(false)
}
// Variant 2
val res = inptDf.select(
$"value".getItem(0).as("col0"),
$"value".getItem(1).as("col1"),
$"value".getItem(2).as("col2")
)
// Variant 3
val res1 = inptDf.select(
col("*") +: (0 until 3).map(i => col("value").getItem(i).as(s"$i")): _*
)
.drop("value")

Related

Scala - Return the largest string within each group

DataSet:
+---+--------+
|age| name|
+---+--------+
| 33| Will|
| 26|Jean-Luc|
| 55| Hugh|
| 40| Deanna|
| 68| Quark|
| 59| Weyoun|
| 37| Gowron|
| 54| Will|
| 38| Jadzia|
| 27| Hugh|
+---+--------+
Here is my attempt but it just returns the size of the largest string rather than the largest string:
AgeName.groupBy("age")
.agg(max(length(AgeName("name")))).show()
The usual row_number trick should work if you specify the Window correctly. Using #LeoC's example,
val df = Seq(
(35, "John"),
(22, "Jennifer"),
(22, "Alexander"),
(35, "Michelle"),
(22, "Celia")
).toDF("age", "name")
val df2 = df.withColumn(
"rownum",
expr("row_number() over (partition by age order by length(name) desc)")
).filter("rownum = 1").drop("rownum")
df2.show
+---+---------+
|age| name|
+---+---------+
| 22|Alexander|
| 35| Michelle|
+---+---------+
Here's one approach using Spark higher-order function, aggregate, as shown below:
val df = Seq(
(35, "John"),
(22, "Jennifer"),
(22, "Alexander"),
(35, "Michelle"),
(22, "Celia")
).toDF("age", "name")
df.
groupBy("age").agg(collect_list("name").as("names")).
withColumn(
"longest_name",
expr("aggregate(names, '', (acc, x) -> case when length(acc) < length(x) then x else acc end)")
).
show(false)
// +---+----------------------------+------------+
// |age|names |longest_name|
// +---+----------------------------+------------+
// |22 |[Jennifer, Alexander, Celia]|Alexander |
// |35 |[John, Michelle] |Michelle |
// +---+----------------------------+------------+
Note that higher-order functions are available only on Spark 2.4+.
object BasicDatasetTest {
def main(args: Array[String]): Unit = {
val spark=SparkSession.builder()
.master("local[*]")
.appName("BasicDatasetTest")
.getOrCreate()
val pairs=List((33,"Will"),(26,"Jean-Luc"),
(55, "Hugh"),
(26, "Deanna"),
(26, "Quark"),
(55, "Weyoun"),
(33, "Gowron"),
(55, "Will"),
(26, "Jadzia"),
(27, "Hugh"))
val schema=new StructType(Array(
StructField("age",IntegerType,false),
StructField("name",StringType,false))
)
val dataRDD=spark.sparkContext.parallelize(pairs).map(record=>Row(record._1,record._2))
val dataset=spark.createDataFrame(dataRDD,schema)
val ageNameGroup=dataset.groupBy("age","name")
.agg(max(length(col("name"))))
.withColumnRenamed("max(length(name))","length")
ageNameGroup.printSchema()
val ageGroup=dataset.groupBy("age")
.agg(max(length(col("name"))))
.withColumnRenamed("max(length(name))","length")
ageGroup.printSchema()
ageGroup.createOrReplaceTempView("age_group")
ageNameGroup.createOrReplaceTempView("age_name_group")
spark.sql("select ag.age,ang.name from age_group as ag, age_name_group as ang " +
"where ag.age=ang.age and ag.length=ang.length")
.show()
}
}

How to create a Spark DF as below

I need to create a Scala Spark DF as below. This question may be silly but need to know what is the best approach to create small structures for testing purpose
For creating a minimal DF.
For creating a minimal RDD.
I've tried the following code so far without success :
val rdd2 = sc.parallelize(Seq("7","8","9"))
and then creating to DF by
val dfSchema = Seq("col1", "col2", "col3")
and
rdd2.toDF(dfSchema: _*)
Here's a sample Dataframe I'd like to obtain :
c1 c2 c3
1 2 3
4 5 6
abc_spark, here's a sample you can use to easily create Dataframes and RDDs for testing :
import spark.implicits._
val df = Seq(
(1, 2, 3),
(4, 5, 6)
).toDF("c1", "c2", "c3")
df.show(false)
+---+---+---+
|c1 |c2 |c3 |
+---+---+---+
|1 |2 |3 |
|4 |5 |6 |
+---+---+---+
val rdd: RDD[Row] = df.rdd
rdd.map{_.getAs[Int]("c2")}.foreach{println}
Gives
5
2
You are missing one "()" in Seq. Use it as below:
scala> val df = sc.parallelize(Seq(("7","8","9"))).toDF("col1", "col2", "col3")
scala> df.show
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 7| 8| 9|
+----+----+----+

How to convert List to Row with multiple columns

Create a DataFrame from csv file, process each row, want to create a new row with the same number of columns.
val df = spark.read.format("csv").load("data.csv")
def process(line: Row) : Seq[String] = {
val list = new ArrayList[String]
for (i <- 0 to line.size-1) {
list.add(line.getString(i).toUpperCase)
}
list.asScala.toSeq
}
val df2 = df.map(process(_))
df2.show
Expecting/hope-to-get:
+---+---+---+
| _1| _2| _3|
+---+---+---+
| X1| X2| X3|
| Y1| Y2| Y3|
+---+---+---+
Getting:
+------------+
| value|
+------------+
|[X1, X2, X3]|
|[Y1, Y2, Y3]|
+------------+
Input file data.csv:
x1,x2,x3
y1,y2,y3
Note that the code should work in this input file as well:
x1,x2,x3,x4
y1,y2,y3,y4
And for this input file, I'd like to see result
+---+---+---+---+
| _1| _2| _3| _4|
+---+---+---+---+
| X1| X2| X3| X4|
| Y1| Y2| Y3| Y4|
+---+---+---+---+
Please note that I used tpUpperCase() in process() just to make the simple example to work. The real logic in process() can be a lot more complex.
Second Update to Change rdd to Row
#USML , basically changed Seq[String] to Row so that rdd can be paralellized. it's a distributed parallel collection that needs to be serialized
val df2 = csvDf.rdd.map(process(_)).map(a => Row.fromSeq(a))
//df2: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
// And we use dynamic Schema (e.g. same number of columns as csv
spark.createDataFrame(df2, schema = dynamicSchema).show(false)
+---+---+---+
|_c0|_c1|_c2|
+---+---+---+
|X1 |X2 |X3 |
|Y1 |Y2 |Y3 |
+---+---+---+
Update on Changed Requirement
As long as you are reading the CSV , end output will have same numbers of columns as your csv as we are using df.schema to create dataframe after calling process method. Try this:
val df = spark.read.format("csv").load("data.csv")
val dynamicSchema = df.schema // This makes sure to prserve same number of columns
def process(line: Row) : Seq[String] = {
val list = new ArrayList[String]
for (i <- 0 to line.size-1) {
list.add(line.getString(i).toUpperCase)
}
list.asScala.toSeq
}
val df2 = df.rdd.map(process(_)).map(a => Row.fromSeq(a)) // df2 is actually an RDD // updated conversion to Row
val finalDf = spark.createDataFrame(df2, schema = dynamicSchema) // We use same schema
finalDf.show(false)
File Contents =>
cat data.csv
a1,b1,c1,d1
a2,b2,c2,d2
Code =>
import org.apache.spark.sql.Row
val csvDf = spark.read.csv("data.csv")
csvDf.show(false)
+---+---+---+---+
|_c0|_c1|_c2|_c3|
+---+---+---+---+
|a1 |b1 |c1 |d1 |
|a2 |b2 |c2 |d2 |
+---+---+---+---+
def process(cols: Row): Row = { Row("a", "b", "c","d") } // Check the Data Type
val df2 = csvDf.rdd.map(process(_)) // df2: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
val finalDf = spark.createDataFrame(df2,schema = csvDf.schema)
finalDf.show(false)
+---+---+---+---+
|_c0|_c1|_c2|_c3|
+---+---+---+---+
|a |b |c |d |
|a |b |c |d |
+---+---+---+---+
Points to note Row data type is needed to Map a Row
Better practice to have a type safe case class
Rest should be easy

Perform lookup on a broadcasted Map conditoned on column value in Spark using Scala

I want to perform a lookup on myMap. When col2 value is "0000" I want to update it with the value related to col1 key. Otherwise I want to keep the existing col2 value.
val myDF :
+-----+-----+
|col1 |col2 |
+-----+-----+
|1 |a |
|2 |0000 |
|3 |c |
|4 |0000 |
+-----+-----+
val myMap : Map[String, String] ("2" -> "b", "4" -> "d")
val broadcastMyMap = spark.sparkContext.broadcast(myMap)
def lookup = udf((key:String) => broadcastMyMap.value.get(key))
myDF.withColumn("col2", when ($"col2" === "0000", lookup($"col1")).otherwise($"col2"))
I've used the code above in spark-shell and it works fine but when I build the application jar and submit it to Spark using spark-submit it throws an error:
org.apache.spark.SparkException: Failed to execute user defined function(anonfun$5: (string) => string)
Caused by: java.lang.NullPointerException
Is there a way to perform the lookup without using UDF, which aren't the best option in terms of performance, or to fix the error?
I think I can't just use join because some values of myDF.col2 that have to be kept could be sobstituted in the operation.
your NullPointerException is NOT Valid.I proved with sample program like below.
its PERFECTLY WORKING FINE. you execute the below program.
package com.example
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.expressions.UserDefinedFunction
object MapLookupDF {
Logger.getLogger("org").setLevel(Level.OFF)
def main(args: Array[String]) {
import org.apache.spark.sql.functions._
val spark = SparkSession.builder.
master("local[*]")
.appName("MapLookupDF")
.getOrCreate()
import spark.implicits._
val mydf = Seq((1, "a"), (2, "0000"), (3, "c"), (4, "0000")).toDF("col1", "col2")
mydf.show
val myMap: Map[String, String] = Map("2" -> "b", "4" -> "d")
println(myMap.toString)
val broadcastMyMap = spark.sparkContext.broadcast(myMap)
def lookup: UserDefinedFunction = udf((key: String) => {
println("getting the value for the key " + key)
broadcastMyMap.value.get(key)
}
)
val finaldf = mydf.withColumn("col2", when($"col2" === "0000", lookup($"col1")).otherwise($"col2"))
finaldf.show
}
}
Result :
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
+----+----+
|col1|col2|
+----+----+
| 1| a|
| 2|0000|
| 3| c|
| 4|0000|
+----+----+
Map(2 -> b, 4 -> d)
getting the value for the key 2
getting the value for the key 4
+----+----+
|col1|col2|
+----+----+
| 1| a|
| 2| b|
| 3| c|
| 4| d|
+----+----+
note: there wont be significant degradation for a small map broadcasted.
if you want to go with a dataframe you can go as convert map to dataframe
val df = myMap.toSeq.toDF("key", "val")
Map(2 -> b, 4 -> d) in dataframe format will be like
+----+----+
|key|val |
+----+----+
| 2| b|
| 4| d|
+----+----+
and then join like this
DIY...

How to get max value of corresponding item in many arrays in DataFrame column?

A DataFrame as following:
import spark.implicits._
val df1 = List(
("id1", Array(0,2)),
("id1",Array(2,1)),
("id2",Array(0,3))
).toDF("id", "value")
+---+------+
| id| value|
+---+------+
|id1|[0, 2]|
|id1|[2, 1]|
|id2|[0, 3]|
+---+------+
I want to groupBy id to get max pooling of every value array. Max id1 value is Array(2,2). The result I want to get is:
import spark.implicits._
val res = List(
("id1", Array(2,2)),
("id2",Array(0,3))
).toDF("id", "value")
+---+------+
| id| value|
+---+------+
|id1|[2, 2]|
|id2|[0, 3]|
+---+------+
import spark.implicits._
val df1 = List(
("id1", Array(0,2,3)),
("id1",Array(2,1,4)),
("id2",Array(0,7,3))
).toDF("id", "value")
val df2rdd = df1.rdd
.map(x => (x(0).toString,x.getSeq[Int](1)))
.reduceByKey((x,y) => {
val arrlength = x.length
var i = 0
val resarr = scala.collection.mutable.ArrayBuffer[Int]()
while(i < arrlength){
if (x(i) >= y(i)){
resarr.append(x(i))
} else {
resarr.append(y(i))
}
i += 1
}
resarr
}).toDF("id","newvalue")
You can do like below
//Input df
+---+---------+
| id| value|
+---+---------+
|id1|[0, 2, 3]|
|id1|[2, 1, 4]|
|id2|[0, 7, 3]|
+---+---------+
//Solution approach:
import org.apache.spark.sql.functions.udf
val df1=df.groupBy("id").agg(collect_set("value").as("value"))
val maxUDF = udf{(s:Seq[Seq[Int]])=>s.reduceLeft((prev,next)=>prev.zip(next).map(tup=>if(tup._1>tup._2) tup._1 else tup._2))}
df1.withColumn("value",maxUDF(df1.col("value"))).show
//Sample Output:
+---+---------+
| id| value|
+---+---------+
|id1|[2, 2, 4]|
|id2|[0, 7, 3]|
+---+---------+
I hope, this will help you.