Spark UDF to split a column value to multiple columns - scala

I have a dataframe column called 'description' value in the below format
ABC XXXXXXXXXXXX STORE NAME ABC TYPE1
I will like to parse it into different 3 columns like below
| mode | type | store | description |
|------------------------------------------------------------------------|
| ABC | TYPE1 | STORE NAME | ABC XXXXXXXXXXXX STORE NAME ABC TYPE1 |
I tried the method suggested in the like here. It works for simple UDF function but not for the function I have written. The challenge is that the value of store could be more 2 words or no fixed number of words.
def myFunc1: (String => (String, String, String)) = { description =>
var descripe = description.split(" ")
val type = descripe(descripe.size - 1)
descripe = description.substring(description.indexOf("ABC") + 4, description.lastIndexOf("ABC")).split(" ")
val mode = descripe(0)
descripe(0) = ""
val store = descripe.mkString(" ").trim
(mode, store, type)
}
val schema = StructType(Array(
StructField("mode", StringType, true),
StructField("store", StringType, true),
StructField("type", StringType, true)
))
val myUDF = udf(myFunc1, schema)
val test = pos.withColumn("test", myUDF(col("description")))
test.printSchema()
val a =test.withColumn("mode", col("test").getItem("_1"))
.withColumn("store", col("test").getItem("_2"))
.withColumn("type", col("test").getItem("_3"))
.drop(col("test"))
a.printSchema()
a.show(5, false)
I get the below error when I execute
18/10/06 21:38:02 ERROR Executor: Exception in task 0.0 in stage 5.0
(TID 5) org.apache.spark.SparkException: Failed to execute user
defined function($anonfun$myFunc1$1$1: (string) =>
struct(mode:string,store:string,type:string)) at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source) at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at
org.apache.spark.scheduler.Task.run(Task.scala:108) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748) Caused by:
java.lang.StringIndexOutOfBoundsException: String index out of range:
-4 at java.lang.String.substring(String.java:1967) at com.hasif.bank.track.trasaction.TransactionParser$$anonfun$myFunc1$1$1.apply(TransactionParser.scala:26)
at
com.hasif.bank.track.trasaction.TransactionParser$$anonfun$myFunc1$1$1.apply(TransactionParser.scala:22)
... 16 more
Any pointers on this will be appreciated.

Check this out.
scala> val df = Seq("ABC XXXXXXXXXXXX STORE NAME ABC TYPE1").toDF("desc")
df: org.apache.spark.sql.DataFrame = [desc: string]
scala> df.withColumn("mode",split('desc," ")(0)).withColumn("type",split('desc," ")(5)).withColumn("store",concat(split('desc," ")(2), lit(" "), split('desc," ")(3))).show(false)
+-------------------------------------+----+-----+----------+
|desc |mode|type |store |
+-------------------------------------+----+-----+----------+
|ABC XXXXXXXXXXXX STORE NAME ABC TYPE1|ABC |TYPE1|STORE NAME|
+-------------------------------------+----+-----+----------+
scala>
Update1:
scala> def splitStore(x:String):String=
| return x.split(" ").drop(2).init.init.mkString(" ")
splitStore: (x: String)String
scala> val mysplitstore = udf(splitStore(_:String):String)
mysplitstore: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))
scala> val df2 = Seq("ABC XXXXXXXXXXXX STORE NAME XYZ ABC TYPE1").toDF("desc")
df2: org.apache.spark.sql.DataFrame = [desc: string]
scala> val df3 = df2.withColumn("length",split('desc," "))
df3: org.apache.spark.sql.DataFrame = [desc: string, length: array<string>]
scala> val df4 = df3.withColumn("mode",split('desc," ")(size('length)-2)).withColumn("type",split('desc," ")(size('length)-1)).withColumn("store",mysplitstore('desc))
df4: org.apache.spark.sql.DataFrame = [desc: string, length: array<string> ... 3 more fields]
scala> df4.drop('length).show(false)
+-----------------------------------------+----+-----+--------------+
|desc |mode|type |store |
+-----------------------------------------+----+-----+--------------+
|ABC XXXXXXXXXXXX STORE NAME XYZ ABC TYPE1|ABC |TYPE1|STORE NAME XYZ|
+-----------------------------------------+----+-----+--------------+
scala>

Related

Spark : converting Array[Byte] data to RDD or DataFrame

I have data in the form of Array[Byte] which I want to convert into Spark RDD or DataFrame so that I can write my data directly into a Google bucket in the form of a file. I am not able to write Array[Byte] data into Google bucket directly. So looking for this conversion.
My below code is able to write data into Local FS, but not Google bucket
val encrypted = encrypt(original, readPublicKey(pubKey), outFile, true, true)
val dfis = new FileOutputStream(outFile)
dfis.write(encrypted)
dfis.close()
def encrypt(clearData: Array[Byte], encKey: PGPPublicKey, fileName: String, withIntegrityCheck: Boolean, armor: Boolean): Array[Byte] = {
...
}
So how can I convert Array[Byte] data to RDD or DataFrame? I am using Scala.
just use .toDF() or .toDF().rdd
scala> val arr: Array[Byte] = Array(192.toByte, 168.toByte, 1.toByte, 4.toByte)
arr: Array[Byte] = Array(-64, -88, 1, 4)
scala> val df = arr.toSeq.toDF()
df: org.apache.spark.sql.DataFrame = [value: tinyint]
scala> df.show()
+-----+
|value|
+-----+
| -64|
| -88|
| 1|
| 4|
+-----+
scala> df.printSchema()
root
|-- value: byte (nullable = false)

Using Spark UDFs with struct sequences

Given a dataframe in which one column is a sequence of structs generated by the following sequence
val df = spark
.range(10)
.map((i) => (i % 2, util.Random.nextInt(10), util.Random.nextInt(10)))
.toDF("a","b","c")
.groupBy("a")
.agg(collect_list(struct($"b",$"c")).as("my_list"))
df.printSchema
df.show(false)
Outputs
root
|-- a: long (nullable = false)
|-- my_list: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- b: integer (nullable = false)
| | |-- c: integer (nullable = false)
+---+-----------------------------------+
|a |my_list |
+---+-----------------------------------+
|0 |[[0,3], [9,5], [3,1], [4,2], [3,3]]|
|1 |[[1,7], [4,6], [5,9], [6,4], [3,9]]|
+---+-----------------------------------+
I need to run a function over each struct list. The function prototype is similar to the function below
case class DataPoint(b: Int, c: Int)
def do_something_with_data(data: Seq[DataPoint]): Double = {
// This is an example. I don't actually want the sum
data.map(data_point => data_point.b + data_point.c).sum
}
I want to store the result of this function to another DataFrame column.
I tried to run
val my_udf = udf(do_something_with_data(_))
val df_with_result = df.withColumn("result", my_udf($"my_list"))
df_with_result.show(false)
and got
17/07/13 12:33:42 WARN TaskSetManager: Lost task 0.0 in stage 15.0 (TID 225, REDACTED, executor 0): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (array<struct<b:int,c:int>>) => double)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to $line27.$read$$iw$$iw$DataPoint
at $line28.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$do_something_with_data$1.apply(<console>:29)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at $line28.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.do_something_with_data(<console>:29)
at $line32.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:29)
at $line32.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:29)
Is it possible to use a UDF like this without first casting my rows to a container struct with the DataFrame API?
Doing something like:
case class MyRow(a: Long, my_list: Seq[DataPoint])
df.as[MyRow].map(_ => (a, my_list, my_udf(my_list)))
using the DataSet api works, but I'd prefer to stick with the DataFrame API if possible.
You cannot use a case-class as the input-argument of your UDF (but you can return case classes from the UDF). To map an array of structs, you can pass in a Seq[Row] to your UDF:
val my_uDF = udf((data: Seq[Row]) => {
// This is an example. I don't actually want the sum
data.map{case Row(x:Int,y:Int) => x+y}.sum
})
df.withColumn("result", my_udf($"my_list")).show
+---+--------------------+------+
| a| my_list|result|
+---+--------------------+------+
| 0|[[0,3], [5,5], [3...| 41|
| 1|[[0,9], [4,9], [6...| 54|
+---+--------------------+------+

scala - Spark : How to union all dataframe in loop

Is there a way to get the dataframe that union dataframe in loop?
This is a sample code:
var fruits = List(
"apple"
,"orange"
,"melon"
)
for (x <- fruits){
var df = Seq(("aaa","bbb",x)).toDF("aCol","bCol","name")
}
I would want to obtain some like this:
aCol | bCol | fruitsName
aaa,bbb,apple
aaa,bbb,orange
aaa,bbb,melon
Thanks again
You could created a sequence of DataFrames and then use reduce:
val results = fruits.
map(fruit => Seq(("aaa", "bbb", fruit)).toDF("aCol","bCol","name")).
reduce(_.union(_))
results.show()
Steffen Schmitz's answer is the most concise one I believe.
Below is a more detailed answer if you are looking for more customization (of field types, etc):
import org.apache.spark.sql.types.{StructType, StructField, StringType}
import org.apache.spark.sql.Row
//initialize DF
val schema = StructType(
StructField("aCol", StringType, true) ::
StructField("bCol", StringType, true) ::
StructField("name", StringType, true) :: Nil)
var initialDF = spark.createDataFrame(sc.emptyRDD[Row], schema)
//list to iterate through
var fruits = List(
"apple"
,"orange"
,"melon"
)
for (x <- fruits) {
//union returns a new dataset
initialDF = initialDF.union(Seq(("aaa", "bbb", x)).toDF)
}
//initialDF.show()
references:
How to create an empty DataFrame with a specified schema?
https://spark.apache.org/docs/2.0.1/api/java/org/apache/spark/sql/Dataset.html
https://docs.databricks.com/spark/latest/faq/append-a-row-to-rdd-or-dataframe.html
If you have different/multiple dataframes you can use below code, which is efficient.
val newDFs = Seq(DF1,DF2,DF3)
newDFs.reduce(_ union _)
In a for loop:
val fruits = List("apple", "orange", "melon")
( for(f <- fruits) yield ("aaa", "bbb", f) ).toDF("aCol", "bCol", "name")
you can first create a sequence and then use toDF to create Dataframe.
scala> var dseq : Seq[(String,String,String)] = Seq[(String,String,String)]()
dseq: Seq[(String, String, String)] = List()
scala> for ( x <- fruits){
| dseq = dseq :+ ("aaa","bbb",x)
| }
scala> dseq
res2: Seq[(String, String, String)] = List((aaa,bbb,apple), (aaa,bbb,orange), (aaa,bbb,melon))
scala> val df = dseq.toDF("aCol","bCol","name")
df: org.apache.spark.sql.DataFrame = [aCol: string, bCol: string, name: string]
scala> df.show
+----+----+------+
|aCol|bCol| name|
+----+----+------+
| aaa| bbb| apple|
| aaa| bbb|orange|
| aaa| bbb| melon|
+----+----+------+
Well... I think your question is a bit mis-guided.
As per my limited understanding of whatever you are trying to do, you should be doing following,
val fruits = List(
"apple",
"orange",
"melon"
)
val df = fruits
.map(x => ("aaa", "bbb", x))
.toDF("aCol", "bCol", "name")
And this should be sufficient.

RDD[Array[String]] to Dataframe

I am new to Spark and Hive and my goal is to load a delimited(lets say csv) to Hive table. After a bit of reading I found out that the path to load the data into Hive is csv->dataframe->Hive.(Please correct me if I am wrong).
CSV:
1,Alex,70000,Columbus
2,Ryan,80000,New York
3,Johny,90000,Banglore
4,Cook, 65000,Glasgow
5,Starc, 70000,Aus
I read the csv file be using below command:
val csv =sc.textFile("employee_data.txt").map(line => line.split(",").map(elem => elem.trim))
csv: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[29] at map at <console>:39
Now I am trying to convert this RDD to Dataframe and using below code:
scala> val df = csv.map { case Array(s0, s1, s2, s3) => employee(s0, s1, s2, s3) }.toDF()
df: org.apache.spark.sql.DataFrame = [eid: string, name: string, salary: string, destination: string]
employee is a case class and I am using it as a schema definition.
case class employee(eid: String, name: String, salary: String, destination: String)
However when I do df.show I am getting below error:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 10.0 failed 4 times, most recent failure: Lost task
0.3 in stage 10.0 (TID 22, user.hostname): scala.MatchError: [Ljava.lang.String;#88ba3cb (of class
[Ljava.lang.String;)
I was expecting a dataframe as a output. I know why I might be getting this error because the values in RDD are stored in Ljava.lang.String;#88ba3cb format and I need to use mkString to get the actual values but I am not able to find how to do it. I appreciate your time.
If you fix your case class then it should work:
scala> case class employee(eid: String, name: String, salary: String, destination: String)
defined class employee
scala> val txtRDD = sc.textFile("data.txt").map(line => line.split(",").map(_.trim))
txtRDD: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[30] at map at <console>:24
scala> txtRDD.map{case Array(s0, s1, s2, s3) => employee(s0, s1, s2, s3)}.toDF.show
+---+-----+------+-----------+
|eid| name|salary|destination|
+---+-----+------+-----------+
| 1| Alex| 70000| Columbus|
| 2| Ryan| 80000| New York|
| 3|Johny| 90000| Banglore|
| 4| Cook| 65000| Glasgow|
| 5|Starc| 70000| Aus|
+---+-----+------+-----------+
Otherwise you could convert the String to an Int:
scala> case class employee(eid: Int, name: String, salary: String, destination: String)
defined class employee
scala> val df = txtRDD.map{case Array(s0, s1, s2, s3) => employee(s0.toInt, s1, s2, s3)}.toDF
df: org.apache.spark.sql.DataFrame = [eid: int, name: string ... 2 more fields]
scala> df.show
+---+-----+------+-----------+
|eid| name|salary|destination|
+---+-----+------+-----------+
| 1| Alex| 70000| Columbus|
| 2| Ryan| 80000| New York|
| 3|Johny| 90000| Banglore|
| 4| Cook| 65000| Glasgow|
| 5|Starc| 70000| Aus|
+---+-----+------+-----------+
However the best solution would be to use spark-csv (which would treat the salary as an Int as well).
Also note that the error was thrown when you ran df.show because everything was being lazily evaluated up until that point. df.show is an action which will cause all of the queued transformations to execute (see this article for more).
Use map on array elements, not on array:
val csv = sc.textFile("employee_data.txt")
.map(line => line
.split(",")
.map(e => e.map(_.trim))
)
val df = csv.map { case Array(s0, s1, s2, s3) => employee(s0, s1, s2, s3) }.toDF()
But, why you are reading CSV and then converting RDD to DF? Spark 1.5 already can read CSV via spark-csv package:
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.option("delimiter", ";")
.load("employee_data.txt")
As you said in your comment, your case class employee, which should be named Employee, receives an Int as first argument of its constructor, but you are passing a String. Thus, you should convert it to an Int before instantiating or modify your case defining eid as a String.

Spark: convert rdd[row] to dataframe where one of the columns in the row is a list

I have a rdd[row] with the following data for each row
[guid, List(peopleObjects)]
["123", List(peopleObjects1, peopleObjects2, peopleObjects3)]
I want to convert this to a dataframe
I am using the following code
val personStructureType = new StructType()
.add(StructField("guid", StringType, true))
.add(StructField("personList", StringType, true))
val personDF = hiveContext.createDataFrame(personRDD, personStructureType)
Should I be using a different datatype for my schema instead of StringType?
If my list is just a string it works but when its a List I get the following error
scala.MatchError: List(personObject1, personObject2, personObject3) (of class scala.collection.immutable.$colon$colon)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:295)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:294)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
at org.apache.spark.sql.SQLContext$$anonfun$7.apply(SQLContext.scala:445)
at org.apache.spark.sql.SQLContext$$anonfun$7.apply(SQLContext.scala:445)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:219)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
It's not entirely clear what you are trying to do, but a better way to do what you are trying to do is to create a case class and then just map your RDD lines to the case class, then call toDF.
Something like:
case class MyClass(guid: Int, peopleObjects: List[String])
val rdd = sc.parallelize(Array((123,List("a","b")),(1232,List("b","d"))))
val df = rdd.map(r => MyClass(r._1, r._2)).toDF
df.show
+----+-------------+
|guid|peopleObjects|
+----+-------------+
| 123| [a, b]|
|1232| [b, d]|
+----+-------------+
Or you can do it the long-hand way, but without using the case class, like this:
val df = sqlContext.createDataFrame(
rdd.map(r => Row(r._1, r._2)),
StructType(Array(
StructField("guid",IntegerType),
StructField("peopleObjects", ArrayType(StringType))
))
)