DataFrame to Array of Jsons - scala

I have a dataframe as below
+-------------+-------------+-------------+
| columnName1 | columnName2 | columnName3 |
+-------------+-------------+-------------+
| 001 | 002 | 003 |
+-------------+-------------+-------------+
| 004 | 005 | 006 |
+-------------+-------------+-------------+
I want to convert to JSON as expected Below Format.
EXPECTED FORMAT
[[{"key":"columnName1","value":"001"},{"key":"columnName2","value":"002"},{"key":"columnName1","value":"003"}],[{"key":"columnName1","value":"004"},{"key":"columnName2","value":"005"},{"key":"columnName1","value":"006"}]]
Thanks in Advance
I have tried this with playjson api's
val ColumnsNames: Seq[String] = DF.columns.toSeq
val result= DF
.limit(recordLimit)
.map { row =>
val kv: Map[String, String] = row.getValuesMap[String](allColumns)
kv.map { x =>
Json
.toJson(
List(
("key" -> x._1),
("value" -> x._2)
).toMap
)
.toString()
}.mkString("[", ", ", "]")
}
.take(10)
Now it is coming in this format:
["[{"key":"columnName1","value":"001"},{"key":"columnName2","value":"002"},{"key":"columnName1","value":"003"}]","[{"key":"columnName1","value":"004"},{"key":"columnName2","value":"005"},{"key":"columnName1","value":"006"}]"]
But i need in this expected format with playjson with encoders
[[{"key":"columnName1","value":"001"},{"key":"columnName2","value":"002"},{"key":"columnName1","value":"003"}],[{"key":"columnName1","value":"004"},{"key":"columnName2","value":"005"},{"key":"columnName1","value":"006"}]]
facing this issue
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
[error] .map { row =>
Basically converting Array[String] to Array[Array[Jsvalue]]

The above exception is thrown because Spark does not have an encoder for Jsvalue to deserialize/serialize DF. Check Spark custom encoder for it. However, instead of returning JSValue JS.toString could be returned inside DF map operation.
One approach could be:
Convert row of DF to compatible JSON - [{"key":"columnName1","value":"001"},{"key":"columnName2","value":"002"},{"key":"columnName1","value":"003"}]
Collect DF as array/list mkstring using , delimiter
Enclosed above string inside "[]"
Alert below code use Collect, it could choke the Spark driver
//CSV
c1,c2,c3
001,002,003
004,005,006
//Code
val df = spark.read.option("header", "true").csv("array_json.csv")
val allColumns = df.schema.map(s => s.name)
//Import Spark implicits Encoder
import spark.implicits._
val sdf = df.map(row => {
val kv = row.getValuesMap[String](allColumns)
Json.toJson(kv.map(x => {
List(
"key" -> x._1,
"value" -> x._2
).toMap
})).toString()
})
val dfString = sdf.collect().mkString(",")
val op = s"[$dfString]"
println(op)
Output:
[[{"key":"c1","value":"001"},{"key":"c2","value":"002"},{"key":"c3","value":"003"}],[{"key":"c1","value":"004"},{"key":"c2","value":"005"},{"key":"c3","value":"006"}]]

Another approach without RDD:
import spark.implicits._
val df = List((1, 2, 3), (11, 12, 13), (21, 22, 23)).toDF("A", "B", "C")
val fKeyValue = (name: String) =>
struct(lit(name).as("key"), col(name).as("value"))
val lstCol = df.columns.foldLeft(List[Column]())((a, b) => fKeyValue(b) :: a)
val dsJson =
df
.select(collect_list(array(lstCol: _*)).as("obj"))
.toJSON
import play.api.libs.json._
val json: JsValue = Json.parse(dsJson.first())
val arrJson = json \ "obj"
println(arrJson)

val ColumnsNames: Seq[String] = DF.columns.toSeq
val result= Json.parse(DF
.limit(recordLimit)
.map { row =>
val kv: Map[String, String] = row.getValuesMap[String](allColumns)
kv.map { x =>
Json
.toJson(
List(
("key" -> x._1),
("value" -> x._2)
).toMap
)
.toString()
}.mkString("[", ", ", "]")
}
.take(10).mkstring("[", ", ", "]"))
gives
[[{"key":"columnName1","value":"001"},{"key":"columnName2","value":"002"},{"key":"columnName1","value":"003"}],[{"key":"columnName1","value":"004"},{"key":"columnName2","value":"005"},{"key":"columnName1","value":"006"}]]

Related

How to fix this type error, after using UDF function in Spark SQL?

I want to explode my features(type: sparse vector of ml.linalg) as each feature's index and value,so I do the following things:
def zipKeyValue(vec:linalg.Vector) : Array[(Int,Double)] = {
val indice:Array[Int] = vec.toSparse.indices;
val value:Array[Double] = vec.toSparse.values;
indice.zip(value)
}
val udf1 = udf( zipKeyValue _)
val df1 = df.withColumn("features",udf1(col("features")));
val df2 = df1.withColumn("features",explode(col("features")) );
val udf2 = udf( ( f:Tuple2[Int,Double]) => f._1.toString ) ;
val udf3 = udf( (f:Tuple2[Int,Double]) =>f._2) ;
val df3 = df2.withColumn("key",udf2(col("features"))).withColumn("value",udf3(col("features")));
df3.show();
But I got error:
Failed to execute user defined function(anonfun$38: (struct<_1:int,_2:double>) => string)
Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to scala.Tuple2
It is confused for me since my function zipKeyValue return a Tuple2[(Int,Double)], but actually I got a struct<_1:int,_2:double>. How can I fix it ?
You don't need UDF here. Just select the columns.
df2
.withColumn("key", col("features._1"))
.withColumn("value", col("features._2"))
In general case you should use Rows not Tuples:
import org.apache.spark.sql.Row
val udf2 = udf((f: Row) => f.getInt(0).toString)
val udf3 = udf((f: Row) => f.getDouble(1))

how to call an udf with multiple arguments(currying) in spark sql?

How do i call the below UDF with multiple arguments(currying) in a spark dataframe as below.
read read and get a list[String]
val data = sc.textFile("file.csv").flatMap(line => line.split("\n")).collect.toList
register udf
val getValue = udf(Udfnc.getVal(_: Int, _: String, _: String)(_: List[String]))
call udf in the below df
df.withColumn("value",
getValue(df("id"),
df("string1"),
df("string2"))).show()
Here is am missing the List[String] argument, and I am really not sure as how should i pass on this argument .
I can make following assumption about your requirement based on your question
a] UDF should accept parameter other than dataframe column
b] UDF should take multiple columns as parameter
Let's say you want to concat values from all column along with specified parameter. Here is how you can do it
import org.apache.spark.sql.functions._
def uDF(strList: List[String]) = udf[String, Int, String, String]((value1: Int, value2: String, value3: String) => value1.toString + "_" + value2 + "_" + value3 + "_" + strList.mkString("_"))
val df = spark.sparkContext.parallelize(Seq((1,"r1c1","r1c2"),(2,"r2c1","r2c2"))).toDF("id","str1","str2")
scala> df.show
+---+----+----+
| id|str1|str2|
+---+----+----+
| 1|r1c1|r1c2|
| 2|r2c1|r2c2|
+---+----+----+
val dummyList = List("dummy1","dummy2")
val result = df.withColumn("new_col", uDF(dummyList)(df("id"),df("str1"),df("str2")))
scala> result.show(2, false)
+---+----+----+-------------------------+
|id |str1|str2|new_col |
+---+----+----+-------------------------+
|1 |r1c1|r1c2|1_r1c1_r1c2_dummy1_dummy2|
|2 |r2c1|r2c2|2_r2c1_r2c2_dummy1_dummy2|
+---+----+----+-------------------------+
Defining a UDF with multiple parameters:
val enrichUDF: UserDefinedFunction = udf((jsonData: String, id: Long) => {
val lastOccurence = jsonData.lastIndexOf('}')
val sid = ",\"site_refresh_stats_id\":" + id+ " }]"
val enrichedJson = jsonData.patch(lastOccurence, sid, sid.length)
enrichedJson
})
Calling the udf to an existing dataframe:
val enrichedDF = EXISTING_DF
.withColumn("enriched_column",
enrichUDF(col("jsonData")
, col("id")))
An import statement is also required as:
import org.apache.spark.sql.expressions.UserDefinedFunction

How to convert var to List?

How to convert one var to two var List?
Below is my input variable:
val input="[level:1,var1:name,var2:id][level:1,var1:name1,var2:id1][level:2,var1:add1,var2:city]"
I want my result should be:
val first= List(List("name","name1"),List("add1"))
val second= List(List("id","id1"),List("city"))
First of all, input is not a valid json
val input="[level:1,var1:name,var2:id][level:1,var1:name1,var2:id1][level:2,var1:add1,var2:city]"
You have to make it valid json RDD ( as you are going to use apache spark)
val validJsonRdd = sc.parallelize(Seq(input)).flatMap(x => x.replace(",", "\",\"").replace(":", "\":\"").replace("[", "{\"").replace("]", "\"}").replace("}{", "}&{").split("&"))
Once you have valid json rdd, you can easily convert that to dataframe and then apply the logic you have
import org.apache.spark.sql.functions._
val df = spark.read.json(validJsonRdd)
.groupBy("level")
.agg(collect_list("var1").as("var1"), collect_list("var2").as("var2"))
.select(collect_list("var1").as("var1"), collect_list("var2").as("var2"))
You should get desired output in dataframe as
+------------------------------------------------+--------------------------------------------+
|var1 |var2 |
+------------------------------------------------+--------------------------------------------+
|[WrappedArray(name1, name2), WrappedArray(add1)]|[WrappedArray(id1, id2), WrappedArray(city)]|
+------------------------------------------------+--------------------------------------------+
And you can convert the array to list if required
To get the values as in the question, you can do the following
val rdd = df.collect().map(row => (row(0).asInstanceOf[Seq[Seq[String]]], row(1).asInstanceOf[Seq[Seq[String]]]))
val first = rdd(0)._1.map(x => x.toList).toList
//first: List[List[String]] = List(List(name1, name2), List(add1))
val second = rdd(0)._2.map(x => x.toList).toList
//second: List[List[String]] = List(List(id1, id2), List(city))
I hope the answer is helpful
reduceByKey is the important function to achieve your required output. More explaination on step by step reduceByKey explanation
You can do the following
val input="[level:1,var1:name1,var2:id1][level:1,var1:name2,var2:id2][level:2,var1:add1,var2:city]"
val groupedrdd = sc.parallelize(Seq(input)).flatMap(_.split("]\\[").map(x => {
val values = x.replace("[", "").replace("]", "").split(",").map(y => y.split(":")(1))
(values(0), (List(values(1)), List(values(2))))
})).reduceByKey((x, y) => (x._1 ::: y._1, x._2 ::: y._2))
val first = groupedrdd.map(x => x._2._1).collect().toList
//first: List[List[String]] = List(List(add1), List(name1, name2))
val second = groupedrdd.map(x => x._2._2).collect().toList
//second: List[List[String]] = List(List(city), List(id1, id2))

SCALA : Read the text file and create tuple of it

How to create a tuple from the below-existing RDD?
// reading a text file "b.txt" and creating RDD
val rdd = sc.textFile("/home/training/desktop/b.txt")
b.txt dataset -->
Ankita,26,BigData,newbie
Shikha,30,Management,Expert
If you are intending to have Array[Tuples4] then you can do the following
scala> val rdd = sc.textFile("file:/home/training/desktop/b.txt")
rdd: org.apache.spark.rdd.RDD[String] = file:/home/training/desktop/b.txt MapPartitionsRDD[5] at textFile at <console>:24
scala> val arrayTuples = rdd.map(line => line.split(",")).map(array => (array(0), array(1), array(2), array(3))).collect
arrayTuples: Array[(String, String, String, String)] = Array((" Ankita",26,BigData,newbie), (" Shikha",30,Management,Expert))
Then you can access each fields as tuples
scala> arrayTuples.map(x => println(x._3))
BigData
Management
res4: Array[Unit] = Array((), ())
Updated
If you have variable sized input file as
Ankita,26,BigData,newbie
Shikha,30,Management,Expert
Anita,26,big
you can write match case pattern matching as
scala> val arrayTuples = rdd.map(line => line.split(",") match {
| case Array(a, b, c, d) => (a,b,c,d)
| case Array(a,b,c) => (a,b,c)
| }).collect
arrayTuples: Array[Product with Serializable] = Array((Ankita,26,BigData,newbie), (Shikha,30,Management,Expert), (Anita,26,big))
Updated again
As #eliasah pointed that above procedure is a bad practice which is using product iterator. As his suggestion we should know the maximum elements of the input data and use following logic where we assign default values for no elements
val arrayTuples = rdd.map(line => line.split(",")).map(array => (Try(array(0)) getOrElse("Empty"), Try(array(1)) getOrElse(0), Try(array(2)) getOrElse("Empty"), Try(array(3)) getOrElse("Empty"))).collect
And as #philantrovert pointed out, we can verify the output in the following way, if we are not using REPL
arrayTuples.foreach(println)
which results to
(Ankita,26,BigData,newbie)
(Shikha,30,Management,Expert)
(Anita,26,big,Empty)

Array[Byte] Spark RDD to String Spark RDD

I'm using the Cloudera's SparkOnHBase module in order to get data from HBase.
I get a RDD in this way:
var getRdd = hbaseContext.hbaseRDD("kbdp:detalle_feedback", scan)
Based on that, what I get is an object of type
RDD[(Array[Byte], List[(Array[Byte], Array[Byte], Array[Byte])])]
which corresponds to row key and a list of values. All of them represented by a byte array.
If I save the getRDD to a file, what I see is:
([B#f7e2590,[([B#22d418e2,[B#12adaf4b,[B#48cf6e81), ([B#2a5ffc7f,[B#3ba0b95,[B#2b4e651c), ([B#27d0277a,[B#52cfcf01,[B#491f7520), ([B#3042ad61,[B#6984d407,[B#f7c4db0), ([B#29d065c1,[B#30c87759,[B#39138d14), ([B#32933952,[B#5f98506e,[B#8c896ca), ([B#2923ac47,[B#65037e6a,[B#486094f5), ([B#3cd385f2,[B#62fef210,[B#4fc62b36), ([B#5b3f0f24,[B#8fb3349,[B#23e4023a), ([B#4e4e403e,[B#735bce9b,[B#10595d48), ([B#5afb2a5a,[B#1f99a960,[B#213eedd5), ([B#2a704c00,[B#328da9c4,[B#72849cc9), ([B#60518adb,[B#9736144,[B#75f6bc34)])
for each record (rowKey and the columns)
But what I need is to get the String representation of all and each of the keys and values. Or at least the values. In order to save it to a file and see something like
key1,(value1,value2...)
or something like
key1,value1,value2...
I'm completely new on spark and scala and it's being quite hard to get something.
Could you please help me with that?
First lets create some sample data:
scala> val d = List( ("ab" -> List(("qw", "er", "ty")) ), ("cd" -> List(("ac", "bn", "afad")) ) )
d: List[(String, List[(String, String, String)])] = List((ab,List((qw,er,ty))), (cd,List((ac,bn,afad))))
This is how the data is:
scala> d foreach println
(ab,List((qw,er,ty)))
(cd,List((ac,bn,afad)))
Convert it to Array[Byte] format
scala> val arrData = d.map { case (k,v) => k.getBytes() -> v.map { case (a,b,c) => (a.getBytes(), b.getBytes(), c.getBytes()) } }
arrData: List[(Array[Byte], List[(Array[Byte], Array[Byte], Array[Byte])])] = List((Array(97, 98),List((Array(113, 119),Array(101, 114),Array(116, 121)))), (Array(99, 100),List((Array(97, 99),Array(98, 110),Array(97, 102, 97, 100)))))
Create an RDD out of this data
scala> val rdd1 = sc.parallelize(arrData)
rdd1: org.apache.spark.rdd.RDD[(Array[Byte], List[(Array[Byte], Array[Byte], Array[Byte])])] = ParallelCollectionRDD[0] at parallelize at <console>:25
Create a conversion function from Array[Byte] to String:
scala> def b2s(a: Array[Byte]): String = new String(a)
b2s: (a: Array[Byte])String
Perform our final conversion:
scala> val rdd2 = rdd1.map { case (k,v) => b2s(k) -> v.map{ case (a,b,c) => (b2s(a), b2s(b), b2s(c)) } }
rdd2: org.apache.spark.rdd.RDD[(String, List[(String, String, String)])] = MapPartitionsRDD[1] at map at <console>:29
scala> rdd2.collect()
res2: Array[(String, List[(String, String, String)])] = Array((ab,List((qw,er,ty))), (cd,List((ac,bn,afad))))
I don't know about HBase but if those Array[Byte]s are Unicode strings, something like this should work:
rdd: RDD[(Array[Byte], List[(Array[Byte], Array[Byte], Array[Byte])])] = *whatever*
rdd.map(k, l =>
(new String(k),
l.map(a =>
a.map(elem =>
new String(elem)
)
))
)
Sorry for bad styling and whatnot, I am not even sure it will work.