Scala - String and Column objects - scala

Here the variable "exprs" is of column type (i.e. exprs: Array[org.apache.spark.sql.Column] = Array(sum(country), sum(value), sum(price))).
why does exprs: _* runs into error? why should I provide head and tail which as far my understanding is only for string type?
val resGroupByDF2 = data.groupBy($"country").agg(exprs: _*) // why does this not work
case class (
country: String,
value: Double,
price: Double
)
val data = Seq(
cname("NA", 2, 14),
cname("EU", 4, 61),
cname("FE", 5, 1),
)
.toDF()
val exprs = data.columns.map(sum(_)) // here it returns exprs: Array[org.apache.spark.sql.Column] = Array(sum(country), sum(value), sum(price))
val resGroupByDF2 = data.groupBy($"country").agg(exprs.head, exprs.tail: _*) // why just agg(exprs: _*) does not work in select or agg as it is already a column type

It is because of signature of agg.
The signature is (expr: org.apache.spark.sql.Column, exprs: org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame. This expects a column at least and optional var-arg.

Related

Spark scala.collection.immutable.$colon$colon is not a valid external type for schema of string

What is possible reason why this code doesn't work and how to fix it? Yes, f1, f2, f3, f4 fields are not using, but in productive code getList has an argument xml_data, so I pass xml_data field to method and get List[AnyRef]
def getList: List[AnyRef] = {
List("string",
new Integer(20),
Decimal(BigDecimal(10000), 10, 2),
Timestamp.valueOf("2021-01-01 10:00:00"),
List("1","2","3"),
List(1,2,3),
List(Decimal(BigDecimal(10000), 10, 2)),
List(Timestamp.valueOf("2021-01-01 10:00:00")),
null,
List(null))
}
...
val schema = StructType(Seq(
StructField("string", StringType),
StructField("int", IntegerType),
StructField("decimal", DecimalType(10, 2)),
StructField("timestamp", TimestampType),
StructField("array_string", ArrayType(StringType)),
StructField("array_int", ArrayType(IntegerType)),
StructField("array_decimal", ArrayType(DecimalType(10, 2))),
StructField("array_timestamp", ArrayType(TimestampType)),
StructField("array_int", ArrayType(IntegerType)),
StructField("array_string", ArrayType(StringType))
))
val encoder = RowEncoder(schema)
import spark.implicits._
List((1, 2, 3, 4))
.toDF("f1", "f2", "f3", "f4")
.as[Rec]
.map(rec => {
Row(getList)
})(encoder)
.show()
Row doesn't take a List but a varargs parameter. To expand that list into varargs you have to use this weird type ascription _*. Otherwise the whole list will be interpreted as the first (and only) parameter.
List((1, 2, 3, 4))
.toDF("f1", "f2", "f3", "f4")
.as[Rec]
.map(rec => {
Row(getList :_*)
})(encoder)
.show()

False/True Column constant

TL;DR: I need this spark constant :
val False : Column = lit(1) === lit(0)
Any idea how to do it prettier ?
Problem Context
I want to filter a dataframe from a collection. For exemple
case class Condition(column: String, value: String)
val conditions = Seq(
Condition("name", "bob"),
Condition("age", 18)
)
val personsDF = Seq(
("bob", 30),
("anna", 20),
("jack", 18)
).toDF("name", "age")
When applying my collection to personsDF I expect:
val expected = Seq(
("bob", 30),
("jack", 18)
)
To do so, I am creating a filter from the collection and apply it to the dataframe:
val conditionsFilter = conditions.foldLeft(initialValue) {
case (cumulatedFilter, Condition(column, value)) =>
cumulatedFilter || col(column) === value
}
personsDF.filter(conditionsFilter)
Pretty sweet, right ?
But to do so, I need the neutral value of OR operator which is False. Since False doesn't exist is Spark I used:
val False : Column = lit(1) === lit(0)
Any idea how to do this without tricks ?
You can just do :
val False : Column = lit(false)
This should be your initialValue, right? You can avoid that by using head and tail:
val buildCondition = (c:Condition) => col(c.column)===c.value
val initialValue = buildCondition(conditions.head)
val conditionsFilter = conditions.tail.foldLeft(initialValue)(
(cumulatedFilter, condition) =>
cumulatedFilter || buildCondition(condition)
)
Even shorter, you could use reduce:
val buildCondition = (c:Condition) => col(c.column)===c.value
val conditionsFilter = conditions.map(buildCondition).reduce(_ or _)

DataFrame to Array of Jsons

I have a dataframe as below
+-------------+-------------+-------------+
| columnName1 | columnName2 | columnName3 |
+-------------+-------------+-------------+
| 001 | 002 | 003 |
+-------------+-------------+-------------+
| 004 | 005 | 006 |
+-------------+-------------+-------------+
I want to convert to JSON as expected Below Format.
EXPECTED FORMAT
[[{"key":"columnName1","value":"001"},{"key":"columnName2","value":"002"},{"key":"columnName1","value":"003"}],[{"key":"columnName1","value":"004"},{"key":"columnName2","value":"005"},{"key":"columnName1","value":"006"}]]
Thanks in Advance
I have tried this with playjson api's
val ColumnsNames: Seq[String] = DF.columns.toSeq
val result= DF
.limit(recordLimit)
.map { row =>
val kv: Map[String, String] = row.getValuesMap[String](allColumns)
kv.map { x =>
Json
.toJson(
List(
("key" -> x._1),
("value" -> x._2)
).toMap
)
.toString()
}.mkString("[", ", ", "]")
}
.take(10)
Now it is coming in this format:
["[{"key":"columnName1","value":"001"},{"key":"columnName2","value":"002"},{"key":"columnName1","value":"003"}]","[{"key":"columnName1","value":"004"},{"key":"columnName2","value":"005"},{"key":"columnName1","value":"006"}]"]
But i need in this expected format with playjson with encoders
[[{"key":"columnName1","value":"001"},{"key":"columnName2","value":"002"},{"key":"columnName1","value":"003"}],[{"key":"columnName1","value":"004"},{"key":"columnName2","value":"005"},{"key":"columnName1","value":"006"}]]
facing this issue
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
[error] .map { row =>
Basically converting Array[String] to Array[Array[Jsvalue]]
The above exception is thrown because Spark does not have an encoder for Jsvalue to deserialize/serialize DF. Check Spark custom encoder for it. However, instead of returning JSValue JS.toString could be returned inside DF map operation.
One approach could be:
Convert row of DF to compatible JSON - [{"key":"columnName1","value":"001"},{"key":"columnName2","value":"002"},{"key":"columnName1","value":"003"}]
Collect DF as array/list mkstring using , delimiter
Enclosed above string inside "[]"
Alert below code use Collect, it could choke the Spark driver
//CSV
c1,c2,c3
001,002,003
004,005,006
//Code
val df = spark.read.option("header", "true").csv("array_json.csv")
val allColumns = df.schema.map(s => s.name)
//Import Spark implicits Encoder
import spark.implicits._
val sdf = df.map(row => {
val kv = row.getValuesMap[String](allColumns)
Json.toJson(kv.map(x => {
List(
"key" -> x._1,
"value" -> x._2
).toMap
})).toString()
})
val dfString = sdf.collect().mkString(",")
val op = s"[$dfString]"
println(op)
Output:
[[{"key":"c1","value":"001"},{"key":"c2","value":"002"},{"key":"c3","value":"003"}],[{"key":"c1","value":"004"},{"key":"c2","value":"005"},{"key":"c3","value":"006"}]]
Another approach without RDD:
import spark.implicits._
val df = List((1, 2, 3), (11, 12, 13), (21, 22, 23)).toDF("A", "B", "C")
val fKeyValue = (name: String) =>
struct(lit(name).as("key"), col(name).as("value"))
val lstCol = df.columns.foldLeft(List[Column]())((a, b) => fKeyValue(b) :: a)
val dsJson =
df
.select(collect_list(array(lstCol: _*)).as("obj"))
.toJSON
import play.api.libs.json._
val json: JsValue = Json.parse(dsJson.first())
val arrJson = json \ "obj"
println(arrJson)
val ColumnsNames: Seq[String] = DF.columns.toSeq
val result= Json.parse(DF
.limit(recordLimit)
.map { row =>
val kv: Map[String, String] = row.getValuesMap[String](allColumns)
kv.map { x =>
Json
.toJson(
List(
("key" -> x._1),
("value" -> x._2)
).toMap
)
.toString()
}.mkString("[", ", ", "]")
}
.take(10).mkstring("[", ", ", "]"))
gives
[[{"key":"columnName1","value":"001"},{"key":"columnName2","value":"002"},{"key":"columnName1","value":"003"}],[{"key":"columnName1","value":"004"},{"key":"columnName2","value":"005"},{"key":"columnName1","value":"006"}]]

Array[Byte] Spark RDD to String Spark RDD

I'm using the Cloudera's SparkOnHBase module in order to get data from HBase.
I get a RDD in this way:
var getRdd = hbaseContext.hbaseRDD("kbdp:detalle_feedback", scan)
Based on that, what I get is an object of type
RDD[(Array[Byte], List[(Array[Byte], Array[Byte], Array[Byte])])]
which corresponds to row key and a list of values. All of them represented by a byte array.
If I save the getRDD to a file, what I see is:
([B#f7e2590,[([B#22d418e2,[B#12adaf4b,[B#48cf6e81), ([B#2a5ffc7f,[B#3ba0b95,[B#2b4e651c), ([B#27d0277a,[B#52cfcf01,[B#491f7520), ([B#3042ad61,[B#6984d407,[B#f7c4db0), ([B#29d065c1,[B#30c87759,[B#39138d14), ([B#32933952,[B#5f98506e,[B#8c896ca), ([B#2923ac47,[B#65037e6a,[B#486094f5), ([B#3cd385f2,[B#62fef210,[B#4fc62b36), ([B#5b3f0f24,[B#8fb3349,[B#23e4023a), ([B#4e4e403e,[B#735bce9b,[B#10595d48), ([B#5afb2a5a,[B#1f99a960,[B#213eedd5), ([B#2a704c00,[B#328da9c4,[B#72849cc9), ([B#60518adb,[B#9736144,[B#75f6bc34)])
for each record (rowKey and the columns)
But what I need is to get the String representation of all and each of the keys and values. Or at least the values. In order to save it to a file and see something like
key1,(value1,value2...)
or something like
key1,value1,value2...
I'm completely new on spark and scala and it's being quite hard to get something.
Could you please help me with that?
First lets create some sample data:
scala> val d = List( ("ab" -> List(("qw", "er", "ty")) ), ("cd" -> List(("ac", "bn", "afad")) ) )
d: List[(String, List[(String, String, String)])] = List((ab,List((qw,er,ty))), (cd,List((ac,bn,afad))))
This is how the data is:
scala> d foreach println
(ab,List((qw,er,ty)))
(cd,List((ac,bn,afad)))
Convert it to Array[Byte] format
scala> val arrData = d.map { case (k,v) => k.getBytes() -> v.map { case (a,b,c) => (a.getBytes(), b.getBytes(), c.getBytes()) } }
arrData: List[(Array[Byte], List[(Array[Byte], Array[Byte], Array[Byte])])] = List((Array(97, 98),List((Array(113, 119),Array(101, 114),Array(116, 121)))), (Array(99, 100),List((Array(97, 99),Array(98, 110),Array(97, 102, 97, 100)))))
Create an RDD out of this data
scala> val rdd1 = sc.parallelize(arrData)
rdd1: org.apache.spark.rdd.RDD[(Array[Byte], List[(Array[Byte], Array[Byte], Array[Byte])])] = ParallelCollectionRDD[0] at parallelize at <console>:25
Create a conversion function from Array[Byte] to String:
scala> def b2s(a: Array[Byte]): String = new String(a)
b2s: (a: Array[Byte])String
Perform our final conversion:
scala> val rdd2 = rdd1.map { case (k,v) => b2s(k) -> v.map{ case (a,b,c) => (b2s(a), b2s(b), b2s(c)) } }
rdd2: org.apache.spark.rdd.RDD[(String, List[(String, String, String)])] = MapPartitionsRDD[1] at map at <console>:29
scala> rdd2.collect()
res2: Array[(String, List[(String, String, String)])] = Array((ab,List((qw,er,ty))), (cd,List((ac,bn,afad))))
I don't know about HBase but if those Array[Byte]s are Unicode strings, something like this should work:
rdd: RDD[(Array[Byte], List[(Array[Byte], Array[Byte], Array[Byte])])] = *whatever*
rdd.map(k, l =>
(new String(k),
l.map(a =>
a.map(elem =>
new String(elem)
)
))
)
Sorry for bad styling and whatnot, I am not even sure it will work.

Make .csv files from arrays in Scala

I have two tuples in Scala of the following form:
val array1 = (bucket1, Seq((dateA, Amount11), (dateB, Amount12), (dateC, Amount13)))
val array2 = (bucket2, Seq((dateA, Amount21), (dateB, Amount22), (dateC, Amount23)))
What is the quickest way to make a .csv file in Scala such that:
date* is pivot.
bucket* is column name.
Amount* fill the table.
It needs to look something like this:
Dates______________bucket1__________bucket2
dateA______________Amount11________Amount21
dateB______________Amount12________Amount22
dateC______________Amount13________Amount23
You can make it shorter by chaining some operations, but :
scala> val array1 = ("bucket1", Seq(("dateA", "Amount11"), ("dateB", "Amount12"), ("dateC", "Amount13")))
array1: (String, Seq[(String, String)]) =
(bucket1,List((dateA,Amount11), (dateB,Amount12), (dateC,Amount13)))
scala> val array2 = ("bucket2", Seq(("dateA", "Amount21"), ("dateB", "Amount22"), ("dateC", "Amount23")))
array2: (String, Seq[(String, String)]) =
(bucket2,List((dateA,Amount21), (dateB,Amount22), (dateC,Amount23)))
// Single array to work with
scala> val arrays = List(array1, array2)
arrays: List[(String, Seq[(String, String)])] = List(
(bucket1,List((dateA,Amount11), (dateB,Amount12), (dateC,Amount13))),
(bucket2,List((dateA,Amount21), (dateB,Amount22), (dateC,Amount23)))
)
// Split between buckets and the values
scala> val (buckets, values) = arrays.unzip
buckets: List[String] = List(bucket1, bucket2)
values: List[Seq[(String, String)]] = List(
List((dateA,Amount11), (dateB,Amount12), (dateC,Amount13)),
List((dateA,Amount21), (dateB,Amount22), (dateC,Amount23))
)
// Format the data
// Note that this does not keep the 'dateX' order
scala> val grouped = values.flatten
.groupBy(_._1)
.map { case (date, list) => date::(list.map(_._2)) }
grouped: scala.collection.immutable.Iterable[List[String]] = List(
List(dateC, Amount13, Amount23),
List(dateB, Amount12, Amount22),
List(dateA, Amount11, Amount21)
)
// Join everything, and add the "Dates" column in front of the buckets
scala> val table = ("Dates"::buckets)::grouped.toList
table: List[List[String]] = List(
List(Dates, bucket1, bucket2),
List(dateC, Amount13, Amount23),
List(dateB, Amount12, Amount22),
List(dateA, Amount11, Amount21)
)
// Join the rows by ',' and the lines by "\n"
scala> val string = table.map(_.mkString(",")).mkString("\n")
string: String =
Dates,bucket1,bucket2
dateC,Amount13,Amount23
dateB,Amount12,Amount22
dateA,Amount11,Amount21