How to create DataFrame: The number of columns doesn't match - scala

I get the error:
java.lang.IllegalArgumentException: requirement failed: The number of
columns doesn't match. Old column names (4): _1, _2, _3, _4 New column
names (1): 'srcId', 'srcLabel', 'dstId', 'dstLabel'
in this code:
val columnNames = """'srcId', 'srcLabel', 'dstId', 'dstLabel'"""
import spark.sqlContext.implicits._
var df = Seq.empty[(String, String, String, String)]
.toDF(columnNames)

The problem with your approach is that columnNames is a string while you have defined tuple4 of empty strings. So you will have to split the columnNames string into four strings and pass to toDF
Correct way is to do it as following
val columnNames = """'srcId', 'srcLabel', 'dstId', 'dstLabel'"""
var df = Seq.empty[(String, String, String, String)]
.toDF(columnNames.split(","): _*)
which should give you an empty dataframe as
+-------+-----------+--------+-----------+
|'srcId'| 'srcLabel'| 'dstId'| 'dstLabel'|
+-------+-----------+--------+-----------+
+-------+-----------+--------+-----------+
I hope the answer is helpful

scala> val columnNames = Seq("srcId", "srcLabel", "dstId", "dstLabel")
columnNames: Seq[String] = List(srcId, srcLabel, dstId, dstLabel)
scala> var d = Seq.empty[(String, String, String, String)].toDF(columnNames: _*)
d: org.apache.spark.sql.DataFrame = [srcId: string, srcLabel: string ... 2 more fields]

Related

Error while saving map value in Spark dataframe(scala) - Expected column, actual Map[int,string]

I have key values pair in Map[int,string]. I need to save this value in Hive table using spark dataframe. But i am getting error - Expected column, actual Map[int,string]
Code:
val dbValuePairs = Array(2019,10)
val dbkey = dbValuePairs.map(x => x).zipWithIndex.map(t => (t._2, t._1)).toMap
val dqMetrics = spark.sql("select * from dqMetricsStagingTbl")
.withColumn("Dataset_Name", lit(Dataset_Name))
.withColumn("Key", dbkey)
dqMetrics.createOrReplaceTempView("tempTable")
spark.sql("create table if not exists hivetable AS select * from tempTable")
dqMetrics.write.mode("append").insertInto(hivetable)
Please help! Error in withColumn("Key", dbkey) line
Look at Spark function withColumn signature:
def withColumn(colName: String, col: Column): DataFrame
it takes two arguments: colName as String and col as Column.
your dbkey type is Map[Int, Int] that is not Column:
val dbkey: Map[Int, Int] = dbValuePairs.map(x => x).zipWithIndex.map(t => (t._2, t._1)).toMap
if you want store Map in your table column you can use map function which takes a sequence of Column:
// from object org.apache.spark.sql.functions
def map(cols: Column*): Column
so you can convert your dbkey to Seq[Column] and pass it to withColumn function:
val dbValuePairs = Array(2019,10)
val dbkey: Map[Int, Int] = dbValuePairs.map(x => x).zipWithIndex.map(t => (t._2, t._1)).toMap
val dbkeyColumnSeq: Seq[Column] = dbkey.flatMap(t => Seq(lit(t._2), lit(t._1))).toSeq
val dqMetrics = spark.sql("select * from dqMetricsStagingTbl")
.withColumn("Dataset_Name", lit(""))
.withColumn("Key", map(dbkeyColumnSeq:_*))

How to get the alias of a Spark Column as String?

If I declare a Column in a val, like this:
import org.apache.spark.sql.functions._
val col: org.apache.spark.sql.Column = count("*").as("col_name")
col is of type org.apache.spark.sql.Column. Is there a way to access its name ("col_name")?
Something like:
col.getName() // returns "col_name"
In this case, col.toString returns "count(1) AS col_name"
Try below code.
scala> val cl = count("*").as("col_name")
cl: org.apache.spark.sql.Column = count(1) AS `col_name`
scala> cl.expr.argString
res14: String = col_name
scala> cl.expr.productElement(1).asInstanceOf[String]
res24: String = col_name
scala> val cl = count("*").cast("string").as("column_name")
cl: org.apache.spark.sql.Column = CAST(count(1) AS STRING) AS `column_name`
scala> cl.expr.argString
res113: String = column_name
From the above code if you alter .as & .cast It will give you wrong result.
You can also use json4s to extract name from expr.toJSON
scala> import org.json4s._
import org.json4s._
scala> import org.json4s.jackson.JsonMethods._
import org.json4s.jackson.JsonMethods._
scala> implicit val formats = DefaultFormats
formats: org.json4s.DefaultFormats.type = org.json4s.DefaultFormats$#16cccda5
scala> val cl = count("*").as("column_name").cast("string") // Used cast last.
cl: org.apache.spark.sql.Column = CAST(count(1) AS `column_name` AS STRING)
scala> (parse(cl.expr.toJSON) \\ "name").extract[String]
res104: String = column_name
One another easy way is, column names will always be covered by `` these characters. You can use either regex or split the string and get index 1 element.
with split,
col.toString.split("`")(1)
with regex,
val pattern = "`(.*)`".r
pattern.findFirstMatchIn(col.toString).get.group(1)
Advantage doing like this is even you include something like .cast("string") to you column it will still work.

How to use the functions.explode to flatten element in dataFrame

I've made this piece of code :
case class RawPanda(id: Long, zip: String, pt: String, happy: Boolean, attributes: Array[Double])
case class PandaPlace(name: String, pandas: Array[RawPanda])
object TestSparkDataFrame extends App{
System.setProperty("hadoop.home.dir", "E:\\Programmation\\Libraries\\hadoop")
val conf = new SparkConf().setAppName("TestSparkDataFrame").set("spark.driver.memory","4g").setMaster("local[*]")
val session = SparkSession.builder().config(conf).getOrCreate()
import session.implicits._
def createAndPrintSchemaRawPanda(session:SparkSession):DataFrame = {
val newPanda = RawPanda(1,"M1B 5K7", "giant", true, Array(0.1, 0.1))
val pandaPlace = PandaPlace("torronto", Array(newPanda))
val df =session.createDataFrame(Seq(pandaPlace))
df
}
val df2 = createAndPrintSchemaRawPanda(session)
df2.show
+--------+--------------------+
| name| pandas|
+--------+--------------------+
|torronto|[[1,M1B 5K7,giant...|
+--------+--------------------+
val pandaInfo = df2.explode(df2("pandas")) {
case Row(pandas: Seq[Row]) =>
pandas.map{
case (Row(
id: Long,
zip: String,
pt: String,
happy: Boolean,
attrs: Seq[Double])) => RawPanda(id, zip, pt , happy, attrs.toArray)
}
}
pandaInfo2.show
+--------+--------------------+---+-------+-----+-----+----------+
| name| pandas| id| zip| pt|happy|attributes|
+--------+--------------------+---+-------+-----+-----+----------+
|torronto|[[1,M1B 5K7,giant...| 1|M1B 5K7|giant| true|[0.1, 0.1]|
+--------+--------------------+---+-------+-----+-----+----------+
The problem that the explode function as I used it is deprecated, so I would like to recaculate the pandaInfo2 dataframe but using the adviced method in the warning.
use flatMap() or select() with functions.explode() instead
But then when I do :
val pandaInfo = df2.select(functions.explode(df("pandas"))
I obtain the same result as I had in df2.
I don't know how to proceed to use flatMap or functions.explode.
How could I use flatMap or functions.explode to obtain the result that I want ?(the one in pandaInfo)
I've seen this post and this other one but none of them helped me.
Calling select with explode function returns a DataFrame where the Array pandas is "broken up" into individual records; Then, if you want to "flatten" the structure of the resulting single "RawPanda" per record, you can select the individual columns using a dot-separated "route":
val pandaInfo2 = df2.select($"name", explode($"pandas") as "pandas")
.select($"name", $"pandas",
$"pandas.id" as "id",
$"pandas.zip" as "zip",
$"pandas.pt" as "pt",
$"pandas.happy" as "happy",
$"pandas.attributes" as "attributes"
)
A less verbose version of the exact same operation would be:
import org.apache.spark.sql.Encoders // going to use this to "encode" case class into schema
val pandaColumns = Encoders.product[RawPanda].schema.fields.map(_.name)
val pandaInfo3 = df2.select($"name", explode($"pandas") as "pandas")
.select(Seq($"name", $"pandas") ++ pandaColumns.map(f => $"pandas.$f" as f): _*)

Array[Byte] Spark RDD to String Spark RDD

I'm using the Cloudera's SparkOnHBase module in order to get data from HBase.
I get a RDD in this way:
var getRdd = hbaseContext.hbaseRDD("kbdp:detalle_feedback", scan)
Based on that, what I get is an object of type
RDD[(Array[Byte], List[(Array[Byte], Array[Byte], Array[Byte])])]
which corresponds to row key and a list of values. All of them represented by a byte array.
If I save the getRDD to a file, what I see is:
([B#f7e2590,[([B#22d418e2,[B#12adaf4b,[B#48cf6e81), ([B#2a5ffc7f,[B#3ba0b95,[B#2b4e651c), ([B#27d0277a,[B#52cfcf01,[B#491f7520), ([B#3042ad61,[B#6984d407,[B#f7c4db0), ([B#29d065c1,[B#30c87759,[B#39138d14), ([B#32933952,[B#5f98506e,[B#8c896ca), ([B#2923ac47,[B#65037e6a,[B#486094f5), ([B#3cd385f2,[B#62fef210,[B#4fc62b36), ([B#5b3f0f24,[B#8fb3349,[B#23e4023a), ([B#4e4e403e,[B#735bce9b,[B#10595d48), ([B#5afb2a5a,[B#1f99a960,[B#213eedd5), ([B#2a704c00,[B#328da9c4,[B#72849cc9), ([B#60518adb,[B#9736144,[B#75f6bc34)])
for each record (rowKey and the columns)
But what I need is to get the String representation of all and each of the keys and values. Or at least the values. In order to save it to a file and see something like
key1,(value1,value2...)
or something like
key1,value1,value2...
I'm completely new on spark and scala and it's being quite hard to get something.
Could you please help me with that?
First lets create some sample data:
scala> val d = List( ("ab" -> List(("qw", "er", "ty")) ), ("cd" -> List(("ac", "bn", "afad")) ) )
d: List[(String, List[(String, String, String)])] = List((ab,List((qw,er,ty))), (cd,List((ac,bn,afad))))
This is how the data is:
scala> d foreach println
(ab,List((qw,er,ty)))
(cd,List((ac,bn,afad)))
Convert it to Array[Byte] format
scala> val arrData = d.map { case (k,v) => k.getBytes() -> v.map { case (a,b,c) => (a.getBytes(), b.getBytes(), c.getBytes()) } }
arrData: List[(Array[Byte], List[(Array[Byte], Array[Byte], Array[Byte])])] = List((Array(97, 98),List((Array(113, 119),Array(101, 114),Array(116, 121)))), (Array(99, 100),List((Array(97, 99),Array(98, 110),Array(97, 102, 97, 100)))))
Create an RDD out of this data
scala> val rdd1 = sc.parallelize(arrData)
rdd1: org.apache.spark.rdd.RDD[(Array[Byte], List[(Array[Byte], Array[Byte], Array[Byte])])] = ParallelCollectionRDD[0] at parallelize at <console>:25
Create a conversion function from Array[Byte] to String:
scala> def b2s(a: Array[Byte]): String = new String(a)
b2s: (a: Array[Byte])String
Perform our final conversion:
scala> val rdd2 = rdd1.map { case (k,v) => b2s(k) -> v.map{ case (a,b,c) => (b2s(a), b2s(b), b2s(c)) } }
rdd2: org.apache.spark.rdd.RDD[(String, List[(String, String, String)])] = MapPartitionsRDD[1] at map at <console>:29
scala> rdd2.collect()
res2: Array[(String, List[(String, String, String)])] = Array((ab,List((qw,er,ty))), (cd,List((ac,bn,afad))))
I don't know about HBase but if those Array[Byte]s are Unicode strings, something like this should work:
rdd: RDD[(Array[Byte], List[(Array[Byte], Array[Byte], Array[Byte])])] = *whatever*
rdd.map(k, l =>
(new String(k),
l.map(a =>
a.map(elem =>
new String(elem)
)
))
)
Sorry for bad styling and whatnot, I am not even sure it will work.

Make .csv files from arrays in Scala

I have two tuples in Scala of the following form:
val array1 = (bucket1, Seq((dateA, Amount11), (dateB, Amount12), (dateC, Amount13)))
val array2 = (bucket2, Seq((dateA, Amount21), (dateB, Amount22), (dateC, Amount23)))
What is the quickest way to make a .csv file in Scala such that:
date* is pivot.
bucket* is column name.
Amount* fill the table.
It needs to look something like this:
Dates______________bucket1__________bucket2
dateA______________Amount11________Amount21
dateB______________Amount12________Amount22
dateC______________Amount13________Amount23
You can make it shorter by chaining some operations, but :
scala> val array1 = ("bucket1", Seq(("dateA", "Amount11"), ("dateB", "Amount12"), ("dateC", "Amount13")))
array1: (String, Seq[(String, String)]) =
(bucket1,List((dateA,Amount11), (dateB,Amount12), (dateC,Amount13)))
scala> val array2 = ("bucket2", Seq(("dateA", "Amount21"), ("dateB", "Amount22"), ("dateC", "Amount23")))
array2: (String, Seq[(String, String)]) =
(bucket2,List((dateA,Amount21), (dateB,Amount22), (dateC,Amount23)))
// Single array to work with
scala> val arrays = List(array1, array2)
arrays: List[(String, Seq[(String, String)])] = List(
(bucket1,List((dateA,Amount11), (dateB,Amount12), (dateC,Amount13))),
(bucket2,List((dateA,Amount21), (dateB,Amount22), (dateC,Amount23)))
)
// Split between buckets and the values
scala> val (buckets, values) = arrays.unzip
buckets: List[String] = List(bucket1, bucket2)
values: List[Seq[(String, String)]] = List(
List((dateA,Amount11), (dateB,Amount12), (dateC,Amount13)),
List((dateA,Amount21), (dateB,Amount22), (dateC,Amount23))
)
// Format the data
// Note that this does not keep the 'dateX' order
scala> val grouped = values.flatten
.groupBy(_._1)
.map { case (date, list) => date::(list.map(_._2)) }
grouped: scala.collection.immutable.Iterable[List[String]] = List(
List(dateC, Amount13, Amount23),
List(dateB, Amount12, Amount22),
List(dateA, Amount11, Amount21)
)
// Join everything, and add the "Dates" column in front of the buckets
scala> val table = ("Dates"::buckets)::grouped.toList
table: List[List[String]] = List(
List(Dates, bucket1, bucket2),
List(dateC, Amount13, Amount23),
List(dateB, Amount12, Amount22),
List(dateA, Amount11, Amount21)
)
// Join the rows by ',' and the lines by "\n"
scala> val string = table.map(_.mkString(",")).mkString("\n")
string: String =
Dates,bucket1,bucket2
dateC,Amount13,Amount23
dateB,Amount12,Amount22
dateA,Amount11,Amount21