RDD[Array[String]] to Dataframe - scala

I am new to Spark and Hive and my goal is to load a delimited(lets say csv) to Hive table. After a bit of reading I found out that the path to load the data into Hive is csv->dataframe->Hive.(Please correct me if I am wrong).
CSV:
1,Alex,70000,Columbus
2,Ryan,80000,New York
3,Johny,90000,Banglore
4,Cook, 65000,Glasgow
5,Starc, 70000,Aus
I read the csv file be using below command:
val csv =sc.textFile("employee_data.txt").map(line => line.split(",").map(elem => elem.trim))
csv: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[29] at map at <console>:39
Now I am trying to convert this RDD to Dataframe and using below code:
scala> val df = csv.map { case Array(s0, s1, s2, s3) => employee(s0, s1, s2, s3) }.toDF()
df: org.apache.spark.sql.DataFrame = [eid: string, name: string, salary: string, destination: string]
employee is a case class and I am using it as a schema definition.
case class employee(eid: String, name: String, salary: String, destination: String)
However when I do df.show I am getting below error:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 10.0 failed 4 times, most recent failure: Lost task
0.3 in stage 10.0 (TID 22, user.hostname): scala.MatchError: [Ljava.lang.String;#88ba3cb (of class
[Ljava.lang.String;)
I was expecting a dataframe as a output. I know why I might be getting this error because the values in RDD are stored in Ljava.lang.String;#88ba3cb format and I need to use mkString to get the actual values but I am not able to find how to do it. I appreciate your time.

If you fix your case class then it should work:
scala> case class employee(eid: String, name: String, salary: String, destination: String)
defined class employee
scala> val txtRDD = sc.textFile("data.txt").map(line => line.split(",").map(_.trim))
txtRDD: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[30] at map at <console>:24
scala> txtRDD.map{case Array(s0, s1, s2, s3) => employee(s0, s1, s2, s3)}.toDF.show
+---+-----+------+-----------+
|eid| name|salary|destination|
+---+-----+------+-----------+
| 1| Alex| 70000| Columbus|
| 2| Ryan| 80000| New York|
| 3|Johny| 90000| Banglore|
| 4| Cook| 65000| Glasgow|
| 5|Starc| 70000| Aus|
+---+-----+------+-----------+
Otherwise you could convert the String to an Int:
scala> case class employee(eid: Int, name: String, salary: String, destination: String)
defined class employee
scala> val df = txtRDD.map{case Array(s0, s1, s2, s3) => employee(s0.toInt, s1, s2, s3)}.toDF
df: org.apache.spark.sql.DataFrame = [eid: int, name: string ... 2 more fields]
scala> df.show
+---+-----+------+-----------+
|eid| name|salary|destination|
+---+-----+------+-----------+
| 1| Alex| 70000| Columbus|
| 2| Ryan| 80000| New York|
| 3|Johny| 90000| Banglore|
| 4| Cook| 65000| Glasgow|
| 5|Starc| 70000| Aus|
+---+-----+------+-----------+
However the best solution would be to use spark-csv (which would treat the salary as an Int as well).
Also note that the error was thrown when you ran df.show because everything was being lazily evaluated up until that point. df.show is an action which will cause all of the queued transformations to execute (see this article for more).

Use map on array elements, not on array:
val csv = sc.textFile("employee_data.txt")
.map(line => line
.split(",")
.map(e => e.map(_.trim))
)
val df = csv.map { case Array(s0, s1, s2, s3) => employee(s0, s1, s2, s3) }.toDF()
But, why you are reading CSV and then converting RDD to DF? Spark 1.5 already can read CSV via spark-csv package:
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.option("delimiter", ";")
.load("employee_data.txt")

As you said in your comment, your case class employee, which should be named Employee, receives an Int as first argument of its constructor, but you are passing a String. Thus, you should convert it to an Int before instantiating or modify your case defining eid as a String.

Related

How to convert an RDD array string to a dataframe

Please help me convert the RDD array of IP address below, into a dataframe.
(Full Disclosure: I have little experience working with RDD)
RDD CREATION:
val SCND_RDD = FIRST_RDD.map(kv => kv._2).flatMap(r => r.get("ip")).map(o => o.asInstanceOf[scala.collection.mutable.Map[String, String]]).flatMap(ip => ip.get("address"))
SCND_RDD.take(3)
RESULTS:
SCND_RDD: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[33] at flatMap at <console>:38
res87: Array[String] = Array(5.42.212.99, 51.34.21.60, 63.99.831.7)`
My rdd<->dataframe conversion attempt:
case class X(callId: String)
val userDF = SCND_RDD.map{case Array(s0)=>X(s0)}.toDF()
This is the error I get
defined class X
<console>:40: error: scrutinee is incompatible with pattern type;
found : Array[T]
required: String
val userDF = NIPR_RDD22.map{case Array(s0)=>X(s0)}.toDF()
I leave a comment that is a duplicated question that might help you.
But here I also leave my trial.
val rdd = sc.parallelize(Array("test", "test2", "test3"))
rdd.take(3)
//res53: Array[String] = Array(test, test2, test3)
val df = rdd.toDF()
df.show
+-----+
|value|
+-----+
| test|
|test2|
|test3|
+-----+

Spark : converting Array[Byte] data to RDD or DataFrame

I have data in the form of Array[Byte] which I want to convert into Spark RDD or DataFrame so that I can write my data directly into a Google bucket in the form of a file. I am not able to write Array[Byte] data into Google bucket directly. So looking for this conversion.
My below code is able to write data into Local FS, but not Google bucket
val encrypted = encrypt(original, readPublicKey(pubKey), outFile, true, true)
val dfis = new FileOutputStream(outFile)
dfis.write(encrypted)
dfis.close()
def encrypt(clearData: Array[Byte], encKey: PGPPublicKey, fileName: String, withIntegrityCheck: Boolean, armor: Boolean): Array[Byte] = {
...
}
So how can I convert Array[Byte] data to RDD or DataFrame? I am using Scala.
just use .toDF() or .toDF().rdd
scala> val arr: Array[Byte] = Array(192.toByte, 168.toByte, 1.toByte, 4.toByte)
arr: Array[Byte] = Array(-64, -88, 1, 4)
scala> val df = arr.toSeq.toDF()
df: org.apache.spark.sql.DataFrame = [value: tinyint]
scala> df.show()
+-----+
|value|
+-----+
| -64|
| -88|
| 1|
| 4|
+-----+
scala> df.printSchema()
root
|-- value: byte (nullable = false)

Extract columns from unordered data in scala spark

I am learning scala-spark and want to know how can we extract required columns from an unordered data based on column name? Details below-
Input Data: RDD[Array[String]]
id=1,country=USA,age=20,name=abc
name=def,country=USA,id=2,age=30
name=ghi,id=3,age=40,country=USA
Required Output:
Name,id
abc,1
def,2
ghi,3
Any help would be much appreciated.Thanks in advance!
If you have RDD[Array[String]] then you can get the desired data as
You can define a case class as
case class Data(Name: String, Id: Long)
Then parse each line to case class
val df = rdd.map( row => {
//split the line and convert to map so you can extract the data
val data = row.split(",").map(x => (x.split("=")(0),x.split("=")(1))).toMap
Data(data("name"), data("id").toLong)
})
convert to Dataframe and display
df.toDF().show(false)
Output:
+----+---+
|Name|Id |
+----+---+
|abc |1 |
|def |2 |
|ghi |3 |
+----+---+
Here is full code to read the file
case class Data(Name: String, Id: Long)
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("xyz").master("local[*]").getOrCreate()
import spark.implicits._
val rdd = spark.sparkContext.textFile("path to file ")
val df = rdd.map(row => {
val data = row.split(",").map(x => (x.split("=")(0), x.split("=")(1))).toMap
Data(data("name"), data("id").toLong)
})
df.toDF().show(false)
}

Scala Spark - empty map on DataFrame column for map(String, Int)

I am joining two DataFrames, where there are columns of a type Map[String, Int]
I want the merged DF to have an empty map [] and not null on the Map type columns.
val df = dfmerged.
.select("id"),
coalesce(col("map_1"), lit(null).cast(MapType(StringType, IntType))).alias("map_1"),
coalesce(col("map_2"), lit(Map.empty[String, Int])).alias("map_2")
for a map_1 column, a null will be inserted, but I'd like to have an empty map
map_2 is giving me an error:
java.lang.RuntimeException: Unsupported literal type class
scala.collection.immutable.Map$EmptyMap$ Map()
I've also tried with an udf function like:
case class myStructMap(x:Map[String, Int])
val emptyMap = udf(() => myStructMap(Map.empty[String, Int]))
also did not work.
when I try something like:
.select( coalesce(col("myMapCol"), lit(map())).alias("brand_viewed_count")...
or
.select(coalesce(col("myMapCol"), lit(map().cast(MapType(LongType, LongType)))).alias("brand_viewed_count")...
I get the error:
cannot resolve 'map()' due to data type mismatch: cannot cast
MapType(NullType,NullType,false) to MapType(LongType,IntType,true);
In Spark 2.2
import org.apache.spark.sql.functions.typedLit
val df = Seq((1L, null), (2L, Map("foo" -> "bar"))).toDF("id", "map")
df.withColumn("map", coalesce($"map", typedLit(Map[String, Int]()))).show
// +---+-----------------+
// | id| map|
// +---+-----------------+
// | 1| Map()|
// | 2|Map(foobar -> 42)|
// +---+-----------------+
Before
df.withColumn("map", coalesce($"map", map().cast("map<string,int>"))).show
// +---+-----------------+
// | id| map|
// +---+-----------------+
// | 1| Map()|
// | 2|Map(foobar -> 42)|
// +---+-----------------+

Read HBase in Scala - it.nerdammer

I would like to read HBase data in a Spark stream code for looking up and further enhancement of streaming data. I am using spark-hbase-connector_2.10-1.0.3.jar.
In my code the following line is successful
val docRdd =
sc.hbaseTable[(Option[String], Option[String])]("hbase_customer_profile")
.select("id","gender").inColumnFamily("data")
docRdd.count returns the right count.
docRdd is of type
HBaseReaderBuilder(org.apache.spark.SparkContext#3a49e5,hbase_customer_profile,Some(data),WrappedArray(id,
gender),None,None,List())
How can I read all the rows in id, gender columns please. Also how can I convert docRdd into a data frame so that SparkSQL can be used.
You can read all rows from the RDD using
docRdd.collect().foreach(println)
To convert the RDD to a DataFrame you could define a case class:
case class Customer(rowKey: String, id: Option[String], gender: Option[String])
I have added the row key to the case class; that's not strictly necessary, so if you don't need it, you can omit it.
Then map over the RDD:
// Row key, id, gender
type Record = (String, Option[String], Option[String])
val rdd =
sc.hbaseTable[Record]("customers")
.select("id","gender")
.inColumnFamily("data")
.map(r => Customer(r._1, r._2, r._3))
and then - based on the case class - convert the RDD to a DataFrame
import sqlContext.implicits._
val df = rdd.toDF()
df.show()
df.printSchema()
The output from spark-shell looks like this:
scala> df.show()
+---------+----+------+
| rowKey| id|gender|
+---------+----+------+
|customer1| 1| null|
|customer2|null| f|
|customer3| 3| m|
+---------+----+------+
scala> df.printSchema()
root
|-- rowKey: string (nullable = true)
|-- id: string (nullable = true)
|-- gender: string (nullable = true)