Extract columns from unordered data in scala spark - scala

I am learning scala-spark and want to know how can we extract required columns from an unordered data based on column name? Details below-
Input Data: RDD[Array[String]]
id=1,country=USA,age=20,name=abc
name=def,country=USA,id=2,age=30
name=ghi,id=3,age=40,country=USA
Required Output:
Name,id
abc,1
def,2
ghi,3
Any help would be much appreciated.Thanks in advance!

If you have RDD[Array[String]] then you can get the desired data as
You can define a case class as
case class Data(Name: String, Id: Long)
Then parse each line to case class
val df = rdd.map( row => {
//split the line and convert to map so you can extract the data
val data = row.split(",").map(x => (x.split("=")(0),x.split("=")(1))).toMap
Data(data("name"), data("id").toLong)
})
convert to Dataframe and display
df.toDF().show(false)
Output:
+----+---+
|Name|Id |
+----+---+
|abc |1 |
|def |2 |
|ghi |3 |
+----+---+
Here is full code to read the file
case class Data(Name: String, Id: Long)
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("xyz").master("local[*]").getOrCreate()
import spark.implicits._
val rdd = spark.sparkContext.textFile("path to file ")
val df = rdd.map(row => {
val data = row.split(",").map(x => (x.split("=")(0), x.split("=")(1))).toMap
Data(data("name"), data("id").toLong)
})
df.toDF().show(false)
}

Related

aggregating jsonarray into Map<key, list> in spark in spark2.x

I am quite new to Spark. I have a input json file which I am reading as
val df = spark.read.json("/Users/user/Desktop/resource.json");
Contents of resource.json looks like this:
{"path":"path1","key":"key1","region":"region1"}
{"path":"path112","key":"key1","region":"region1"}
{"path":"path22","key":"key2","region":"region1"}
Is there any way we can process this dataframe and aggregate result as
Map<key, List<data>>
where data is each json object in which key is present.
For ex: expected result is
Map<key1 =[{"path":"path1","key":"key1","region":"region1"}, {"path":"path112","key":"key1","region":"region1"}] ,
key2 = [{"path":"path22","key":"key2","region":"region1"}]>
Any reference/documents/link to proceed further would be a great help.
Thank you.
Here is what you can do:
import org.json4s._
import org.json4s.jackson.Serialization.read
case class cC(path: String, key: String, region: String)
val df = spark.read.json("/Users/user/Desktop/resource.json");
scala> df.show
+----+-------+-------+
| key| path| region|
+----+-------+-------+
|key1| path1|region1|
|key1|path112|region1|
|key2| path22|region1|
+----+-------+-------+
//Please note that original json structure is gone. Use .toJSON to get json back and extract key from json and create RDD[(String, String)] RDD[(key, json)]
val rdd = df.toJSON.rdd.map(m => {
implicit val formats = DefaultFormats
val parsedObj = read[cC](m)
(parsedObj.key, m)
})
scala> rdd.collect.groupBy(_._1).map(m => (m._1,m._2.map(_._2).toList))
res39: scala.collection.immutable.Map[String,List[String]] = Map(key2 -> List({"key":"key2","path":"path22","region":"region1"}), key1 -> List({"key":"key1","path":"path1","region":"region1"}, {"key":"key1","path":"path112","region":"region1"}))
You can use groupBy with collect_list, which is an aggregation function that collects all matching values into a list per key.
Notice that the original JSON strings are already "gone" (Spark parses them into individual columns), so if you really want a list of all records (with all their columns, including the key), you can use the struct function to combine columns into one column:
import org.apache.spark.sql.functions._
import spark.implicits._
df.groupBy($"key")
.agg(collect_list(struct($"path", $"key", $"region")) as "value")
The result would be:
+----+--------------------------------------------------+
|key |value |
+----+--------------------------------------------------+
|key1|[[path1, key1, region1], [path112, key1, region1]]|
|key2|[[path22, key2, region1]] |
+----+--------------------------------------------------+

Spark with Kafka streaming save to Elastic search slow performance

I have a list of data, the value is basically a bson document (think json), each json ranges from 5k to 20k in size. It either can be in bson object format or can be converted to json directly:
Key, Value
--------
K1, JSON1
K1, JSON2
K2, JSON3
K2, JSON4
I expect the groupByKey would produce:
K1, (JSON1, JSON2)
K2, (JSON3, JSON4)
so that when I do:
val data = [...].map(x => (x.Key, x.Value))
val groupedData = data.groupByKey()
groupedData.foreachRDD { rdd =>
//the elements in the rdd here are not really grouped by the Key
}
I am so confused the the behaviour of the RDD. I read many articles in the internet including the official website from Spark: https://spark.apache.org/docs/0.9.1/scala-programming-guide.html
Still couldn't achieve what I want.
-------- UPDATED ---------------------
Basically I really need it to be grouped by the key, the key is the index to be used in Elasticsearch, so that I can perform batch process based on the key via Elasticsearch for Hadoop:
EsSpark.saveToEs(rdd);
I can't do per partition because Elasticsearch only accept RDD. I tried to use sc.MakeRDD or sc.parallize, both telling me it is not serializable.
I tried to use:
EsSpark.saveToEs(rdd, Map(
"es.resource.write" -> "{TheKeyFromTheObjectAbove}",
"es.batch.size.bytes" -> "5000000")
Documentation of the config is here: https://www.elastic.co/guide/en/elasticsearch/hadoop/current/configuration.html
But it is VERY slow comparing to not using the configuration to define dynamic index based on the value of individual document, I suspect it is parsing every json to fetch the value dynamically.
Here is the example.
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
object Test extends App {
val session: SparkSession = SparkSession
.builder.appName("Example")
.config(new SparkConf().setMaster("local[*]"))
.getOrCreate()
val sc = session.sparkContext
import session.implicits._
case class Message(key: String, value: String)
val input: Seq[Message] =
Seq(Message("K1", "foo1"),
Message("K1", "foo2"),
Message("K2", "foo3"),
Message("K2", "foo4"))
val inputRdd: RDD[Message] = sc.parallelize(input)
val intermediate: RDD[(String, String)] =
inputRdd.map(x => (x.key, x.value))
intermediate.toDF().show()
// +---+----+
// | _1| _2|
// +---+----+
// | K1|foo1|
// | K1|foo2|
// | K2|foo3|
// | K2|foo4|
// +---+----+
val output: RDD[(String, List[String])] =
intermediate.groupByKey().map(x => (x._1, x._2.toList))
output.toDF().show()
// +---+------------+
// | _1| _2|
// +---+------------+
// | K1|[foo1, foo2]|
// | K2|[foo3, foo4]|
// +---+------------+
output.foreachPartition(rdd => if (rdd.nonEmpty) {
println(rdd.toList)
})
// List((K1,List(foo1, foo2)))
// List((K2,List(foo3, foo4)))
}

How do convert date to Unix timestamp in milliseconds [duplicate]

I am using Spark 2.1 with Scala.
How to convert a string column with milliseconds to a timestamp with milliseconds?
I tried the following code from the question Better way to convert a string field into timestamp in Spark
import org.apache.spark.sql.functions.unix_timestamp
val tdf = Seq((1L, "05/26/2016 01:01:01.601"), (2L, "#$####")).toDF("id", "dts")
val tts = unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss.SSS").cast("timestamp")
tdf.withColumn("ts", tts).show(2, false)
But I get the result without milliseconds:
+---+-----------------------+---------------------+
|id |dts |ts |
+---+-----------------------+---------------------+
|1 |05/26/2016 01:01:01.601|2016-05-26 01:01:01.0|
|2 |#$#### |null |
+---+-----------------------+---------------------+
UDF with SimpleDateFormat works. The idea is taken from the Ram Ghadiyaram's link to an UDF logic.
import java.text.SimpleDateFormat
import java.sql.Timestamp
import org.apache.spark.sql.functions.udf
import scala.util.{Try, Success, Failure}
val getTimestamp: (String => Option[Timestamp]) = s => s match {
case "" => None
case _ => {
val format = new SimpleDateFormat("MM/dd/yyyy' 'HH:mm:ss.SSS")
Try(new Timestamp(format.parse(s).getTime)) match {
case Success(t) => Some(t)
case Failure(_) => None
}
}
}
val getTimestampUDF = udf(getTimestamp)
val tdf = Seq((1L, "05/26/2016 01:01:01.601"), (2L, "#$####")).toDF("id", "dts")
val tts = getTimestampUDF($"dts")
tdf.withColumn("ts", tts).show(2, false)
with output:
+---+-----------------------+-----------------------+
|id |dts |ts |
+---+-----------------------+-----------------------+
|1 |05/26/2016 01:01:01.601|2016-05-26 01:01:01.601|
|2 |#$#### |null |
+---+-----------------------+-----------------------+
There is an easier way than making a UDF. Just parse the millisecond data and add it to the unix timestamp (the following code works with pyspark and should be very close the scala equivalent):
timeFmt = "yyyy/MM/dd HH:mm:ss.SSS"
df = df.withColumn('ux_t', unix_timestamp(df.t, format=timeFmt) + substring(df.t, -3, 3).cast('float')/1000)
Result:
'2017/03/05 14:02:41.865' is converted to 1488722561.865
import org.apache.spark.sql.functions;
import org.apache.spark.sql.types.DataTypes;
dataFrame.withColumn(
"time_stamp",
dataFrame.col("milliseconds_in_string")
.cast(DataTypes.LongType)
.cast(DataTypes.TimestampType)
)
the code is in java and it is easy to convert to scala

How to convert a string column with milliseconds to a timestamp with milliseconds in Spark 2.1 using Scala?

I am using Spark 2.1 with Scala.
How to convert a string column with milliseconds to a timestamp with milliseconds?
I tried the following code from the question Better way to convert a string field into timestamp in Spark
import org.apache.spark.sql.functions.unix_timestamp
val tdf = Seq((1L, "05/26/2016 01:01:01.601"), (2L, "#$####")).toDF("id", "dts")
val tts = unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss.SSS").cast("timestamp")
tdf.withColumn("ts", tts).show(2, false)
But I get the result without milliseconds:
+---+-----------------------+---------------------+
|id |dts |ts |
+---+-----------------------+---------------------+
|1 |05/26/2016 01:01:01.601|2016-05-26 01:01:01.0|
|2 |#$#### |null |
+---+-----------------------+---------------------+
UDF with SimpleDateFormat works. The idea is taken from the Ram Ghadiyaram's link to an UDF logic.
import java.text.SimpleDateFormat
import java.sql.Timestamp
import org.apache.spark.sql.functions.udf
import scala.util.{Try, Success, Failure}
val getTimestamp: (String => Option[Timestamp]) = s => s match {
case "" => None
case _ => {
val format = new SimpleDateFormat("MM/dd/yyyy' 'HH:mm:ss.SSS")
Try(new Timestamp(format.parse(s).getTime)) match {
case Success(t) => Some(t)
case Failure(_) => None
}
}
}
val getTimestampUDF = udf(getTimestamp)
val tdf = Seq((1L, "05/26/2016 01:01:01.601"), (2L, "#$####")).toDF("id", "dts")
val tts = getTimestampUDF($"dts")
tdf.withColumn("ts", tts).show(2, false)
with output:
+---+-----------------------+-----------------------+
|id |dts |ts |
+---+-----------------------+-----------------------+
|1 |05/26/2016 01:01:01.601|2016-05-26 01:01:01.601|
|2 |#$#### |null |
+---+-----------------------+-----------------------+
There is an easier way than making a UDF. Just parse the millisecond data and add it to the unix timestamp (the following code works with pyspark and should be very close the scala equivalent):
timeFmt = "yyyy/MM/dd HH:mm:ss.SSS"
df = df.withColumn('ux_t', unix_timestamp(df.t, format=timeFmt) + substring(df.t, -3, 3).cast('float')/1000)
Result:
'2017/03/05 14:02:41.865' is converted to 1488722561.865
import org.apache.spark.sql.functions;
import org.apache.spark.sql.types.DataTypes;
dataFrame.withColumn(
"time_stamp",
dataFrame.col("milliseconds_in_string")
.cast(DataTypes.LongType)
.cast(DataTypes.TimestampType)
)
the code is in java and it is easy to convert to scala

RDD[Array[String]] to Dataframe

I am new to Spark and Hive and my goal is to load a delimited(lets say csv) to Hive table. After a bit of reading I found out that the path to load the data into Hive is csv->dataframe->Hive.(Please correct me if I am wrong).
CSV:
1,Alex,70000,Columbus
2,Ryan,80000,New York
3,Johny,90000,Banglore
4,Cook, 65000,Glasgow
5,Starc, 70000,Aus
I read the csv file be using below command:
val csv =sc.textFile("employee_data.txt").map(line => line.split(",").map(elem => elem.trim))
csv: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[29] at map at <console>:39
Now I am trying to convert this RDD to Dataframe and using below code:
scala> val df = csv.map { case Array(s0, s1, s2, s3) => employee(s0, s1, s2, s3) }.toDF()
df: org.apache.spark.sql.DataFrame = [eid: string, name: string, salary: string, destination: string]
employee is a case class and I am using it as a schema definition.
case class employee(eid: String, name: String, salary: String, destination: String)
However when I do df.show I am getting below error:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 10.0 failed 4 times, most recent failure: Lost task
0.3 in stage 10.0 (TID 22, user.hostname): scala.MatchError: [Ljava.lang.String;#88ba3cb (of class
[Ljava.lang.String;)
I was expecting a dataframe as a output. I know why I might be getting this error because the values in RDD are stored in Ljava.lang.String;#88ba3cb format and I need to use mkString to get the actual values but I am not able to find how to do it. I appreciate your time.
If you fix your case class then it should work:
scala> case class employee(eid: String, name: String, salary: String, destination: String)
defined class employee
scala> val txtRDD = sc.textFile("data.txt").map(line => line.split(",").map(_.trim))
txtRDD: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[30] at map at <console>:24
scala> txtRDD.map{case Array(s0, s1, s2, s3) => employee(s0, s1, s2, s3)}.toDF.show
+---+-----+------+-----------+
|eid| name|salary|destination|
+---+-----+------+-----------+
| 1| Alex| 70000| Columbus|
| 2| Ryan| 80000| New York|
| 3|Johny| 90000| Banglore|
| 4| Cook| 65000| Glasgow|
| 5|Starc| 70000| Aus|
+---+-----+------+-----------+
Otherwise you could convert the String to an Int:
scala> case class employee(eid: Int, name: String, salary: String, destination: String)
defined class employee
scala> val df = txtRDD.map{case Array(s0, s1, s2, s3) => employee(s0.toInt, s1, s2, s3)}.toDF
df: org.apache.spark.sql.DataFrame = [eid: int, name: string ... 2 more fields]
scala> df.show
+---+-----+------+-----------+
|eid| name|salary|destination|
+---+-----+------+-----------+
| 1| Alex| 70000| Columbus|
| 2| Ryan| 80000| New York|
| 3|Johny| 90000| Banglore|
| 4| Cook| 65000| Glasgow|
| 5|Starc| 70000| Aus|
+---+-----+------+-----------+
However the best solution would be to use spark-csv (which would treat the salary as an Int as well).
Also note that the error was thrown when you ran df.show because everything was being lazily evaluated up until that point. df.show is an action which will cause all of the queued transformations to execute (see this article for more).
Use map on array elements, not on array:
val csv = sc.textFile("employee_data.txt")
.map(line => line
.split(",")
.map(e => e.map(_.trim))
)
val df = csv.map { case Array(s0, s1, s2, s3) => employee(s0, s1, s2, s3) }.toDF()
But, why you are reading CSV and then converting RDD to DF? Spark 1.5 already can read CSV via spark-csv package:
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.option("delimiter", ";")
.load("employee_data.txt")
As you said in your comment, your case class employee, which should be named Employee, receives an Int as first argument of its constructor, but you are passing a String. Thus, you should convert it to an Int before instantiating or modify your case defining eid as a String.