Spark with Kafka streaming save to Elastic search slow performance - scala

I have a list of data, the value is basically a bson document (think json), each json ranges from 5k to 20k in size. It either can be in bson object format or can be converted to json directly:
Key, Value
--------
K1, JSON1
K1, JSON2
K2, JSON3
K2, JSON4
I expect the groupByKey would produce:
K1, (JSON1, JSON2)
K2, (JSON3, JSON4)
so that when I do:
val data = [...].map(x => (x.Key, x.Value))
val groupedData = data.groupByKey()
groupedData.foreachRDD { rdd =>
//the elements in the rdd here are not really grouped by the Key
}
I am so confused the the behaviour of the RDD. I read many articles in the internet including the official website from Spark: https://spark.apache.org/docs/0.9.1/scala-programming-guide.html
Still couldn't achieve what I want.
-------- UPDATED ---------------------
Basically I really need it to be grouped by the key, the key is the index to be used in Elasticsearch, so that I can perform batch process based on the key via Elasticsearch for Hadoop:
EsSpark.saveToEs(rdd);
I can't do per partition because Elasticsearch only accept RDD. I tried to use sc.MakeRDD or sc.parallize, both telling me it is not serializable.
I tried to use:
EsSpark.saveToEs(rdd, Map(
"es.resource.write" -> "{TheKeyFromTheObjectAbove}",
"es.batch.size.bytes" -> "5000000")
Documentation of the config is here: https://www.elastic.co/guide/en/elasticsearch/hadoop/current/configuration.html
But it is VERY slow comparing to not using the configuration to define dynamic index based on the value of individual document, I suspect it is parsing every json to fetch the value dynamically.

Here is the example.
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
object Test extends App {
val session: SparkSession = SparkSession
.builder.appName("Example")
.config(new SparkConf().setMaster("local[*]"))
.getOrCreate()
val sc = session.sparkContext
import session.implicits._
case class Message(key: String, value: String)
val input: Seq[Message] =
Seq(Message("K1", "foo1"),
Message("K1", "foo2"),
Message("K2", "foo3"),
Message("K2", "foo4"))
val inputRdd: RDD[Message] = sc.parallelize(input)
val intermediate: RDD[(String, String)] =
inputRdd.map(x => (x.key, x.value))
intermediate.toDF().show()
// +---+----+
// | _1| _2|
// +---+----+
// | K1|foo1|
// | K1|foo2|
// | K2|foo3|
// | K2|foo4|
// +---+----+
val output: RDD[(String, List[String])] =
intermediate.groupByKey().map(x => (x._1, x._2.toList))
output.toDF().show()
// +---+------------+
// | _1| _2|
// +---+------------+
// | K1|[foo1, foo2]|
// | K2|[foo3, foo4]|
// +---+------------+
output.foreachPartition(rdd => if (rdd.nonEmpty) {
println(rdd.toList)
})
// List((K1,List(foo1, foo2)))
// List((K2,List(foo3, foo4)))
}

Related

Scala : how to pass variable in a UDF and use in withColumn

I have a variable of type Map[String, Set[String]
val metadata = Map(a -> Set(b ,c))
val colToUse = "existingcol" // Option[String]
I am trying to add a new column in my dataFrame using metadata and colToUse which is an existing column in my dataframe
value of metadata is Set of Strings and
key is just a string which is a value of a column in df.
eg :
val metadata = Map['mike', ['physics','chemistry']]
val colToUse = 'student_name' // student_name is a column name in df
'mike' will be a value of "student_name" column.
i am trying to add a new column in existing DF where i can add subjects of each student based on student_name and metadata
myDF.withColumn("subjects", metadata.getorelse(col(colToUse), set.empty)
The above will not work in scala as i need pass columns only in withColumn.
Tried using UDF
def logic: (Map[String, Set[String]], String) => Set[String] =
(metadata: Map[String, Set[String]], colToUse: String) => {
metadata.getOrElse(colToUse, Set("a"))
}
def myUDF = udf(logic)
def getVal: Column = { myUDF(metadata, col(colToUse.get) }
and using it in withcolumn :
myDF.withColumn("newCol", getVal(metadata, colToUse)
Getting error : Unsupported literal type class scala.Tuple2
Looking for a best simplistic way to approach this ?
Issue 2: In getVal , for passing metadata a column is expected but i am passing a map
Is something like this is what you need:
val spark = SparkSession.builder().master("local[1]").getOrCreate()
val df = spark.createDataFrame(
spark.sparkContext.parallelize(Seq(Row("mike"))),
StructType(List(StructField("student_name", StringType)))
)
df.show()
First test dataframe:
+------------+
|student_name|
+------------+
| mike|
+------------+
And now, create the udf that uses the map:
val metadata = Map("mike" -> Set("physics", "chemistry"))
val colToUse = "student_name"
def createUdf =
udf((key: String) => metadata.getOrElse(key, Set.empty))
and uset it in withColumn function:
df.withColumn("subjects", createUdf(col(colToUse))).show()
it gives:
+------------+--------------------+
|student_name| subjects|
+------------+--------------------+
| mike|[physics, chemistry]|
+------------+--------------------+
am I missing something?

spark Scala RDD to DataFrame Date format

Would you be able to help in this spark prob statement
Data -
empno|ename|designation|manager|hire_date|sal|deptno
7369|SMITH|CLERK|9902|2010-12-17|800.00|20
7499|ALLEN|SALESMAN|9698|2011-02-20|1600.00|30
Code:
val rawrdd = spark.sparkContext.textFile("C:\\Users\\cmohamma\\data\\delta scenarios\\emp_20191010.txt")
val refinedRDD = rawrdd.map( lines => {
val fields = lines.split("\\|") (fields(0).toInt,fields(1),fields(2),fields(3).toInt,fields(4).toDate,fields(5).toFloat,fields(6).toInt)
})
Problem Statement - This is not working -fields(4).toDate , whats is the alternative or what is the usage ?
What i have tried ?
tried replacing it to - to_date(col(fields(4)) , "yyy-MM-dd") - Not working
2.
Step 1.
val refinedRDD = rawrdd.map( lines => {
val fields = lines.split("\\|")
(fields(0),fields(1),fields(2),fields(3),fields(4),fields(5),fields(6))
})
Now this tuples are all strings
Step 2.
mySchema = StructType(StructField(empno,IntegerType,true), StructField(ename,StringType,true), StructField(designation,StringType,true), StructField(manager,IntegerType,true), StructField(hire_date,DateType,true), StructField(sal,DoubleType,true), StructField(deptno,IntegerType,true))
Step 3. converting the string tuples to Rows
val rowRDD = refinedRDD.map(attributes => Row(attributes._1, attributes._2, attributes._3, attributes._4, attributes._5 , attributes._6, attributes._7))
Step 4.
val empDF = spark.createDataFrame(rowRDD, mySchema)
This is also not working and gives error related to types. to solve this i changed the step 1 as
(fields(0).toInt,fields(1),fields(2),fields(3).toInt,fields(4),fields(5).toFloat,fields(6).toInt)
Now this is giving error for the date type column and i am again at the main problem.
Use Case - use textFile Api, convert this to a dataframe using custom schema (StructType) on top of it.
This can be done using the case class but in case class also i would be stuck where i would need to do a fields(4).toDate (i know i can cast string to date later in code but if the above problem solutionis possible)
You can use the following code snippet
import org.apache.spark.sql.functions.to_timestamp
scala> val df = spark.read.format("csv").option("header", "true").option("delimiter", "|").load("gs://otif-etl-input/test.csv")
df: org.apache.spark.sql.DataFrame = [empno: string, ename: string ... 5 more fields]
scala> val ts = to_timestamp($"hire_date", "yyyy-MM-dd")
ts: org.apache.spark.sql.Column = to_timestamp(`hire_date`, 'yyyy-MM-dd')
scala> val enriched_df = df.withColumn("ts", ts).show(2, false)
+-----+-----+-----------+-------+----------+-------+----------+-------------------+
|empno|ename|designation|manager|hire_date |sal |deptno |ts |
+-----+-----+-----------+-------+----------+-------+----------+-------------------+
|7369 |SMITH|CLERK |9902 |2010-12-17|800.00 |20 |2010-12-17 00:00:00|
|7499 |ALLEN|SALESMAN |9698 |2011-02-20|1600.00|30 |2011-02-20 00:00:00|
+-----+-----+-----------+-------+----------+-------+----------+-------------------+
enriched_df: Unit = ()
There are multiple ways to cast your data to proper data types.
First : use InferSchema
val df = spark.read .option("delimiter", "\\|").option("header", true) .option("inferSchema", "true").csv(path)
df.printSchema
Some time it doesn't work as expected. see details here
Second : provide your own Datatype conversion template
val rawDF = Seq(("7369", "SMITH" , "2010-12-17", "800.00"), ("7499", "ALLEN","2011-02-20", "1600.00")).toDF("empno", "ename","hire_date", "sal")
//define schema in DF , hire_date as Date
val schemaDF = Seq(("empno", "INT"), ("ename", "STRING"), (**"hire_date", "date"**) , ("sal", "double")).toDF("columnName", "columnType")
rawDF.printSchema
//fetch schema details
val dataTypes = schemaDF.select("columnName", "columnType")
val listOfElements = dataTypes.collect.map(_.toSeq.toList)
//creating a map friendly template
val validationTemplate = (c: Any, t: Any) => {
val column = c.asInstanceOf[String]
val typ = t.asInstanceOf[String]
col(column).cast(typ)
}
//Apply datatype conversion template on rawDF
val convertedDF = rawDF.select(listOfElements.map(element => validationTemplate(element(0), element(1))): _*)
println("Conversion done!")
convertedDF.show()
convertedDF.printSchema
Third : Case Class
Create schema from caseclass with ScalaReflection and provide this customized schema while loading DF.
import org.apache.spark.sql.catalyst.ScalaReflection
import org.apache.spark.sql.types._
case class MySchema(empno: int, ename: String, hire_date: Date, sal: Double)
val schema = ScalaReflection.schemaFor[MySchema].dataType.asInstanceOf[StructType]
val rawDF = spark.read.schema(schema).option("header", "true").load(path)
rawDF.printSchema
Hope this will help.

aggregating jsonarray into Map<key, list> in spark in spark2.x

I am quite new to Spark. I have a input json file which I am reading as
val df = spark.read.json("/Users/user/Desktop/resource.json");
Contents of resource.json looks like this:
{"path":"path1","key":"key1","region":"region1"}
{"path":"path112","key":"key1","region":"region1"}
{"path":"path22","key":"key2","region":"region1"}
Is there any way we can process this dataframe and aggregate result as
Map<key, List<data>>
where data is each json object in which key is present.
For ex: expected result is
Map<key1 =[{"path":"path1","key":"key1","region":"region1"}, {"path":"path112","key":"key1","region":"region1"}] ,
key2 = [{"path":"path22","key":"key2","region":"region1"}]>
Any reference/documents/link to proceed further would be a great help.
Thank you.
Here is what you can do:
import org.json4s._
import org.json4s.jackson.Serialization.read
case class cC(path: String, key: String, region: String)
val df = spark.read.json("/Users/user/Desktop/resource.json");
scala> df.show
+----+-------+-------+
| key| path| region|
+----+-------+-------+
|key1| path1|region1|
|key1|path112|region1|
|key2| path22|region1|
+----+-------+-------+
//Please note that original json structure is gone. Use .toJSON to get json back and extract key from json and create RDD[(String, String)] RDD[(key, json)]
val rdd = df.toJSON.rdd.map(m => {
implicit val formats = DefaultFormats
val parsedObj = read[cC](m)
(parsedObj.key, m)
})
scala> rdd.collect.groupBy(_._1).map(m => (m._1,m._2.map(_._2).toList))
res39: scala.collection.immutable.Map[String,List[String]] = Map(key2 -> List({"key":"key2","path":"path22","region":"region1"}), key1 -> List({"key":"key1","path":"path1","region":"region1"}, {"key":"key1","path":"path112","region":"region1"}))
You can use groupBy with collect_list, which is an aggregation function that collects all matching values into a list per key.
Notice that the original JSON strings are already "gone" (Spark parses them into individual columns), so if you really want a list of all records (with all their columns, including the key), you can use the struct function to combine columns into one column:
import org.apache.spark.sql.functions._
import spark.implicits._
df.groupBy($"key")
.agg(collect_list(struct($"path", $"key", $"region")) as "value")
The result would be:
+----+--------------------------------------------------+
|key |value |
+----+--------------------------------------------------+
|key1|[[path1, key1, region1], [path112, key1, region1]]|
|key2|[[path22, key2, region1]] |
+----+--------------------------------------------------+

Extract columns from unordered data in scala spark

I am learning scala-spark and want to know how can we extract required columns from an unordered data based on column name? Details below-
Input Data: RDD[Array[String]]
id=1,country=USA,age=20,name=abc
name=def,country=USA,id=2,age=30
name=ghi,id=3,age=40,country=USA
Required Output:
Name,id
abc,1
def,2
ghi,3
Any help would be much appreciated.Thanks in advance!
If you have RDD[Array[String]] then you can get the desired data as
You can define a case class as
case class Data(Name: String, Id: Long)
Then parse each line to case class
val df = rdd.map( row => {
//split the line and convert to map so you can extract the data
val data = row.split(",").map(x => (x.split("=")(0),x.split("=")(1))).toMap
Data(data("name"), data("id").toLong)
})
convert to Dataframe and display
df.toDF().show(false)
Output:
+----+---+
|Name|Id |
+----+---+
|abc |1 |
|def |2 |
|ghi |3 |
+----+---+
Here is full code to read the file
case class Data(Name: String, Id: Long)
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("xyz").master("local[*]").getOrCreate()
import spark.implicits._
val rdd = spark.sparkContext.textFile("path to file ")
val df = rdd.map(row => {
val data = row.split(",").map(x => (x.split("=")(0), x.split("=")(1))).toMap
Data(data("name"), data("id").toLong)
})
df.toDF().show(false)
}

How to pass dataset column value to a function while using spark filter with scala?

I have an action array which consists of user id and action type
+-------+-------+
|user_id| type|
+-------+-------+
| 11| SEARCH|
+-------+-------+
| 11| DETAIL|
+-------+-------+
| 12| SEARCH|
+-------+-------+
I want to filter actions that belongs to the users who have at least one search action.
So I created a bloom filter with user ids who has SEARCH action.
Then I tried to filter all actions depending on bloom filter's user status
val df = spark.read...
val searchers = df.filter($"type" === "SEARCH").select("user_id").distinct.as[String].collect
val bloomFilter = BloomFilter.create(100)
searchers.foreach(bloomFilter.putString(_))
df.filter(bloomFilter.mightContainString($"user_id"))
But the code gives an exception
type mismatch;
found : org.apache.spark.sql.ColumnName
required: String
Please let me know how can I pass column value to the BloomFilter.mightContainString method?
Create filter:
val expectedNumItems: Long = ???
val fpp: Double = ???
val f = df.stat.bloomFilter("user_id", expectedNumItems, fpp)
Use udf for filtering:
import org.apache.spark.sql.functions.udf
val mightContain = udf((s: String) => f.mightContain(s))
df.filter(mightContain($"user_id"))
If your current Bloom filter implementation is serializable you should be able to use it the same way, but if data is large enough to justify Bloom filter, you should avoid collecting.
You can do something like this,
val sparkSession = ???
val sc = sparkSession.sparkContext
val bloomFilter = BloomFilter.create(100)
val df = ???
val searchers = df.filter($"type" === "SEARCH").select("user_id").distinct.as[String].collect
At this point, i'll mention the fact that collect is not a good idea. Next you can do something like.
import org.apache.spark.sql.functions.udf
val bbFilter = sc.broadcast(bloomFilter)
val filterUDF = udf((s: String) => bbFilter.value.mightContainString(s))
df.filter(filterUDF($"user_id"))
You can remove the broadcasting if the bloomFilter instance is serializable.
Hope this helps, Cheers.