is rdd.contains function in spark-scala expensive

is rdd.contains function in spark-scala expensive - scala

I am getting millions of message from Kafka stream in spark-streaming. There are 15 different types of message. Messages come from a single topic. I can only differentiate message by its content. so I am using rdd.contains method to get the different type of rdd.
sample message
{"a":"foo", "b":"bar","type":"first" .......}
{"a":"foo1", "b":"bar1","type":"second" .......}
{"a":"foo2", "b":"bar2","type":"third" .......}
{"a":"foo", "b":"bar","type":"first" .......}
..............
...............
.........
so on
code
DStream.foreachRDD { rdd =>
if (!rdd.isEmpty()) {
val rdd_first = rdd.filter {
ele => ele.contains("First")
}
if (!rdd_first.isEmpty()) {
insertIntoTableFirst(hivecontext.read.json(rdd_first))
}
val rdd_second = rdd.filter {
ele => ele.contains("Second")
}
if (!rdd_second.isEmpty()) {
insertIntoTableSecond(hivecontext.read.json(rdd_second))
}
.............
......
same way for 15 different rdd
is there any way to get different rdd from kafka topic message?

There's no rdd.contains. The function contains used here is applied to the Strings in the RDD.
Like here:
val rdd_first = rdd.filter {
element => element.contains("First") // each `element` is a String
}
This method is not robust because other content in the String might meet the comparison, resulting in errors.
e.g.
{"a":"foo", "b":"bar","type":"second", "c": "first", .......}
One way to deal with this would be to first transform the JSON data into proper records, and then apply grouping or filtering logic on those records. For that, we first need a schema definition of the data. With the schema, we can parse the records into json and apply any processing on top of that:
case class Record(a:String, b:String, `type`:String)
import org.apache.spark.sql.types._
val schema = StructType(
Array(
StructField("a", StringType, true),
StructField("b", StringType, true),
StructField("type", String, true)
)
)
val processPerType: Map[String, Dataset[Record] => Unit ] = Map(...)
stream.foreachRDD { rdd =>
val records = rdd.toDF("value").select(from_json($"value", schema)).as[Record]
processPerType.foreach{case (tpe, process) =>
val target = records.filter(entry => entry.`type` == tpe)
process(target)
}
}
The question does not specify what kind of logic needs to be applied to each type of record. What's presented here is a generic way of approaching the problem where any custom logic can be expressed as a function Dataset[Record] => Unit.
If the logic could be expressed as an aggregation, probably the Dataset aggregation functions will be more appropriate.

Related

Writing null values to Parquet in Spark when the NullType is inside a StructType

I'm importing a collection from MongodB to Spark. All the documents have field 'data' which in turn is a structure and has field 'configurationName' (which is always null).
val partitionDF = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("database", "db").option("collection", collectionName).load()
For the data column in the resulting DataFrame, I get this type:
StructType(StructField(configurationName,NullType,true), ...
When I try to save the dataframe as Parquet
partitionDF.write.mode("overwrite").parquet(collectionName + ".parquet")
I get the following error:
AnalysisException: Parquet data source does not support struct<configurationName:null, ...
It looks like the problem is that I have that NullType buried in the data column's type. I'm looking at How to handle null values when writing to parquet from Spark , but it only shows how to solve this NullType problem on the top-level columns.
But how do you solve this problem when a NullType is not at the top level? The only idea I have so far is to flatten the dataframe completely (exploding arrays and so on) and then all the NullTypes would pop at the top. But in such a case I would lose the original structure of the data (which I don't want to lose).
Is there a better solution?

#Roman Puchkovskiy : Rewritten your function using pattern matching.
def deNullifyStruct(struct: StructType): StructType = {
val items = struct.map { field => StructField(field.name, fixNullType(field.dataType), field.nullable, field.metadata) }
StructType(items)
}
def fixNullType(dt: DataType): DataType = {
dt match {
case _: StructType => return deNullifyStruct(dt.asInstanceOf[StructType])
case _: ArrayType =>
val array = dt.asInstanceOf[ArrayType]
return ArrayType(fixNullType(array.elementType), array.containsNull)
case _: NullType => return StringType
case _ => return dt
}
}

Building on How to handle null values when writing to parquet from Spark and How to pass schema to create a new Dataframe from existing Dataframe? (the second is suggested by #pasha701, thanks!), I constructed this:
def denullifyStruct(struct: StructType): StructType = {
val items = struct.map{ field => StructField(field.name, denullify(field.dataType), field.nullable, field.metadata) }
StructType(items)
}
def denullify(dt: DataType): DataType = {
dt match {
case struct: StructType => denullifyStruct(struct)
case array: ArrayType => ArrayType(denullify(array.elementType), array.containsNull)
case _: NullType => StringType
case _ => dt
}
}
which effectively replaces all NullType instances with StringType ones.
And then
val fixedDF = spark.createDataFrame(partitionDF.rdd, denullifyStruct(partitionDF.schema))
fixedDF.printSchema

Creating column types for schema

I have a text file that I read from and parse to create a dataframe. However, the columns amount and code should be IntegerTypes. Here's what I have:
def getSchema: StructType = {
StructType(Seq(
StructField("carrier", StringType, false),
StructField("amount", StringType, false),
StructField("currency", StringType, false),
StructField("country", StringType, false),
StructField("code", StringType, false),
))
}
def getRow(x: String): Row = {
val columnArray = new Array[String](5)
columnArray(0) = x.substring(40, 43)
columnArray(1) = x.substring(43, 46)
columnArray(2) = x.substring(46, 51)
columnArray(3) = x.substring(51, 56)
columnArray(4) = x.substring(56, 64)
Row.fromSeq(columnArray)
}
Because I have Array[String] defined, the columns can only be StringTypes and not a variety of both String and Integer. To explain in detail my problem, here's what happens:
First I create an empty dataframe:
var df = spark.sqlContext.createDataFrame(spark.sparkContext.emptyRDD[Row], getSchema)
Then I have a for loop that goes through each file in all the directories. Note: I need to validate every file and cannot read all at once.
for (each file parse):
df2 = spark.sqlContext.createDataFrame(spark.sparkContext.textFile(inputPath)
.map(x => getRow(x)), schema)
df = df.union(df2)
I now have a complete dataframe of all the files. However, columns amount and code are StringTypes still. How can I make it so that they are IntegerTypes?
Please note: I cannot cast the columns during the for-loop process because it takes a lot of time. I'd like to keep the current structure I have as similar as possible. At the end of the for loop, I could cast the columns as IntegerTypes, however, what if the column contains a value that is not an Integer? I'd like for the columns to be not NULL.
Is there a way to make the 2 specified columns IntegerTypes without adding a lot of change to the code?

What about using datasets?
First create a case class modelling your data:
case class MyObject(
carrier: String,
amount: Double,
currency: String,
country: String,
code: Int)
create an other case class wrapping the first one with additional infos (potential errors, source file):
case class MyObjectWrapper(
myObject: Option[MyObject],
someError: Option[String],
source: String
)
Then create a parser, transforming a line from your file in myObject:
object Parser {
def parse(line: String, file: String): MyObjectWrapper = {
Try {
MyObject(
carrier = line.substring(40, 43),
amount = line.substring(43, 46).toDouble,
currency = line.substring(46, 51),
country = line.substring(51, 56),
code = line.substring(56, 64).toInt)
} match {
case Success(objectParsed) => MyObjectWrapper(Some(objectParsed), None, file)
case Failure(error) => MyObjectWrapper(None, Some(error.getLocalizedMessage), file)
}
}
}
Finally, parse your files:
import org.apache.spark.sql.functions._
val ds = files
.filter( {METHOD TO SELECT CORRECT FILES})
.map( { GET INPUT PATH FROM FILES} )
.map(path => spark.read.textFile(_).map(Parser.parse(_, path))
.reduce(_.union(_))
This should give you a Dataset[MyObjectWrapper] with the types and APIs you wish.
Afterwards you can take those you could parse:
ds.filter(_.someError == None)
Or take those you failed to parse (for investigation):
ds.filter(_.someError != None)

EsHadoopException: Could not write all entries for bulk operation Spark Streaming

I want to traverse the stream of data, run a query on it and return the results which should be written into ElasticSearch. I tried to use mapPartitions method for creation of the connection to the database, however, I get such an error, which indicates that partition returns None to the rdd (I guess, some action should be added after the transformations):
org.elasticsearch.hadoop.EsHadoopException: Could not write all entries for bulk operation [10/10]. Error sample (first [5] error messages)
What can be changed in the code to get the data into rdd and send it to ElasticSearch without any troubles?
Alos, I had a variant of the solution for this problem with flatMap in foreachRDD, however, I create a connection to the database on each rdd, which is not effective in terms of performance.
This is the code for streaming data processing:
wordsArrays.foreachRDD(rdd => {
rdd.mapPartitions { part => {
val neo4jConfig = neo4jConfigurations.getNeo4jConfig(args(1))
part.map(
data => {
val recommendations = execNeo4jSearchQuery(neo4jConfig, data)
val calendarTime = Calendar.getInstance.getTime
val recommendationsMap = convertDataToMap(recommendations, calendarTime)
recommendationsMap
})
}
}
}.saveToEs("rdd-timed/output")
)

The problem was that I tried to convert the iterator directly into the Array, although it holds multiple rows of my records. That is why ElasticSEarch was not able to map this collection of records to the defined single record schema.
Here is the code that works properly:
wordsArrays.foreachRDD(rdd => {
rdd.mapPartitions { partition => {
val neo4jConfig = neo4jConfigurations.getNeo4jConfig(args(1))
val result = partition.map( data => {
val recommendations = execNeo4jSearchQuery(neo4jConfig, data)
val calendarTime = Calendar.getInstance.getTime
convertDataToMap(recommendations, calendarTime)
}).toList.flatten
result.iterator
}
}.saveToEs("rdd-timed/output")
})

how to read json with schema in spark dataframes/spark sql

sql/dataframes,
please help me out or provide some good suggestion on how to read this json
{
"billdate":"2016-08-08',
"accountid":"xxx"
"accountdetails":{
"total":"1.1"
"category":[
{
"desc":"one",
"currentinfo":{
"value":"10"
},
"subcategory":[
{
"categoryDesc":"sub",
"value":"10",
"currentinfo":{
"value":"10"
}
}]
}]
}
}
Thanks,

You can try the following code to read the JSON file based on Schema in Spark 2.2
import org.apache.spark.sql.types.{DataType, StructType}
//Read Json Schema and Create Schema_Json
val schema_json=spark.read.json("/user/Files/ActualJson.json").schema.json
//add the schema
val newSchema=DataType.fromJson(schema_json).asInstanceOf[StructType]
//read the json files based on schema
val df=spark.read.schema(newSchema).json("Json_Files/Folder Path")

Seems like your json is not valid.
pls check with http://www.jsoneditoronline.org/
Please see an-introduction-to-json-support-in-spark-sql.html
if you want to register as the table you can register like below and print the schema.
DataFrame df = sqlContext.read().json("/path/to/validjsonfile").toDF();
df.registerTempTable("df");
df.printSchema();
Below is sample code snippet
DataFrame app = df.select("toplevel");
app.registerTempTable("toplevel");
app.printSchema();
app.show();
DataFrame appName = app.select("toplevel.sublevel");
appName.registerTempTable("sublevel");
appName.printSchema();
appName.show();
Example with scala :
{"name":"Michael", "cities":["palo alto", "menlo park"], "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley", "year":2012}]}
{"name":"Andy", "cities":["santa cruz"], "schools":[{"sname":"ucsb", "year":2011}]}
{"name":"Justin", "cities":["portland"], "schools":[{"sname":"berkeley", "year":2014}]}
val people = sqlContext.read.json("people.json")
people: org.apache.spark.sql.DataFrame
Reading top level field
val names = people.select('name).collect()
names: Array[org.apache.spark.sql.Row] = Array([Michael], [Andy], [Justin])
names.map(row => row.getString(0))
res88: Array[String] = Array(Michael, Andy, Justin)
Use the select() method to specify the top-level field, collect() to collect it into an Array[Row], and the getString() method to access a column inside each Row.
Flatten and Read a JSON Array
each Person has an array of "cities". Let's flatten these arrays and read out all their elements.
val flattened = people.explode("cities", "city"){c: List[String] => c}
flattened: org.apache.spark.sql.DataFrame
val allCities = flattened.select('city).collect()
allCities: Array[org.apache.spark.sql.Row]
allCities.map(row => row.getString(0))
res92: Array[String] = Array(palo alto, menlo park, santa cruz, portland)
The explode() method explodes, or flattens, the cities array into a new column named "city". We then use select() to select the new column, collect() to collect it into an Array[Row], and getString() to access the data inside each Row.
Read an Array of Nested JSON Objects, Unflattened
read out the "schools" data, which is an array of nested JSON objects. Each element of the array holds the school name and year:
val schools = people.select('schools).collect()
schools: Array[org.apache.spark.sql.Row]
val schoolsArr = schools.map(row => row.getSeq[org.apache.spark.sql.Row](0))
schoolsArr: Array[Seq[org.apache.spark.sql.Row]]
schoolsArr.foreach(schools => {
schools.map(row => print(row.getString(0), row.getLong(1)))
print("\n")
})
(stanford,2010)(berkeley,2012)
(ucsb,2011)
(berkeley,2014)
Use select() and collect() to select the "schools" array and collect it into an Array[Row]. Now, each "schools" array is of type List[Row], so we read it out with the getSeq[Row]() method. Finally, we can read the information for each individual school, by calling getString() for the school name and getLong() for the school year.

scala.MatchError: null on spark RDDs

I am relatively new to both spark and scala.
I was trying to implement collaborative filtering using scala on spark.
Below is the code
import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.Rating
val data = sc.textFile("/user/amohammed/CB/input-cb.txt")
val distinctUsers = data.map(x => x.split(",")(0)).distinct().map(x => x.toInt)
val distinctKeywords = data.map(x => x.split(",")(1)).distinct().map(x => x.toInt)
val ratings = data.map(_.split(',') match {
case Array(user, item, rate) => Rating(user.toInt,item.toInt, rate.toDouble)
})
val model = ALS.train(ratings, 1, 20, 0.01)
val keywords = distinctKeywords collect
distinctUsers.map(x => {(x, keywords.map(y => model.predict(x,y)))}).collect()
It throws a scala.MatchError: null
org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:571) at the last line
Thw code works fine if I collect the distinctUsers rdd into an array and execute the same code:
val users = distinctUsers collect
users.map(x => {(x, keywords.map(y => model.predict(x, y)))})
Where am I getting it wrong when dealing with RDDs?
Spark Version : 1.0.0
Scala Version : 2.10.4

Going one call further back in the stack trace, line 43 of the MatrixFactorizationModel source says:
val userVector = new DoubleMatrix(userFeatures.lookup(user).head)
Note that the userFeatures field of model is itself another RDD; I believe it isn't getting serialized properly when the anonymous function block closes over model, and thus the lookup method on it is failing. I also tried placing both model and keywords into broadcast variables, but that didn't work either.
Instead of falling back to Scala collections and losing the benefits of Spark, it's probably better to stick with RDDs and take advantage of other ways of transforming them.
I'd start with this:
val ratings = data.map(_.split(',') match {
case Array(user, keyword, rate) => Rating(user.toInt, keyword.toInt, rate.toDouble)
})
// instead of parsing the original RDD's strings three separate times,
// you can map the "user" and "product" fields of the Rating case class
val distinctUsers = ratings.map(_.user).distinct()
val distinctKeywords = ratings.map(_.product).distinct()
val model = ALS.train(ratings, 1, 20, 0.01)
Then, instead of calculating each prediction one by one, we can obtain the Cartesian product of all possible user-keyword pairs as an RDD and use the other predict method in MatrixFactorizationModel, which takes an RDD of such pairs as its argument.
val userKeywords = distinctUsers.cartesian(distinctKeywords)
val predictions = model.predict(userKeywords).map { case Rating(user, keyword, rate) =>
(user, Map(keyword -> rate))
}.reduceByKey { _ ++ _ }
Now predictions has an immutable map for each user that can be queried for the predicted rating of a particular keyword. If you specifically want arrays as in your original example, you can do:
val keywords = distinctKeywords.collect() // add .sorted if you want them in order
val predictionArrays = predictions.mapValues(keywords.map(_))
Caveat: I tested this with Spark 1.0.1 as it's what I had installed, but it should work with 1.0.0 as well.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

is rdd.contains function in spark-scala expensive - scala

Related

Writing null values to Parquet in Spark when the NullType is inside a StructType

Creating column types for schema

EsHadoopException: Could not write all entries for bulk operation Spark Streaming

how to read json with schema in spark dataframes/spark sql

scala.MatchError: null on spark RDDs

Categories

Resources