Unable to read json files in AWS Glue using Apache Spark

Unable to read json files in AWS Glue using Apache Spark - scala

For our use case we need to load in json files from an S3 bucket. As processing tool we are using AWS Glue. But because we will soon be migrating to Amazon EMR, we are already developing our Glue jobs with Spark functionalities only. So that in the future the migration will be easier. Meaning that for our use case we can't use any Glue functionalities like grouping input files.
The problem we are facing is that when we are reading in these JSON files, we see that our driver's memory is going to 100% until eventually the job fails in OOM exceptions.
We already tried maxing out the driver memory by using G.2X instances and adding --conf spark.driver.memory=20g argument to our Glue job.
The code we are using is as simple as:
spark.read.option("inferSchema", value = true).json("s3://bucket_with_json/sub_folder")
The input data are 21 json files with a size of 100MB. The files itself are not valid json objects, but each file contains multiple json objects. Like for example:
{
"RecordNumber": 2,
"Zipcode": 704,
"ZipCodeType": "STANDARD",
"City": "PASEO COSTA DEL SUR",
"State": "PR"
}
{
"RecordNumber": 10,
"Zipcode": 709,
"ZipCodeType": "STANDARD",
"City": "BDA SAN LUIS",
"State": "PR"
}
(not the real dataset)
The Glue Job specs we are currently using:
Worker type: G.2X
Number of workers: 20
Additional Spark arguments: '--conf': 'spark.driver.maxResultSize=2g --conf spark.yarn.executor.memory=7g --conf spark.driver.memory=20g'
Job language: scala
Glue version: 3.0
This visual shows how the memory is exceeds the maximum for the driver vs. the memory of the executors:
And the error we are getting after +- 10 mins is:
Command Failed due to Out of Memory
java.lang.OutOfMemoryError: Java heap space
-XX:OnOutOfMemoryError="kill -9 %p"
Executing /bin/sh -c "kill -9 8"...
Also worth noting that when we are running on a smaller set of data, everything works fine.
I'm kind of out of options at this point. Can someone help me get this fixed or point me to the right direction?
Also if someone could explain why my driver gets overflooded. I always thought that the json files are read by the executors. I'm not collecting any data to driver after I read in the data, so I can't explain why this is happening.
** EDIT **
I tried to convert the input files to one valid json. So transforming to format:
[{
"RecordNumber": 2,
"Zipcode": 704,
"ZipCodeType": "STANDARD",
"City": "PASEO COSTA DEL SUR",
"State": "PR"
},
{
"RecordNumber": 10,
"Zipcode": 709,
"ZipCodeType": "STANDARD",
"City": "BDA SAN LUIS",
"State": "PR"
}]
And used option:
.option("multiline", "true")
But unfortunately this gives me the same result/error..
** EDIT **
I would like to add that the data example above and it's structure does not resemble the data I am using. To give you some information about my data:
The structure is very nested. It contains 25 top-level fields. 7 of them are nested. If you flatten everything you end up with +- 200 fields. It is possible that the inferSchema option is the cause of my issue.

I think setting inferSchema == true could be the problem, as that's performed in the driver. So why infer schema while reading (requires an extra pass over the data, requires more driver resources)? Maybe the reason is lost on this toy example, but maybe you can try this?
First... your second file format worked fine (first did not)... i created a few files like this and stuck them all in a folder on S3.
[{
"RecordNumber": 2,
"Zipcode": 704,
"ZipCodeType": "STANDARD",
"City": "PASEO COSTA DEL SUR",
"State": "PR"
},
{
"RecordNumber": 10,
"Zipcode": 709,
"ZipCodeType": "STANDARD",
"City": "BDA SAN LUIS",
"State": "PR"
}]
One alternative I'd try is to provide the schema yourself when you read.
import org.apache.spark.sql.types.{ IntegerType, StringType, StructField, StructType }
val schema = {
new StructType()
.add(StructField("RecordNumber", IntegerType, false))
.add(StructField("Zipcode", IntegerType, true))
.add(StructField("ZipCodeType", StringType, true))
.add(StructField("City", StringType, true))
.add(StructField("State", StringType, true))
}
val df = spark.read.option("multiline", value = true).schema(schema).json("s3://bucket/path/")
One other thing to try... just skip inferring the schema while reading. I don't know if the following uses driver resources in the same way, but I seem to recall that it may use a small subset of rows.
val df = spark.read.option("multiline", value = true).json("s3://bucket/path/")
val schema = df.schema
schema.printTreeString
df.printSchema
EDIT - in response to comment indicating above was no good
One last thing to try... here I'm just trying to get the driver out of the mix, so I'm doing the following...
reading it in as plain text with JSON records over multiple lines
using .mapPartitions to iterate over each partition and merge JSON that is split over multiple lines into 1 record per JSON string
finally... parse into JSON using your favorite parser (I use json4s for no particular reason)
If after this, you still run into memory errors, it should be on the executors, where you have more options.
Of course if you're looking for Spark to automatically just read it into a 200 column dataframe, maybe you are just going to need a bigger driver.
So here's the function to iterate over lines of text and trying to merge into single records on each line. This works for the toy example, but you'll likely have to do something smarter.
.mapPartitions treats each partition as an iterator... so you need to give it a function of type Iterator[A] => Iterator[B] that in this case is just a .foldLeft that uses regex to figure out if it's the end of the record.
import org.apache.spark.rdd.RDD // RDD because that's what I use; probably similar on dataframes
import org.json4s._ // json4s for no particular reason
import org.json4s.jackson.JsonMethods._
/** `RDD.mapPartitions` treats each partition as an iterator
* so use a .foldLeft on the partition and a little regex
* merge multiple lines into one
*
* probably need something a smarter for more nested JSON
*/
val mergeJsonRecords: (Iterator[String] => Iterator[String]) = (oneRawPartition) => {
// val patternStart = "^\\[?\\{".r
val patternEnd = "(.*?\\})[,\\]]?$".r // end of JSON record
oneRawPartition
.foldLeft(List[String]())((list, next) => list match {
case Nil => List(next.trim.drop(1))
case x :: Nil => {
x.trim match {
case patternEnd(e) => List(next.trim, e)
case _ => List(x + next.trim)
}
}
case x :: xs => {
x.trim match {
case patternEnd(e) => next.trim :: e :: xs
case _ => x.trim + next.trim :: xs
}
}
})
.map { case patternEnd(e) => e; case x => x } // lame way to clean up last JSON in each partitions
.iterator
}
Here just read the data it... merge lines... then parse. Again, I'm using an RDD because it's what I usually use, but I'm sure you can keep it in a dataframe if you need to.
// read JSON in as plain text; each JSON record over multiple lines
val rdd: RDD[String] = spark.read.text("s3://bucket/path/").rdd.map(_.getAs[String](0))
rdd.count // 56 rows == 8 records
// one record per JSON object
val rdd2: RDD[String] = rdd.mapPartitions(mergeJsonRecords)
rdd2.collect.foreach(println)
rdd2.count // 8
// parsed JSON
object Parser extends Serializable {
implicit val formats = DefaultFormats
val func: (String => JValue) = (s) => parse(s)
}
val rddJson = rdd2.map(Parser.func)

Related

Spark: Writing RDD Results to File System is Slow

I'm developing a Spark application with Scala. My application consists of only one operation that requires shuffling (namely cogroup). It runs flawlessly and at a reasonable time. The issue I'm facing is when I want to write the results back to the file system; for some reason, it takes longer than running the actual program. At first, I tried writing the results without re-partitioning or coalescing, and I realized that the number of generated files are huge, so I thought that was the issue. I tried re-partitioning (and coalescing) before writing, but the application took a long time performing these tasks. I know that re-partitioning (and coalescing) is costly, but is what I'm doing the right way? If it's not, could you please give me hints on what's the right approach.
Notes:
My file system is Amazon S3.
My input data size is around 130GB.
My cluster contains a driver node and five slave nodes each has 16 cores and 64 GB of RAM.
I'm assigning 15 executors for my job, each has 5 cores and 19GB of RAM.
P.S. I tried using Dataframes, same issue.
Here is a sample of my code just in case:
val sc = spark.sparkContext
// loading the samples
val samplesRDD = sc
.textFile(s3InputPath)
.filter(_.split(",").length > 7)
.map(parseLine)
.filter(_._1.nonEmpty) // skips any un-parsable lines
// pick random samples
val samples1Ids = samplesRDD
.map(_._2._1) // map to id
.distinct
.takeSample(withReplacement = false, 100, 0)
// broadcast it to the cluster's nodes
val samples1IdsBC = sc broadcast samples1Ids
val samples1RDD = samplesRDD
.filter(samples1IdsBC.value contains _._2._1)
val samples2RDD = samplesRDD
.filter(sample => !samples1IdsBC.value.contains(sample._2._1))
// compute
samples1RDD
.cogroup(samples2RDD)
.flatMapValues { case (left, right) =>
left.map(sample1 => (sample1._1, right.filter(sample2 => isInRange(sample1._2, sample2._2)).map(_._1)))
}
.map {
case (timestamp, (sample1Id, sample2Ids)) =>
s"$timestamp,$sample1Id,${sample2Ids.mkString(";")}"
}
.repartition(10)
.saveAsTextFile(s3OutputPath)
UPDATE
Here is the same code using Dataframes:
// loading the samples
val samplesDF = spark
.read
.csv(inputPath)
.drop("_c1", "_c5", "_c6", "_c7", "_c8")
.toDF("id", "timestamp", "x", "y")
.withColumn("x", ($"x" / 100.0f).cast(sql.types.FloatType))
.withColumn("y", ($"y" / 100.0f).cast(sql.types.FloatType))
// pick random ids as samples 1
val samples1Ids = samplesDF
.select($"id") // map to the id
.distinct
.rdd
.takeSample(withReplacement = false, 1000)
.map(r => r.getAs[String]("id"))
// broadcast it to the executor
val samples1IdsBC = sc broadcast samples1Ids
// get samples 1 and 2
val samples1DF = samplesDF
.where($"id" isin (samples1IdsBC.value: _*))
val samples2DF = samplesDF
.where(!($"id" isin (samples1IdsBC.value: _*)))
samples2DF
.withColumn("combined", struct("id", "lng", "lat"))
.groupBy("timestamp")
.agg(collect_list("combined").as("combined_list"))
.join(samples1DF, Seq("timestamp"), "rightouter")
.map {
case Row(timestamp: String, samples: mutable.WrappedArray[GenericRowWithSchema], sample1Id: String, sample1X: Float, sample1Y: Float) =>
val sample2Info = samples.filter {
case Row(_, sample2X: Float, sample2Y: Float) =>
Misc.isInRange((sample2X, sample2Y), (sample1X, sample1Y), 20)
case _ => false
}.map {
case Row(sample2Id: String, sample2X: Float, sample2Y: Float) =>
s"$sample2Id:$sample2X:$sample2Y"
case _ => ""
}.mkString(";")
(timestamp, sample1Id, sample1X, sample1Y, sample2Info)
case Row(timestamp: String, _, sample1Id: String, sample1X: Float, sample1Y: Float) => // no overlapping samples
(timestamp, sample1Id, sample1X, sample1Y, "")
case _ =>
("error", "", 0.0f, 0.0f, "")
}
.where($"_1" notEqual "error")
// .show(1000, truncate = false)
.write
.csv(outputPath)

Issue here is that normally spark commit tasks, jobs by renaming files, and on S3 renames are really, really slow. The more data you write, the longer it takes at the end of the job. That what you are seeing.
Fix: switch to the S3A committers, which don't do any renames.
Some tuning options to massively increase the number of threads in IO, commits and connection pool size
fs.s3a.threads.max from 10 to something bigger
fs.s3a.committer.threads -number files committed by a POST in parallel; default is 8
fs.s3a.connection.maximum + try (fs.s3a.committer.threads + fs.s3a.threads.max + 10)
These are all fairly small as many jobs work with multiple buckets and if there were big numbers for each it'd be really expensive to create an s3a client...but if you have many thousands of files, probably worthwhile.

How to load data into Product case class using Dataframe in Spark

I have a text file and has data like below:
productId|price|saleEvent|rivalName|fetchTS
123|78.73|Special|VistaCart.com|2017-05-11 15:39:30
123|45.52|Regular|ShopYourWay.com|2017-05-11 16:09:43
123|89.52|Sale|MarketPlace.com|2017-05-11 16:07:29
678|1348.73|Regular|VistaCart.com|2017-05-11 15:58:06
678|1348.73|Special|ShopYourWay.com|2017-05-11 15:44:22
678|1232.29|Daily|MarketPlace.com|2017-05-11 15:53:03
777|908.57|Daily|VistaCart.com|2017-05-11 15:39:01
I have to find minimum price of a product across websites, e.g. my output should be like this:
productId|price|saleEvent|rivalName|fetchTS
123|45.52|Regular|ShopYourWay.com|2017-05-11 16:09:43
678|1232.29|Daily|MarketPlace.com|2017-05-11 15:53:03
777|908.57|Daily|VistaCart.com|2017-05-11 15:39:01
I am trying like this:
case class Product(productId:String, price:Double, saleEvent:String, rivalName:String, fetchTS:String)
val cDF = spark.read.text("/home/prabhat/Documents/Spark/sampledata/competitor_data.txt")
val (header,values) = cDF.collect.splitAt(1)
values.foreach(x => Product(x(0).toString, x(1).toString.toDouble,
x(2).toString, x(3).toString, x(4).toString))
Getting exception while running last line:
java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.spark.sql.catalyst.expressions.GenericRow
.get(rows.scala:174)
at org.apache.spark.sql.Row$class.apply(Row.scala:163)
at
org.apache.spark.sql.catalyst.expressions.GenericRow
.apply(rows.scala:166
)
at $anonfun$1.apply(<console>:28)
at $anonfun$1.apply(<console>:28)
at scala.collection.IndexedSeqOptimized$class.foreach
(IndexedSeqOptimized.scala:33)
at
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
... 49 elided
Priting value in values:
scala> values
res2: **Array[org.apache.spark.sql.Row]** = `
Array([123|78.73|Special|VistaCart.com|2017-05-11 15:39:30 ],
[123|45.52|Regular|ShopYourWay.com|2017-05-11 16:09:43 ],
[123|89.52|Sale|MarketPlace.com|2017-05-11 16:07:29 ],
[678|1348.73|Regular|VistaCart.com|2017-05-11 15:58:06 ],
[678|1348.73|Special|ShopYourWay.com|2017-05-11 15:44:22 ],
[678|1232.29|Daily|MarketPlace.com|2017-05-11 15:53:03 ],
[777|908.57|Daily|VistaCart.com|2017-05-11 15:39:01 ]`
scala>
I am able to understand that I need to split("|").
scala> val xy = values.foreach(x => x.toString.split("|").toSeq)
xy: Unit = ()
So after splitting its giving me Unit class, i.e. void, so unable to load values into the Product case class. How can I load this Dataframe to Product case class? I dont want to use Dataset for now, although Dataset is type safe.
I'm using Spark 2.3 and Scala 2.11.

The issue is due to split taking a regex, which means you need to use "\\|" instead of a single "|". Also, the foreach need to be changed to map to actually give a return value, i.e:
val xy = values.map(x => x.toString.split("\\|"))
However, a better approach would be to read the data as a csv file with | separators. In this way you do not need to treat the header in a special way and by inferring the column types there is no need to make any convertions (here I changed fetchTS to a timestamp):
case class Product(productId: String, price: Double, saleEvent: String, rivalName: String, fetchTS: Timestamp)
val df = spark.read
.option("header", "true")
.option("inferSchema", "true")
.option("sep", "|")
.csv("/home/prabhat/Documents/Spark/sampledata/competitor_data.txt")
.as[Product]
The final line will convert the dataframe to use the Product case class. If you want to use it as an RDD instead, simply add .rdd in the end.
After this is done, use groupBy and agg to get the final results.

is rdd.contains function in spark-scala expensive

I am getting millions of message from Kafka stream in spark-streaming. There are 15 different types of message. Messages come from a single topic. I can only differentiate message by its content. so I am using rdd.contains method to get the different type of rdd.
sample message
{"a":"foo", "b":"bar","type":"first" .......}
{"a":"foo1", "b":"bar1","type":"second" .......}
{"a":"foo2", "b":"bar2","type":"third" .......}
{"a":"foo", "b":"bar","type":"first" .......}
..............
...............
.........
so on
code
DStream.foreachRDD { rdd =>
if (!rdd.isEmpty()) {
val rdd_first = rdd.filter {
ele => ele.contains("First")
}
if (!rdd_first.isEmpty()) {
insertIntoTableFirst(hivecontext.read.json(rdd_first))
}
val rdd_second = rdd.filter {
ele => ele.contains("Second")
}
if (!rdd_second.isEmpty()) {
insertIntoTableSecond(hivecontext.read.json(rdd_second))
}
.............
......
same way for 15 different rdd
is there any way to get different rdd from kafka topic message?

There's no rdd.contains. The function contains used here is applied to the Strings in the RDD.
Like here:
val rdd_first = rdd.filter {
element => element.contains("First") // each `element` is a String
}
This method is not robust because other content in the String might meet the comparison, resulting in errors.
e.g.
{"a":"foo", "b":"bar","type":"second", "c": "first", .......}
One way to deal with this would be to first transform the JSON data into proper records, and then apply grouping or filtering logic on those records. For that, we first need a schema definition of the data. With the schema, we can parse the records into json and apply any processing on top of that:
case class Record(a:String, b:String, `type`:String)
import org.apache.spark.sql.types._
val schema = StructType(
Array(
StructField("a", StringType, true),
StructField("b", StringType, true),
StructField("type", String, true)
)
)
val processPerType: Map[String, Dataset[Record] => Unit ] = Map(...)
stream.foreachRDD { rdd =>
val records = rdd.toDF("value").select(from_json($"value", schema)).as[Record]
processPerType.foreach{case (tpe, process) =>
val target = records.filter(entry => entry.`type` == tpe)
process(target)
}
}
The question does not specify what kind of logic needs to be applied to each type of record. What's presented here is a generic way of approaching the problem where any custom logic can be expressed as a function Dataset[Record] => Unit.
If the logic could be expressed as an aggregation, probably the Dataset aggregation functions will be more appropriate.

Create scala filter conditions based on JSON in API gateway

I am getting JSON filter conditions for Spark SQL. The JSON will be of the format:
{
"x": {
"LT": "2"
}
}
Which should become like: spark.sql("Select * from df where x < 2")
Any idea how can I proceed? The data is read from parquet file using
spark.read.parquet(filePath)
So the code is as follows:
val df = spark.read.parquet(filePath)
implicit val formats = org.json4s.DefaultFormats
parse(filterJson).extract[Map[String, Any]]
// Once tables have been registered, you can run SQL queries over them.
for((k,v)<-Map){
v match {
case "EQ"=>"==="
case "GT"=>">"
case "LT"=>"<"
}
}
df.filter(k)
Any help will be appreciated.

Share HDInsight SPARK SQL Table saveAsTable does not work

I want to show the data from HDInsight SPARK using tableau. I was following this video where they have described how to connect the two systems and expose the data.
currently my script itself is very simple as shown below:
/* csvFile is an RDD of lists, each list representing a line in the CSV file */
val csvLines = sc.textFile("wasb://mycontainer#mysparkstorage.blob.core.windows.net/*/*/*/mydata__000000.csv")
// Define a schema
case class MyData(Timestamp: String, TimezoneOffset: String, SystemGuid: String, TagName: String, NumericValue: Double, StringValue: String)
// Map the values in the .csv file to the schema
val myData = csvLines.map(s => s.split(",")).filter(s => s(0) != "Timestamp").map(
s => MyData(s(0),
s(1),
s(2),
s(3),
s(4).toDouble,
s(5)
)
).toDF()
// Register as a temporary table called "processdata"
myData.registerTempTable("test_table")
myData.saveAsTable("test_table")
unfortunately I run in to the following error
warning: there were 1 deprecation warning(s); re-run with -deprecation for details
org.apache.spark.sql.AnalysisException: Table `test_table` already exists.;
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:209)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:198)
i have also tried to use the following code to overwrite the table if it exists
import org.apache.spark.sql.SaveMode
myData.saveAsTable("test_table", SaveMode.Overwrite)
but still it gives me same error.
warning: there were 1 deprecation warning(s); re-run with -deprecation for details
java.lang.RuntimeException: Tables created with SQLContext must be TEMPORARY. Use a HiveContext instead.
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.execution.SparkStrategies$DDLStrategy$.apply(SparkStrategies.scala:416)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
Can someone please help me fix this issue?

I know it was my mistake, but i'll leave it as an answer as it was not readily available in any of the blogs or forum answers. hopefully it will help someone like me starting with Spark
I figured out that .toDF() actually creates the sqlContext and not the hiveContext based DataFrame. so I have now updated my code as below
// Map the values in the .csv file to the schema
val myData = csvLines.map(s => s.split(",")).filter(s => s(0) != "Timestamp").map(
s => MyData(s(0),
s(1),
s(2),
s(3),
s(4).toDouble,
s(5)
)
)
// Register as a temporary table called "myData"
val myDataFrame = hiveContext.createDataFrame(myData)
myDataFrame.registerTempTable("mydata_stored")
myDataFrame.write.mode(SaveMode.Overwrite).saveAsTable("mydata_stored")
also make sure that the s(4) has proper double value else add try/catch to handle it. i did something like this:
def parseDouble(s: String): Double = try { s.toDouble } catch { case _ => 0.00 }
parseDouble(s(4))
Regards
Kiran

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse