In our application , most of our code is just apply filter , group by and aggregate operations on DataFrame and save the DF to Cassandra database.
Like the below code, we have several methods which do the same kind of operations[filter, group by, join, agg] on different number of fields and returns an DF and that will be saved to Cassandra tables.
Sample code is:
val filteredDF = df.filter(col("hour") <= LocalDataTime.now().getHour())
.groupBy("country")
.agg(sum(col("volume")) as "pmtVolume")
saveToCassandra(df)
def saveToCassandra(df: DataFrame) {
try {
df.write.format("org.apache.spark.sql.cassandra")
.options(Map("Table" -> "tableName", "keyspace" -> keyspace)
.mode("append").save()
}
catch {
case e: Throwable => log.error(e)
}
}
Since i am calling the action by saving the DF to Cassandra, i hope i need to handle the exception only on that line as per this thread.
If i get any exception, i can see the exception in the Spark detailed log by default.
Do i have to really surround the filter, group by code with Try or try , catch?
I don't see any example on Spark SQL DataFrame API examples with exception handling.
How do i use the Try on saveToCassandra method? it returns Unit
There is no point wrapping the lazy DAG in try catch.
You would need to wrap the lambda function in Try().
Unfortunately there AFAIK there is no way to do row level exception handling in DataFrames.
You can use RDD or DataSet as mentioned in answer to this post below
spache spark exception handling
You don't really need to surround the filter, group by code with Try or try , catch. Since, all of these operations are transformations, they don't get execute until an action is performed on them, like saveToCassandra in your case.
However, if an error occurs while filtering, grouping or aggregating the dataframe, the catch clause in saveToCassandra function will log it as action is being performed there.
Related
I am observing some weired issue. I am not sure whether it is lack of my knowledge in spark or what.
I have a dataframe as shown in below code. I create a tempview from it and I am observing that after merge operation, that tempview becomes empty. Not sure why.
val myDf = getEmployeeData()
myDf.createOrReplaceTempView("myView")
// Result1: Below lines display all the records
myDf.show()
spark.Table("myView").show()
// performing merge operation
val sql = s"""MERGE INTO employee AS a
USING myView AS b
ON a.Id = b.Id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *"""
spark.sql(sql)
// Result2: ISSUE is here. myDf & mvView both are empty
myDf.show()
spark.Table("myView").show()
Edit
getEmployeeData method performs join between two dataframes and returns the result.
df1.as(df1Alias).join(df2.as(df2Alias), expr(joinString), "inner").filter(finalFilterString).select(s"$df1Alias.*")
Dataframes in Spark are lazily evaluated, ie. not executed until an action like .show, .collect is executed or they're used in SQL DDL operation. This also means that if you refer to it once more, it will get reevaluated again.
Assuming there's no other background activity that can mess up, apparently your function getEmployeeData, depends on employee table. It gets executed both before and after the merge and might yield different result.
To prevent it you can checkpoint the dataframe:
myView.checkpoint()
or explicitly materialize it:
myView.write.saveAsTable("myViewMaterialized")
and later refer to the materialized version.
I agree with the points that #Kombajn zbozowy said related to Dataframes in spark are lazily evaluated and will be reevaluated again if you call action on it once more.
I would like to point out that, it is normal expected behavior and might not have to do anything with the merge operation.
For example, the dataframe that you get as an output of the join only contains inserts and you perform those inserts on the target table using dataframe write api, and then if you run the df.show() command again it would show you output as empty because when it is reevaluating the content of the dataframe by performing join it won't get any difference and will not output any records...
Same holds true for merge operation as it also updates the target table with insert/updates and when you rerun it won't show any output rows.
I'm trying to count the number of valid and invalid data, that is present in a file. Below is the code to do the same,
val badDataCountAcc = spark.sparkContext.longAccumulator("BadDataAcc")
val goodDataCountAcc = spark.sparkContext.longAccumulator("GoodDataAcc")
val dataframe = spark
.read
.format("csv")
.option("header", true)
.option("inferSchema", true)
.load(path)
.filter(data => {
val matcher = regex.matcher(data.toString())
if (matcher.find()) {
goodDataCountAcc.add(1)
println("GoodDataCountAcc: " + goodDataCountAcc.value)
true
} else {
badDataCountAcc.add(1)
println("BadDataCountAcc: " + badDataCountAcc.value)
false
}
}
)
.withColumn("FileName", input_file_name())
dataframe.show()
val filename = dataframe
.select("FileName")
.distinct()
val name = filename.collectAsList().get(0).toString()
println("" + filename)
println("Bad data Count Acc: " + badDataCountAcc.value)
println("Good data Count Acc: " + goodDataCountAcc.value)
I ran this code for the sample data that has 2 valid and 3 invalid data. Inside the filter, where I'm printing the counts, values are correct. But outside the filter when I'm printing the values for count, it is coming as 4 for good data and 6 for bad data.
Questions:
When I remove the withColumn statement at the end - along with the code which calculates distinct filename - values are printed correctly. I'm not sure why?
I do have a requirement to get the input filename as well. What would be best way to do that here?
First of all, Accumulator belongs to the RDD API, while you are using Dataframes. Dataframes are compiled down to RDDs in the end, but they are at a higher level of abstraction. It is better to use aggregations instead of Accumulators in this context.
From the Spark Accumulators documentation:
For accumulator updates performed inside actions only, Spark guarantees that each task’s update to the accumulator will only be applied once, i.e. restarted tasks will not update the value. In transformations, users should be aware of that each task’s update may be applied more than once if tasks or job stages are re-executed.
Accumulators do not change the lazy evaluation model of Spark. If they are being updated within an operation on an RDD, their value is only updated once that RDD is computed as part of an action. Consequently, accumulator updates are not guaranteed to be executed when made within a lazy transformation like map(). The below code fragment demonstrates this property:
Your DataFrame filter will be compiled to an RDD filter, which is not an action, but a transformation (and thus lazy), so this only-once guarantee does not hold in your case.
How many times your code is executed depends is implementation-dependent, and may change with Spark versions, so you should not rely on it.
Regarding your two questions:
(BEFORE EDIT) This cannot be answered based on your code snippet because it doesn't contain any actions. Is it even the exact code snippet you use? I suspect that if you actually execute the code you posted without any additions except for the missing imports, it should print 0 two times because nothing is executed. Either way, you should always assume that an accumulator inside an RDD transformation is potentially executed multiple times (or even not at all if it is in a DataFrame operation which can possibly be optimized out).
Your approach of using withColumn is perfectly fine.
I'd suggest using DataFrame expressions and aggregations (or equivalent Spark SQL if you prefer that). The regex matching can be done using rlike, using the columns instead of relying of toString(), e.g. .withColumn("IsGoodData", $"myColumn1".rlike(regex1) && $"myColumn2".rlike(regex2)).
Then you can count the good and bad records using an aggregation like dataframe.groupBy($"IsGoodData").count()
EDIT: With the additional lines the answer to your first question is also clear: The first time was from the dataframe.show() and the second time from the filename.collectAsList(), which you probably also removed as it depends on the added column. Please make sure you understand the distinction between Spark transformations and actions and the lazy evaluation model of Spark. Otherwise you won't be very happy with it :-)
I need to do certain operation on a set of BQ tables but I want to do the operation if and only if I know for certain that all the BQ tables exist.
I have checked the google big query package and it has a sample to read the data from BQ tables - fine But what if my tables are really huge? I can't load all the tables for existence check as it would take too much time and seems redundant.
Is there another way to achieve this? I would be very glad if I could get some pointers in the right direction.
Thank you in advance.
Gaurav
spark.read.option(...).load does will not load all the objects into a dataframe.
spark.read.option(...) returns a DataFrameReader. when you call load on it , it will test the connection and issue a query like
SELECT * FROM (select * from objects) SPARK_GEN_SUBQ_11 WHERE 1=0
The query will not scan any records and will error out when the table does not exist. I am not sure about the BigQuery driver but jdbc drivers throw a java exception here, which you need to handle in a try {} catch {} block.
Thus you can just call load, catch exceptions and check wether all dataframes could be instantiated. Here is some example code
def query(q: String) = {
val reader = spark.read.format("bigquery").option("query", q)
try {
Some(reader.load())
} catch {
case e: Exception => None
}
}
val dfOpts = Seq(
query("select * from foo"),
query("select * from bar"),
query("select * from baz")
)
if(dfOpts.exists(_.isEmpty)){
println("Some table is missing");
}
You could use the method tables.get
https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/get
Otherwise, you can run BG CLI command in a bash script, which can be called from your spark program.
I am new to spark. I have some json data that comes as an HttpResponse. I'll need to store this data in hive tables. Every HttpGet request returns a json which will be a single row in the table. Due to this, I am having to write single rows as files in the hive table directory.
But I feel having too many small files will reduce the speed and efficiency. So is there a way I can recursively add new rows to the Dataframe and write it to the hive table directory all at once. I feel this will also reduce the runtime of my spark code.
Example:
for(i <- 1 to 10){
newDF = hiveContext.read.json("path")
df = df.union(newDF)
}
df.write()
I understand that the dataframes are immutable. Is there a way to achieve this?
Any help would be appreciated. Thank you.
You are mostly on the right track, what you want to do is to obtain multiple single records as a Seq[DataFrame], and then reduce the Seq[DataFrame] to a single DataFrame by unioning them.
Going from the code you provided:
val BatchSize = 100
val HiveTableName = "table"
(0 until BatchSize).
map(_ => hiveContext.read.json("path")).
reduce(_ union _).
write.insertInto(HiveTableName)
Alternatively, if you want to perform the HTTP requests as you go, we can do that too. Let's assume you have a function that does the HTTP request and converts it into a DataFrame:
def obtainRecord(...): DataFrame = ???
You can do something along the lines of:
val HiveTableName = "table"
val OtherHiveTableName = "other_table"
val jsonArray = ???
val batched: DataFrame =
jsonArray.
map { parameter =>
obtainRecord(parameter)
}.
reduce(_ union _)
batched.write.insertInto(HiveTableName)
batched.select($"...").write.insertInto(OtherHiveTableName)
You are clearly misusing Spark. Apache Spark is analytical system, not a database API. There is no benefit of using Spark to modify Hive database like this. It will only bring a severe performance penalty without benefiting from any of the Spark features, including distributed processing.
Instead you should use Hive client directly to perform transactional operations.
If you can batch-download all of the data (for example with a script using curl or some other program) and store it in a file first (or many files, spark can load an entire directory at once) you can then load that file(or files) all at once into spark to do your processing. I would also check to see it the webapi as any endpoints to fetch all the data you need instead of just one record at a time.
I have a spark streaming job which reads from Kafka and does some comparisons with an existing table in Postgres before writing to Postrges again. This is what it looks like :
val message = KafkaUtils.createStream(...).map(_._2)
message.foreachRDD( rdd => {
if (!rdd.isEmpty){
val kafkaDF = sqlContext.read.json(rdd)
println("First")
kafkaDF.foreachPartition(
i =>{
val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> "jdbc:postgresql://...",
"dbtable" -> "table", "user" -> "user", "password" -> "pwd" )).load()
createConnection()
i.foreach(
row =>{
println("Second")
connection.sendToTable()
}
)
closeConnection()
}
)
This code is giving me NullPointerException at the line val jbdcDF = ...
What am I doing wrong? Also, my log "First" works, but "Second" doesn't show up anywhere in the logs. I tried the entire code with kafkaDF.collect().foreach(...) and it works perfectly, but has very poor performance. I am looking to replace it with foreachPartition.
Thanks
It is not clear if there are any issues inside createConnection, closeConnection or connection.sendToTable but fundamental problem is an attempt to nest actions / transformations. It is not supported in Spark and Spark Streaming is not different.
It means that nested DataFrame initialization (val jdbcDF = sqlContext.read.format ...) simply cannot work and should be removed. If you use it as a reference it should be created at the same level as kafkaDF and refferenced using standard transformations (unionAll, join, ...).
If for some reason it is not an acceptable solution you can create plain JDBC connection inside forEachPartition and operate on PostgreSQL table (I guess it is what you're already do inside sendToTable).
As #zero323 correctly pointed out, you can't broadcast your jdbc connection around and you cannot create nested RDDs either. Spark simply does not support using sparkContext or sqlContext for that matter within an existing closure, i.e. foreachPartition, hence the null pointer exception.
The only way to solve this efficiently is to create a JDBC connection within foreachPartition and execute SQL directly on it to do whatever you intended and then use that same connection to write back the records.
As to your second, edited, question:
Change:
kafkaDF.foreachPartition(..)
to
kafkaDF.repartition(numPartition).foreachPartition(..)
where numPartition is the desired number of partitions. This will increase the number of partitions. If you have multiple executors (and multiple tasks per executor), these will run in parallel.