How to prevent Apache Spark to read JDBC DataFrame multiple times? - scala

I have to read data from an Oracle Database using JDBC with Spark (2.2). To minimize the transfered data, I use a pushdown query, which already filters the data to be loaded. That data then is appended to an existing Hive table.
To log what has been loaded, I count the records loaded via JDBC. My code basically looks like this:
val query = "(select a, b from table where c = 1) t"
val myDF = spark.read.jdbc(jdbc_url, query, partitionColumn, lowerBound, upperBound, 10, connectionProperties).cache()
myDF.write.mode(SaveMode.Append)
.format("parquet")
.saveAsTable("my_table_name")
val numRecs = myDF.count()
My assumption was, due to the cache(), the DataFrame is read once via JDBC, saved and used to count the records. But when I look at the sessions created in Oracle, I looks like the count operation itself creates 10 sessions on the Oracle database. At first I see 10 sessions, which basicaly look like that:
SELECT * FROM (select a, b from table where c = 1) t WHERE a >= x AND < a y
And, after that is done, another 10 sessions appear like that:
SELECT 1 FROM (select a, b from table where c = 1) t WHERE a >= x AND < a y
So it looks like Spark is loading data from the JDBC source only to count records, where it should already be sufficient to use the data already loaded. How can this be prevented and Spark forced to only read the data once from the JDBC source?
UPDATE
Turns out, I was blind: there was another count() within my code before saveAsTable was called. So it totally makes sense, the first action called on the DataFrame was indeed count(). After eliminating this, it al behaved like expected.

Related

Spark Persist method odd behavior

I am exploring spark persist function. It seems for some dataframe it is persisting whereas for others it is not, even though I have used the persist method on all the dataframes
Here is my code with explaination
// loading csv as dataframe and creating a view
val src_data=spark.read.option("header",true).csv("sources/data.csv")
src_data.createTempView("src_data")
**There is alreading a table called test in hive**
Here I am creating 3 dataframes using src and test and using persist on all 3 for later use
//dataframe 1
val changed_data= spark.sql("select sc.* from src_data sc inner join default.test t on sc.id=t.id where t.value!=sc.value or t.description!=sc.description ")
changed_data.persist().show()
changed_data.createOrReplaceTempView("changed_data")
// dataframe 2
val new_data= spark.sql("select * from src_data where id not in (select distinct id from default.test)")
println("new_data")
new_data.persist().show()
new_data.createOrReplaceTempView("new_data")
// dataframe 3
val unchanged_data= spark.sql("select * from test where id not in (select id from changed_data)")
unchanged_data.persist().show()
unchanged_data.createTempView("unchanged_data")
**then I truncate the table test**
spark.sql("truncate table test")
***Then i print the 3 dataframes I persisted***
new_data.show()
unchanged_data.show()
changed_data.show()
Before truncating test I can see data for all 3 dataframes using show but after I see only data for one dataframe....
I get data for only new_data(which is dataframe 2) eventhough I persisted all 3 dataframes and all three use table test??
Why this odd behaviour
The Dataframes will only be persisted if you invoke an action that goes through every record of the Dataframe.
Remember, that show() only shows the Top 20 rows as documented in the ScalaDocs.
In contrast, you could apply something like count() but that obviously will have some negative impact on your performance.

Spark Job simply stalls when querying full cassandra table

I have a rather peculiar problem. In a DSE spark analytics engine I produce frequent stats that I store to cassandra in a small table. Since I keep the table trimmed and it is supposed to serve a web interface with consolidated information, I simply want to query the whole table in spark and send the results over an API. I have tried two methods for this:
val a = Try(sc.cassandraTable[Data](keyspace, table).collect()).toOption
val query = "SELECT * FROM keyspace.table"
val df = spark.sqlContext.sql(query)
val list = df.collect()
I am doing this in a scala program. When I use method 1, spark job mysteriously gets stuck showing stage 10 of 12 forever. Verified in logs and spark jobs page. When I use the second method it simply tells me that no such table exists:
Unknown exception: org.apache.spark.sql.AnalysisException: Table or view not found: keyspace1.table1; line 1 pos 15;
'Project [*]
+- 'UnresolvedRelation keyspace1.table1
Interestingly, I tested both methods in spark shell on the cluster and they work just fine. My program has plenty of other queries done using method 1 and they all work fine, the key difference being that in each of them the main partition key always has a condition on it unlike in this query (holds true for this particular table too).
Here is the table structure:
CREATE TABLE keyspace1.table1 (
userid text,
stat_type text,
event_time bigint,
stat_value double,
PRIMARY KEY (userid, stat_type))
WITH CLUSTERING ORDER BY (stat_type ASC)
Any solid diagnosis of the problem or a work around would be much appreciated
When you do select * without where clause in cassandra, you're actually performing a full range query. This is not intended use case in cassandra (aside from peeking at the data perhaps). Just for the fun of it, try replacing with select * from keyspace.table limit 10 and see if it works, it might...
Anyway, my gut feeling says you're problem isn't with spark, but with cassandra. If you have visibility for cassandra metrics, look for the range query latencies.
Now, if your code above is complete - the reason that method 1 freezes, while method 2 doesn't, is that method 1 contains an action (collect), while method 2 doesn't involve any spark action, just schema inference. Should you add to method 2 df.collect you will face the same issue with cassandra

Iterating in Scala DataFrame

I have a DataFrame in spark with Sample accounts which has 5 different columns.
val sampledf= sqlContext.sql(select * from Sampledf)
I have other table in oracle db which has millions of records. OracleTable
I want to filter Accounts present in OracleTable with respect to SampleDF
Select * from OracleTable where column in (select column from SamplesDf)
I realized that in oracle we can not provide more than 1000 values in IN condition.
And below subquery query is not working. Due to huge data in OracleTable
I want to achieve below query
select column from OracleTable where (acctnum in (1,2,3,...1000) or acctnum in (1001,....2000) ....
Basically all the accounts from SampleDF (every 1000 accounts)
Since we cant give more than 1000 at once (that's the limitation in Oracle) we can give 1000 every time.
How can I generate this kind of dynamic query. DO I need to create Array from Dataframe?
I just need a work around, how can I proceed. Any suggestions will be helpful.
broadcast join is the best option which will broadcast the smaller dataframe across the cluster. As it’s mentioned the reading oracle data it’s taking time, it might be due to the profile restrictions of number of parallel sessions.
See below work around to build a dynamic in condition.
Val newsampledf = sampledf.withColumn(“seq”,row_number().over(Window.orderBy(“yourcolumn”)).select(“yourcolumn”, “seq”)
var i = 1L
var j = 0L
while(i <= (cnt/999))
{ var sql = newsampledf.select(“yourcolumn”).where(col(“seq” >= j).where(col(“seq”) <j + 999) j=j+999 i=i+1}
You can try to join the both tables based on the column.
Load the Oracle table as dataframe
Join the oracleDF with sampleDF
val resultDF=oracleDF.join(sampleDF,seq("column"))
Use broadcast if sampleDF is small for better performance
val resultDF=oracleDF.join(broadcast(sampleDF),seq("column"))
Hope it helps you.

Recursively adding rows to a dataframe

I am new to spark. I have some json data that comes as an HttpResponse. I'll need to store this data in hive tables. Every HttpGet request returns a json which will be a single row in the table. Due to this, I am having to write single rows as files in the hive table directory.
But I feel having too many small files will reduce the speed and efficiency. So is there a way I can recursively add new rows to the Dataframe and write it to the hive table directory all at once. I feel this will also reduce the runtime of my spark code.
Example:
for(i <- 1 to 10){
newDF = hiveContext.read.json("path")
df = df.union(newDF)
}
df.write()
I understand that the dataframes are immutable. Is there a way to achieve this?
Any help would be appreciated. Thank you.
You are mostly on the right track, what you want to do is to obtain multiple single records as a Seq[DataFrame], and then reduce the Seq[DataFrame] to a single DataFrame by unioning them.
Going from the code you provided:
val BatchSize = 100
val HiveTableName = "table"
(0 until BatchSize).
map(_ => hiveContext.read.json("path")).
reduce(_ union _).
write.insertInto(HiveTableName)
Alternatively, if you want to perform the HTTP requests as you go, we can do that too. Let's assume you have a function that does the HTTP request and converts it into a DataFrame:
def obtainRecord(...): DataFrame = ???
You can do something along the lines of:
val HiveTableName = "table"
val OtherHiveTableName = "other_table"
val jsonArray = ???
val batched: DataFrame =
jsonArray.
map { parameter =>
obtainRecord(parameter)
}.
reduce(_ union _)
batched.write.insertInto(HiveTableName)
batched.select($"...").write.insertInto(OtherHiveTableName)
You are clearly misusing Spark. Apache Spark is analytical system, not a database API. There is no benefit of using Spark to modify Hive database like this. It will only bring a severe performance penalty without benefiting from any of the Spark features, including distributed processing.
Instead you should use Hive client directly to perform transactional operations.
If you can batch-download all of the data (for example with a script using curl or some other program) and store it in a file first (or many files, spark can load an entire directory at once) you can then load that file(or files) all at once into spark to do your processing. I would also check to see it the webapi as any endpoints to fetch all the data you need instead of just one record at a time.

Scala & Spark: Recycling SQL statements

I spent quite some time to code multiple SQL queries that were formerly used to fetch the data for various R scripts. This is how it worked
sqlContent = readSQLFile("file1.sql")
sqlContent = setSQLVariables(sqlContent, variables)
results = executeSQL(sqlContent)
The clue is, that for some queries a result from a prior query is required - why creating VIEWs in the database itself does not solve this problem. With Spark 2.0 I already figured out a way to do just that through
// create a dataframe using a jdbc connection to the database
val tableDf = spark.read.jdbc(...)
var tempTableName = "TEMP_TABLE" + java.util.UUID.randomUUID.toString.replace("-", "").toUpperCase
var sqlQuery = Source.fromURL(getClass.getResource("/sql/" + sqlFileName)).mkString
sqlQuery = setSQLVariables(sqlQuery, sqlVariables)
sqlQuery = sqlQuery.replace("OLD_TABLE_NAME",tempTableName)
tableDf.createOrReplaceTempView(tempTableName)
var data = spark.sql(sqlQuery)
But this is in my humble opinion very fiddly. Also, more complex queries, e.g. queries that incooporate subquery factoring currently don't work. Is there a more robust way like re-implementing the SQL code into Spark.SQL code using filter($""), .select($""), etc.
The overall goal is to get multiple org.apache.spark.sql.DataFrames, each representing the results of one former SQL query (which always a few JOINs, WITHs, etc.). So n queries leading to n DataFrames.
Is there a better option than the provided two?
Setup: Hadoop v.2.7.3, Spark 2.0.0, Intelli J IDEA 2016.2, Scala 2.11.8, Testcluster on Win7 Workstation
It's not especially clear what your requirement is, but I think you're saying you have queries something like:
SELECT * FROM people LEFT OUTER JOIN places ON ...
SELECT * FROM (SELECT * FROM people LEFT OUTER JOIN places ON ...) WHERE age>20
and you would want to declare and execute this efficiently as
SELECT * FROM people LEFT OUTER JOIN places ON ...
SELECT * FROM <cachedresult> WHERE age>20
To achieve that I would enhance the input file so each sql statement has an associated table name into which the result will be stored.
e.g.
PEOPLEPLACES\tSELECT * FROM people LEFT OUTER JOIN places ON ...
ADULTS=SELECT * FROM PEOPLEPLACES WHERE age>18
Then execute in a loop like
parseSqlFile().foreach({case (name, query) => {
val data: DataFrame = execute(query)
data.createOrReplaceTempView(name)
}
Make sure you declare the queries in order so all required tables have been created. Other do a little more parsing and sort by dependencies.
In an RDMS I'd call these tables Materialised Views. i.e. a transform on other data, like a view, but with the result cached for later reuse.