Iterating in Scala DataFrame - scala

I have a DataFrame in spark with Sample accounts which has 5 different columns.
val sampledf= sqlContext.sql(select * from Sampledf)
I have other table in oracle db which has millions of records. OracleTable
I want to filter Accounts present in OracleTable with respect to SampleDF
Select * from OracleTable where column in (select column from SamplesDf)
I realized that in oracle we can not provide more than 1000 values in IN condition.
And below subquery query is not working. Due to huge data in OracleTable
I want to achieve below query
select column from OracleTable where (acctnum in (1,2,3,...1000) or acctnum in (1001,....2000) ....
Basically all the accounts from SampleDF (every 1000 accounts)
Since we cant give more than 1000 at once (that's the limitation in Oracle) we can give 1000 every time.
How can I generate this kind of dynamic query. DO I need to create Array from Dataframe?
I just need a work around, how can I proceed. Any suggestions will be helpful.

broadcast join is the best option which will broadcast the smaller dataframe across the cluster. As it’s mentioned the reading oracle data it’s taking time, it might be due to the profile restrictions of number of parallel sessions.
See below work around to build a dynamic in condition.
Val newsampledf = sampledf.withColumn(“seq”,row_number().over(Window.orderBy(“yourcolumn”)).select(“yourcolumn”, “seq”)
var i = 1L
var j = 0L
while(i <= (cnt/999))
{ var sql = newsampledf.select(“yourcolumn”).where(col(“seq” >= j).where(col(“seq”) <j + 999) j=j+999 i=i+1}

You can try to join the both tables based on the column.
Load the Oracle table as dataframe
Join the oracleDF with sampleDF
val resultDF=oracleDF.join(sampleDF,seq("column"))
Use broadcast if sampleDF is small for better performance
val resultDF=oracleDF.join(broadcast(sampleDF),seq("column"))
Hope it helps you.

Related

Spark performance issues when iteratively adding multiple columns

I am seeing performance issues when iteratively adding columns (around 100) to a Dataframe.
I know that it is more efficient to use select to add multiple columns however I have to add the columns in order because column2 may depend on column 1 etc. etc.
The columns are being added following a join which resulted in skew so I have explicitly repartitioned by a salt key to evenly distribute data on the cluster.
When I ran locally I was seeing OOM errors even for fairly small (100 row, 500 column) datasets.
I was able to get the job running locally by checkpointing after the addition of every x columns so I suspect spark lineage issues are causing my problems however I am still unable to run the job at scale on the cluster.
Any advice on where to look or on best practice in this scenario would be greatly received.
At a high level my job looks like this:
val df1 = ??? // Millions of rows, ~500 cols, from parquet
val df2 = ??? // 1000 rows, from parquet
val newExpressions = ??? // 100 rows, from Oracle
val joined = df1.join(broadcast(df2), <join expr>)
val newColumns = newExpressions.collectAsList.map(<get columnExpr and columnName>)
val salted = joined.withColumn("salt", rand()).repartition(x, col("salt"))
newColumns.foldLeft(joined) {
case (df, row) => df.withColumn(col(row.expression).as(row.name))
} // Checkpointing after ever x columns seems to help
Cheers
Terry

save dropped duplicates in pyspark RDD

From here, Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame, we learned how to drop duplicated observations based on some specific variables. What if I want to save those duplicate observations in form of RDD, how shall I do? I guess rdd.substract() may be not efficient if RDD contains billions of observations. So besides using rdd.substract(), is there any other way I can use?
If you need both the datasets, one having only the distinct values and the other having the duplicates, you should use subtract. That will provide an accurate result. In case you need only the duplicates, you can use sql to get that.
df.createOrReplaceTempView('mydf')
df2 = spark.sql("select *,row_number() over(partition by <<list of columns used to identify duplicates>> order by <<any column/s not used to identify duplicates>>) as row_num from mydf having row_num>1").drop('row_num')

How to prevent Apache Spark to read JDBC DataFrame multiple times?

I have to read data from an Oracle Database using JDBC with Spark (2.2). To minimize the transfered data, I use a pushdown query, which already filters the data to be loaded. That data then is appended to an existing Hive table.
To log what has been loaded, I count the records loaded via JDBC. My code basically looks like this:
val query = "(select a, b from table where c = 1) t"
val myDF = spark.read.jdbc(jdbc_url, query, partitionColumn, lowerBound, upperBound, 10, connectionProperties).cache()
myDF.write.mode(SaveMode.Append)
.format("parquet")
.saveAsTable("my_table_name")
val numRecs = myDF.count()
My assumption was, due to the cache(), the DataFrame is read once via JDBC, saved and used to count the records. But when I look at the sessions created in Oracle, I looks like the count operation itself creates 10 sessions on the Oracle database. At first I see 10 sessions, which basicaly look like that:
SELECT * FROM (select a, b from table where c = 1) t WHERE a >= x AND < a y
And, after that is done, another 10 sessions appear like that:
SELECT 1 FROM (select a, b from table where c = 1) t WHERE a >= x AND < a y
So it looks like Spark is loading data from the JDBC source only to count records, where it should already be sufficient to use the data already loaded. How can this be prevented and Spark forced to only read the data once from the JDBC source?
UPDATE
Turns out, I was blind: there was another count() within my code before saveAsTable was called. So it totally makes sense, the first action called on the DataFrame was indeed count(). After eliminating this, it al behaved like expected.

Spark DataFrame row count is inconsistent between runs

When I am running my spark job (version 2.1.1) on EMR, each run counts a different amount of rows on a dataframe. I first read data from s3 to 4 different dataframes, these counts are always consistent an then after joining the dataframes, the result of the join have different counts. afterwards I also filter the result and that also has a different count on each run. The variations are small, 1-5 rows difference but it's still something I would like to understand.
This is the code for the join:
val impJoinKey = Seq("iid", "globalVisitorKey", "date")
val impressionsJoined: DataFrame = impressionDsNoDuplicates
.join(realUrlDSwithDatenoDuplicates, impJoinKey, "outer")
.join(impressionParamterDSwithDateNoDuplicates, impJoinKey, "left")
.join(chartSiteInstance, impJoinKey, "left")
.withColumn("timestamp", coalesce($"timestampImp", $"timestampReal", $"timestampParam"))
.withColumn("url", coalesce($"realUrl", $"url"))
and this is for the filter:
val impressionsJoined: Dataset[ImpressionJoined] = impressionsJoinedFullDay.where($"timestamp".geq(new Timestamp(start.getMillis))).cache()
I have also tried using filter method instead of where, but with same results
Any thought?
Thanks
Nir
is it possible that one of the data sources changes over over time?
since impressionsJoined is not cached, spark will reevaluate it from scratch on every action, and that includes reading the data again from the source.
try caching impressionsJoined after the join.

Scala & Spark: Recycling SQL statements

I spent quite some time to code multiple SQL queries that were formerly used to fetch the data for various R scripts. This is how it worked
sqlContent = readSQLFile("file1.sql")
sqlContent = setSQLVariables(sqlContent, variables)
results = executeSQL(sqlContent)
The clue is, that for some queries a result from a prior query is required - why creating VIEWs in the database itself does not solve this problem. With Spark 2.0 I already figured out a way to do just that through
// create a dataframe using a jdbc connection to the database
val tableDf = spark.read.jdbc(...)
var tempTableName = "TEMP_TABLE" + java.util.UUID.randomUUID.toString.replace("-", "").toUpperCase
var sqlQuery = Source.fromURL(getClass.getResource("/sql/" + sqlFileName)).mkString
sqlQuery = setSQLVariables(sqlQuery, sqlVariables)
sqlQuery = sqlQuery.replace("OLD_TABLE_NAME",tempTableName)
tableDf.createOrReplaceTempView(tempTableName)
var data = spark.sql(sqlQuery)
But this is in my humble opinion very fiddly. Also, more complex queries, e.g. queries that incooporate subquery factoring currently don't work. Is there a more robust way like re-implementing the SQL code into Spark.SQL code using filter($""), .select($""), etc.
The overall goal is to get multiple org.apache.spark.sql.DataFrames, each representing the results of one former SQL query (which always a few JOINs, WITHs, etc.). So n queries leading to n DataFrames.
Is there a better option than the provided two?
Setup: Hadoop v.2.7.3, Spark 2.0.0, Intelli J IDEA 2016.2, Scala 2.11.8, Testcluster on Win7 Workstation
It's not especially clear what your requirement is, but I think you're saying you have queries something like:
SELECT * FROM people LEFT OUTER JOIN places ON ...
SELECT * FROM (SELECT * FROM people LEFT OUTER JOIN places ON ...) WHERE age>20
and you would want to declare and execute this efficiently as
SELECT * FROM people LEFT OUTER JOIN places ON ...
SELECT * FROM <cachedresult> WHERE age>20
To achieve that I would enhance the input file so each sql statement has an associated table name into which the result will be stored.
e.g.
PEOPLEPLACES\tSELECT * FROM people LEFT OUTER JOIN places ON ...
ADULTS=SELECT * FROM PEOPLEPLACES WHERE age>18
Then execute in a loop like
parseSqlFile().foreach({case (name, query) => {
val data: DataFrame = execute(query)
data.createOrReplaceTempView(name)
}
Make sure you declare the queries in order so all required tables have been created. Other do a little more parsing and sort by dependencies.
In an RDMS I'd call these tables Materialised Views. i.e. a transform on other data, like a view, but with the result cached for later reuse.