Spark Efficiently Extract Large Query Result From Relational DB

Spark Efficiently Extract Large Query Result From Relational DB - scala

(Spark beginner here) I wrote a Spark job to extract data from an Oracle database and write it to Parquet files. It works, but I am not satisfied with the batching solution I used. Is there a more efficient way?
I need to extract data from queries, not simply tables. I used a straightforward solution: get 1000 IDs (the max that can fit in a WHERE clause in Oracle), and built the query as a string. Then I passed it to Spark and extracted the data to Parquet. This works, but I wonder whether there are better, more efficient/idiomatic ways to do this. For instance, I am doing all the batching myself, whereas Spark was built to split and distribute work.
My code works on a small data set. I will have to scale it up by a factor of 100, and I would like to know the most idiomatic way of doing this.
val query =
s"""
|(SELECT tbl.*
|FROM $tableName tbl
|WHERE tbl.id IN (${idList.mkString("'", "','", "'")})
|) $tableName
""".stripMargin
private def queryToDf(query: String, props: Properties)(implicit spark: SparkSession, appConfig: AppConfig): sql.DataFrame = {
spark.read.format("jdbc")
.option("url", appConfig.dbURL)
.option("dbtable", query)
.option("user", appConfig.dbUsername).option("password", appConfig.dbPassword)
.option("fetchsize", appConfig.fetchsize.toString)
.option("driver", appConfig.jdbcDriver)
.load()
}
Using Spark 2.4.0, Scala 2.12, Oracle DB.

This would probably work better if you just let spark do all the work of distributing the data that is loaded and how to process it. Here you are making a filter before loading the data. I haven't worked with a jdbc source before, but I would assume that the query is passed to the jdbc before loading the data for spark.
So the solution is to pass the heavy work of filtering the data to spark, by making the property value of dbtable to the actual table name, and the query to spark:
val query =
s"""
|(SELECT tbl.*
|FROM $tableName tbl
|WHERE tbl.id IN (${idList.mkString("'", "','", "'")})
|) $tableName
""".stripMargin
private def queryToDf(tableName: String, query: String, props: Properties)(implicit spark: SparkSession, appConfig: AppConfig): sql.DataFrame = {
spark.read.format("jdbc")
.option("url", appConfig.dbURL)
.option("dbtable", tableName)
.option("user", appConfig.dbUsername).option("password", appConfig.dbPassword)
.option("fetchsize", appConfig.fetchsize.toString)
.option("driver", appConfig.jdbcDriver)
.load()
.selectExpr(query)
}
I have not tested this tho, so there may be some mistakes (query could not be a valid for the selectExpr(), but shouldn't be that different).

Related

How to access the last two character of each cell of Spark DataFrame to do some calculations on its value using Scala

I am using Spark with Scala. After loading the data to Spark Dataframe, I want to access each cell of the Dataframe to do some calculations. The code is in the following:
val spark = SparkSession
.builder
.master("local[4]")
.config("spark.executor.memory", "8g")
.config("spark.executor.cores", 4)
.config("spark.task.cpus",1)
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
val df = spark.read
.format("jdbc")
.option("url", "jdbc:oracle:thin:#x.x.x.x:1521:orcldb")
.option("dbtable", "table")
.option("user", "orcl")
.option("password", "*******")
.load()
val row_cnt = df.count()
for(i<-0 to row_cnt.toInt){
val t = df.select("CST_NUM").take(i)
println("start_date:",t)
}
The Output is like this:
(start_date:,[Lorg.apache.spark.sql.Row;#38ae69f7)
I can access value of each cell using forEach; but,I cannot return it to another function.
Would you please guide me how to access value of each cell of Spark Dataframe?
any help is really appreciated.

You need to learn how to work with Spark efficiently - right now your code isn't very optimal. I recommend to read first chapters of the Learning Spark, 2ed book to understand how to work with Spark - it's freely downloadable from Databricks' site.
Regarding your case, you need to change your code just to do the single .select instead of doing it in the loop. And then you can return your data to caller. But it will depend on the amount of data that you need to return - usually you should return Dataframe itself (maybe only subset of columns), and people will transform data as they need. In this case you can take advantage of distributed computation.
If you have a small dataset (hundreds/thousands of rows), then you can materialize dataset as a Spark/Java collection & return it to caller. For example, this could be done as following:
val t = df.select("CST_NUM")
val coll = t.collect().map(_.getInt(0))
in this case, we're selecting only one column from our dataframe (CST_NUM), then use .collect to bring all rows to the driver node, and then extracting the column value from the row object (I've used .getInt for that, but you can use something else from the Row API, find a function that matches the type of your column - .getLong, .getString, etc.)

spark OOM on simple read and write

I'm using spark to read from a postgres table and dump it to Google cloud storage as json. The table is quite big, many 100's of GBs. The code is relatively straightforward (plz see below) but it fails with OOM. It seems like spark is trying to read the entire table in memory before starting to write it. Is this true? How can I change the behavior such that it reads and writes in a streaming fashion?
Thanks.
SparkSession sparkSession = SparkSession
.builder()
.appName("01-Getting-Started")
.getOrCreate();
Dataset<Row> dataset = sparkSession.read().jdbc("jdbc:postgresql://<ip>:<port>/<db>", "<table>", properties);
dataset.write().mode(SaveMode.Append).json("gs://some/path");

There are a couple of overloaded DataFrameReader.jdbc() methods that are useful for splitting up JDBC data on input.
jdbc(String url, String table, String[] predicates, java.util.Properties connectionProperties) - the resulting DataFrame will have one partition for each predicate given, e.g.
String[] preds = {“state=‘Alabama’”, “state=‘Alaska’”, “state=‘Arkansas’”, …};
Dataset<Row> dataset = sparkSession.read().jdbc("jdbc:postgresql://<ip>:<port>/<db>", "<table>", preds, properties);
jdbc(String url, String table, String columnName, long lowerBound, long upperBound, int numPartitions, java.util.Properties connectionProperties) - Spark will divide the data based on a numeric column columnName into numPartitions partitions between lowerBound and upperBound inclusive, e.g.:
Dataset<Row> dataset = sparkSession.read().jdbc("jdbc:postgresql://<ip>:<port>/<db>", "<table>", “<idColumn>”, 1, 1000, 100, properties);

Spark dataframe cast column for Kudu compatibility

(I am new to Spark, Impala and Kudu.) I am trying to copy a table from an Oracle DB to an Impala table having the same structure, in Spark, through Kudu. I am getting an error when the code tries to map an Oracle NUMBER to a Kudu data type. How can I change the data type of a Spark DataFrame to make it compatible with Kudu?
This is intended to be a 1-to-1 copy of data from Oracle to Impala. I have extracted the Oracle schema of the source table and created a target Impala table with the same structure (same column names and a reasonable mapping of data types). I was hoping that Spark+Kudu would map everything automatically and just copy the data. Instead, Kudu complains that it cannot map DecimalType(38,0).
I would like to specify that "column #1, with name SOME_COL, which is a NUMBER in Oracle, should be mapped to a LongType, which is supported in Kudu".
How can I do that?
// This works
val df: DataFrame = spark.read
.option("fetchsize", 10000)
.option("driver", "oracle.jdbc.driver.OracleDriver")
.jdbc("jdbc:oracle:thin:#(DESCRIPTION=...)", "SCHEMA.TABLE_NAME", partitions, props)
// This does not work
kuduContext.insertRows(df.toDF(colNamesLower: _*), "impala::schema.table_name")
// Error: No support for Spark SQL type DecimalType(38,0)
// See https://github.com/cloudera/kudu/blob/master/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/SparkUtil.scala
// So let's see the Spark data types
df.dtypes.foreach{case (colName, colType) => println(s"$colName: $colType")}
// Spark data type: SOME_COL DecimalType(38,0)
// Oracle data type: SOME_COL NUMBER -- no precision specifier; values are int/long
// Kudu data type: SOME_COL BIGINT

Apparently, we can specify a custom schema when reading from a JDBC data source.
connectionProperties.put("customSchema", "id DECIMAL(38, 0), name STRING")
val jdbcDF3 = spark.read
.jdbc("jdbc:postgresql:dbserver", "schema.tablename", connectionProperties)
That worked. I was able to specify a customSchema like so:
col1 Long, col2 Timestamp, col3 Double, col4 String
and with that, the code works:
import spark.implicits._
val df: Dataset[case_class_for_table] = spark.read
.option("fetchsize", 10000)
.option("driver", "oracle.jdbc.driver.OracleDriver")
.jdbc("jdbc:oracle:thin:#(DESCRIPTION=...)", "SCHEMA.TABLE_NAME", partitions, props)
.as[case_class_for_table]
kuduContext.insertRows(df.toDF(colNamesLower: _*), "impala::schema.table_name")

Create Hive Table from Spark using API, rather than SQL?

I want to create a hive table with partitions.
The schema for the table is:
val schema = StructType(StructField(name,StringType,true),StructField(age,IntegerType,true))
I can do this with Spark-SQL using:
val query = "CREATE TABLE some_new_table (name string, age integer) USING org.apache.spark.sql.parquet OPTIONS (path '<some_path>') PARTITIONED BY (age)"
spark.sql(query)
When I try to do with Spark API (using Scala), the table is filled with data. I only want to create an empty table and define partitions. This is what I am doing, what I am doing wrong :
val df = spark.createDataFrame(sc.emptyRDD[Row], schema)
val options = Map("path" -> "<some_path>", "partitionBy" -> "age")
df.sqlContext().createExternalTable("some_new_table", "org.apache.spark.sql.parquet", schema, options);
I am using Spark-2.1.1.

If you skip partitioning. can try with saveAsTable:
spark.createDataFrame(sc.emptyRDD[Row], schema)
.write
.format("parquet")
//.partitionBy("age")
.saveAsTable("some_new_table")
Spark partitioning and Hive partitioning are not compatible, so if you want access from Hive you have to use SQL: https://issues.apache.org/jira/browse/SPARK-14927

Spark JDBC returning dataframe only with column names

I am trying to connect to a HiveTable using spark JDBC, with the following code:
val df = spark.read.format("jdbc").
option("driver", "org.apache.hive.jdbc.HiveDriver").
option("user","hive").
option("password", "").
option("url", jdbcUrl).
option("dbTable", tableName).load()
df.show()
but the return I get is only an empty dataframe with modified columns name, like this:
--------------|---------------|
tableName.uuid|tableName.name |
--------------|---------------|
I've tried to read the dataframe in a lot of ways, but it always results the same.
I'm using JDBC Hive Driver, and this HiveTable is located in an EMR cluster. The code also runs in the same cluster.
Any help will be really appreciated.
Thank you all.

Please set fetchsize in option it should work.
Dataset<Row> referenceData
= sparkSession.read()
.option("fetchsize", "100")
.format("jdbc")
.option("url", jdbc.getJdbcURL())
.option("user", "")
.option("password", "")
.option("dbtable", hiveTableName).load();