Spark JDBC returning dataframe only with column names - scala

I am trying to connect to a HiveTable using spark JDBC, with the following code:
val df = spark.read.format("jdbc").
option("driver", "org.apache.hive.jdbc.HiveDriver").
option("user","hive").
option("password", "").
option("url", jdbcUrl).
option("dbTable", tableName).load()
df.show()
but the return I get is only an empty dataframe with modified columns name, like this:
--------------|---------------|
tableName.uuid|tableName.name |
--------------|---------------|
I've tried to read the dataframe in a lot of ways, but it always results the same.
I'm using JDBC Hive Driver, and this HiveTable is located in an EMR cluster. The code also runs in the same cluster.
Any help will be really appreciated.
Thank you all.

Please set fetchsize in option it should work.
Dataset<Row> referenceData
= sparkSession.read()
.option("fetchsize", "100")
.format("jdbc")
.option("url", jdbc.getJdbcURL())
.option("user", "")
.option("password", "")
.option("dbtable", hiveTableName).load();

Related

Filter rows of snowflake table while reading in pyspark dataframe

I have a huge snowflake table. I want to do some transformation on the table in pyspark. My snowflake table has a column called 'snapshot'. I only want to read the current snapshot data in pyspark dataframe and do transformation on the filtered data.
So, Is there a way to apply filtering the rows while reading the snowflake table in spark dataframe (I don't want to read the entire snowflake table in memory because it is not efficient) or do I need to read entire snowflake table (in spark dataframe) and then apply filter to get the latest snapshot something as below?
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
snowflake_database="********"
snowflake_schema="********"
source_table_name="********"
snowflake_options = {
"sfUrl": "********",
"sfUser": "********",
"sfPassword": "********",
"sfDatabase": snowflake_database,
"sfSchema": snowflake_schema,
"sfWarehouse": "COMPUTE_WH"
}
df = spark.read \
.format(SNOWFLAKE_SOURCE_NAME) \
.options(**snowflake_options) \
.option("dbtable",snowflake_database+"."+snowflake_schema+"."+source_table_name) \
.load()
df = df.where(df.snapshot == current_timestamp()).collect()
There are forms of filters (filter or where functionality of Spark DataFrame) that Spark doesn't pass to the Spark Snowflake connector. That means, in some situations, you may get more records than you expect.
The safest way would be to use a SQL query directly:
df = spark.read \
.format(SNOWFLAKE_SOURCE_NAME) \
.options(**snowflake_options) \
.option("query","SELECT X,Y,Z FROM TABLE1 WHERE SNAPSHOT==CURRENT_TIMESTAMP()") \
.load()
Of course, if you want to use filter/where functionality of Spark DataFrame, check the Query History in Snowflake UI to see if the query generated has the right filter applied.

How to access the last two character of each cell of Spark DataFrame to do some calculations on its value using Scala

I am using Spark with Scala. After loading the data to Spark Dataframe, I want to access each cell of the Dataframe to do some calculations. The code is in the following:
val spark = SparkSession
.builder
.master("local[4]")
.config("spark.executor.memory", "8g")
.config("spark.executor.cores", 4)
.config("spark.task.cpus",1)
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
val df = spark.read
.format("jdbc")
.option("url", "jdbc:oracle:thin:#x.x.x.x:1521:orcldb")
.option("dbtable", "table")
.option("user", "orcl")
.option("password", "*******")
.load()
val row_cnt = df.count()
for(i<-0 to row_cnt.toInt){
val t = df.select("CST_NUM").take(i)
println("start_date:",t)
}
The Output is like this:
(start_date:,[Lorg.apache.spark.sql.Row;#38ae69f7)
I can access value of each cell using forEach; but,I cannot return it to another function.
Would you please guide me how to access value of each cell of Spark Dataframe?
any help is really appreciated.
You need to learn how to work with Spark efficiently - right now your code isn't very optimal. I recommend to read first chapters of the Learning Spark, 2ed book to understand how to work with Spark - it's freely downloadable from Databricks' site.
Regarding your case, you need to change your code just to do the single .select instead of doing it in the loop. And then you can return your data to caller. But it will depend on the amount of data that you need to return - usually you should return Dataframe itself (maybe only subset of columns), and people will transform data as they need. In this case you can take advantage of distributed computation.
If you have a small dataset (hundreds/thousands of rows), then you can materialize dataset as a Spark/Java collection & return it to caller. For example, this could be done as following:
val t = df.select("CST_NUM")
val coll = t.collect().map(_.getInt(0))
in this case, we're selecting only one column from our dataframe (CST_NUM), then use .collect to bring all rows to the driver node, and then extracting the column value from the row object (I've used .getInt for that, but you can use something else from the Row API, find a function that matches the type of your column - .getLong, .getString, etc.)

Writing SQL table directly to file in Scala

Team,
I'm working on Azure databricks, I'm able to write a dataframe to CSV file using the following option:
df2018JanAgg
.write.format("com.databricks.spark.csv")
.option("header", "true")
.save("dbfs:/FileStore/output/df2018janAgg.csv")
but I'm seeking an option to write data directly from SQL table to CSV in Scala.
Can someone please let me know if such options exist.
Thanks,
Srini
Yes data could be directly loaded between a sql table to Datafame and vice-versa. Reference: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
//JDBC -> DataFarme -> CSV
spark.read
.format("jdbc")
.option("url", "jdbc:postgresql:dbserver")
.option("dbtable", "schema.tablename")
.option("user", "username")
.option("password", "password")
.load()
.write.format("com.databricks.spark.csv")
.option("header", "true")
.save("dbfs:/FileStore/output/df2018janAgg.csv")
//DataFarme -> JDBC
df.write
.format("jdbc")
.option("url", "jdbc:postgresql:dbserver")
.option("dbtable", "schema.tablename")
.option("user", "username")
.option("password", "password")
.save()

Spark Efficiently Extract Large Query Result From Relational DB

(Spark beginner here) I wrote a Spark job to extract data from an Oracle database and write it to Parquet files. It works, but I am not satisfied with the batching solution I used. Is there a more efficient way?
I need to extract data from queries, not simply tables. I used a straightforward solution: get 1000 IDs (the max that can fit in a WHERE clause in Oracle), and built the query as a string. Then I passed it to Spark and extracted the data to Parquet. This works, but I wonder whether there are better, more efficient/idiomatic ways to do this. For instance, I am doing all the batching myself, whereas Spark was built to split and distribute work.
My code works on a small data set. I will have to scale it up by a factor of 100, and I would like to know the most idiomatic way of doing this.
val query =
s"""
|(SELECT tbl.*
|FROM $tableName tbl
|WHERE tbl.id IN (${idList.mkString("'", "','", "'")})
|) $tableName
""".stripMargin
private def queryToDf(query: String, props: Properties)(implicit spark: SparkSession, appConfig: AppConfig): sql.DataFrame = {
spark.read.format("jdbc")
.option("url", appConfig.dbURL)
.option("dbtable", query)
.option("user", appConfig.dbUsername).option("password", appConfig.dbPassword)
.option("fetchsize", appConfig.fetchsize.toString)
.option("driver", appConfig.jdbcDriver)
.load()
}
Using Spark 2.4.0, Scala 2.12, Oracle DB.
This would probably work better if you just let spark do all the work of distributing the data that is loaded and how to process it. Here you are making a filter before loading the data. I haven't worked with a jdbc source before, but I would assume that the query is passed to the jdbc before loading the data for spark.
So the solution is to pass the heavy work of filtering the data to spark, by making the property value of dbtable to the actual table name, and the query to spark:
val query =
s"""
|(SELECT tbl.*
|FROM $tableName tbl
|WHERE tbl.id IN (${idList.mkString("'", "','", "'")})
|) $tableName
""".stripMargin
private def queryToDf(tableName: String, query: String, props: Properties)(implicit spark: SparkSession, appConfig: AppConfig): sql.DataFrame = {
spark.read.format("jdbc")
.option("url", appConfig.dbURL)
.option("dbtable", tableName)
.option("user", appConfig.dbUsername).option("password", appConfig.dbPassword)
.option("fetchsize", appConfig.fetchsize.toString)
.option("driver", appConfig.jdbcDriver)
.load()
.selectExpr(query)
}
I have not tested this tho, so there may be some mistakes (query could not be a valid for the selectExpr(), but shouldn't be that different).

Scala Spark - illegal start of definition

This is probably a stupid newbie mistake, but I'm getting an error running what I thought was basic Scala code (in a Spark notebook, via Jupyter notebook):
val sampleDF = spark.read.parquet("/data/my_data.parquet")
sampleDF
.limit(5)
.write
.format("jdbc")
.option("url", "jdbc:sqlserver://sql.example.com;database=my_database")
.option("dbtable", "my_schema.test_table")
.option("user", "foo")
.option("password", "bar")
.save()
The error:
<console>:1: error: illegal start of definition
.limit(5)
^
What am I doing wrong?
Don't know anything about jupyter internals, but I suspect that it's an artifact from the jupyter-repl interaction. sampleDF is for some reason considered to be a complete statement on its own. Try
(sampleDF
.limit(5)
.write
.format("jdbc")
.option("url", "jdbc:sqlserver://sql.example.com;database=my_database")
.option("dbtable", "my_schema.test_table")
.option("user", "foo")
.option("password", "bar")
.save())
Jupyter would try to interpret each line as a complete command, so sampleDF is first interpreted as a valid expression, and then it moves to the next line, producing an error. Move the dots to the previous line, to let the interpreter know that "there's more stuff coming":
sampleDF.
limit(5).
write.
format("jdbc").
option("url", "jdbc:sqlserver://sql.example.com;database=my_database").
option("dbtable", "my_schema.test_table").
option("user", "foo").
option("password", "bar").
save()