Scala Spark - illegal start of definition - scala

This is probably a stupid newbie mistake, but I'm getting an error running what I thought was basic Scala code (in a Spark notebook, via Jupyter notebook):
val sampleDF = spark.read.parquet("/data/my_data.parquet")
sampleDF
.limit(5)
.write
.format("jdbc")
.option("url", "jdbc:sqlserver://sql.example.com;database=my_database")
.option("dbtable", "my_schema.test_table")
.option("user", "foo")
.option("password", "bar")
.save()
The error:
<console>:1: error: illegal start of definition
.limit(5)
^
What am I doing wrong?

Don't know anything about jupyter internals, but I suspect that it's an artifact from the jupyter-repl interaction. sampleDF is for some reason considered to be a complete statement on its own. Try
(sampleDF
.limit(5)
.write
.format("jdbc")
.option("url", "jdbc:sqlserver://sql.example.com;database=my_database")
.option("dbtable", "my_schema.test_table")
.option("user", "foo")
.option("password", "bar")
.save())

Jupyter would try to interpret each line as a complete command, so sampleDF is first interpreted as a valid expression, and then it moves to the next line, producing an error. Move the dots to the previous line, to let the interpreter know that "there's more stuff coming":
sampleDF.
limit(5).
write.
format("jdbc").
option("url", "jdbc:sqlserver://sql.example.com;database=my_database").
option("dbtable", "my_schema.test_table").
option("user", "foo").
option("password", "bar").
save()

Related

How to access the last two character of each cell of Spark DataFrame to do some calculations on its value using Scala

I am using Spark with Scala. After loading the data to Spark Dataframe, I want to access each cell of the Dataframe to do some calculations. The code is in the following:
val spark = SparkSession
.builder
.master("local[4]")
.config("spark.executor.memory", "8g")
.config("spark.executor.cores", 4)
.config("spark.task.cpus",1)
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
val df = spark.read
.format("jdbc")
.option("url", "jdbc:oracle:thin:#x.x.x.x:1521:orcldb")
.option("dbtable", "table")
.option("user", "orcl")
.option("password", "*******")
.load()
val row_cnt = df.count()
for(i<-0 to row_cnt.toInt){
val t = df.select("CST_NUM").take(i)
println("start_date:",t)
}
The Output is like this:
(start_date:,[Lorg.apache.spark.sql.Row;#38ae69f7)
I can access value of each cell using forEach; but,I cannot return it to another function.
Would you please guide me how to access value of each cell of Spark Dataframe?
any help is really appreciated.
You need to learn how to work with Spark efficiently - right now your code isn't very optimal. I recommend to read first chapters of the Learning Spark, 2ed book to understand how to work with Spark - it's freely downloadable from Databricks' site.
Regarding your case, you need to change your code just to do the single .select instead of doing it in the loop. And then you can return your data to caller. But it will depend on the amount of data that you need to return - usually you should return Dataframe itself (maybe only subset of columns), and people will transform data as they need. In this case you can take advantage of distributed computation.
If you have a small dataset (hundreds/thousands of rows), then you can materialize dataset as a Spark/Java collection & return it to caller. For example, this could be done as following:
val t = df.select("CST_NUM")
val coll = t.collect().map(_.getInt(0))
in this case, we're selecting only one column from our dataframe (CST_NUM), then use .collect to bring all rows to the driver node, and then extracting the column value from the row object (I've used .getInt for that, but you can use something else from the Row API, find a function that matches the type of your column - .getLong, .getString, etc.)

Writing SQL table directly to file in Scala

Team,
I'm working on Azure databricks, I'm able to write a dataframe to CSV file using the following option:
df2018JanAgg
.write.format("com.databricks.spark.csv")
.option("header", "true")
.save("dbfs:/FileStore/output/df2018janAgg.csv")
but I'm seeking an option to write data directly from SQL table to CSV in Scala.
Can someone please let me know if such options exist.
Thanks,
Srini
Yes data could be directly loaded between a sql table to Datafame and vice-versa. Reference: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
//JDBC -> DataFarme -> CSV
spark.read
.format("jdbc")
.option("url", "jdbc:postgresql:dbserver")
.option("dbtable", "schema.tablename")
.option("user", "username")
.option("password", "password")
.load()
.write.format("com.databricks.spark.csv")
.option("header", "true")
.save("dbfs:/FileStore/output/df2018janAgg.csv")
//DataFarme -> JDBC
df.write
.format("jdbc")
.option("url", "jdbc:postgresql:dbserver")
.option("dbtable", "schema.tablename")
.option("user", "username")
.option("password", "password")
.save()

Pyspark saving is not working when called from inside a foreach

I am building a pipeline that receives messages from Azure EventHub and save into databricks delta tables.
All my tests with static data went well, see the code below:
body = 'A|B|C|D\n"False"|"253435564"|"14"|"2019-06-25 04:56:21.713"\n"True"|"253435564"|"13"|"2019-06-25 04:56:21.713"\n"
tableLocation = "/delta/tables/myTableName"
spark = SparkSession.builder.appName("CSV converter").getOrCreate()
csvData = spark.sparkContext.parallelize(body.split('\n'))
df = spark.read \
.option("header", True) \
.option("delimiter","|") \
.option("quote", "\"") \
.option("nullValue", "\\N") \
.option("inferShema", "true") \
.option("mergeSchema", "true") \
.csv(csvData)
df.write.format("delta").mode("append").save(tableLocation)
However in my case, each eventhub message is a CSV string, and they may come from many sources. So each message must be processed separatelly, because each message may end up saved in different delta tables.
When I try to execute this same code inside a foreach statement, It doesn't work. There are no errors shown at the logs, and I cant find any table saved.
So maybe I am doing something wrong when calling the foreach. See the code below:
def SaveData(row):
...
The same code above
dfEventHubCSV.rdd.foreach(SaveData)
I tried to do this on a streaming context, but I sadly went through the same problem.
What is in the foreach that makes it behave different?
Below the full code I am running:
import pyspark.sql.types as t
from pyspark.sql import SQLContext
--row contains the fields Body and SdIds
--Body: CSV string
--SdIds: A string ID
def SaveData(row):
--Each row data that is going to be added to different tables
rowInfo = GetDestinationTableData(row['SdIds']).collect()
table = rowInfo[0][4]
schema = rowInfo[0][3]
database = rowInfo[0][2]
body = row['Body']
tableLocation = "/delta/" + database + '/' + schema + '/' + table
checkpointLocation = "/delta/" + database + '/' + schema + "/_checkpoints/" + table
spark = SparkSession.builder.appName("CSV").getOrCreate()
csvData = spark.sparkContext.parallelize(body.split('\n'))
df = spark.read \
.option("header", True) \
.option("delimiter","|") \
.option("quote", "\"") \
.option("nullValue", "\\N") \
.option("inferShema", "true") \
.option("mergeSchema", "true") \
.csv(csvData)
df.write.format("delta").mode("append").save(tableLocation)
dfEventHubCSV.rdd.foreach(SaveData)
Well, at the end of all, as always it is something very simple, but I dind't see this anywere.
Basically when you perform a foreach and the dataframe you want to save is built inside the loop. The worker unlike the driver, won't automatically setup the "/dbfs/" path on the saving, so if you don't manually add the "/dbfs/", it will save the data locally in the worker and you will never find the saved data.
That is why my loops weren't working.

Create Index thru SPARK for JDBC

I am trying to create Index on Postgres Table thru Spark and the code is as below:
val df3 = sqlContext.read.format("jdbc")
.option("url", "jdbc:postgresql://URL")
.option("user", "user")
.option("password", "password")
.option("dbtable", "(ALTER TABLE abc.test1 ADD PRIMARY KEY (test))as t")
.option("driver", "org.postgresql.Driver")
.option("lowerBound", 1L)
.option("upperBound", 10000000L)
.option("numPartitions", 100)
.option("fetchSize", "1000000")
.load()
The error is
Exception in thread "main" org.postgresql.util.PSQLException: ERROR: syntax error at or near "TABLE"
Just wondering can we do that or the above Data frame is wrong. Appreciate your help.

Spark JDBC returning dataframe only with column names

I am trying to connect to a HiveTable using spark JDBC, with the following code:
val df = spark.read.format("jdbc").
option("driver", "org.apache.hive.jdbc.HiveDriver").
option("user","hive").
option("password", "").
option("url", jdbcUrl).
option("dbTable", tableName).load()
df.show()
but the return I get is only an empty dataframe with modified columns name, like this:
--------------|---------------|
tableName.uuid|tableName.name |
--------------|---------------|
I've tried to read the dataframe in a lot of ways, but it always results the same.
I'm using JDBC Hive Driver, and this HiveTable is located in an EMR cluster. The code also runs in the same cluster.
Any help will be really appreciated.
Thank you all.
Please set fetchsize in option it should work.
Dataset<Row> referenceData
= sparkSession.read()
.option("fetchsize", "100")
.format("jdbc")
.option("url", jdbc.getJdbcURL())
.option("user", "")
.option("password", "")
.option("dbtable", hiveTableName).load();