Snowflake Table from Databricks using Python/Scala - scala

Can anyone help me out please? I want to create a table and lad data into it in Snowflake from Databricks using Python/Scala.
Below is my code snippet. I'm getting the below error. Could you please let me know how I can create the table first if not exists in Databricks notebook using Python or Scala and then load the data?
If so, what functions do I need to use. Below gives me an error. Thanks!
'''
df1.write.format("snowflake").options(sfOptions).option("dbtable","TEST_TABLE")
.mode(SaveMode.Append)
'''

If you use Scala code then your df write should look like this:
df.write
.format(SNOWFLAKE_SOURCE_NAME)
.options(sfOptions)
.option("dbtable", "t2")
.mode(SaveMode.Append)
.save()
If you use Python code then your df write should look like this:
df.write
.format(SNOWFLAKE_SOURCE_NAME)
.options(**sfOptions)
.option("dbtable", "t2")
.mode(SaveMode.Append)
.save()
where:
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
Observe there is a difference between the options on Scala vs Python.

Related

How to create a temporary table in snowflake based on pyspark dataframe

I can read the snowflake table in pyspark dataframe using sqlContext
sql = f"""select * from table1""";
df = sqlContext.read
.format(SNOWFLAKE_SOURCE_NAME)
.options(**snowflake_options)
.option("query", sql)
.load()
How do I create a temporary table in snowflake (using pyspark code) and insert values from this pyspark dataframe (df)?
just save as usual, with snowflake format
snowflake_options = {
...
'sfDatabase': 'dbabc',
'dbtable': 'tablexyz',
...
}
(df
.write
.format(SNOWFLAKE_SOURCE_NAME)
.options(**snowflake_options)
.save()
)
I don't believe this can be done. At least not the way you want.
You can, technically, create a temporary table; but persisting it is something that I have had a great deal of difficulty finding how to do (i.e. I haven't). If you run the following:
spark.sparkContext._jvm.net.snowflake.spark.snowflake.Utils.runQuery(snowflake_options, 'create temporary table tmp_table (id int, value text)')
you'll notice that it successfully returns a java object indicating the temp table was created successfully; but once you try and run any further statements on it, you'll get nasty errors that mean it no longer exists. Somehow we mere mortals would need to find a way to access and persist the Snowflake session through the jvm api. That being said, I also think that would run contrary to the Spark paradigm.
If you really need the special-case performance boost of running transformations on Snowflake instead of bringing it all into Spark, just keep everything in Snowflake to begin with by either
Using CTEs in the query, or
Using the runQuery api described above to create "temporary" permanent/transient tables and designing Snowflake queries that insert directly to those and then clean them up (DROP them) when you are done.

How to get Create Statement of Table in some other database in Spark using JDBC

Problem statement:
I have a Impala database where multiple tables are present
I am creating Spark JDBC connection to Impala and loading these tables into spark dataframe for my validations like this which works fine:
val df = spark.read.format("jdbc")
.option("url","url")
.option("dbtable","tablename")
.load()
Now the next step and my actual problem is I need to find the create statement which was used to create the tables in Impala itself
Since I cannot run command like below as it gives error, is there anyway I can fetch the show create statement for tables present in Impala.
val df = spark.read.format("jdbc")
.option("url","url")
.option("dbtable","show create table tablename")
.load()
Perhaps you can use Spark SQL "natively" to execute something like
val createstmt = spark.sql("show create table <tablename>")
The resulting dataframe will have a single column (type string) which contains a complete CREATE TABLE statement.
But, if you still choose to go JDBC route there is always an option to use the good old JDBC interface. Scala understands everything written in Java, after all...
import java.sql.*
Connection conn = DriverManager.getConnection("url")
Statement stmt = conn.createStatement()
ResultSet rs = stmt.executeQuery("show create table <tablename>")
...etc...

Cassandra: no viable alternative at input

I am newbie to Cassandra database and I am trying to save my Spark dataframe to the Cassandra DB.
While creating the table I am getting the exception. "SyntaxException: no viable alternative at input".
val sparkContext = spark.sparkContext
//Set the Log file level
sparkContext.setLogLevel("WARN")
//Connect Spark to Cassandra and execute CQL statements from Spark applications
val connector = CassandraConnector(sparkContext.getConf)
connector.withSessionDo(session =>
{
session.execute("DROP KEYSPACE IF EXISTS my_keyspace")
session.execute("CREATE KEYSPACE my_keyspace WITH replication = {'class':'SimpleStrategy', 'replication_factor':1}")
session.execute("USE my_keyspace")
session.execute("CREATE TABLE mytable('Inbound_Order_No' varchar,'Material' varchar,'Container_net_weight' double,'Shipping_Line' varchar,'Container_No' varchar,'Month' int,'Day' int,'Year' int,'Job_Run_Date' timestamp, PRIMARY KEY(Inbound_Order_No,Container_No))")
df.write
.format("org.apache.spark.sql.cassandra")
.mode("overwrite")
.option("confirm.truncate", "true")
.option("spark.cassandra.connection.host", "localhost")
.option("spark.cassandra.connection.port", "9042")
.option("keyspace", "my_keyspace")
.option("table", "mytable")
.save()
}
)
I am unable to trace the error, hence seeking the help.
Please note:I am doing this work in windows system and everything is setup locally. I have also shared my spark code if you find any other error then please do share with me.
session.execute("CREATE TABLE mytable(\"Inbound_Order_No\" varchar,\"Material\" varchar,\"Container_net_weight\" double,\"Shipping_Line\" varchar,\"Container_No\" varchar,\"Month\" int,\"Day\" int,\"Year\" int,\"Job_Run_Date\" timestamp, PRIMARY KEY(\"Inbound_Order_No\",\"Container_No\"))")
Double quote is used for case sensitive columns and not single quote.
session.execute("CREATE TABLE mytable(Inbound_Order_No varchar,Material varchar,Container_net_weight double,Shipping_Line varchar,Container_No varchar,Month int,Day int,Year int,Job_Run_Date timestamp, PRIMARY KEY(Inbound_Order_No,Container_No))")
If you want column names in lower case the use above query.. Cassandra will create column names in lower case by default (If not enclosed in double quotes)
As requested in comment command to run in cqlsh:
CREATE TABLE mytable("Inbound_Order_No" varchar,"Material" varchar,"Container_net_weight" double,"Shipping_Line" varchar,"Container_No" varchar,"Month" int,"Day" int,"Year" int,"Job_Run_Date" timestamp, PRIMARY KEY("Inbound_Order_No","Container_No"))

To Compute statistics of Hive table in Spark

I have created a DataFrame to load CSV files and created a temp table to get the column statistics.
However when I try to run the ANALYZE command I am facing the below error
The same Analyze command ran in Hive successfully.
Spark Version : 1.6.3
df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("mode", "DROPMALFORMED")
.load("/bn_data/bopis/*.csv")
// To get the statistics of columns
df.registerTempTable("bopis")
val stat=sqlContext.sql("analyze table bopis compute statistics for columns").show()
Error:
java.lang.RuntimeException: [1.1] failure: ``with'' expected but identifier analyze found
analyze table bopis compute statistics for columns
^
Please let us know on how to achieve the column statistics using Spark
Thanks.!
If you use the FOR COLUMNS option, you have to pass a list of column names, see https://docs.databricks.com/spark/latest/spark-sql/language-manual/analyze-table.html
In any case, even if you do, you are going to get an error because you can't run compute statistics on a temp table. ( you will get a Table or view 'bopis' not found in database 'default').
You'll have to create a full blown Hive table, either via df.write.saveAsTable("bopis_hive"), or sqlContext.sql("CREATE TABLE bopis_hive as SELECT * from bopis")

Spark unable to read a database table to a dataframe with jdbc using a query as a table name

I'm trying to read a postgres/postgis table into a spark 2.0 dataframe like this.
val jdbcUrl = s"jdbc:postgresql://${host}:${port}/${dbName}"
val connectionProperties = new Properties()
connectionProperties.put("user", s"${user}")
connectionProperties.put("password", s"${password}")
connectionProperties.setProperty("Driver", "org.postgresql.Driver")
def readTable ( table: String ): DataFrame = {
spark.read.jdbc(jdbcUrl, s"(select st_astext(geom) as geom from
${table}) as t;", connectionProperties)
}
readTable("myschema.mytable")
I get this error:
org.postgresql.util.PSQLException: ERROR: syntax error at or near "WHERE"
I'm pretty sure this is caused by a where clause being added to the query as described in this question.
However according to the docs this method should work https://docs.databricks.com/spark/latest/data-sources/sql-databases.html#pushdown-query-to-database-engine
I need to use a query as a table name because I need to get the postgis geometry as a wkt string. My question is, has anyone found a way to read a table with a query as a table name like this? Or does anyone see anything wrong with my code? Or perhaps another way? thanks