PostgreSQL count higher than Spark SQL - postgresql

When I try to Write a Dataframe to PostgreSQL using Spark Scala, I have noticed that the count on PostgreSQL is always higher than Spark Scala. I expected it to be same.
Is there any issue with Spark with PostgreSQL writes?
Writing to PostgreSQL
val connection="jdbc:postgresql://localhost:5449/adb?user=aschema&password=abc"
val prop = new java.util.Properties
prop.setProperty("driver", "org.postgresql.Driver")
df.write.mode("Overwrite").jdbc(url= connection, table = "adb.aschema.TABLE", connectionProperties = prop)
My command to read count which is giving accurate count but PostgreSQL shows higher count.
sqlContext.read.option("compression","snappy")
.parquet("/user-data/xyz/input/TABLE/").count

Related

writing psycopg2 query result to pyspark dataframe

Is there a way to directly fetch the contents of a table from a postgresQL database into a pyspark dataframe using the psycopg2 library?
The solutions online so far only talk about using a pandas dataframe. But that is not possible with very large set of data in spark since it would be loading all the data to the driver node.
The code I am using is as follows:
conn = psycopg2.connect(database="databasename", user='user', password='pass', host='postgres.host, port= '5432'
)
cur = conn.cursor()
cur.execute("select * from database.table limit 10")
data = cur.fetchall()
The resulting data output is a tuple that is difficult to convert to a dataframe.
Any suggestions would be greatly appreciated
Directly use spark jdbc to connect to postgresql to read the data, and it will return a dataframe.

How to get Create Statement of Table in some other database in Spark using JDBC

Problem statement:
I have a Impala database where multiple tables are present
I am creating Spark JDBC connection to Impala and loading these tables into spark dataframe for my validations like this which works fine:
val df = spark.read.format("jdbc")
.option("url","url")
.option("dbtable","tablename")
.load()
Now the next step and my actual problem is I need to find the create statement which was used to create the tables in Impala itself
Since I cannot run command like below as it gives error, is there anyway I can fetch the show create statement for tables present in Impala.
val df = spark.read.format("jdbc")
.option("url","url")
.option("dbtable","show create table tablename")
.load()
Perhaps you can use Spark SQL "natively" to execute something like
val createstmt = spark.sql("show create table <tablename>")
The resulting dataframe will have a single column (type string) which contains a complete CREATE TABLE statement.
But, if you still choose to go JDBC route there is always an option to use the good old JDBC interface. Scala understands everything written in Java, after all...
import java.sql.*
Connection conn = DriverManager.getConnection("url")
Statement stmt = conn.createStatement()
ResultSet rs = stmt.executeQuery("show create table <tablename>")
...etc...

Cassandra: no viable alternative at input

I am newbie to Cassandra database and I am trying to save my Spark dataframe to the Cassandra DB.
While creating the table I am getting the exception. "SyntaxException: no viable alternative at input".
val sparkContext = spark.sparkContext
//Set the Log file level
sparkContext.setLogLevel("WARN")
//Connect Spark to Cassandra and execute CQL statements from Spark applications
val connector = CassandraConnector(sparkContext.getConf)
connector.withSessionDo(session =>
{
session.execute("DROP KEYSPACE IF EXISTS my_keyspace")
session.execute("CREATE KEYSPACE my_keyspace WITH replication = {'class':'SimpleStrategy', 'replication_factor':1}")
session.execute("USE my_keyspace")
session.execute("CREATE TABLE mytable('Inbound_Order_No' varchar,'Material' varchar,'Container_net_weight' double,'Shipping_Line' varchar,'Container_No' varchar,'Month' int,'Day' int,'Year' int,'Job_Run_Date' timestamp, PRIMARY KEY(Inbound_Order_No,Container_No))")
df.write
.format("org.apache.spark.sql.cassandra")
.mode("overwrite")
.option("confirm.truncate", "true")
.option("spark.cassandra.connection.host", "localhost")
.option("spark.cassandra.connection.port", "9042")
.option("keyspace", "my_keyspace")
.option("table", "mytable")
.save()
}
)
I am unable to trace the error, hence seeking the help.
Please note:I am doing this work in windows system and everything is setup locally. I have also shared my spark code if you find any other error then please do share with me.
session.execute("CREATE TABLE mytable(\"Inbound_Order_No\" varchar,\"Material\" varchar,\"Container_net_weight\" double,\"Shipping_Line\" varchar,\"Container_No\" varchar,\"Month\" int,\"Day\" int,\"Year\" int,\"Job_Run_Date\" timestamp, PRIMARY KEY(\"Inbound_Order_No\",\"Container_No\"))")
Double quote is used for case sensitive columns and not single quote.
session.execute("CREATE TABLE mytable(Inbound_Order_No varchar,Material varchar,Container_net_weight double,Shipping_Line varchar,Container_No varchar,Month int,Day int,Year int,Job_Run_Date timestamp, PRIMARY KEY(Inbound_Order_No,Container_No))")
If you want column names in lower case the use above query.. Cassandra will create column names in lower case by default (If not enclosed in double quotes)
As requested in comment command to run in cqlsh:
CREATE TABLE mytable("Inbound_Order_No" varchar,"Material" varchar,"Container_net_weight" double,"Shipping_Line" varchar,"Container_No" varchar,"Month" int,"Day" int,"Year" int,"Job_Run_Date" timestamp, PRIMARY KEY("Inbound_Order_No","Container_No"))

Read from a hive table and write back to it using spark sql

I am reading a Hive table using Spark SQL and assigning it to a scala val
val x = sqlContext.sql("select * from some_table")
Then I am doing some processing with the dataframe x and finally coming up with a dataframe y , which has the exact schema as the table some_table.
Finally I am trying to insert overwrite the y dataframe to the same hive table some_table
y.write.mode(SaveMode.Overwrite).saveAsTable().insertInto("some_table")
Then I am getting the error
org.apache.spark.sql.AnalysisException: Cannot insert overwrite into table that is also being read from
I tried creating an insert sql statement and firing it using sqlContext.sql() but it too gave me the same error.
Is there any way I can bypass this error? I need to insert the records back to the same table.
Hi I tried doing as suggested , but still getting the same error .
val x = sqlContext.sql("select * from incremental.test2")
val y = x.limit(5)
y.registerTempTable("temp_table")
val dy = sqlContext.table("temp_table")
dy.write.mode("overwrite").insertInto("incremental.test2")
scala> dy.write.mode("overwrite").insertInto("incremental.test2")
org.apache.spark.sql.AnalysisException: Cannot insert overwrite into table that is also being read from.;
Actually you can also use checkpointing to achieve this. Since it breaks data lineage, Spark is not able to detect that you are reading and overwriting in the same table:
sqlContext.sparkContext.setCheckpointDir(checkpointDir)
val ds = sqlContext.sql("select * from some_table").checkpoint()
ds.write.mode("overwrite").saveAsTable("some_table")
You should first save your DataFrame y in a temporary table
y.write.mode("overwrite").saveAsTable("temp_table")
Then you can overwrite rows in your target table
val dy = sqlContext.table("temp_table")
dy.write.mode("overwrite").insertInto("some_table")
You should first save your DataFrame y like a parquet file:
y.write.parquet("temp_table")
After you load this like:
val parquetFile = sqlContext.read.parquet("temp_table")
And finish you insert your data in your table
parquetFile.write.insertInto("some_table")
In context to Spark 2.2
This error means that our process is reading from same table and writing to same table.
Normally, this should work as process writes to directory .hiveStaging...
This error occurs in case of saveAsTable method, as it overwrites entire table instead of individual partitions.
This error should not occur with insertInto method, as it overwrites partitions not the table.
A reason why this happening is because Hive table has following Spark TBLProperties in its definition. This problem will solve for insertInto method if you remove following Spark TBLProperties -
'spark.sql.partitionProvider' 'spark.sql.sources.provider'
'spark.sql.sources.schema.numPartCols
'spark.sql.sources.schema.numParts' 'spark.sql.sources.schema.part.0'
'spark.sql.sources.schema.part.1' 'spark.sql.sources.schema.part.2'
'spark.sql.sources.schema.partCol.0'
'spark.sql.sources.schema.partCol.1'
https://querydb.blogspot.com/2019/07/read-from-hive-table-and-write-back-to.html
when we upgraded our HDP to 2.6.3 The Spark was updated from 2.2 to 2.3 which resulted in below error -
Caused by: org.apache.spark.sql.AnalysisException: Cannot overwrite a path that is also being read from.;
at org.apache.spark.sql.execution.command.DDLUtils$.verifyNotReadPath(ddl.scala:906)
This error occurs for job where-in we are reading and writing to same path. Like Jobs with SCD Logic
Solution -
Set --conf "spark.sql.hive.convertMetastoreOrc=false"
or, update the job such that it writes data to a temporary table. Then reads from temporary table and insert it into final table.
https://querydb.blogspot.com/2020/09/orgapachesparksqlanalysisexception.html
Read the data from hive table in spark:
val hconfig = new org.apache.hadoop.conf.Configuration()
org.apache.hive.hcatalog.mapreduce.HCatInputFormat.setInput(hconfig , "dbname", "tablename")
val inputFormat = (new HCatInputFormat).asInstanceOf[InputFormat[WritableComparable[_],HCatRecord]].getClass
val data = sc.newAPIHadoopRDD(hconfig,inputFormat,classOf[WritableComparable[_]],classOf[HCatRecord])
You'll also get the Error: "Cannot overwrite a path that is also being read from" in a case where your are doing this:
You are "insert overwrite" to a hive TABLE "A" from a VIEW "V" (that executes your logic)
And that VIEW also references the same TABLE "A". I found this the hard way as the VIEW is deeply nested code that was querying "A" as well. Bummer.
It is like cutting the very branch on which you are sitting :-(
What you need to keep in mind before doing below is that the hive table in which you are overwriting should be have been created by hive DDL not by
spark(df.write.saveAsTable("<table_name>"))
if the above is not true this wont work.
I tested this in spark 2.3.0
val tableReadDf=spark.sql("select * from <dbName>.<tableName>")
val updatedDf=tableReadDf.<transformation> //any update/delete/addition
updatedDf.createOrReplaceTempView("myUpdatedTable")
spark.sql("""with tempView as(select * from myUpdatedTable) insert overwrite table
<dbName>.<tableName> <partition><partition_columns> select * from tempView""")
This is good solution for me:
Extract RDD and schema from DataFrame.
Create new clone DataFame.
Overwrite table.
private def overWrite(df: DataFrame): Unit = {
val schema = df.schema
val rdd = df.rdd
val dfForSave = spark.createDataFrame(rdd, schema)
dfForSave.write
.mode(SaveMode.Overwrite)
.insertInto(s"${tableSource.schema}.${tableSource.table}")}

How to execute a spark sql query from a map function (Python)?

How does one execute spark sql queries from routines that are not the driver portion of the program?
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
def doWork(rec):
data = SQLContext.sql("select * from zip_data where STATEFP ='{sfp}' and COUNTYFP = '{cfp}' ".format(sfp=rec[0], cfp=rec[1]))
for item in data.collect():
print(item)
# do something
return (rec[0], rec[1])
if __name__ == "__main__":
sc = SparkContext(appName="Some app")
print("Starting some app")
SQLContext = SQLContext(sc)
parquetFile = SQLContext.read.parquet("/path/to/data/")
parquetFile.registerTempTable("zip_data")
df = SQLContext.sql("select distinct STATEFP,COUNTYFP from zip_data where STATEFP IN ('12') ")
rslts = df.map(doWork)
for rslt in rslts.collect():
print(rslt)
In this example I'm attempting to query the same table but would like to query other tables registered in Spark SQL too.
One does not execute nested operations on distributed data structure.It is simply not supported in Spark. You have to use joins, local (optionally broadcasted) data structures or access external data directly instead.
In case when you can't accomplish your task with the joins and want to run the SQL queries in memory:
You can consider using some in-memory database like H2, Apache Derby and Redis etc. to execute parallel faster SQL queries without loosing benefits of in-memory computation.
In-memory databases will provide faster access as compared to MySQL, PostgreSQL etc. databases.