I have a function :
def PopulatePostgres(df: DataFrame ,df1: DataFrame,df2: DataFrame table: String): Result = {
val result = Try({
df
.write
.format("jdbc")
.mode(SaveMode.Append)
.option("url", config.url)
.option("user", config.username)
.option("password", config.password)
.option("dbtable", table)
.option("driver", "org.postgresql.Driver")
.save()
})
result match {
case Success(_) => Result(s"Created ${table}")
case Failure(problem) => {
log.error(problem.getMessage)
Result(s"Failed to create ${table}")
}
}
}
But,I am not sure how to dump the 3 dataframes one by one into the postgres table .So I need to insert df,df1,df2 all into the postgres table .Can anyone please help me
If you want to store all the data frames into the same table.
val findaldf = df.union(df1).union(df2)
Then you can use your persitance logic.
But all df's are want to store separately
List(df, df1, df2).map(_.write.format("jdbc")
.mode(SaveMode.Append)
.option("url", config.url)
.option("user", config.username)
.option("password", config.password)
.option("dbtable", table)
.option("driver", "org.postgresql.Driver")
.save())
If the schema of all the Dataframes are same (and of course you want them in one table) then you can combine all the them in one and then push the data to postgres else loop it for each dataframe and push the data.
val result = df.unionAll(df1).unionAll(df2)
Related
I can read the Oracle table using this simple Scala program:
val spark = SparkSession
.builder
.master("local[4]")
.config("spark.sql.sources.partitionColumnTypeInference.enabled", false)
.config("spark.executor.memory", "8g")
.config("spark.executor.cores", 4)
.config("spark.task.cpus", 1)
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
val jdbcDF = spark.read
.format("jdbc")
.option("url", "jdbc:oracle:thin:#x.x.x.x:1521:orcl")
.option("dbtable", "big_table")
.option("user", "test")
.option("password", "123456")
.load()
jdbcDF.show()
However, the table is huge and each node have to read part of it. So, I must use a hash function to distribute data among Spark nodes. To have that Spark has Predicates. In fact, I did that in Python. The table has the column named NUM, that Hash Function receives each value and returns an Integer between num_partitions and 0. The predicate list is in following:
hash_function = lambda x: 'ora_hash({}, {})'.format(x, num_partitions)
hash_df = connection.read_sql_full(
'SELECT distinct {0} hash FROM {1}'.format(hash_function(var.hash_col), source_table_name))
hash_values = list(hash_df.loc[:, 'HASH'])
hash_values for num_partitions=19 is :
hash_values=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]
predicates = [
"to_date({0},'YYYYMMDD','nls_calendar=persian')= to_date({1} ,'YYYYMMDD','nls_calendar=persian') " \
"and hash_func({2},{3}) = {4}"
.format(partition_key, current_date, hash_col, num_partitions, hash_val) for hash_val in
hash_values]
Then I read the table based on the predicates like this:
dataframe = spark.read \
.option('driver', 'oracle.jdbc.driver.OracleDriver') \
.jdbc(url=spark_url,
table=table_name,
predicates=predicates)
Would you please guide me how to create Predicates List in Scala as I explained in Python?
Any help is really appreciated.
Problem Solved.
I changed the code like this, then it's work:
import org.apache.spark.sql.SparkSession
import java.sql.Connection
import oracle.jdbc.pool.OracleDataSource
object main extends App {
def read_spark(): Unit = {
val numPartitions = 19
val partitionColumn = "name"
val field_date = "test"
val current_date = "********"
// Define JDBC properties
val url = "jdbc:oracle:thin:#//x.x.x.x:1521/orcl"
val properties = new java.util.Properties()
properties.put("url", url)
properties.put("user", "user")
properties.put("password", "pass")
// Define the where clauses to assign each row to a partition
val predicateFct = (partition: Int) => s"""ora_hash("$partitionColumn",$numPartitions) = $partition"""
val predicates = (0 until numPartitions).map{partition => predicateFct(partition)}.toArray
val test_table = s"(SELECT * FROM table where $field_date=$current_date) dbtable"
// Load the table into Spark
val df = spark.read
.format("jdbc")
.option("driver", "oracle.jdbc.driver.OracleDriver")
.option("dbtable", test_table)
.jdbc(url, test_table, predicates, properties)
df.show()
}
val spark = SparkSession
.builder
.master("local[4]")
.config("spark.sql.sources.partitionColumnTypeInference.enabled", false)
.config("spark.executor.memory", "8g")
.config("spark.executor.cores", 4)
.config("spark.task.cpus", 1)
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
read_spark()
}
I have a general question about Spark.
Should Pyspark and Scala Spark always have the same behaviour when we use the exact same code ?
If yes, how can ou explain this example:
Scala version:
val inputDf = spark
.readStream
.format("csv")
.schema(schema)
.option("ignoreChanges", "true")
.option("delimiter", ";").option("header", true)
.load("/input/")
def processIsmedia(df: DataFrame, batchId: Long): Unit = {
val ids = df
.select("id").distinct().collect().toList
.map(el => s"$el")
ids.foreach { id =>
val datedDf = df.filter(col("id") === id)
datedDf
.write
.format("delta")
.option("mergeSchema", "true")
.partitionBy("id")
.option("replaceWhere", s"id == '$id'")
.mode("overwrite")
.save("/res/")
}
}
inputDf
.writeStream
.format("delta")
.foreachBatch(processIsmedia _)
.queryName("tgte")
.option("checkpointLocation", "/check")
.trigger(Trigger.Once)
.start()
Python version:
inputDf = spark \
.readStream \
.format("csv") \
.schema(schema) \
.option("ignoreChanges", "true") \
.option("delimiter", ";").option("header", True) \
.load("/in/") \
def processDf(df, epoch_id):
PartitionKey = "id"
df.cache()
ids=[x.id for x in df.select("id").distinct().collect()]
for idd in ids:
idd =str(idd)
tmp = df.filter(df.id == idd)
tmp.write.format("delta").option("mergeSchema", "true").partitionBy(PartitionKey).option("replaceWhere", "id == '$i'".format(i=idd)).save("/res/")
inputDf.writeStream.format("delta").foreachBatch(processDf).queryName("aaaa").option("checkpointLocation", "/check").trigger(once=True).start()
Both codes are exactly equivalent.
They are supposed to write data (append new partitions and overwrite existant ones).
With Scala it is working perfectly fine.
With Python I am having an error :
Data written out does not match replaceWhere 'id == '$i''.
So my question is: Isnt spark the same thing whether it is used with Scala, Java, Python or even R ? How can this error be possible then ?
The python code is not performing a replace for the value in idd and the resulting string is "id == '$i'" which is not the case in your scala code i.e.
.option("replaceWhere", "id == '$i'".format(i=idd))
should be
.option("replaceWhere", "id == '{i}'".format(i=idd))
Let me know if this change works for you.
I am quite a newbie to Spark and Scala ;)
Code summary :
Reading data from CSV files --> Creating A simple inner join on 2 Files --> Writing data to Hive table --> Submitting the job on the cluster
Can you please help to identify what went wrong.
The code is not really complex.
The job is executed well on cluster.
Therefore when I try to visualize data written on hive table I am facing issue.
hive> select * from Customers limit 10;
Failed with exception java.io.IOException:java.io.IOException: hdfs://m01.itversity.com:9000/user/itv000666/warehouse/updatedcustomers.db/customers/part-00000-348a54cf-aa0c-45b4-ac49-3a881ae39702_00000.c000 .csv not a SequenceFile
object LapeyreSparkDemo extends App {
//Getting spark ready
val sparkConf = new SparkConf()
sparkConf.set("spark.app.name","Spark for Lapeyre")
//Creating Spark Session
val spark = SparkSession.builder()
.config(sparkConf)
.enableHiveSupport()
.config("spark.sql.warehouse.dir","/user/itv000666/warehouse")
.getOrCreate()
Logger.getLogger(getClass.getName).info("Spark Session Created Successfully")
//Reading
Logger.getLogger(getClass.getName).info("Data loading in DF started")
val ordersSchema = "orderid Int, customerName String, orderDate String, custId Int, orderStatus
String, age String, amount Int"
val orders2019Df = spark.read
.format("csv")
.option("header",true)
.schema(ordersSchema)
.option("path","/user/itv0006666/lapeyrePoc/orders2019.csv")
.load
val newOrder = orders2019Df.withColumnRenamed("custId", "oldCustId")
.withColumnRenamed("customername","oldCustomerName")
val orders2020Df = spark.read
.format("csv")
.option("header",true)
.schema(ordersSchema)
.option("path","/user/itv000666/lapeyrePoc/orders2020.csv")
.load
Logger.getLogger(getClass.getName).info("Data loading in DF complete")
//processing
Logger.getLogger(getClass.getName).info("Processing Started")
val joinCondition = newOrder.col("oldCustId") === orders2020Df.col("custId")
val joinType = "inner"
val joinData = newOrder.join(orders2020Df, joinCondition, joinType)
.select("custId","customername")
//Writing
spark.sql("create database if not exists updatedCustomers")
joinData.write
.format("csv")
.mode(SaveMode.Overwrite)
.bucketBy(4, "custId")
.sortBy("custId")
.saveAsTable("updatedCustomers.Customers")
//Stopping Spark Session
spark.stop()
}
Please let me know in case more information required.
Thanks in advance.
This is the culprit
joinData.write
.format("csv")
Instead used this and it worked.
joinData.write
.format("Hive")
Since I am writing data to hive table (orc format), the format should be "Hive" and not csv.
Also, do not forget to enable hive support while creating spark session.
Also, In spark 2, bucketby & sortby is not supported. Maybe it does in Spark 3.
I am newbie for the Cassandra and I want to implement SCD Type-1 in Cassandra DB.
This SCD Type1 job will be executed from the Spark.
The data will be stored as time series partitioned data. viz: Year/month/Day
Example: I have records for the last 300 days and my new records may have the new records as well as the updated records.
I want to compare the updated records for the last 100 days and if the records are new then it should perform the insert operation else update.
I am not getting any clue to perform this operation hence not sharing any CQL :(
Sample table structure is:
CREATE TABLE crossfit_gyms_by_city_New (
country_code text,
state_province text,
city text,
gym_name text,
PRIMARY KEY ((country_code, state_province), gym_name)
) WITH CLUSTERING ORDER BY (gym_name ASC );
My Sample Spark Code:
object SparkUpdateCassandra {
System.setProperty("hadoop.home.dir", "C:\\hadoop\\")
def main(args: Array[String]): Unit = {
val spark = org.apache.spark.sql.SparkSession
.builder()
.master("local[*]")
.config("spark.cassandra.connection.host", "localhost")
.appName("Spark Cassandra Connector Example")
.getOrCreate()
import spark.implicits._
//Read Cassandra data using DataFrame
val FirstDF = Seq(("India", "WB", "Kolkata", "Cult Fit"),("India", "KA", "Bengaluru", "Cult Fit")).toDF("country_code", "state_province","city","gym_name")
FirstDF.show(10)
FirstDF.write
.format("org.apache.spark.sql.cassandra")
.mode("append")
.option("confirm.truncate", "true")
.option("spark.cassandra.connection.host", "localhost")
.option("spark.cassandra.connection.port", "9042")
.option("keyspace", "emc_test")
.option("table", "crossfit_gyms_by_city_new")
.save()
val loaddf1 = spark.read
.format("org.apache.spark.sql.cassandra")
.option("spark.cassandra.connection.host", "localhost")
.option("spark.cassandra.connection.port", "9042")
.options(Map( "table" -> "crossfit_gyms_by_city_new", "keyspace" -> "emc_test"))
.load()
loaddf1.show(10)
// spark.implicits.wait(5000)
val SecondDF = Seq(("India", "WB", "Siliguri", "CultFit"),("India", "KA", "Bengaluru", "CultFit")).toDF("country_code", "state_province","city","gym_name")
SecondDF.show(10)
SecondDF.write
.format("org.apache.spark.sql.cassandra")
.mode("append")
.option("confirm.truncate", "true")
.option("spark.cassandra.connection.host", "localhost")
.option("spark.cassandra.connection.port", "9042")
.option("keyspace", "emc_test")
.option("table", "crossfit_gyms_by_city_new")
.save()
val loaddf2 = spark.read
.format("org.apache.spark.sql.cassandra")
.option("spark.cassandra.connection.host", "localhost")
.option("spark.cassandra.connection.port", "9042")
.options(Map( "table" -> "crossfit_gyms_by_city_new", "keyspace" -> "emc_test"))
.load()
loaddf2.show(10)
}
}
Note: I am using Scala for the Spark framework.
In Cassandra everything is upsert - if row doesn't exist it will be inserted, if it exists, then it will be updated, so you just need to get your data into RDD or DataFrame and use corresponding function of Spark Cassandra Connector:
saveToCassandra for RDD API:
rdd.saveToCassandra("keyspace", "table")
Or just write inDataFrame API:
df.write
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "table_name", "keyspace" -> "keyspace_name"))
.mode(SaveMode.Append)
.save()
To achieve this there are some facts that will help you navigate the code examples you will run into
In previous Spark 1 code we would use
1 A SparkContext see docs
2 To connect to Cassandra use a CassandraSQLContext constructed with the SparkContext
For Spark 2 this has mostly changed
Create a Spark Session and a CassandraConnector[1]
You would then run your native SQL with the session as shown in [1]
Once you have this set up and working you can just execute the appropriate sql for SCD Type 1 operations, good examples of the sql involved can be found.
I am trying to return a list or tuple of dataframes and displaying them, but i am unable to display it, so can someone please help
I have tried to return list(dataframes) and then i am calling .show in my main
method
def dbReader():List[DataFrame]= {
val abcd_df_user = spark.read
.format("jdbc")
.option("url", "jdbc:postgresql://localhost/abcd?user=postgres&password=postgres")
.option("dbtable", "User_Testing")
.option("user", "postgres")
.option("password", "postgres")
.option("driver", "org.postgresql.Driver")
.load()
.createOrReplaceTempView("abcd_user_testing")
val user_Testing = spark.sql("""select * from abcd_user_testing""")
//.createOrReplaceTempView("All_User__Data")
val abcd_df_employee = spark.read
.format("jdbc")
.option("url", "jdbc:postgresql://localhost/abcd?user=postgres&password=postgres")
.option("dbtable", "employee")
.option("user", "postgres")
.option("password", "postgres")
.option("driver", "org.postgresql.Driver")
.load()
.createOrReplaceTempView("Employee_Table")
val employee_df = spark.sql("""select emp_id,emp_name from Employee_Table""")
List(employee_df,user_Testing)
}
Main Method:
object readinfromdbusingdf extends App {
val readingfromdbusingdf=new readingfromdbusingdf()
readingfromdbusingdf.dbReader().show(10)
expected results: It should be displaying both the dataframes on the console