I want to display multiple dataframes on the console at once - scala

I am trying to return a list or tuple of dataframes and displaying them, but i am unable to display it, so can someone please help
I have tried to return list(dataframes) and then i am calling .show in my main
method
def dbReader():List[DataFrame]= {
val abcd_df_user = spark.read
.format("jdbc")
.option("url", "jdbc:postgresql://localhost/abcd?user=postgres&password=postgres")
.option("dbtable", "User_Testing")
.option("user", "postgres")
.option("password", "postgres")
.option("driver", "org.postgresql.Driver")
.load()
.createOrReplaceTempView("abcd_user_testing")
val user_Testing = spark.sql("""select * from abcd_user_testing""")
//.createOrReplaceTempView("All_User__Data")
val abcd_df_employee = spark.read
.format("jdbc")
.option("url", "jdbc:postgresql://localhost/abcd?user=postgres&password=postgres")
.option("dbtable", "employee")
.option("user", "postgres")
.option("password", "postgres")
.option("driver", "org.postgresql.Driver")
.load()
.createOrReplaceTempView("Employee_Table")
val employee_df = spark.sql("""select emp_id,emp_name from Employee_Table""")
List(employee_df,user_Testing)
}
Main Method:
object readinfromdbusingdf extends App {
val readingfromdbusingdf=new readingfromdbusingdf()
readingfromdbusingdf.dbReader().show(10)
expected results: It should be displaying both the dataframes on the console

Related

How to create Predicate for reading data using Spark SQL in Scala

I can read the Oracle table using this simple Scala program:
val spark = SparkSession
.builder
.master("local[4]")
.config("spark.sql.sources.partitionColumnTypeInference.enabled", false)
.config("spark.executor.memory", "8g")
.config("spark.executor.cores", 4)
.config("spark.task.cpus", 1)
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
val jdbcDF = spark.read
.format("jdbc")
.option("url", "jdbc:oracle:thin:#x.x.x.x:1521:orcl")
.option("dbtable", "big_table")
.option("user", "test")
.option("password", "123456")
.load()
jdbcDF.show()
However, the table is huge and each node have to read part of it. So, I must use a hash function to distribute data among Spark nodes. To have that Spark has Predicates. In fact, I did that in Python. The table has the column named NUM, that Hash Function receives each value and returns an Integer between num_partitions and 0. The predicate list is in following:
hash_function = lambda x: 'ora_hash({}, {})'.format(x, num_partitions)
hash_df = connection.read_sql_full(
'SELECT distinct {0} hash FROM {1}'.format(hash_function(var.hash_col), source_table_name))
hash_values = list(hash_df.loc[:, 'HASH'])
hash_values for num_partitions=19 is :
hash_values=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]
predicates = [
"to_date({0},'YYYYMMDD','nls_calendar=persian')= to_date({1} ,'YYYYMMDD','nls_calendar=persian') " \
"and hash_func({2},{3}) = {4}"
.format(partition_key, current_date, hash_col, num_partitions, hash_val) for hash_val in
hash_values]
Then I read the table based on the predicates like this:
dataframe = spark.read \
.option('driver', 'oracle.jdbc.driver.OracleDriver') \
.jdbc(url=spark_url,
table=table_name,
predicates=predicates)
Would you please guide me how to create Predicates List in Scala as I explained in Python?
Any help is really appreciated.
Problem Solved.
I changed the code like this, then it's work:
import org.apache.spark.sql.SparkSession
import java.sql.Connection
import oracle.jdbc.pool.OracleDataSource
object main extends App {
def read_spark(): Unit = {
val numPartitions = 19
val partitionColumn = "name"
val field_date = "test"
val current_date = "********"
// Define JDBC properties
val url = "jdbc:oracle:thin:#//x.x.x.x:1521/orcl"
val properties = new java.util.Properties()
properties.put("url", url)
properties.put("user", "user")
properties.put("password", "pass")
// Define the where clauses to assign each row to a partition
val predicateFct = (partition: Int) => s"""ora_hash("$partitionColumn",$numPartitions) = $partition"""
val predicates = (0 until numPartitions).map{partition => predicateFct(partition)}.toArray
val test_table = s"(SELECT * FROM table where $field_date=$current_date) dbtable"
// Load the table into Spark
val df = spark.read
.format("jdbc")
.option("driver", "oracle.jdbc.driver.OracleDriver")
.option("dbtable", test_table)
.jdbc(url, test_table, predicates, properties)
df.show()
}
val spark = SparkSession
.builder
.master("local[4]")
.config("spark.sql.sources.partitionColumnTypeInference.enabled", false)
.config("spark.executor.memory", "8g")
.config("spark.executor.cores", 4)
.config("spark.task.cpus", 1)
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
read_spark()
}

How to Unit Test Spark.Read in Scala

I wonder, how can I write the unit test case for below piece of code
val df = spark.read
.format("com.microsoft.sqlserver.jdbc.spark")
.option("url", [JDBCURL])
.option("query", [QUERY])
.option("user", [Username])
.option("password", [Password])
.load()
I am using ScalaTest as a Testing library and Mockito for mocking

Insert multiple Dataframes into postgres table in a function using scala

I have a function :
def PopulatePostgres(df: DataFrame ,df1: DataFrame,df2: DataFrame table: String): Result = {
val result = Try({
df
.write
.format("jdbc")
.mode(SaveMode.Append)
.option("url", config.url)
.option("user", config.username)
.option("password", config.password)
.option("dbtable", table)
.option("driver", "org.postgresql.Driver")
.save()
})
result match {
case Success(_) => Result(s"Created ${table}")
case Failure(problem) => {
log.error(problem.getMessage)
Result(s"Failed to create ${table}")
}
}
}
But,I am not sure how to dump the 3 dataframes one by one into the postgres table .So I need to insert df,df1,df2 all into the postgres table .Can anyone please help me
If you want to store all the data frames into the same table.
val findaldf = df.union(df1).union(df2)
Then you can use your persitance logic.
But all df's are want to store separately
List(df, df1, df2).map(_.write.format("jdbc")
.mode(SaveMode.Append)
.option("url", config.url)
.option("user", config.username)
.option("password", config.password)
.option("dbtable", table)
.option("driver", "org.postgresql.Driver")
.save())
If the schema of all the Dataframes are same (and of course you want them in one table) then you can combine all the them in one and then push the data to postgres else loop it for each dataframe and push the data.
val result = df.unionAll(df1).unionAll(df2)

How to checkpoint many source of spark streaming

I have many CSV spark.readStream in a different locations, I have to checkpoint all of them with scala, I specified a query for every stream but when I run the job, I got this message
java.lang.IllegalArgumentException: Cannot start query with name "query1" as a query with that name is already active
I solved my problem by creating a many streaming query like this :
val spark = SparkSession
.builder
.appName("test")
.config("spark.local", "local[*]")
.getOrCreate()
spark.sparkContext.setCheckpointDir(path_checkpoint)
val event1 = spark
.readStream //
.schema(schema_a)
.option("header", "true")
.option("sep", ",")
.csv(path_a)
val query = event1.writeStream
.outputMode("append")
.format("console")
.start()
spark.streams.awaitAnyTermination()
val spark = SparkSession
.builder
.appName("test")
.config("spark.local", "local[*]")
.getOrCreate()
spark.sparkContext.setCheckpointDir(path_checkpoint)
val event1 = spark
.readStream //
.schema(schema_a)
.option("header", "true")
.option("sep", ",")
.csv(path_a)
val query = event1.writeStream
.outputMode("append")
.format("console")
.start()
spark.streams.awaitAnyTermination()

Checkpoint for many streaming source

im working with zeppelin ,I read many files from many source in spark streaming like this :
val var1 = spark
.readStream
.schema(var1_raw)
.option("sep", ",")
.option("mode", "PERMISSIVE")
.option("maxFilesPerTrigger", 100)
.option("treatEmptyValuesAsNulls", "true")
.option("newFilesOnly", "true")
.csv(path_var1 )
val chekpoint_var1 = var1
.writeStream
.format("csv")
.option("checkpointLocation", path_checkpoint_var1)
.option("Path",path_checkpoint )
.option("header", true)
.outputMode("Append")
.queryName("var1_backup")
.start().awaitTermination()
val var2 = spark
.readStream
.schema(var2_raw)
.option("sep", ",")
.option("mode", "PERMISSIVE") //
.option("maxFilesPerTrigger", 100)
.option("treatEmptyValuesAsNulls", "true")
.option("newFilesOnly", "true")
.csv(path_var2 )
val chekpoint_var2 = var2
.writeStream
.format("csv")
.option("checkpointLocation", path_checkpoint_var2) //
.option("path",path_checkpoint_2 )
.option("header", true)
.outputMode("Append")
.queryName("var2_backup")
.start().awaitTermination()
when i re run the job i got this message :
java.lang.IllegalArgumentException: Cannot start query with name var1_backup as a query with that name is already active
*****************the solution*******************
val spark = SparkSession
.builder
.appName("test")
.config("spark.local", "local[*]")
.getOrCreate()
spark.sparkContext.setCheckpointDir(path_checkpoint)
and after i call the checkpoint function on the dataframe
*****************the solution*******************
val spark = SparkSession
.builder
.appName("test")
.config("spark.local", "local[*]")
.getOrCreate()
spark.sparkContext.setCheckpointDir(path_checkpoint)
and after i call the checkpoint function on the dataframe