I want to run a custom function on all tables in a SQLite database. The function is more or less the same, but depends on the schema of the individual table. Also, the tables and their schemata are only known at runtime (the program is called with an argument that specifies the path of the database).
This is what I have so far:
val conf = new SparkConf().setAppName("MyApp")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// somehow bind sqlContext to DB
val allTables = sqlContext.tableNames
for( t <- allTables) {
val df = sqlContext.table(t)
val schema = df.columns
sqlContext.sql("SELECT * FROM " + t + "...").map(x => myFunc(x,schema))
}
The only hint I found so far needs to know the table in advance, which is not the case in my scenario:
val tableData =
sqlContext.read.format("jdbc")
.options(Map("url" -> "jdbc:sqlite:/path/to/file.db", "dbtable" -> t))
.load()
I am using the xerial sqlite jdbc driver. So how can I conntect solely to a database, not to a table?
Edit: Using Beryllium's answer as a start I updated my code to this:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val metaData = sqlContext.read.format("jdbc")
.options(Map("url" -> "jdbc:sqlite:/path/to/file.db",
"dbtable" -> "(SELECT * FROM sqlite_master) AS t")).load()
val myTableNames = metaData.select("tbl_name").distinct()
for (t <- myTableNames) {
println(t.toString)
val tableData = sqlContext.table(t.toString)
for (record <- tableData.select("*")) {
println(record)
}
}
At least I can read the table names at runtime which is a huge step forward for me. But I can't read the tables. I tried both
val tableData = sqlContext.table(t.toString)
and
val tableData = sqlContext.read.format("jdbc")
.options(Map("url" -> "jdbc:sqlite:/path/to/file.db",
"dbtable" -> t.toString)).load()
in the loop, but in both cases I get a NullPointerException. Although I can print the table names it seems I cannot connect to them.
Last but not least I always get an SQLITE_ERROR: Connection is closed error. It looks to be the same issue described in this question: SQLITE_ERROR: Connection is closed when connecting from Spark via JDBC to SQLite database
There are two options you can try
Use JDBC directly
Open a separate, plain JDBC connection in your Spark job
Get the tables names from the JDBC meta data
Feed these into your for comprehension
Use a SQL query for the "dbtable" argument
You can specify a query as the value for the dbtable argument. Syntactically this query must "look" like a table, so it must be wrapped in a sub query.
In that query, get the meta data from the database:
val df = sqlContext.read.format("jdbc").options(
Map(
"url" -> "jdbc:postgresql:xxx",
"user" -> "x",
"password" -> "x",
"dbtable" -> "(select * from pg_tables) as t")).load()
This example works with PostgreSQL, you have to adapt it for SQLite.
Update
It seems that the JDBC driver only supports to iterate over one result set.
Anyway, when you materialize the list of table names using collect(), then the following snippet should work:
val myTableNames = metaData.select("tbl_name").map(_.getString(0)).collect()
for (t <- myTableNames) {
println(t.toString)
val tableData = sqlContext.read.format("jdbc")
.options(
Map(
"url" -> "jdbc:sqlite:/x.db",
"dbtable" -> t)).load()
tableData.show()
}
Related
When learning Spark SQL, I've been using the following approach to register a collection into the Spark SQL catalog and query it.
val persons: Seq[MongoPerson] = Seq(MongoPerson("John", "Doe"))
sqlContext.createDataset(persons)
.write
.format("com.mongodb.spark.sql.DefaultSource")
.option("collection", "peeps")
.mode("append")
.save()
sqlContext.read
.format("com.mongodb.spark.sql.DefaultSource")
.option("collection", "peeps")
.load()
.as[Peeps]
.show()
However, when querying it, it seems that I need to register it as a temporary view in order to access it using SparkSQL.
val readConfig = ReadConfig(Map("uri" -> "mongodb://localhost:37017/test", "collection" -> "morepeeps"), Some(ReadConfig(spark)))
val people: DataFrame = MongoSpark.load[Peeps](spark, readConfig)
people.show()
people.createOrReplaceTempView("peeps")
spark.catalog.listDatabases().show()
spark.catalog.listTables().show()
sqlContext.sql("SELECT * FROM peeps")
.as[Peeps]
.show()
For a database with quite a few collections, is there a way to hydrate the Spark SQL schema catalog so that this op isn't so verbose?
So there's a couple things going on. First of all, simply loading the Dataset using sqlContext.read will not register it with SparkSQL catalog. The end of the function chain you have in your first code sample returns a Dataset at .as[Peeps]. You need to tell Spark that you want to use it as a view.
Depending on what you're doing with it, I might recommend leaning on the Scala Dataset API rather than SparkSQL. However, if SparkSQL is absolutely essential, you can likely speed things up programmatically.
In my experience, you'll need to run that boilerplate on each table you want to import. Fortunately, Scala is a proper programming language, so we can cut down on code duplication substantially by using a function, and calling it as such:
val MongoDbUri: String = "mongodb://localhost:37017/test" // store this as a constant somewhere
// T must be passed in as some case class
// Note, you can also add a second parameter to change the view name if so desired
def loadTableAsView[T <: Product : TypeTag](table: String)(implicit spark: SparkSession): Dataset[T] {
val configMap = Map(
"uri" -> MongoDbUri,
"collection" -> table
)
val readConfig = ReadConfig(configMap, Some(ReadConfig(spark)))
val df: DataFrame = MongoSpark.load[T](spark, readConfig)
df.createOrReplaceTempView(table)
df.as[T]
}
And to call it:
// Note: if spark is defined implicitly, e.g. implicit val spark: SparkSession = spark, you won't need to pass it explicitly
val peepsDS: Dataset[Peeps] = loadTableAsView[Peeps]("peeps")(spark)
val chocolatesDS: Dataset[Chocolates] = loadTableAsView[Chocolates]("chocolates")(spark)
val candiesDS: Dataset[Candies] = loadTableAsView[Candies]("candies")(spark)
spark.catalog.listDatabases().show()
spark.catalog.listTables().show()
peepsDS.show()
chocolatesDS.show()
candiesDS.show()
This will substantially cut down your boilerplate, and also allow you to more easily write some tests for that repeated bit of code. There's also probably a way to create a map of table names to case classes that you can then iterate over, but I don't have an IDE handy to test it out.
I'm using Azure's Databricks and want to pushdown a query to a Azure SQL using PySpark. I've tried many ways and found a solution using Scala (code below), but doing this I need to convert part of my code to scala then bring back to PySpark again.
%scala
import java.util.Properties
import java.sql.DriverManager
val jdbcUsername = username
val jdbcPassword = password
val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = "entire-string-connection-to-Azure-SQL"
// Create a Properties() object to hold the parameters.
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
connectionProperties.setProperty("Driver", driverClass)
val connection = DriverManager.getConnection(jdbcUrl, jdbcUsername, jdbcPassword)
val stmt = connection.createStatement()
val sql = "TRUNCATE TABLE dbo.table"
stmt.execute(sql)
connection.close()
Is there a way to achieve the pushdown of a DML code using PySpark instead of Scala language?
Found something related but only works to read data and DDL commands:
jdbcUrl = "jdbc:mysql://{0}:{1}/{2}".format(jdbcHostname, jdbcPort, jdbcDatabase)
connectionProperties = {
"user" : jdbcUsername,
"password" : jdbcPassword,
"driver" : "com.mysql.jdbc.Driver"
}
pushdown_query = "(select * from employees where emp_no < 10008) emp_alias"
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
You can actually achieve the same thing as the Scala example you provided in Python.
driver_manager = spark._sc._gateway.jvm.java.sql.DriverManager
connection = driver_manager.getConnection(jdbcUrl, jdbcUsername, jdbcPassword)
query = "YOUR SQL QUERY"
exec_statement = connection.prepareCall(query)
exec_statement.execute()
exec_statement.close()
connection.close()
For your case I would try
driver_manager = spark._sc._gateway.jvm.java.sql.DriverManager
connection = driver_manager.getConnection(jdbcUrl, jdbcUsername, jdbcPassword)
stmt = connection.createStatement()
sql = "TRUNCATE TABLE dbo.table"
stmt.execute(sql)
connection.close()
I am a new developer at Spark Scala and I want to ask you about my problem.
I have two huge dataframes, my second dataframe is computed from the first dataframe (it contains a distinct column from the first one).
To optimize my code, I thought about this approach :
Register my first dataframe as a .csv file in HDFS
And then simply read this .csv file to calculate the second dataframe.
So, it wrote this :
//val temp1 is my first DF
writeAsTextFileAndMerge("result1.csv", "/user/result", temp1, spark.sparkContext.hadoopConfiguration)
val temp2 = spark.read.options(Map("header" -> "true", "delimiter" -> ";"))
.csv("/user/result/result1.csv").select("ID").distinct
writeAsTextFileAndMerge("result2.csv", "/user/result",
temp2, spark.sparkContext.hadoopConfiguration)
And this is my save function :
def writeAsTextFileAndMerge(fileName: String, outputPath: String, df: DataFrame, conf: Configuration) {
val sourceFile = WorkingDirectory
df.write.options(Map("header" -> "true", "delimiter" -> ";")).mode("overwrite").csv(sourceFile)
merge(fileName, sourceFile, outputPath, conf)
}
def merge(fileName: String, srcPath: String, dstPath: String, conf: Configuration) {
val hdfs = FileSystem.get(conf)
val destinationPath = new Path(dstPath)
if (!hdfs.exists(destinationPath))
hdfs.mkdirs(destinationPath)
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath + "/" + fileName),
true, conf, null)
}
It seems "logical" to me but I got errors doing this. I guess it's not possible for Spark to "wait" until registering my first DF in HDFS and AFTER read this new file (or maybe I have some errors on my save function ?).
Here is the exception that I got :
19/02/16 17:27:56 ERROR yarn.ApplicationMaster: User class threw exception: java.lang.ArrayIndexOutOfBoundsException: 1
java.lang.ArrayIndexOutOfBoundsException: 1
Can you help me to fix this please ?
The problem is the merge - Spark is not aware and thus not synchronized with all the HDFS operations you are making.
The good news is that you don't need to do that. just do df.write and then create a new dataframe with the read (spark will read all the parts into a single df)
i.e. the following would work just fine
temp1.write.options(Map("header" -> "true", "delimiter" -> ";")).mode("overwrite").csv("/user/result/result1.csv")
val temp2 = spark.read.options(Map("header" -> "true", "delimiter" -> ";"))
.csv("/user/result/result1.csv").select("ID").distinct
I'm reading a text file, where all hive queries are stored. I'd need to loop through all the queries and execute them on hive database and store the results in hive db
The code looks as below with the output. The RDD is read and invoke a method which executes the SQL queries on hive db and store them in the database
[abbi1680#gw01 ~]$ hdfs dfs -put SQLQueries.csv /user/abbi1680/data/SQLQueries50.csv
--HDFS File
[abbi1680#gw01 ~]$ hdfs dfs -cat /user/abbi1680/data/SQLQueries50.csv
"abbi1680.PPPP","XXXX","select * from abbi1680.tbl1"
"abbi1680.QQQQ","YYYY","select * from abbi1680.tbl2"
scala> def HiveExec(TblName:String,dfName : String,HiveSQL: String) ={
| val dfName = sqlContext.sql(HiveSQL)
| dfName.write.mode("overwrite").saveAsTable(TblName)
| }
HiveExec: (TblName: String, dfName: String, HiveSQL: String)Unit
scala> val ReadQuery =
sc.textFile("/user/abbi1680/data/SQLQueries50.csv").map(line =>
line.split(",")).map(x=>HiveExec(x(0),x(1),x(2)))
ReadQuery: org.apache.spark.rdd.RDD[Unit] = MapPartitionsRDD[3] at
map at <console>:29
hive (default)> use abbi1680;
hive (abbi1680)> show tables;
I'm expecting the tables with names PPPP and QQQQ are created with the data as Tbl1 and Tbl2 resp.
It's not created any tables or threw any error.
Could someone please help.
Thanks for helping me out.
Since I don't have much time, I've chosen a different approach, which works
The below code works.
cat SQLQueries50.csv
abbi1680.RKKKK,AAAAA,select * from abbi1680.tbl1
abbi1680.SPPPP,QQQQQ,select * from abbi1680.tbl2
val ReadQuery = sc.textFile("/user/abbi1680/data/SQLQueries50.csv");
val cnt = ReadQuery.count().toInt
for (line <- ReadQuery.take(cnt)) {
val cols = line.split(",").map(_.trim)
val TblName={cols(0)}
val dfName={cols(1)}
val HivSQL={cols(2)}
println(s"${TblName}|${dfName}|${HivSQL}")
HiveExec(TblName,dfName,HivSQL);
}
def HiveExec(TblName:String,dfName : String,HiveSQL: String) ={
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val dfName = sqlContext.sql(HiveSQL)
dfName.write.mode("overwrite").saveAsTable(TblName)
}
Id want to avoid invoking for loop and complete the task using map function. It didn't work.
Any help would be appreciated.
I am new to Spark. Here is something I wanna do.
I have created two data streams; first one reads data from text file and register it as a temptable using hivecontext. The other one continuously gets RDDs from Kafka and for each RDD, it it creates data streams and register the contents as temptable. Finally I join these two temp tables on a key to get final result set. I want to insert that result set in a hive table. But I am out of ideas. Tried to follow some exmples but that only create a table with one column in hive and that too not readable. Could you please show me how to insert results in a particular database and table of hive. Please note that I can see the results of join using show function so the real challenge lies with insertion in hive table.
Below is the code I am using.
imports.....
object MSCCDRFilter {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Flume, Kafka and Spark MSC CDRs Manipulation")
val sc = new SparkContext(sparkConf)
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
val cgiDF = sc.textFile("file:///tmp/omer-learning/spark/dim_cells.txt").map(_.split(",")).map(p => CGIList(p(0).trim, p(1).trim, p(2).trim,p(3).trim)).toDF()
cgiDF.registerTempTable("my_cgi_list")
val CGITable=sqlContext.sql("select *"+
" from my_cgi_list")
CGITable.show() // this CGITable is a structure I defined in the project
val streamingContext = new StreamingContext(sc, Seconds(10)
val zkQuorum="hadoopserver:2181"
val topics=Map[String, Int]("FlumeToKafka"->1)
val messages: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream(streamingContext,zkQuorum,"myGroup",topics)
val logLinesDStream = messages.map(_._2) //获取数据
logLinesDStream.print()
val MSCCDRDStream = logLinesDStream.map(MSC_KPI.parseLogLine) // change MSC_KPI to MCSCDR_GO if you wanna change the class
// MSCCDR_GO and MSC_KPI are structures defined in the project
MSCCDRDStream.foreachRDD(MSCCDR => {
println("+++++++++++++++++++++NEW RDD ="+ MSCCDR.count())
if (MSCCDR.count() == 0) {
println("==================No logs received in this time interval=================")
} else {
val dataf=sqlContext.createDataFrame(MSCCDR)
dataf.registerTempTable("hive_msc")
cgiDF.registerTempTable("my_cgi_list")
val sqlquery=sqlContext.sql("select a.cdr_type,a.CGI,a.cdr_time, a.mins_int, b.Lat, b.Long,b.SiteID from hive_msc a left join my_cgi_list b"
+" on a.CGI=b.CGI")
sqlquery.show()
sqlContext.sql("SET hive.exec.dynamic.partition = true;")
sqlContext.sql("SET hive.exec.dynamic.partition.mode = nonstrict;")
sqlquery.write.mode("append").partitionBy("CGI").saveAsTable("omeralvi.msc_data")
val FilteredCDR = sqlContext.sql("select p.*, q.* " +
" from MSCCDRFiltered p left join my_cgi_list q " +
"on p.CGI=q.CGI ")
println("======================print result =================")
FilteredCDR.show()
streamingContext.start()
streamingContext.awaitTermination()
}
}
I have had some success writing to Hive, using the following:
dataFrame
.coalesce(n)
.write
.format("orc")
.options(Map("path" -> savePath))
.mode(SaveMode.Append)
.saveAsTable(fullTableName)
Our attempts to use partitions weren't followed through with, because I think there was some issue with our desired partitioning column.
The only limitation is with concurrent writes, where the table does not exist yet, then any task tries to create the table (because it didn't exist when it first attempted to write to the table) will Exception out.
Be aware, that writing to Hive in streaming applications is usually bad design, as you will often write many small files, which is very inefficient to read and store. So if you write more often than every hour or so to Hive, you should make sure you include logic for compaction, or add an intermediate storage layer more suited to transactional data.