I have a hql file which accepts several arguments and I then in stand alone spark application, I am calling this hql script to create a dataframe.
This is a sample hql code from my script:
select id , name, age, country , created_date
from ${db1}.${table1} a
inner join ${db2}.${table2} b
on a.id = b.id
And in this is how I am calling it in my Spark script:
import scala.io.Source
val queryFile = `path/to/my/file`
val db1 = 'cust_db'
val db2 = 'cust_db2'
val table1 = 'customer'
val table2 = 'products'
val query = Source.fromFile(queryFile).mkString
val df = spark.sql(query)
When I am using this way, I am getting:
org.apache.spark.sql.catylyst.parser.ParserException
Is there a way to pass arguments directly to my hql file and then create a df out of the hive code.
Parameters can be injected with such code:
val parametersMap = Map("db1" -> db1, "db2" -> db2, "table1" -> table1, "table2" -> table2)
val injectedQuery = parametersMap.foldLeft(query)((acc, cur) => acc.replace("${" + cur._1 + "}", cur._2))
Related
I need to use a variable that I've created before in spark to select data from a teradata table:
%spark
sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
val query = "select distinct cod_contrato from xxx.contratos"
val df = sqlContext.sql(query)
val dfv = df.select("cod_contrato")
the variable is a string.
So I would like to query the databe usign that vector of strings:
If I use:
%spark
val sql = s"(SELECT * FROM xx2.CONTRATOS where cod_contrato in '$dfv') as query"
I get:
(SELECT * FROM xx2.CONTRATOS where cod_contrato in '[cod_contrato: string]') as query
The desired result would be:
SELECT * FROM xx2.CONTRATOS where cod_contrato in ('11111', '11112' )
How can I transform the vector to a list enclosed by () and with quotation in each element?
thanks
This is my trial. From some dataframe,
val test = df.select("id").as[String].collect
> test: Array[String] = Array(6597, 8011, 2597, 5022, 5022, 6852, 6852, 5611, 14838, 14838, 2588, 2588)
and so the test is now array. Thus, by using mkString,
val sql = s"SELECT * FROM xx2.CONTRATOS where cod_contrato in " + test.mkString("('", "','", "')") + " as query"
> sql: String = SELECT * FROM xx2.CONTRATOS where cod_contrato in ('6597','8011','2597','5022','5022','6852','6852','5611','14838','14838','2588','2588') as query
where the final result is now string.
Make a temp view of the values you want to filter on and then reference it in the query
%spark
sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
val query = "select distinct cod_contrato from xxx.contratos"
sqlContext.sql(query).selectExpr("cast(cod_contrato as string)").createOrReplaceTempView("dfv_table"")
val sql = "(SELECT * FROM xx2.CONTRATOS where cod_contrato in (select * from dfv_table)) as query"
this will work for the query in spark sql, but will not return a query string. Lamanus's answer should be sufficient if all you want is the query as string
I am using Spark and I would like to know: how to create temporary table named C by executing sql query on tables A and B ?
sqlContext
.read.json(file_name_A)
.createOrReplaceTempView("A")
sqlContext
.read.json(file_name_B)
.createOrReplaceTempView("B")
val tableQuery = "(SELECT A.id, B.name FROM A INNER JOIN B ON A.id = B.fk_id) C"
sqlContext.read
.format(SQLUtils.FORMAT_JDBC)
.options(SQLUtils.CONFIG())
.option("dbtable", tableQuery)
.load()
You need to save your results as temp table
tableQuery .createOrReplaceTempView("dbtable")
Permanant storage on external table you can use JDBC
val prop = new java.util.Properties
prop.setProperty("driver", "com.mysql.jdbc.Driver")
prop.setProperty("user", "vaquar")
prop.setProperty("password", "khan")
//jdbc mysql url - destination database is named "temp"
val url = "jdbc:mysql://localhost:3306/temp"
//destination database table
val dbtable = "sample_data_table"
//write data from spark dataframe to database
df.write.mode("append").jdbc(url, dbtable, prop)
https://docs.databricks.com/spark/latest/data-sources/sql-databases.html
http://spark.apache.org/docs/latest/sql-programming-guide.html#saving-to-persistent-tables
sqlContext.read.json(file_name_A).createOrReplaceTempView("A")
sqlContext.read.json(file_name_B).createOrReplaceTempView("B")
val tableQuery = "(SELECT A.id, B.name FROM A INNER JOIN B ON A.id = B.fk_id) C"
sqlContext.sql(tableQuery).createOrReplaceTempView("C")
Try the above code it will work.
I want to get all the tables names from a sql query in Spark using Scala.
Lets say user sends a SQL query which looks like:
select * from table_1 as a left join table_2 as b on a.id=b.id
I would like to get all tables list like table_1 and table_2.
Is regex the only option ?
Thanks a lot #Swapnil Chougule for the answer. That inspired me to offer an idiomatic way of collecting all the tables in a structured query.
scala> spark.version
res0: String = 2.3.1
def getTables(query: String): Seq[String] = {
val logicalPlan = spark.sessionState.sqlParser.parsePlan(query)
import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation
logicalPlan.collect { case r: UnresolvedRelation => r.tableName }
}
val query = "select * from table_1 as a left join table_2 as b on a.id=b.id"
scala> getTables(query).foreach(println)
table_1
table_2
Hope it will help you
Parse the given query using spark sql parser (spark internally does same). You can get sqlParser from session's state. It will give Logical plan of query. Iterate over logical plan of query & check whether it is instance of UnresolvedRelation (leaf logical operator to represent a table reference in a logical query plan that has yet to be resolved) & get table from it.
def getTables(query: String) : Seq[String] ={
val logical : LogicalPlan = localsparkSession.sessionState.sqlParser.parsePlan(query)
val tables = scala.collection.mutable.LinkedHashSet.empty[String]
var i = 0
while (true) {
if (logical(i) == null) {
return tables.toSeq
} else if (logical(i).isInstanceOf[UnresolvedRelation]) {
val tableIdentifier = logical(i).asInstanceOf[UnresolvedRelation].tableIdentifier
tables += tableIdentifier.unquotedString.toLowerCase
}
i = i + 1
}
tables.toSeq
}
I had some complicated sql queries with nested queries and iterated on #Jacek Laskowski's answer to get this
def getTables(spark: SparkSession, query: String): Seq[String] = {
val logicalPlan = spark.sessionState.sqlParser.parsePlan(query)
var tables = new ListBuffer[String]()
var i: Int = 0
while (logicalPlan(i) != null) {
logicalPlan(i) match {
case t: UnresolvedRelation => tables += t.tableName
case _ =>
}
i += 1
}
tables.toList
}
def __sqlparse2table(self, query):
'''
#description: get table name from table
'''
plan = self.spark._jsparkSession.sessionState().sqlParser().parsePlan(query)
plan_string = plan.toString().replace('`.`', '.')
unr = re.findall(r"UnresolvedRelation `(.*?)`", plan_string)
cte = re.findall(r"CTE \[(.*?)\]", plan.toString())
cte = [tt.strip() for tt in cte[0].split(',')] if cte else cte
schema = set()
tables = set()
for table_name in unr:
if table_name not in cte:
schema.update([table_name.split('.')[0]])
tables.update([table_name])
return schema, tables
Since you need to list all the columns names listed in table1 and table2, what you can do is to show tables in db.table_name in your hive db.
val tbl_column1 = sqlContext.sql("show tables in table1");
val tbl_column2 = sqlContext.sql("show tables in table2");
You will get list of columns in both the table.
tbl_column1.show
name
id
data
unix did the trick, grep 'INTO\|FROM\|JOIN' .sql | sed -r 's/.?(FROM|INTO|JOIN)\s?([^ ])./\2/g' | sort -u
grep 'overwrite table' .txt | sed -r 's/.?(overwrite table)\s?([^ ])./\2/g' | sort -u
How can I execute lengthy, multiline Hive Queries in Spark SQL? Like query below:
val sqlContext = new HiveContext (sc)
val result = sqlContext.sql ("
select ...
from ...
");
Use """ instead, so for example
val results = sqlContext.sql ("""
select ....
from ....
""");
or, if you want to format code, use:
val results = sqlContext.sql ("""
|select ....
|from ....
""".stripMargin);
You can use triple-quotes at the start/end of the SQL code or a backslash at the end of each line.
val results = sqlContext.sql ("""
create table enta.scd_fullfilled_entitlement as
select *
from my_table
""");
results = sqlContext.sql (" \
create table enta.scd_fullfilled_entitlement as \
select * \
from my_table \
")
val query = """(SELECT
a.AcctBranchName,
c.CustomerNum,
c.SourceCustomerId,
a.SourceAccountId,
a.AccountNum,
c.FullName,
c.LastName,
c.BirthDate,
a.Balance,
case when [RollOverStatus] = 'Y' then 'Yes' Else 'No' end as RollOverStatus
FROM
v_Account AS a left join v_Customer AS c
ON c.CustomerID = a.CustomerID AND c.Businessdate = a.Businessdate
WHERE
a.Category = 'Deposit' AND
c.Businessdate= '2018-11-28' AND
isnull(a.Classification,'N/A') IN ('Contractual Account','Non-Term Deposit','Term Deposit')
AND IsActive = 'Yes' ) tmp """
It is worth noting that the length is not the issue, just the writing. For this you can use """ as Gaweda suggested or simply use a string variable, e.g. by building it with string builder. For example:
val selectElements = Seq("a","b","c")
val builder = StringBuilder.newBuilder
builder.append("select ")
builder.append(selectElements.mkString(","))
builder.append(" where d<10")
val results = sqlContext.sql(builder.toString())
In addition to the above ways, you can use the below-mentioned way as well:
val results = sqlContext.sql("select .... " +
" from .... " +
" where .... " +
" group by ....
");
Write your sql inside triple quotes, like """ sql code """
df = spark.sql(f""" select * from table1 """)
This is same for Scala Spark and PySpark.
Using Spark 1.5.1, Hive 1.2.1
When I run this snippet under spark-shell --master yarn --deploy-mode client:
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
var queryLeft = "SELECT t1.* FROM (SELECT t2.*, row_number() over (PARTITION BY CAST(TRIM(t2.pk) as DECIMAL(31,8)) ORDER BY t2.create_dt DESC) AS R FROM myschema.mytable t2 WHERE t2.part_dt='mydate' AND t2.part_seq='myseq') t1 WHERE t1.R = 1"
val dfLeft = hiveContext.sql(queryLeft)
val firstCount = dfLeft.count
val secondCount = dfLeft.count
I get this result, which are both wrong (and unequal!!)
scala> print (firstCount, secondCount)
(1865,2373)
When I run the same snippet under spark-shell, I get the right results
scala> print (firstCount, secondCount)
(2395,2395)
Is there anything I'm doing wrongly?