I need to use a variable that I've created before in spark to select data from a teradata table:
%spark
sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
val query = "select distinct cod_contrato from xxx.contratos"
val df = sqlContext.sql(query)
val dfv = df.select("cod_contrato")
the variable is a string.
So I would like to query the databe usign that vector of strings:
If I use:
%spark
val sql = s"(SELECT * FROM xx2.CONTRATOS where cod_contrato in '$dfv') as query"
I get:
(SELECT * FROM xx2.CONTRATOS where cod_contrato in '[cod_contrato: string]') as query
The desired result would be:
SELECT * FROM xx2.CONTRATOS where cod_contrato in ('11111', '11112' )
How can I transform the vector to a list enclosed by () and with quotation in each element?
thanks
This is my trial. From some dataframe,
val test = df.select("id").as[String].collect
> test: Array[String] = Array(6597, 8011, 2597, 5022, 5022, 6852, 6852, 5611, 14838, 14838, 2588, 2588)
and so the test is now array. Thus, by using mkString,
val sql = s"SELECT * FROM xx2.CONTRATOS where cod_contrato in " + test.mkString("('", "','", "')") + " as query"
> sql: String = SELECT * FROM xx2.CONTRATOS where cod_contrato in ('6597','8011','2597','5022','5022','6852','6852','5611','14838','14838','2588','2588') as query
where the final result is now string.
Make a temp view of the values you want to filter on and then reference it in the query
%spark
sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
val query = "select distinct cod_contrato from xxx.contratos"
sqlContext.sql(query).selectExpr("cast(cod_contrato as string)").createOrReplaceTempView("dfv_table"")
val sql = "(SELECT * FROM xx2.CONTRATOS where cod_contrato in (select * from dfv_table)) as query"
this will work for the query in spark sql, but will not return a query string. Lamanus's answer should be sufficient if all you want is the query as string
Related
I write you because I donĀ“t know how to execute a snowFlake procedure with Azure Databricks.
This is my SnowFlake procedure:
CREATE OR REPLACE PROCEDURE getBalanceFrontAndInTotalFront(tableName VARCHAR, stringBalanceFront VARCHAR, stringInTotalFront VARCHAR)
RETURNS VARCHAR
NOT NULL
LANGUAGE javascript
AS
$$
var tableName = TABLENAME;
var balanceFront = STRINGBALANCEFRONT;
var inTotalFront = STRINGINTOTALFRONT;
// Dynamically compose the SQL statement to execute.
var sqlCommand = "SELECT BALANCE_FRONT, IN_TOTAL_FRONT, SUM(AMOUNT) AS AMOUNT FROM (SELECT " + balanceFront + " AS \"BALANCE_FRONT\", AMOUNT, " + inTotalFront + " AS \"IN_TOTAL_FRONT\" FROM " + tableName + ") GROUP BY BALANCE_FRONT, IN_TOTAL_FRONT";
// Prepare statement.
var stmt = snowflake.createStatement({sqlText: sqlCommand});
// Execute Statement
var rs = stmt.execute();
arrayValues=[]
while (rs.next()) {
var column1 = rs.getColumnValue(1);
var column2 = rs.getColumnValue(2);
var column3 = rs.getColumnValue(3);
arrayValues.push([column1 + ':' + column3 + ':' + column2]);
}
return arrayValues;
$$;
When I execute the procedure in SnowFlake
set stringBalanceFront = 'CASE WHEN Balance_Type like (\'%A%\')THEN \'ACTIVO\' WHEN Balance_Type like (\'%P%\') THEN \'PASIVO\' WHEN Balance_Type like (\'%N%\') THEN \'NETO\' ELSE \'RESTO\' END';
set stringInTotalFront = 'CASE WHEN Balance_Type like (\'%A%\')THEN \'true\' ELSE \'false\' END';
CALL getBalanceFrontAndInTotalFront('DMAAS_OUTPUT_DATA_TABLE_0049_D18CER', $stringBalanceFront, $stringInTotalFront);
I obtain next array of strings
RESTO:-184281744:false,ACTIVO:-17881395:true,NETO:20599:false,PASIVO:12672:false
I am trying to run this procedure from Spark with the following code and it obviously fails
val stringBalanceFront = Funciones.generarCondiciones(dfOrdenado, Variables.CAMPO_BALANCE_FRONT.toLowerCase())
val stringInTotalFront = Funciones.generarCondiciones(dfOrdenado, Variables.CAMPO_IN_TOTAL_FRONT.toLowerCase())
val query = s"CALL getBalanceFrontAndInTotalFront(${cfgVal.getRutaMasterNoAgregada}, ${stringBalanceFront}, ${stringInTotalFront});"
val arrayBalanceFront = spark.read
.format(SNOWFLAKE_SOURCE_NAME)
.options(snowOptionsRead)
.option("query", query)
.load()
And I get the next error:
21/07/15 17:14:36 ERROR Uncaught throwable from user code: net.snowflake.client.jdbc.SnowflakeSQLException: SQL compilation error:
syntax error line 1 at position 15 unexpected 'CALL'.
What is the correct way to execute a SnowFlake procedure from Spark? Keep in mind that I want to return the results to a val in Spark.
Thanks in advance!
Best regards.
I have a hql file which accepts several arguments and I then in stand alone spark application, I am calling this hql script to create a dataframe.
This is a sample hql code from my script:
select id , name, age, country , created_date
from ${db1}.${table1} a
inner join ${db2}.${table2} b
on a.id = b.id
And in this is how I am calling it in my Spark script:
import scala.io.Source
val queryFile = `path/to/my/file`
val db1 = 'cust_db'
val db2 = 'cust_db2'
val table1 = 'customer'
val table2 = 'products'
val query = Source.fromFile(queryFile).mkString
val df = spark.sql(query)
When I am using this way, I am getting:
org.apache.spark.sql.catylyst.parser.ParserException
Is there a way to pass arguments directly to my hql file and then create a df out of the hive code.
Parameters can be injected with such code:
val parametersMap = Map("db1" -> db1, "db2" -> db2, "table1" -> table1, "table2" -> table2)
val injectedQuery = parametersMap.foldLeft(query)((acc, cur) => acc.replace("${" + cur._1 + "}", cur._2))
I want to generate a query by using a list in PySpark
list = ["hi#gmail.com", "goodbye#gmail.com"]
query = "SELECT * FROM table WHERE email IN (" + list + ")"
This is my desired output:
query
SELECT * FROM table WHERE email IN ("hi#gmail.com", "goodbye#gmail.com")
Instead I'm getting: TypeError: cannot concatenate 'str' and 'list' objects
Can anyone help me achieve this? Thanks
If someone's having the same issue, I found that you can use the following code:
"'"+"','".join(map(str, emails))+"'"
and you will have the following output:
SELECT * FROM table WHERE email IN ('hi#gmail.com', 'goodbye#gmail.com')
Try this:
Dataframe based approach -
df = spark.createDataFrame([(1,"hi#gmail.com") ,(2,"goodbye#gmail.com",),(3,"abc#gmail.com",),(4,"xyz#gmail.com")], ['id','email_id'])
email_filter_list = ["hi#gmail.com", "goodbye#gmail.com"]
df.where(col('email_id').isin(email_filter_list)).show()
Spark SQL based approach -
df = spark.createDataFrame([(1,"hi#gmail.com") ,(2,"goodbye#gmail.com",),(3,"abc#gmail.com",),(4,"xyz#gmail.com")], ['id','email_id'])
df.createOrReplaceTempView('t1')
sql_filter = ','.join(["'" +i + "'" for i in email_filter_list])
spark.sql("SELECT * FROM t1 WHERE email_id IN ({})".format(sql_filter)).show()
I want to get all the tables names from a sql query in Spark using Scala.
Lets say user sends a SQL query which looks like:
select * from table_1 as a left join table_2 as b on a.id=b.id
I would like to get all tables list like table_1 and table_2.
Is regex the only option ?
Thanks a lot #Swapnil Chougule for the answer. That inspired me to offer an idiomatic way of collecting all the tables in a structured query.
scala> spark.version
res0: String = 2.3.1
def getTables(query: String): Seq[String] = {
val logicalPlan = spark.sessionState.sqlParser.parsePlan(query)
import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation
logicalPlan.collect { case r: UnresolvedRelation => r.tableName }
}
val query = "select * from table_1 as a left join table_2 as b on a.id=b.id"
scala> getTables(query).foreach(println)
table_1
table_2
Hope it will help you
Parse the given query using spark sql parser (spark internally does same). You can get sqlParser from session's state. It will give Logical plan of query. Iterate over logical plan of query & check whether it is instance of UnresolvedRelation (leaf logical operator to represent a table reference in a logical query plan that has yet to be resolved) & get table from it.
def getTables(query: String) : Seq[String] ={
val logical : LogicalPlan = localsparkSession.sessionState.sqlParser.parsePlan(query)
val tables = scala.collection.mutable.LinkedHashSet.empty[String]
var i = 0
while (true) {
if (logical(i) == null) {
return tables.toSeq
} else if (logical(i).isInstanceOf[UnresolvedRelation]) {
val tableIdentifier = logical(i).asInstanceOf[UnresolvedRelation].tableIdentifier
tables += tableIdentifier.unquotedString.toLowerCase
}
i = i + 1
}
tables.toSeq
}
I had some complicated sql queries with nested queries and iterated on #Jacek Laskowski's answer to get this
def getTables(spark: SparkSession, query: String): Seq[String] = {
val logicalPlan = spark.sessionState.sqlParser.parsePlan(query)
var tables = new ListBuffer[String]()
var i: Int = 0
while (logicalPlan(i) != null) {
logicalPlan(i) match {
case t: UnresolvedRelation => tables += t.tableName
case _ =>
}
i += 1
}
tables.toList
}
def __sqlparse2table(self, query):
'''
#description: get table name from table
'''
plan = self.spark._jsparkSession.sessionState().sqlParser().parsePlan(query)
plan_string = plan.toString().replace('`.`', '.')
unr = re.findall(r"UnresolvedRelation `(.*?)`", plan_string)
cte = re.findall(r"CTE \[(.*?)\]", plan.toString())
cte = [tt.strip() for tt in cte[0].split(',')] if cte else cte
schema = set()
tables = set()
for table_name in unr:
if table_name not in cte:
schema.update([table_name.split('.')[0]])
tables.update([table_name])
return schema, tables
Since you need to list all the columns names listed in table1 and table2, what you can do is to show tables in db.table_name in your hive db.
val tbl_column1 = sqlContext.sql("show tables in table1");
val tbl_column2 = sqlContext.sql("show tables in table2");
You will get list of columns in both the table.
tbl_column1.show
name
id
data
unix did the trick, grep 'INTO\|FROM\|JOIN' .sql | sed -r 's/.?(FROM|INTO|JOIN)\s?([^ ])./\2/g' | sort -u
grep 'overwrite table' .txt | sed -r 's/.?(overwrite table)\s?([^ ])./\2/g' | sort -u
How can I execute lengthy, multiline Hive Queries in Spark SQL? Like query below:
val sqlContext = new HiveContext (sc)
val result = sqlContext.sql ("
select ...
from ...
");
Use """ instead, so for example
val results = sqlContext.sql ("""
select ....
from ....
""");
or, if you want to format code, use:
val results = sqlContext.sql ("""
|select ....
|from ....
""".stripMargin);
You can use triple-quotes at the start/end of the SQL code or a backslash at the end of each line.
val results = sqlContext.sql ("""
create table enta.scd_fullfilled_entitlement as
select *
from my_table
""");
results = sqlContext.sql (" \
create table enta.scd_fullfilled_entitlement as \
select * \
from my_table \
")
val query = """(SELECT
a.AcctBranchName,
c.CustomerNum,
c.SourceCustomerId,
a.SourceAccountId,
a.AccountNum,
c.FullName,
c.LastName,
c.BirthDate,
a.Balance,
case when [RollOverStatus] = 'Y' then 'Yes' Else 'No' end as RollOverStatus
FROM
v_Account AS a left join v_Customer AS c
ON c.CustomerID = a.CustomerID AND c.Businessdate = a.Businessdate
WHERE
a.Category = 'Deposit' AND
c.Businessdate= '2018-11-28' AND
isnull(a.Classification,'N/A') IN ('Contractual Account','Non-Term Deposit','Term Deposit')
AND IsActive = 'Yes' ) tmp """
It is worth noting that the length is not the issue, just the writing. For this you can use """ as Gaweda suggested or simply use a string variable, e.g. by building it with string builder. For example:
val selectElements = Seq("a","b","c")
val builder = StringBuilder.newBuilder
builder.append("select ")
builder.append(selectElements.mkString(","))
builder.append(" where d<10")
val results = sqlContext.sql(builder.toString())
In addition to the above ways, you can use the below-mentioned way as well:
val results = sqlContext.sql("select .... " +
" from .... " +
" where .... " +
" group by ....
");
Write your sql inside triple quotes, like """ sql code """
df = spark.sql(f""" select * from table1 """)
This is same for Scala Spark and PySpark.