how to pass the case statements in to .selectExpr("*", "all case statements") spark code, from external file - scala

I have below case statement in sql file
note - it is just a sample statement and i saved it as col_sql.sql
"CASE WHEN a = 1 THEN ONE END AS INT_VAL"
, "CASE WHEN a = 'DE' THEN 'APHABET' AS STR_VAL"
In spark scala code
Im getting the col_sql.sql as per below
val col_file = "dir/path/col_sql.sql"
val col_query = readFile(col_file) --- It is internal converted as string using .mkString
Then passing it to my select query in spark code
.selectExpr("*", col_query )
Expectation --
My expectation is when my spark job is running the case statement should be passed in .selectExpr() function as it is given in sql file, like below it should be passed.
When manually running in spark2-shell it is working correctly but in spark2-summit job it throwing parserDriver error .
Kindly assit me on this.
.selectExpr("*", "CASE WHEN a = 1 THEN ONE END AS INT_VAL", "CASE WHEN a = 'DE' THEN 'APHABET' AS STR_VAL")

Each argument in selectExpr should resolve to one column (see examples in the doc). In this case you will have to split the expression read from the file, e.g.:
// Example given the complete string, you could split already when reading the file
val col_query = "\"CASE WHEN a = 1 THEN ONE END AS INT_VAL\", \"CASE WHEN a = 'DE' THEN 'APHABET' AS STR_VAL\""
val cols_queries = col_query.split(",").map(x => x.trim().stripPrefix("\"").stripSuffix("\""))
df.selectExpr("*", cols_queries: _*) // to expand the list into arguments

Related

Spark / Scala / SparkSQL dataframes filter issue "data type mismatch"

My probleme is i have a code that gives filter column and values in a list as parameters
val vars = "age IN ('0')"
val ListPar = "entered_user,2014-05-05,2016-10-10;"
//val ListPar2 = "entered_user,2014-05-05,2016-10-10;revenue,0,5;"
val ListParser : List[String] = ListPar.split(";").map(_.trim).toList
val myInnerList : List[String] = ListParser(0).split(",").map(_.trim).toList
if (myInnerList(0) == "entered_user" || myInnerList(0) == "date" || myInnerList(0) == "dt_action"){
responses.filter(vars +" AND " + responses(myInnerList(0)).between(myInnerList(1), myInnerList(2)))
}else{
responses.filter(vars +" AND " + responses(myInnerList(0)).between(myInnerList(1).toInt, myInnerList(2).toInt))
}
well for all the fields except the one that contains date the functions works flawless but for fields that have date it throws an error
Note : I'm working with parquet files
here is the error
when i try to write it manually i get the same
here is how the query it sent to the sparkSQL
the first one where there is revenue it works but the second one doesn't work
and when i try to just filter with dates without the value of "vars" which contains other columns, it works
Well my issue is that i was mixing between sql and spark and when i tried to concatenate sql query which is my variable "vars" whith df.filter() and especially when i used between operator it was giving an output format unrocognised by sparksql which is
age IN ('0') AND ((entered_user >= 2015-01-01) AND (entered_user <= 2015-05-01))
it might seems correct but after looking in sql documentation it was missing parenthesese(in vars) it needed to be
(age IN ('0')) AND ((entered_user >= 2015-01-01) AND (entered_user <= 2015-05-01))
well the solution is i needed to concatenate those correctly so to do that i must to add " expr " to the variable vars which will result the desire syntaxe
responses.filter(expr(vars) && responses(myInnerList(0)).between(myInnerList(1), myInnerList(2)))

Try to update postgresql database using Doobie but no update happening

I'm trying to update table in postgresql database passing dynamic value using doobie functional JDBC while executing sql statement getting below error.Any help will be appreciable.
Code
Working code
sql"""UPDATE layout_lll
|SET runtime_params = 'testing string'
|WHERE run_id = '123-ksdjf-oreiwlds-9dadssls-kolb'
|""".stripMargin.update.quick.unsafeRunSync
Not working code
val abcRunTimeParams="testing string"
val runID="123-ksdjf-oreiwlds-9dadssls-kolb"
sql"""UPDATE layout_lll
|SET runtime_params = '${abcRunTimeParams}'
|WHERE run_id = '$runID'
|""".stripMargin.update.quick.unsafeRunSync
Error
Exception in thread "main" org.postgresql.util.PSQLException: The column index is out of range: 3, number of columns: 2.
Remove the ' quotes - Doobie make sure they aren't needed. Doobie (and virtually any other DB library) uses parametrized queries, like:
UPDATE layout_lll
SET runtime_params = ?
WHERE run_id = ?
where ? will be replaced by parameters passes later on. This:
makes SQL injection impossible
helps spotting errors in SQL syntax
When you want to pass parameter, the ' is part of the value passed, not part of the parametrized query. And Doobie (or JDBC driver) will "add" it for you. The variables you pass there are processed by Doobie, they aren't just pasted there like in normal string interpolation.
TL;DR Try running
val abcRunTimeParams="testing string"
val runID="123-ksdjf-oreiwlds-9dadssls-kolb"
sql"""UPDATE layout_lll
|SET runtime_params = ${abcRunTimeParams}
|WHERE run_id = $runID
|""".stripMargin.update.quick.unsafeRunSync

Saving a Spark-SQL file as csv

I am trying to save the output of SparkSQL to a path but not sure what function to use. I want to do this without using spark data frames. I was trying using write.mode("overwrite").csv("file:///home/user204943816622/Task_3a-out") but not successful. Can someone tell how to do it?
Note: spark SQL will give the output in multiple files. Need to ensure that the data is sorted globally across all the files (parts). So, all words in part 0, will be alphabetically before the words in part 1.
case class Docword(docId: Int, vocabId: Int, count: Int)
case class VocabWord(vocabId: Int, word: String)
// Read the input data
val docwords = spark.read.
schema(Encoders.product[Docword].schema).
option("delimiter", " ").
csv("hdfs:///user/bdc_data/t3/docword.txt").
as[Docword]
val vocab = spark.read.
schema(Encoders.product[VocabWord].schema).
option("delimiter", " ").
csv("hdfs:///user/bdc_data/t3/vocab.txt").
as[VocabWord]
docwords.createOrReplaceTempView("docwords")
vocab.createOrReplaceTempView("vocab")
spark.sql("""SELECT vocab.word AS word1, SUM(count) count1 FROM
docwords INNER JOIN vocab
ON docwords.vocabId = vocab.vocabId
GROUP BY word
ORDER BY count1 DESC""").show(10)
write.mode("overwrite").csv("file:///home/user204943816622/Task_3a-out")
// Required to exit the spark-shell
sys.exit(0)
.show() returns void you should dp something like below
val writeDf = spark.sql("""SELECT vocab.word AS word1, SUM(count) count1 FROM
docwords INNER JOIN vocab
ON docwords.vocabId = vocab.vocabId
GROUP BY word
ORDER BY count1 DESC""")
writeDf.write.mode("overwrite").csv("file:///home/user204943816622/Task_3a-out")
writeDf.show() // this should not be used in prod environment

SPARK SQL: Implement AND condition inside a CASE statement

I am aware of how to implement a simple CASE-WHEN-THEN clause in SPARK SQL using Scala. I am using Version 1.6.2. But, I need to specify AND condition on multiple columns inside the CASE-WHEN clause. How to achieve this in SPARK using Scala ?
Thanks in advance for your time and help!
Here's the SQL query that I have:
select sd.standardizationId,
case when sd.numberOfShares = 0 and
isnull(sd.derivatives,0) = 0 and
sd.holdingTypeId not in (3,10)
then
8
else
holdingTypeId
end
as holdingTypeId
from sd;
First read table as dataframe
val table = sqlContext.table("sd")
Then select with expression. There align syntaxt according to your database.
val result = table.selectExpr("standardizationId","case when numberOfShares = 0 and isnull(derivatives,0) = 0 and holdingTypeId not in (3,10) then 8 else holdingTypeId end as holdingTypeId")
And show result
result.show
An alternative option, if it's wanted to avoid using the full string expression, is the following:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
val sd = sqlContext.table("sd")
val conditionedColumn: Column = when(
(sd("numberOfShares") === 0) and
(coalesce(sd("derivatives"), lit(0)) === 0) and
(!sd("holdingTypeId").isin(Seq(3,10): _*)), 8
).otherwise(sd("holdingTypeId")).as("holdingTypeId")
val result = sd.select(sd("standardizationId"), conditionedColumn)

Column name cannot be resolved in SparkSQL join

I'm not sure why this is happening. In PySpark, I read in two dataframes and print out their column names and they are as expected, but then when do a SQL join I get an error that cannot resolve column name given the inputs. I have simplified the merge just to get it to work, but I will need to add in more join conditions which is why I'm using SQL (will be adding in: "and b.mnvr_bgn < a.idx_trip_id and b.mnvr_end > a.idx_trip_data"). It appears that the column 'device_id' is being renamed to '_col7' in the df mnvr_temp_idx_prev_temp
mnvr_temp_idx_prev = mnvr_3.select('device_id', 'mnvr_bgn', 'mnvr_end')
print mnvr_temp_idx_prev.columns
['device_id', 'mnvr_bgn', 'mnvr_end']
raw_data_filtered = raw_data.select('device_id', 'trip_id', 'idx').groupby('device_id', 'trip_id').agg(F.max('idx').alias('idx_trip_end'))
print raw_data_filtered.columns
['device_id', 'trip_id', 'idx_trip_end']
raw_data_filtered.registerTempTable('raw_data_filtered_temp')
mnvr_temp_idx_prev.registerTempTable('mnvr_temp_idx_prev_temp')
test = sqlContext.sql('SELECT a.device_id, a.idx_trip_end, b.mnvr_bgn, b.mnvr_end \
FROM raw_data_filtered_temp as a \
INNER JOIN mnvr_temp_idx_prev_temp as b \
ON a.device_id = b.device_id')
Traceback (most recent call last): AnalysisException: u"cannot resolve 'b.device_id' given input columns: [_col7, trip_id, device_id, mnvr_end, mnvr_bgn, idx_trip_end]; line 1 pos 237"
Any help is appreciated!
I would recommend renaming the name of the field 'device_id' in at least one of the data frame. I modified your query just a bit and tested it(in scala). Below query works
test = sqlContext.sql("select * FROM raw_data_filtered_temp a INNER JOIN mnvr_temp_idx_prev_temp b ON a.device_id = b.device_id")
[device_id: string, mnvr_bgn: string, mnvr_end: string, device_id: string, trip_id: string, idx_trip_end: string]
Now if you are doing a 'select * ' in above statement, it will work. But if you try to select 'device_id', you will get an error "Reference 'device_id' is ambiguous" . As you can see in the above 'test' data frame definition, it has two fields with the same name(device_id). So to avoid this, I recommend changing field name in one of the dataframes.
mnvr_temp_idx_prev = mnvr_3.select('device_id', 'mnvr_bgn', 'mnvr_end')
.withColumnRenamned("device_id","device")
raw_data_filtered = raw_data.select('device_id', 'trip_id', 'idx').groupby('device_id', 'trip_id').agg(F.max('idx').alias('idx_trip_end'))
Now use dataframes or sqlContext
//using dataframes with multiple conditions
val test = mnvr_temp_idx_prev.join(raw_data_filtered,$"device" === $"device_id"
&& $"mnvr_bgn" < $"idx_trip_id","inner")
//in SQL Context
test = sqlContext.sql("select * FROM raw_data_filtered_temp a INNER JOIN mnvr_temp_idx_prev_temp b ON a.device_id = b.device and a. idx_trip_id < b.mnvr_bgn")
Above queries will work for your problem. And if your data set is too large, I would recommend to not use '>' or '<' operators in Join condition as it causes cross join which is a costly operation if data set is large. Instead use them in WHERE condition.