Spark / Scala / SparkSQL dataframes filter issue "data type mismatch" - scala

My probleme is i have a code that gives filter column and values in a list as parameters
val vars = "age IN ('0')"
val ListPar = "entered_user,2014-05-05,2016-10-10;"
//val ListPar2 = "entered_user,2014-05-05,2016-10-10;revenue,0,5;"
val ListParser : List[String] = ListPar.split(";").map(_.trim).toList
val myInnerList : List[String] = ListParser(0).split(",").map(_.trim).toList
if (myInnerList(0) == "entered_user" || myInnerList(0) == "date" || myInnerList(0) == "dt_action"){
responses.filter(vars +" AND " + responses(myInnerList(0)).between(myInnerList(1), myInnerList(2)))
}else{
responses.filter(vars +" AND " + responses(myInnerList(0)).between(myInnerList(1).toInt, myInnerList(2).toInt))
}
well for all the fields except the one that contains date the functions works flawless but for fields that have date it throws an error
Note : I'm working with parquet files
here is the error
when i try to write it manually i get the same
here is how the query it sent to the sparkSQL
the first one where there is revenue it works but the second one doesn't work
and when i try to just filter with dates without the value of "vars" which contains other columns, it works

Well my issue is that i was mixing between sql and spark and when i tried to concatenate sql query which is my variable "vars" whith df.filter() and especially when i used between operator it was giving an output format unrocognised by sparksql which is
age IN ('0') AND ((entered_user >= 2015-01-01) AND (entered_user <= 2015-05-01))
it might seems correct but after looking in sql documentation it was missing parenthesese(in vars) it needed to be
(age IN ('0')) AND ((entered_user >= 2015-01-01) AND (entered_user <= 2015-05-01))
well the solution is i needed to concatenate those correctly so to do that i must to add " expr " to the variable vars which will result the desire syntaxe
responses.filter(expr(vars) && responses(myInnerList(0)).between(myInnerList(1), myInnerList(2)))

Related

Springboot JPA Cast numeric substring to string

I have a JPA/Springboot application backed by a Postgres database. I need to get a records that is equal to a substring passed back to the server.
For example:
Select * from dp1_attachments where TRIM(RIGHT(dp1_submit_date_dp1_number::text, 5)) ='00007'
This query works in PgAdmin, but not in the JPA #Query statement.
#Query("SELECT a.attachmentsFolder as attachmentsFolder, a.attachmentNumber as attachmentNumber, a.attachmentName as attachmentName, a.dp1SubmitDateDp1Number as dp1SubmitDateDp1Number,a.attachmentType as attachmentType, a.attachmentDate as attachmentDate, a.attachmentBy as attachmentBy "
+ "FROM DP1Attachments a WHERE TRIM(SUBSTRING(a.dp1SubmitDateDp1Number::text, 5 )) = :dp1Number")
I've also tried CASTing the parameter like this:
#Query("SELECT a.attachmentsFolder as attachmentsFolder, a.attachmentNumber as attachmentNumber, a.attachmentName as attachmentName, a.dp1SubmitDateDp1Number as dp1SubmitDateDp1Number,a.attachmentType as attachmentType, a.attachmentDate as attachmentDate, a.attachmentBy as attachmentBy "
+ "FROM DP1Attachments a WHERE TRIM(SUBSTRING(CAST(a.dp1SubmitDateDp1Number as string, 5 ))) = :dp1Number")
but the application won't even run, and returns an error that the query isn't valid.
If I make no attempt to cast it, I get an error that function pg_catalog.substring(numeric, integer) does not exist
UPDATE
I've also tried creating a native query instead but that also doesn't seem to work.
List<DP1AttachmentsProjection> results = em.createNativeQuery("Select * FROM dp1_attachments WHERE TRIM(RIGHT(CAST(dp1_submit_date_dp1_number as varchar),5)) =" + dp1Number).getResultList();
In place of varchar I have also tried string and text.
Errors come back similar to ERROR: operator does not exist: text = integer. Its like the CAST is being ignored and I'm not sure why.
I also tried the following as a native query:
em.createNativeQuery("Select * FROM dp1_attachments WHERE TRIM(RIGHT(dp1_submit_date_dp1_number::varchar),5)) =" + dp1Number).getResultList();
and get ERROR: syntax error at or near ":"
FINAL SOLUTION
Thanks to #Nenad J I altered the query to get the final working solution:
#Query(value = "SELECT a.attachments_Folder as attachmentsFolder, a.attachment_Number as attachmentNumber, a.attachment_Name as attachmentName, a.dp1_Submit_Date_Dp1_Number as dp1SubmitDateDp1Number,a.attachment_Type as attachmentType, a.attachment_Date as attachmentDate, a.attachment_By as attachmentBy FROM DP1_Attachments a WHERE TRIM(RIGHT(CAST(a.dp1_Submit_Date_Dp1_Number as varchar ), 5 )) = :dp1Number", nativeQuery = true)"
Default substring returns a string, so substring(integer data,5) returns a string. Thus no need for the cast.
#Query("SELECT * FROM DP1Attachments a WHERE TRIM(SUBSTRING(a.dp1SubmitDateDp1Number, 5)) = :dp1Number")
But I recommend in this case use native query like this:
Put this code in your attachment repository.
#Query(value="SELECT * FROM DP1Attachments AS a WHERE TRIM(SUBSTRING(a.dp1SubmitDateDp1Number, 5 )) = :dp1Number", nativeQuery=true)
Be careful with the column's name.

how to pass the case statements in to .selectExpr("*", "all case statements") spark code, from external file

I have below case statement in sql file
note - it is just a sample statement and i saved it as col_sql.sql
"CASE WHEN a = 1 THEN ONE END AS INT_VAL"
, "CASE WHEN a = 'DE' THEN 'APHABET' AS STR_VAL"
In spark scala code
Im getting the col_sql.sql as per below
val col_file = "dir/path/col_sql.sql"
val col_query = readFile(col_file) --- It is internal converted as string using .mkString
Then passing it to my select query in spark code
.selectExpr("*", col_query )
Expectation --
My expectation is when my spark job is running the case statement should be passed in .selectExpr() function as it is given in sql file, like below it should be passed.
When manually running in spark2-shell it is working correctly but in spark2-summit job it throwing parserDriver error .
Kindly assit me on this.
.selectExpr("*", "CASE WHEN a = 1 THEN ONE END AS INT_VAL", "CASE WHEN a = 'DE' THEN 'APHABET' AS STR_VAL")
Each argument in selectExpr should resolve to one column (see examples in the doc). In this case you will have to split the expression read from the file, e.g.:
// Example given the complete string, you could split already when reading the file
val col_query = "\"CASE WHEN a = 1 THEN ONE END AS INT_VAL\", \"CASE WHEN a = 'DE' THEN 'APHABET' AS STR_VAL\""
val cols_queries = col_query.split(",").map(x => x.trim().stripPrefix("\"").stripSuffix("\""))
df.selectExpr("*", cols_queries: _*) // to expand the list into arguments

SPARK SQL: Implement AND condition inside a CASE statement

I am aware of how to implement a simple CASE-WHEN-THEN clause in SPARK SQL using Scala. I am using Version 1.6.2. But, I need to specify AND condition on multiple columns inside the CASE-WHEN clause. How to achieve this in SPARK using Scala ?
Thanks in advance for your time and help!
Here's the SQL query that I have:
select sd.standardizationId,
case when sd.numberOfShares = 0 and
isnull(sd.derivatives,0) = 0 and
sd.holdingTypeId not in (3,10)
then
8
else
holdingTypeId
end
as holdingTypeId
from sd;
First read table as dataframe
val table = sqlContext.table("sd")
Then select with expression. There align syntaxt according to your database.
val result = table.selectExpr("standardizationId","case when numberOfShares = 0 and isnull(derivatives,0) = 0 and holdingTypeId not in (3,10) then 8 else holdingTypeId end as holdingTypeId")
And show result
result.show
An alternative option, if it's wanted to avoid using the full string expression, is the following:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
val sd = sqlContext.table("sd")
val conditionedColumn: Column = when(
(sd("numberOfShares") === 0) and
(coalesce(sd("derivatives"), lit(0)) === 0) and
(!sd("holdingTypeId").isin(Seq(3,10): _*)), 8
).otherwise(sd("holdingTypeId")).as("holdingTypeId")
val result = sd.select(sd("standardizationId"), conditionedColumn)

Column name cannot be resolved in SparkSQL join

I'm not sure why this is happening. In PySpark, I read in two dataframes and print out their column names and they are as expected, but then when do a SQL join I get an error that cannot resolve column name given the inputs. I have simplified the merge just to get it to work, but I will need to add in more join conditions which is why I'm using SQL (will be adding in: "and b.mnvr_bgn < a.idx_trip_id and b.mnvr_end > a.idx_trip_data"). It appears that the column 'device_id' is being renamed to '_col7' in the df mnvr_temp_idx_prev_temp
mnvr_temp_idx_prev = mnvr_3.select('device_id', 'mnvr_bgn', 'mnvr_end')
print mnvr_temp_idx_prev.columns
['device_id', 'mnvr_bgn', 'mnvr_end']
raw_data_filtered = raw_data.select('device_id', 'trip_id', 'idx').groupby('device_id', 'trip_id').agg(F.max('idx').alias('idx_trip_end'))
print raw_data_filtered.columns
['device_id', 'trip_id', 'idx_trip_end']
raw_data_filtered.registerTempTable('raw_data_filtered_temp')
mnvr_temp_idx_prev.registerTempTable('mnvr_temp_idx_prev_temp')
test = sqlContext.sql('SELECT a.device_id, a.idx_trip_end, b.mnvr_bgn, b.mnvr_end \
FROM raw_data_filtered_temp as a \
INNER JOIN mnvr_temp_idx_prev_temp as b \
ON a.device_id = b.device_id')
Traceback (most recent call last): AnalysisException: u"cannot resolve 'b.device_id' given input columns: [_col7, trip_id, device_id, mnvr_end, mnvr_bgn, idx_trip_end]; line 1 pos 237"
Any help is appreciated!
I would recommend renaming the name of the field 'device_id' in at least one of the data frame. I modified your query just a bit and tested it(in scala). Below query works
test = sqlContext.sql("select * FROM raw_data_filtered_temp a INNER JOIN mnvr_temp_idx_prev_temp b ON a.device_id = b.device_id")
[device_id: string, mnvr_bgn: string, mnvr_end: string, device_id: string, trip_id: string, idx_trip_end: string]
Now if you are doing a 'select * ' in above statement, it will work. But if you try to select 'device_id', you will get an error "Reference 'device_id' is ambiguous" . As you can see in the above 'test' data frame definition, it has two fields with the same name(device_id). So to avoid this, I recommend changing field name in one of the dataframes.
mnvr_temp_idx_prev = mnvr_3.select('device_id', 'mnvr_bgn', 'mnvr_end')
.withColumnRenamned("device_id","device")
raw_data_filtered = raw_data.select('device_id', 'trip_id', 'idx').groupby('device_id', 'trip_id').agg(F.max('idx').alias('idx_trip_end'))
Now use dataframes or sqlContext
//using dataframes with multiple conditions
val test = mnvr_temp_idx_prev.join(raw_data_filtered,$"device" === $"device_id"
&& $"mnvr_bgn" < $"idx_trip_id","inner")
//in SQL Context
test = sqlContext.sql("select * FROM raw_data_filtered_temp a INNER JOIN mnvr_temp_idx_prev_temp b ON a.device_id = b.device and a. idx_trip_id < b.mnvr_bgn")
Above queries will work for your problem. And if your data set is too large, I would recommend to not use '>' or '<' operators in Join condition as it causes cross join which is a costly operation if data set is large. Instead use them in WHERE condition.

npgsql : Selecting a null data throws exception with error "Column is Null"

I am running npgsql v3.7 with .NetCore on Ubuntu.
When I execute a select query and a cell in any row in the results is null, an exception is thrown with the error message "Column is null".
I am having to work around this by putting every column in the select clause inside a case statement which tests for NULL
"CASE WHEN " + fieldName + " IS NULL THEN '' ELSE " + fieldName + " END "
This seems a bit extreme and should not be necessary. Has anyone else come across this.
Thanks.
You are probably trying to read the column like this:
using (var reader = cmd.ExecuteReader()) {
reader.Next();
var o = reader.GetString(0); // Or any other of the Get methods on reader
...
}
This code will fail if the column contains a null, and is the expected behavior. In ADO.NET, you need to check for a null value with reader.IsDBNull(0) before actually getting the value. That's just how the database API works.
I don't know why NULL values are giving you errors, but you can do away with the ugly CASE statement in favor of using COALESCE:
"COALESCE(" + fieldName + ", '')"
Ideally you should make a configuration change such that NULL values do not cause this problem.