How to write the select and case statement together in scala - scala

I am new to scala, I have a below sql that needs to be converted to scala, I have pasted what I tried but I am getting an error.
SQL code:
select (jess,
mark,
timestamp1,
timestamp2,
(CASE WHEN timestamp1>timstamp2 then null else salary) as salary,
(CASE WHEN timestamp1>timstamp2 then null else manager) as manager
)
Scala code I tried:
df.select (jess,
mark,
timestamp1,
timestamp2,
salary
)
.withColumn("salary", when($"timestamp1">$"timstamp2", salary ).otherwise("null"))
Is there a different way to write this.

As mentioned above in comments, it will be easier with error message but what i can see for now:
You have missing comma in your select after "mark" column
In your example you are first doing select without "salary" column but then you are trying to use this column in when/otherwise. Include salary in first select or do the withColumn before selecting
Edit: If you want to have case/when inside select you can do it this way:
import org.apache.spark.sql.functions.{when, col}
df.select(col("jess"),
col("mark"),
col("timestamp1"),
col("timestamp2"),
when(col("timestamp1") > col("timestamp2"),salary)
.otherwise("null").alias("salary")
)
If you want to find out more about case/when please read this: https://sparkbyexamples.com/spark/spark-case-when-otherwise-example/

Related

Create rows from part of column names

Source data
I am working on an ELT project to load data from CSV files into PostgreSQL where I will transform it. The CSV files have many columns that are consistent across files, but also contain activity columns that are inconsistent with names like Date (05/19/2020), Type (05/19/2020), etc.
In the loading script I am merging all of the columns with dates in the column name into one jsonb column so I don't have to constantly add new columns to the raw data table.
The resulting jsonb column in the raw data table looks like this:
id
activity
12345678
{"Date (05/19/2020)": null, "Type (05/19/2020)": null, "Date (06/03/2020)": "06/01/2020", "Type (06/03/2020)": "E"}
98765432
{"Date (05/19/2020)": "05/18/2020", "Type (05/19/2020)": "B", "Date (10/23/2020)": "10/26/2020", "Type (10/23/2020)": "T"}
JSON to columns
Using the amazing create_jsonb_flat_view function from this post I can convert the jsonb to columns like this:
id
Date (05/19/2020)
Type (05/19/2020)
Date (06/03/2020)
Type (06/03/2020)
Type (10/23/2020
Date (10/23/2020)
Type (10/23/2020)
10629465
null
null
06/01/2020
E
98765432
05/18/2020
B
10/26/2020
T
Need to move part of column name to row
Now, this is where I'm stuck. I need to remove the portion of the column name that is the Activity Date (e.g. (05/19/2020)) and create a row for each id and ActivityDate with additional columns for Date and Type like this:
id
ActivityDate
Date
Type
12345678
05/19/2020
null
null
12345678
06/03/2020
06/01/2020
E
98765432
05/19/2020
05/18/2020
B
98765432
10/23/2020
10/26/2020
T
I followed your link to the create_jsonb_flat_view article yesterday and then forgot this question. While I thank you for pointing me there, I think that mentioning it worked against you.
A more conventional approach using regexp_replace() works here. I left the date values as strings, but you can convert them with to_date() if needed:
with parse as (
select id, e.k, e.v,
regexp_replace(e.k, '\s+\([0-9/]{10}\)', '') as k_no_date,
regexp_replace(e.k, '^.+([0-9/]{10}).+', '\1') as k_date_only
from rawinput
cross join lateral jsonb_each_text(activity) as e(k, v)
)
select id,
k_date_only as activity_date,
min(v) filter (where k_no_date = 'Date') as date,
min(v) filter (where k_no_date = 'Type') as type
from parse
group by id, k_date_only;
db<>fiddle here
#Mike-Organek's Answer works beautifully!
However, I was curious if the regexp_replace() calls might be slowing the query down a bit and it seemed I could get the same results using a simpler function.
Since Mike gave me a great example to start with I modified it to split on the space between Date and (05/19/2020).
For 20,000 rows, it went from taking an avg of 7 sec on my local machine to an avg of .9 sec.
Here is the resulting query:
with parse as (
select id, e.k, e.v,
split_part(e.k, ' ', 1) as k_no_date,
trim(split_part(e.k, ' ', 2),'()') as k_date_only
from rawinput
cross join lateral jsonb_each_text(activity) as e(k, v)
)
select id,
k_date_only as activity_date,
min(v) filter (where k_no_date = 'Date') as date,
min(v) filter (where k_no_date = 'Type') as type
from parse
group by id, k_date_only;

TSQL - Parsing substring out of larger string

I have a bunch of rows with values that look like below. It's json extract that I unfortunately have to parse out and load. Anyway, my json parsing tool for some reason doesn't want to parse this full column out so i need to do it in TSQL. I only need the unique_id field:
[{"unique_id":"12345","system_type":"Test System."}]
I tried the below SQL but it's only returning the first 5 characters of the whole column. I know what the issue is which is I need to know how to tell the substring to continue until the 4th set of quotes which comes after the value. I'm not sure how to code the substring like that.
select substring([jsonfield],CHARINDEX('[{"unique_id":"',[jsonfield]),
CHARINDEX('"',[jsonfield]) - CHARINDEX('[{"unique_id":"',[jsonfield]) +
LEN('"')) from etl.my_test_table
Can anyone help me with this?
Thank you, I appreciate it!
Since you tagged 2016, why not use OPENJSON()
Here's an example:
DECLARE #TestData TABLE
(
[SampleData] NVARCHAR(MAX)
);
INSERT INTO #TestData (
[SampleData]
)
VALUES ( N'[{"unique_id":"12345","system_type":"Test System."}]' )
,( N'[{"unique_id":"1234567","system_type":"Test System."},{"unique_id":"1234567_2","system_type":"Test System."}]' )
SELECT b.[unique_id]
FROM #TestData [a]
CROSS APPLY
OPENJSON([a].[SampleData], '$')
WITH (
[unique_id] NVARCHAR(100) '$.unique_id'
) AS [b];
Giving you:
unique_id
---------------
12345
1234567
1234567_2
You can get all the fields as well, just add them to the WITH clause:
SELECT [b].[unique_id]
, [b].[system_type]
FROM #TestData [a]
CROSS APPLY
OPENJSON([a].[SampleData], '$')
WITH (
[unique_id] NVARCHAR(100) '$.unique_id'
, [system_type] NVARCHAR(100) '$.system_type'
) AS [b];
Take it step by step
First get everything to the left of system_type
SELECT LEFT(jsonfield, CHARINDEX('","system_type":"',jsonfield) as s
FROM -- etc
Then take everything to the right of "unique_id":"
SELECT RIGHT(S, LEN(S) - (CHARINDEX('"unique_id":"',S) + 12)) as Result
FROM (
SELECT LEFT(jsonfield, CHARINDEX('","system_type":"',jsonfield) as s
FROM -- etc
) X
Note, I did not test this so it could be off by one or have a syntax error, but you get the idea.
If your larger string ist just a simple JSON as posted, the solution is very easy:
SELECT
JSON_VALUE(N'[{"unique_id":"12345","system_type":"Test System."}]','$[0].unique_id');
JSON_VALUE() needs SQL-Server 2016 and will extract one single value from a specified path.

QueryDSL coalesce in order by

I'm trying to join two tables and output them and sort them alphabetically by two fields like order by coalesce(tableA.name, tableB.name) (NOT order by tableA.name, tableB.name), so result should be something like:
tableA.name tableB.name
A null
B null
null C
D null
null E
In plain SQL it works fine but when I try to do it with QueryDSL it adds additional column to generated select statement and sorts only by first specified column:
//java code
query.orderBy(qTableA.name.coalesce(qTableB.name).asc());
//generated sql code
SELECT ...
COALESCE(tablea_.NAME, tableb_.NAME) AS col_9_0_
FROM ...
WHERE ...
ORDER BY tablea1_.NAME ASC
Can somebody tell why it does it like that and whether it is possible to make it work as I expect?
Try this:
final Coalesce<String> coalesce =
new Coalesce<>(String.class).add(optionalA).add(optionalB);
Use the coalesce in your select fields and so in your order by clause:
.orderBy(coalesce.asc()) // or desc()

MDX query not accepting date values

I'm a SSAS newbie and i'm trying to query a cube to retrieve data against aome measure groups order by date. The date range i wish to specify in my query. The query I'm using is this:-
SELECT
{
[Measures].[Measure1],
[Measures].[Measure2],
[Measures].[Measure3]
}
ON COLUMNS,
NON EMPTY{
[Date].[AllMembers]
}
ON ROWS
FROM (SELECT ( STRTOMEMBER('2/23/2013', CONSTRAINED) :
STRTOMEMBER('3/1/2013', CONSTRAINED) ) ON COLUMNS
FROM [MyCube])
However it gives me the following error
Query (10, 16) The restrictions imposed by the CONSTRAINED flag in the STRTOMEMBER function were violated.
I tried removing the constrained keyword and then even strtomember function. But in each cases i got the following errors respectively
Query (10, 16) The STRTOMEMBER function expects a member expression for the 1 argument. A string or numeric expression was used.
and
*Query (10, 14) The : function expects a member expression for the 1 argument. A string or numeric expression was used.
*
I can understand from the last two errors that i need to include the constraint keyword. But can anyone tell me why this query wont execute?
The string that you pass as the member expression must be a fully-qualified member name, or resolve to one. Use the same format as you did in the SELECT.
For example:
STRTOMEMBER('[Date].[2/23/2013]', CONSTRAINED)
Edit: I just noticed the syntax of your range select looks wrong -- you need to use {...}, not (...).
SELECT {
STRTOMEMBER('2/23/2013', CONSTRAINED) :
STRTOMEMBER('3/1/2013', CONSTRAINED) }
Please execute below script.
Extract your date dimension attribute copy it by right clicking and paste it in STRTOMEMBER value.
It will works fine.
SELECT NON EMPTY { [Measures].[Internet Sales Amount] } ON COLUMNS
FROM ( SELECT ( STRTOMEMBER('[Date].[Date].&[20050701]') :
STRTOMEMBER('[Date].[Date].&[20061007]') ) ON COLUMNS
FROM [Adventure Works])
FROM ( SELECT (
STRTOMEMBER(#FromDateCalendarDate, CONSTRAINED) :
STRTOMEMBER(#ToDateCalendarDate, CONSTRAINED) ) ON COLUMNS

Combining INSERT INTO and WITH/CTE

I have a very complex CTE and I would like to insert the result into a physical table.
Is the following valid?
INSERT INTO dbo.prf_BatchItemAdditionalAPartyNos
(
BatchID,
AccountNo,
APartyNo,
SourceRowID
)
WITH tab (
-- some query
)
SELECT * FROM tab
I am thinking of using a function to create this CTE which will allow me to reuse. Any thoughts?
You need to put the CTE first and then combine the INSERT INTO with your select statement. Also, the "AS" keyword following the CTE's name is not optional:
WITH tab AS (
bla bla
)
INSERT INTO dbo.prf_BatchItemAdditionalAPartyNos (
BatchID,
AccountNo,
APartyNo,
SourceRowID
)
SELECT * FROM tab
Please note that the code assumes that the CTE will return exactly four fields and that those fields are matching in order and type with those specified in the INSERT statement.
If that is not the case, just replace the "SELECT *" with a specific select of the fields that you require.
As for your question on using a function, I would say "it depends". If you are putting the data in a table just because of performance reasons, and the speed is acceptable when using it through a function, then I'd consider function to be an option.
On the other hand, if you need to use the result of the CTE in several different queries, and speed is already an issue, I'd go for a table (either regular, or temp).
WITH common_table_expression (Transact-SQL)
The WITH clause for Common Table Expressions go at the top.
Wrapping every insert in a CTE has the benefit of visually segregating the query logic from the column mapping.
Spot the mistake:
WITH _INSERT_ AS (
SELECT
[BatchID] = blah
,[APartyNo] = blahblah
,[SourceRowID] = blahblahblah
FROM Table1 AS t1
)
INSERT Table2
([BatchID], [SourceRowID], [APartyNo])
SELECT [BatchID], [APartyNo], [SourceRowID]
FROM _INSERT_
Same mistake:
INSERT Table2 (
[BatchID]
,[SourceRowID]
,[APartyNo]
)
SELECT
[BatchID] = blah
,[APartyNo] = blahblah
,[SourceRowID] = blahblahblah
FROM Table1 AS t1
A few lines of boilerplate make it extremely easy to verify the code inserts the right number of columns in the right order, even with a very large number of columns. Your future self will thank you later.
Yep:
WITH tab (
bla bla
)
INSERT INTO dbo.prf_BatchItemAdditionalAPartyNos ( BatchID, AccountNo,
APartyNo,
SourceRowID)
SELECT * FROM tab
Note that this is for SQL Server, which supports multiple CTEs:
WITH x AS (), y AS () INSERT INTO z (a, b, c) SELECT a, b, c FROM y
Teradata allows only one CTE and the syntax is as your example.
Late to the party here, but for my purposes I wanted to be able to run the code the user inputted and store in a temp table. Using oracle no such issues.. the insert is at the start of the statement before the with clause.
For this to work in sql server, the following worked:
INSERT into #stagetable execute (#InputSql)
(so the select statement #inputsql can start as a with clause).