string to json convertor in scala spark - scala

I have a string message in one of the table ,
+------+---------------------------------------------------------------------------------------------------------------+
| cola | colb |
+------+---------------------------------------------------------------------------------------------------------------+
| 1234 | {"root":[{"key1":{"key1a":["3"],"key1b":["hj"],"key1c": []}},{"key2":{"key2a":["3"],"key2b":[],"key2c":[]}}]} |
| 1235 | {"root":[{"key1":{"key1a":["3"],"key1b":["hj"],"key1c":[]}},{"key2":{"key2a":["3"],"key2b":[],"key2c":[]}}]} |
+------+---------------------------------------------------------------------------------------------------------------+
Im looking for output like below. Can you please help me on this?
1234, key1,key1a,{3,..}
1234, key2,key2a,{3,..}
I tried various options such as to_json and get_json_objects but did not get the desired result. 

Related

pyspark join dataframes on each word and compare list with string

I am using pyspark and I have 2 tables :
table REF_A
id | name
---------
1 | help
2 | need
3 | hello
4 | hel
Table DATA_B contains a list
| sentence |
----------------------------
| [I , say , hello, to, you]|
| [I , need , your, help] |
I need to join the 2 tables in a way to have this result:
id | name | sentence |
---------------------------------
1 | help | I need your help
2 | need | I need your help |
3 | hello | I say hello to you|
because in REF_A i have the KEY WORD "need" i need to match it with the sentence containing "need", which is "I need your help"
Thank you for your help
REF_A.createOrReplaceTempView('REF_A')
DATA_B.createOrReplaceTempView('DATA_B')
spark.sql('SELECT * FROM REF_A LEFT JOIN DATA_B ON ARRAY_CONTAINS(DATA_B.sentence, REF_A.name);')

Spark scala finding value in another dataframe

Hello I'm fairly new to spark and I need help with this little exercise. I want to find certain values in another dataframe but if those values aren't present I want to reduce the length of each value until I find the match. I have these dataframes:
----------------
|values_to_find|
----------------
| ABCDE |
| CBDEA |
| ACDEA |
| EACBA |
----------------
------------------
| list | Id |
------------------
| EAC | 1 |
| ACDE | 2 |
| CBDEA | 3 |
| ABC | 4 |
------------------
And I expect the next output:
--------------------------------
| Id | list | values_to_find |
--------------------------------
| 4 | ABC | ABCDE |
| 3 | CBDEA | CBDEA |
| 2 | ACDE | ACDEA |
| 1 | EAC | EACBA |
--------------------------------
For example ABCDE isn't present so I reduce its length by one (ABCD), again it doesn't match any so I reduce it again and this time I get ABC, which matches so I use that value to join and form a new dataframe. There is no need to worry about duplicates values when reducing the length but I need to find the exact match. Also, I would like to avoid using a UDF if possible.
I'm using a foreach to get every value in the first dataframe and I can do a substring there (if there is no match) but I'm not sure how to lookup these values in the 2nd dataframe. What's the best way to do it? I've seen tons of UDFs that could do the trick but I want to avoid that as stated before.
df1.foreach { values_to_find =>
df1.get(0).toString.substring(0, 4)}
Edit: Those dataframes are examples, I have many more values, the solution should be dynamic... iterate over some values and find their match in another dataframe with the catch that I need to reduce their length if not present.
Thanks for the help!
You can load the dataframe as temporary view and write the SQL. Is the above scenario you are implementing for the first time in Spark or already did in the previous code ( i mean before spark have you implemented in the legacy system). With Spark you have the freedom to write udf in scala or use SQL. Sorry i don't have solution handy so just giving a pointer.
the following will help you.
val dataDF1 = Seq((4,"ABC"),(3,"CBDEA"),(2,"ACDE"),(1,"EAC")).toDF("Id","list")
val dataDF2 = Seq(("ABCDE"),("CBDEA"),("ACDEA"),("EACBA")).toDF("compare")
dataDF1.createOrReplaceTempView("table1")
dataDF2.createOrReplaceTempView("table2")
spark.sql("select * from table1 inner join table2 on table1.list like concat('%',SUBSTRING(table2.compare,1,3),'%')").show()
Output:
+---+-----+-------+
| Id| list|compare|
+---+-----+-------+
| 4| ABC| ABCDE|
| 3|CBDEA| CBDEA|
| 2| ACDE| ACDEA|
| 1| EAC| EACBA|
+---+-----+-------+

Include null values in collect_list in pyspark

I am trying to include null values in collect_list while using pyspark, however the collect_list operation excludes nulls. I have looked into the following post Pypsark - Retain null values when using collect_list . However, the answer given is not what I am looking for.
I have a dataframe df like this.
| id | family | date |
----------------------------
| 1 | Prod | null |
| 2 | Dev | 2019-02-02 |
| 3 | Prod | 2017-03-08 |
Here's my code so far:
df.groupby("family").agg(f.collect_list("date").alias("entry_date"))
This gives me an output like this:
| family | date |
-----------------------
| Prod |[2017-03-08]|
| Dev |[2019-02-02]|
What I really want is as follows:
| family | date |
-----------------------------
| Prod |[null, 2017-03-08]|
| Dev |[2019-02-02] |
Can someone please help me with this? Thank you!
A possible workaround for this could be to replace all null-values with another value. (Perhaps not the best way to do this, but it's a solution nonetheless)
df = df.na.fill("my_null") # Replace null with "my_null"
df = df.groupby("family").agg(f.collect_list("date").alias("entry_date"))
Should give you:
| family | date |
-----------------------------
| Prod |[my_null, 2017-03-08]|
| Dev |[2019-02-02] |

Combine multiple rows of the same column into one row

I have following data from my database:
+------+-----------+-------------+---------------+
| ID | SomeValue | SomeDate | SomeOtherDate |
+------+-----------+-------------+---------------+
| 123 | 12345 | 01.01.2017 | 01.01.2018 |
| 123 | 54321 | 01.01.2017 | 01.01.2019 |
| 123 | 25314 | 01.01.2017 | 01.01.2020 |
+------+-----------+-------------+---------------+
I want the following format in Crystal Reports:
+------+---------------+---------------+---------------+
| ID | SomeValue2018 | SomeValue2019 | SomeValue2020 |
+------+---------------+---------------+---------------+
| 123 | 12345 | 54321 | 25314 |
+------+---------------+---------------+---------------+
How can I do this, if it's even possible? I've tried multiple examples but cant seem to make it work. I was successfully able to make the headings.
It is difficult to make Crystal Reports evaluate things horizontally. The entire system is designed to evaluate vertically. (Things on the top of a report are evaluated before things below them.)
However you might be able to get a CrossTab to do this. First you'd want to make a new Forumla field for your columns. It would be structured something like:
"SomeValue" + Year({#SomeOtherDate})
When you design the crosstab, you'll want to set ID as the Row, your new formula from above as the Column, and the Summarized Field will be SomeValue. You'll also want to suppress the Grand Totals in the Customize Style.

T SQL merge example needed to help comprehension

The following:
MERGE dbo.commissions_history AS target
USING (SELECT #amount, #requestID) AS source (amount, request)
ON (target.request = source.request)
WHEN MATCHED THEN
UPDATE SET amount = source.amount
WHEN NOT MATCHED THEN
INSERT (request, amount)
VALUES (source.request, source.amount);
from https://stackoverflow.com/a/2967983/857994 is a pretty nifty way to do insert/update (and delete with some added work). I'm finding it hard to follow though even after some googling.
Can someone please:
explain this a little in simple terms - the MSDN documentation mutilated my brain in this case.
show me how it could be modified so the user can type in values for amount & request instead of having them selected from another database location?
Basically, I'd like to use this to insert/update from a C# app with information taken from XML files I'm getting. So, I need to understand how I can formulate a query manually to get my parsed data into the database with this mechanism.
If you aren't familiar with join statements then that is where you need to start. Understanding how joins work is key to the rest. Once you're familiar with joins then understanding the merge is easiest by thinking of it as a full join with instructions on what to do for rows that do or do not match.
So, using the code sample provided lets look at the table commissions_history
| Amount | Request | <other fields> |
--------------------------------------------
| 12.00 | 1234 | <other data> |
| 14.00 | 1235 | <other data> |
| 15.00 | 1236 | <other data> |
The merge statement creates a full join between a table, called the "target" and an expression that returns a table (or a result set that is logically very similar to a table like a CTE) called the "source".
In the example given it is using variables as the source which we'll assume have been set by the user or passed as a parameter.
DECLARE #Amount Decimal = 18.00;
DECLARE #Request Int = 1234;
MERGE dbo.commissions_history AS target
USING (SELECT #amount, #requestID) AS source (amount, request)
ON (target.request = source.request)
Creates the following result set when thought of as a join.
| Amount | Request | <other fields> | Source.Amount | Source.Request |
------------------------------------------------------------------------------
| 12.00 | 1234 | <other data> | 18.00 | 1234 |
| 14.00 | 1235 | <other data> | null | null |
| 15.00 | 1236 | <other data> | null | null |
Using the instructions given on what to do to the target on the condition that a match was found.
WHEN MATCHED THEN
UPDATE SET amount = source.amount
The resulting target table now looks like this. The row with request 1234 is updated to be 18.
| Amount | Request | <other fields> |
--------------------------------------------
| 18.00 | 1234 | <other data> |
| 14.00 | 1235 | <other data> |
| 15.00 | 1236 | <other data> |
Since a match WAS found nothing else happens. But lets say that the values from the source were like this.
DECLARE #Amount Decimal = 18.00;
DECLARE #Request Int = 1239;
The resulting join would look like this:
| Amount | Request | <other fields> | Source.Amount | Source.Request |
------------------------------------------------------------------------------
| 12.00 | 1234 | <other data> | null | null |
| 14.00 | 1235 | <other data> | null | null |
| 15.00 | 1236 | <other data> | null | null |
| null | null | null | 18.00 | 1239 |
Since a matching row was not found in the target the statement executes the other clause.
WHEN NOT MATCHED THEN
INSERT (request, amount)
VALUES (source.request, source.amount);
Resulting in a target table that now looks like this:
| Amount | Request | <other fields> |
--------------------------------------------
| 12.00 | 1234 | <other data> |
| 14.00 | 1235 | <other data> |
| 15.00 | 1236 | <other data> |
| 18.00 | 1239 | <other data> |
The merge statements true potential is when the source and target are both large tables. As it can do a large amount of updates and/or inserts for each row with a single simple statement.
A final note. It's important to keep in mind that not matched defaults to the full clause not matched by target, however you can specify not matched by source in place of, or in addition to, the default clause. The merge statement supports both types of mismatch (records in source not in target, or records in target not in source as defined by the on clause). You can find full documentation, restrictions, and complete syntax on MSDN.
In the given answer example you've done
DECLARE #Request Int
, but calling it in the SQL as follows:
SELECT #amount, #requestID
Another would be naming and calling variables identically:
#amount vs. Amount -> #Amount & Amount