Converting DataFrame column into array using group by key - scala

I am working on spark dataframes and I need to do a group by of a column employee , designation and company and convert the column values of grouped rows into an array of elements as new column. Example :
Input:
employee | Company Address | designation | company | Home Adress
--------------------------------------------------
Micheal | NY | Head | xyz | YN
Micheal | NJ | Head | xyz | YM
Output:
employee | designation | company | Address
--------------------------------------------------
Micheal | Head | xyz | [Company Address : NY , Home Adress YN], [Company Address : NJ , Home Adress : Ym]
Any help is highly appreciated.!

Below solution in spark for array instead of json,
from pyspark.sql.functions import *
df1 = sc.parallelize([['Micheal','NY','head','XYZ','YN'], ['Micheal','NJ','head','XYZ','YM']]).toDF(("Employee", "Company Address", "designation", "company","Home Adress"))
df2 = df1.groupBy("Employee", "designation", "company").agg(collect_list(struct(col("Company Address"),col("Home Adress"))).alias("Address"))
df2.show(1,False)
Output:
+--------+-----------+-------+--------------------+
|Employee|designation|company|Address |
+--------+-----------+-------+--------------------+
|Micheal |head |XYZ |[[NY, YN], [NJ, YM]]|
+--------+-----------+-------+--------------------+

Related

pyspark join dataframes on each word and compare list with string

I am using pyspark and I have 2 tables :
table REF_A
id | name
---------
1 | help
2 | need
3 | hello
4 | hel
Table DATA_B contains a list
| sentence |
----------------------------
| [I , say , hello, to, you]|
| [I , need , your, help] |
I need to join the 2 tables in a way to have this result:
id | name | sentence |
---------------------------------
1 | help | I need your help
2 | need | I need your help |
3 | hello | I say hello to you|
because in REF_A i have the KEY WORD "need" i need to match it with the sentence containing "need", which is "I need your help"
Thank you for your help
REF_A.createOrReplaceTempView('REF_A')
DATA_B.createOrReplaceTempView('DATA_B')
spark.sql('SELECT * FROM REF_A LEFT JOIN DATA_B ON ARRAY_CONTAINS(DATA_B.sentence, REF_A.name);')

postgresql write a materialized view query to include base record and no of records matching

I have two tables one is users and another is orders in postgresql.
users table
userid | username | usertype
1 | John | F
2 | Bob | P
orders table
userid | orderid | ordername
1 | 001 | Mobile
1 | 002 | TV
1 | 003 | Laptop
2 | 001 | Book
2 | 002 | Kindle
Now I want to write a query for postgresql materialized view it will give me output like below
userid | username | Base Order Name |No of Orders | User Type
1 | John | Mobile | 3 | F - Free
2 | Bob | Book | 2 | P- Premium
I have tried below query but it's giving five records as output instead of two records and didn't figure out how to show usertype F - Free / P - Premium
CREATE MATERIALIZED VIEW userorders
TABLESPACE pg_default
AS
SELECT
u.userid,
username,
(select count(orderid) from orders where userid = u.userid)
as no_of_orders,
(select ordername from orders where orderid=1 and userid = u.userid)
as baseorder
FROM users u
INNER JOIN orders o ON u.userid = o.userid
WITH DATA;
It's giving result like below
userid | username | no_of_orders | baseorder
1 | John | 3 | Mobile
1 | John | 3 | Mobile
1 | John | 3 | Mobile
2 | Bob | 2 | Book
2 | Bob | 2 | Book
Assume base order id is always 001. In the final materialized view user type will return F - Free/ P - Premium by some mapping in query.
Use a group by and this becomes pretty trivial. The only slightly complex part is getting the base order name, but this can be accomplished using FILTER:
select users.userid,
username,
max(ordername) FILTER (WHERE orderid='001') as "Base Order Name",
count(orderid) as "No of Orders",
CASE WHEN usertype = 'F' THEN 'F - Free'
WHEN usertype = 'P' THEN 'P- Premium'
END as "User Type"
FROM users
JOIN orders on users.userid = orders.userid
GROUP BY users.userid, users.username, users.usertype;

Spark SQL Map only one column of DataFrame

Sorry for the noob question, I have a dataframe in SparkSQL like this:
id | name | data
----------------
1 | Mary | ABCD
2 | Joey | DOGE
3 | Lane | POOP
4 | Jack | MEGA
5 | Lynn | ARGH
I want to know how to do two things:
1) use a scala function on one or more columns to produce another column
2) use a scala function on one or more columns to replace a column
Examples:
1) Create a new boolean column that tells whether the data starts with A:
id | name | data | startsWithA
------------------------------
1 | Mary | ABCD | true
2 | Joey | DOGE | false
3 | Lane | POOP | false
4 | Jack | MEGA | false
5 | Lynn | ARGH | true
2) Replace the data column with its lowercase counterpart:
id | name | data
----------------
1 | Mary | abcd
2 | Joey | doge
3 | Lane | poop
4 | Jack | mega
5 | Lynn | argh
What is the best way to do this in SparkSQL? I've seen many examples of how to return a single transformed column, but I don't know how to get back a new DataFrame with all the original columns as well.
You can use withColumn to add new column or to replace the existing column
as
val df = Seq(
(1, "Mary", "ABCD"),
(2, "Joey", "DOGE"),
(3, "Lane", "POOP"),
(4, "Jack", "MEGA"),
(5, "Lynn", "ARGH")
).toDF("id", "name", "data")
val resultDF = df.withColumn("startsWithA", $"data".startsWith("A"))
.withColumn("data", lower($"data"))
If you want separate dataframe then
val resultDF1 = df.withColumn("startsWithA", $"data".startsWith("A"))
val resultDF2 = df.withColumn("data", lower($"data"))
withColumn replaces the old column if the same column name is provided and creates a new column if new column name is provided.
Output:
+---+----+----+-----------+
|id |name|data|startsWithA|
+---+----+----+-----------+
|1 |Mary|abcd|true |
|2 |Joey|doge|false |
|3 |Lane|poop|false |
|4 |Jack|mega|false |
|5 |Lynn|argh|true |
+---+----+----+-----------+

Translating SQL query to Tableau

I am trying to translate the following SQL query into Tableau:
select store1.name, store1.city, store1.order_date
from store1
where order_date = (select max(store2.order_date) from store2
where store2.name = store1.name
and store2.city = store1.city)
I am quite new to Tableau and can't figure out how to translate the where clause as it is selecting from another table.
For example, given the following tables
Store 1:
Name | City | Order Date
Andrew | Boston | 23-Aug-16
Bob | Boston | 31-Jan-17
Cathy | Boston | 31-Jan-17
Cathy | San Diego | 19-Jan-17
Dan | New York | 3-Dec-16
Store 2:
Name | City | Order Date
Andrew | Boston | 2-Sep-16
Brandy | Miami | 4-Feb-17
Cathy | Boston | 31-Jan-17
Cathy | Boston | 2-Mar-16
Dan | New York | 2-Jul-16
My query would return the following from Store 1:
Name | City | Order Date
Bob | Boston | 31-Jan-17
Cathy | Boston | 31-Jan-17
Point for point, converting that SQL query into Tableau Custom SQL Query would be:
SELECT [Store1].[Name], [Store1].[City], [Store1].[Order Date]
FROM [Store1]
WHERE [Order Date] = (SELECT MAX([Store2].[Order Date]) FROM [Store2]
WHERE [Store2].[Name] = [Store1].[Name]
AND [Store2].[City] = [Store1].[City])
In the preview you will notice it will only return Cathy. But once you join the SQL Query onto your primary table on Order Date, you will see both Bob and Cathy as you expect.

Add New Line Character with Multiple Columns in T-SQL

I have a table that has ID, AddrID, and Addr columns.
The ID can be attached to multiple Addr values in which each address has it own ID.
I am trying to make it so that there is a new line to each ID when it has multiple Addresses loaded And not repeat the ID. So in essence not each row for every record.
Hope it makes sense.
This will eventually become an SSRS report.
The desired output would be something as so:
+----+--------+------------+
| ID | AddrID | Addr |
+----+--------+------------+
| 1 | S1 | 123 N St |
| 2 | S2 | 456 S ST |
| | S3 | 789 W ST |
| | S4 | 987 E ST |
| 3 | S1 | 123 N St |
| | S5 | 147 Elm ST |
| | S6 | 258 SQL St |
+----+--------+------------+
I tried to use:
declare #nl as char(2) = char(13) + char(10)
but its just not working.
Presentation should be done in the presentation layer (Reporting Services in this instance) not in the database or query.
You can do this two ways:
Grouping
Add a Row Group on ID and this will happen automatically.
Expression
You can hide the ID field by putting an expression on the Visibility-Hidden property:
=Fields!ID.Value = Previous(Fields!ID.Value)
This hides the ID field if it is the same as the one on the previous row.