Replace specific characters from a column in pyspark dataframe - pyspark

I have the below pyspark dataframe.
column_a
name, varchar(10) country, age
name, age, decimal(15) percentage
name, varchar(12) country, age
name, age, decimal(10) percentage
I have to remove varchar and decimal from above dataframe irrespective of its length. Below is expected output.
column_a
name, country, age
name, age, percentage
name, country, age
name, age, percentage
How to achieve this in Pyspark.

You can replace patterns matching decimal() and varchar() using regexp_replace.
from pyspark.sql import functions as F
data = [("name, varchar(10) country, age",),
("name, age, decimal(15) percentage",),
("name, varchar(12) country, age",),
("name, age, decimal(10) percentage",), ]
df = spark.createDataFrame(data, ("column_a", ), )
df.withColumn("column_a",
F.regexp_replace("column_a", r"varchar\(\d*\)\s|decimal\(\d*\)\s", ""))\
.show(truncate=False)
"""
+---------------------+
|column_a |
+---------------------+
|name, country, age |
|name, age, percentage|
|name, country, age |
|name, age, percentage|
+---------------------+
"""

Related

The objective is to display the most recent transaction by custmer_id

My code ranks the last customer transaction by row number as planned, but I cannot filter my join to display on the last transaction per customer. The objective is to display the last detailed customer transaction per customer_id. I attempted to use the window function and then filter the resulting column.
CREATE TABLE customer1 (
customer_id INT PRIMARY KEY,
first_name VARCHAR(255),
last_name VARCHAR(255),
email VARCHAR(255),
created_at TIMESTAMP WITH TIME ZONE NOT NULL
);
CREATE TABLE purchase (
purchase_id INT PRIMARY KEY,
purchase_time TIMESTAMP WITH TIME ZONE NOT NULL,
customer_id INT NOT NULL,
FOREIGN KEY (customer_id) REFERENCES customer1(customer_id)
);
CREATE TABLE purchase_item (
purchase_item_id INT PRIMARY KEY,
purchase_id INT NOT NULL,
sku VARCHAR(255),
quantity INT NOT NULL,
total_amount_paid DECIMAL(10,2) NOT NULL,
FOREIGN KEY (purchase_id) REFERENCES purchase(purchase_id)
);
INSERT INTO customer1 (customer_id, first_name, last_name, email, created_at) VALUES
(1, 'James', 'Smith', 'jamessmith#example.com', clock_timestamp()),
(2, 'Mary', 'Johnson', 'maryjohnson#example.com', clock_timestamp()),
(3, 'John', 'Williams', 'johnwilliams#example.com', clock_timestamp()),
(4, 'Patricia', 'Brown', 'patriciabrown#example.com', clock_timestamp()),
(5, 'Michael', 'Garcia', 'michaelgarcia#example.com', clock_timestamp());
INSERT INTO purchase (purchase_id, purchase_time, customer_id) VALUES
(100, clock_timestamp(), 1),
(101, clock_timestamp(), 1),
(102, clock_timestamp(), 1),
(103, clock_timestamp(), 2),
(104, clock_timestamp(), 3),
(105, clock_timestamp(), 5);
INSERT INTO purchase_item(purchase_item_id, purchase_id, sku, quantity, total_amount_paid) VALUES
(200, 100, 'shoe_blk_42', 3, 300),
(201, 100, 'shoe_lace_white', 3, 2.5),
(202, 101, 'shorts', 1, 40),
(203, 102, 'bike', 1, 1995),
(204, 103, 'bike', 2, 3990),
(205, 103, 'shoe_wht_39', 2, 200),
(206, 104, 'shirt', 1, 60),
(207, 105, 'headphones', 1, 400);
SELECT DISTINCT customer1.customer_id,
first_name,
last_name,
email,
purchase.purchase_id,
purchase.purchase_time,
purchase_item.quantity,
purchase_item.total_amount_paid,
ROW_NUMBER()OVER (
PARTITION BY purchase.customer_id
ORDER BY
purchase.purchase_time DESC
) As order_queue
FROM customer1
JOIN purchase ON customer1.customer_id = purchase.customer_id
JOIN purchase_item ON purchase.purchase_id = purchase_item.purchase_id
WHERE order_queue = 1;
You can use DISTINCT ON to solve this:
select distinct on (customer1.customer_id)
customer1.customer_id,
first_name,
last_name,
email,
purchase.purchase_id,
purchase.purchase_time,
purchase_item.quantity,
purchase_item.total_amount_paid
FROM customer1
LEFT JOIN purchase ON customer1.customer_id = purchase.customer_id
LEFT JOIN purchase_item ON purchase.purchase_id = purchase_item.purchase_id
ORDER BY customer1.customer_id, purchase_time desc;
customer_id | first_name | last_name | email | purchase_id | purchase_time | quantity | total_amount_paid
-------------+------------+-----------+---------------------------+-------------+-------------------------------+----------+-------------------
1 | James | Smith | jamessmith#example.com | 102 | 2019-06-14 20:17:26.759086+00 | 1 | 1995.00
2 | Mary | Johnson | maryjohnson#example.com | 103 | 2019-06-14 20:17:26.759098+00 | 2 | 200.00
3 | John | Williams | johnwilliams#example.com | 104 | 2019-06-14 20:17:26.759109+00 | 1 | 60.00
4 | Patricia | Brown | patriciabrown#example.com | | | |
5 | Michael | Garcia | michaelgarcia#example.com | 105 | 2019-06-14 20:17:26.75912+00 | 1 | 400.00
(5 rows)
You can change the LEFT JOINs to JOINS if you don't want to see customers with no purchases.

Selecting on a condition in window function postgresql

I am using postgresql and applying window function. previously I had to find first gid with same last name , and address(street_address and city) so i simply put last name in partition by clause in window function.
but now I have requirement to find first g_id of which last name is not same. while address is same How can I do it ?
This is what i was doing previously.
SELECT g_id as g_id,
First_value(g_id)
OVER (PARTITION BY lname,street_address , city ,
order by last_date DESC NULLS LAST )as c_id,
street_address as street_address FROM my table;
lets say this is my db
g_id | l_name | street_address | city | last_date
_________________________________________________
x1 | bar | abc road | khi | 11-6-19
x2 | bar | abc road | khi | 12-6-19
x3 | foo | abc road | khi | 19-6-19
x4 | harry | abc road | khi | 17-6-19
x5 | bar | xyz road | khi | 11-6-19
_________________________________________________
In previous scenario :
for if i run for the first row my c_id, it should return 'x2' as it considers these rows:
_________________________________________________
g_id | l_name | street_address | city | last_date
_________________________________________________
x1 | bar | abc road | khi | 11-6-19
x2 | bar | abc road | khi | 12-6-19
_________________________________________________
and return a row with latest last_date.
what i want now to select these rows (rows with same street_address and city but no same l_name):
g_id | l_name | street_address | city | last_date
_________________________________________________
x1 | bar | abc road | khi | 11-6-19
x3 | foo | abc road | khi | 19-6-19
x4 | harry | abc road | khi | 17-6-19
_________________________________________________
and output will be x3.
somehow i want to compare last_name column if it is not equals to the current value of last name and then partition by address field. and if no rows satisfy the condition c_id should be equal to current g_id
Looking at your expected output,it's not clear whether you want earliest or oldest for each group. You may change the ORDER BY accordingly for last_date in this query which uses DISTINCT ON
SELECT DISTINCT ON ( street_address, city, l_name) *
FROM mytable
ORDER BY street_address,
city,
l_name,
last_date --change this to last_date desc if you want latest
DEMO
After discussing the details in this chat:
demo:db<>fiddle
SELECT DISTINCT ON (t1.g_id)
t1.*,
COALESCE(t2.g_id, t1.g_id) AS g_id
FROM
mytable t1
LEFT JOIN mytable t2
ON t1.street_address = t2.street_address AND t1.l_name != t2.l_name
ORDER BY t1.g_id, t2.last_date DESC
here is how I solved it using subquery
creating example table.
CREATE TABLE mytable
("g_id" varchar(2), "l_name" varchar(5), "street_address" varchar(8), "city" varchar(3), "last_date" date)
;
INSERT INTO mytable
("g_id", "l_name", "street_address", "city", "last_date")
VALUES
('x1', 'bar', 'abc road', 'khi', '11-6-19'),
('x2', 'bar', 'abc road', 'khi', '12-6-19'),
('x3', 'foo', 'abc road', 'khi', '19-6-19'),
('x4', 'harry', 'abc road', 'khi', '17-6-19'),
('x5', 'bar', 'xyz road', 'khi', '11-6-19')
;
query to get g_ids
SELECT * ,
(select b.g_id from mytable b where (base.g_id = b.g_id) or (base.l_name <>
b.l_name and base.street_address = b.street_address and base.city = b.city )
order by b.last_date desc limit 1)
from mytable base

Flatten hierarchy on self-join table

I have data in a self-join hierarchical table where Continents have many Countries have many Regions have many States have many Cities.
Self-joining table structure:
|-------------------------------------------------------------|
| ID | Name | Type | ParentID | IsTopLevel |
|-------------------------------------------------------------|
| 1 | North America | Continent | NULL | 1 |
| 12 | United States | Country | 1 | 0 |
| 113 | Midwest | Region | 12 | 0 |
| 155 | Kansas | State | 113 | 0 |
| 225 | Topeka | City | 155 | 0 |
| 2 | South America | Continent | NULL | 1 |
| 22 | Argentina | Country | 2 | 0 |
| 223 | Southern | Region | 22 | 0 |
| 255 | La Pampa | State | 223 | 0 |
| 777 | Santa Rosa | City | 255 | 0 |
|-------------------------------------------------------------|
I have been able to successfully use a recursive CTE to get the tree structure and depth of each node. Where I am failing is using a pivot to create a nice list of all bottom locations and their corresponding parents at each level.
The expected results:
|------------------------------------------------------------------------------------|
| Continent | Country | Region | State | City | Bottom_Level_ID |
|------------------------------------------------------------------------------------|
| North America | United States | Midwest | Kansas | Topeka | 234 |
| South America | Argentina | Southern | La Pampa | Santa Rosa | 777 |
|------------------------------------------------------------------------------------|
There are a few key points I should clarify.
Every single entry has a bottom level and a top level. There are no
cases where all five Types are not present for a given location.
If I filled out this data, I'd have 50 entries for North America at the
State level, so you can imagine how immense this table is at the
City level for every continent on the planet. Billions of rows.
The reason this is a necessity is because I need to be able to join onto a historical table of all addresses a person has lived at, and journey up the tree. I figure if I have the LocationID from that table, I can just LEFT JOIN onto a View of this query and nab the appropriate columns.
This is an old database, 2005, and I don't have sysadmin or control of the schema.
My CTE Code
--CTE
;WITH Tree
AS (
SELECT ID, Name, ParentID, Type, 1 as Depth
FROM LocationTable
WHERE IsTopLevel = 1
UNION ALL
SELECT L.ID, L.Name, L.ParentID, L.Type, T.Depth+1
FROM Tree T
JOIN LocationTable L
ON L.ParentGUID = T.GUID
)
Good solid data, in a mostly useful format. BUT then I got to thinking about it and isn't the table structure already in this format, so why would I bother doing a depth tree search if I wasn't going to join the entries together at the same time?
Anyway, here was the rest.
The Pivot Attempt
;WITH Tree
AS (
SELECT ID, Name, ParentID, Type
FROM LocationTable
WHERE IsTopLevel = 1
UNION ALL
SELECT L.ID, L.Name, L.ParentID, L.Type
FROM Tree T
JOIN LocationTable L
ON L.ParentGUID = T.GUID
)
select *
from Tree
pivot (
max(Name)
for Type in ([Continent],[Country],[Region],[State],[City])
) pvt
And now I have everything by Type in a column, with nulls for everything else. As I have struggled with before, I need to filter/join the CTE data before I attempt my pivot, but I have no idea where to start with that piece. Everything I have tried is soooooooooo sloooooooow.
Everytime I think I understand CTEs and Pivot, something new makes me extremely humbled. Please help me. ; ;
If your structure is as clean as you describe it (no gaps, 5 levels always) you might go the easy way:
This data really demands for a classical 1:n-table-tree, where your Countries, States etc. live in their own tables and link to their parent record
Make sure there's an index on ParentID and ID!
DECLARE #tbl TABLE(ID INT,Name VARCHAR(100),Type VARCHAR(100),ParentID INT,IsTopLevel BIT);
INSERT INTO #tbl VALUES
(1,'North America','Continent',NULL,1)
,(12,'United States','Country',1,0)
,(113,'Midwest','Region',12,0)
,(155,'Kansas','State',113,0)
,(225,'Topeka','City',155,0)
,(2,'South America','Continent',NULL,1)
,(22,'Argentina','Country',2,0)
,(223,'Southern','Region',22,0)
,(255,'La Pampa','State',223,0)
,(777,'Santa Rosa','City',255,0);
SELECT Level1.Name AS Continent
,Level2.Name AS Country
,Level3.Name AS Region
,Level4.Name AS State
,Level5.Name AS City
,Level5.ID AS Bottom_Level_ID
FROM #tbl AS Level1
INNER JOIN #tbl AS Level2 ON Level1.ID=Level2.ParentID
INNER JOIN #tbl AS Level3 ON Level2.ID=Level3.ParentID
INNER JOIN #tbl AS Level4 ON Level3.ID=Level4.ParentID
INNER JOIN #tbl AS Level5 ON Level4.ID=Level5.ParentID
WHERE Level1.ParentID IS NULL
The result
Continent Country Region State City Bottom_Level_ID
North America United States Midwest Kansas Topeka 225
South America Argentina Southern La Pampa Santa Rosa 777
Another solution with CTE could be :
;WITH Tree
AS (
SELECT cast(NULL as varchar(100)) as C1, cast(NULL as varchar(100)) as C2, cast(NULL as varchar(100)) as C3, cast(NULL as varchar(100)) as C4, Name as C5, ID as B_Level
FROM LocationTable
WHERE IsTopLevel = 1
UNION ALL
SELECT T.C2, T.C3, T.C4, T.C5, L.Name, L.ID
FROM Tree T
JOIN LocationTable L
ON L.ParentID = T.B_Level
)
select *
from Tree
where C1 is not null

Oracle: How to group records by certain columns before fetching results

I have a table in Redshift that looks like this:
col1 | col2 | col3 | col4 | col5 | col6
=======================================
123 | AB | SSSS | TTTT | PQR | XYZ
---------------------------------------
123 | AB | SSTT | TSTS | PQR | XYZ
---------------------------------------
123 | AB | PQRS | WXYZ | PQR | XYZ
---------------------------------------
123 | CD | SSTT | TSTS | PQR | XYZ
---------------------------------------
123 | CD | PQRS | WXYZ | PQR | XYZ
---------------------------------------
456 | AB | GGGG | RRRR | OPQ | RST
---------------------------------------
456 | AB | SSTT | TSTS | PQR | XYZ
---------------------------------------
456 | AB | PQRS | WXYZ | PQR | XYZ
I have another table that also has a similar structure and data.
From these tables, I need to select values that don't have 'SSSS' in col3 and 'TTTT' in col4 in (edited) either of the tables. I'd also need to group my results by the value in col1 and col2.
Here, I'd like my query to return:
123,CD
456,AB
I don't want 123, AB to be in my results, since one of the rows corresponding to 123, AB has SSSS and TTTT in col3 and col4 respectively. i.e, I want to omit items that have SSSS and TTTT in col3 and col4 in either of the two tables that I'm looking up.
I am very new to writing queries to extract information from a database, so please bear with my ignorance. I was told to explore GROUP BY and ORDER BY, but I am not sure I understand their usage well enough yet.
The query I have looks like:
SELECT * from table1 join table2 on
table1.col1 = table2.col1 AND
table1.col2 = table2.col2
WHERE
col3 NOT LIKE 'SSSS' AND
col4 NOT LIKE 'TTTT'
GROUP BY col1,col2
However, this query throws an error: col5 must appear in the GROUP BY clause or be used in an aggregate function;
I'm not sure how to proceed. I'd appreciate any help. Thank you!
It seems you also want DISTINCT results. In this case a solution with MINUS is probably as efficient as any other (and, remember, MINUS automatically also means DISTINCT):
select col1, col2 from table_name -- enter your column and table names here
minus
select col1, col2 from table_name where col3 = 'SSSS' and col4 = 'TTTT'
;
No need to group by anything!
With that said, here is a solution using GROUP BY. Note that the HAVING condition uses a non-trivial aggregate function - it is a COUNT() but what is counted is a CASE to take care of what was required. Note that it is not necessary/required that the aggregate function in the HAVING clause/condition be included in the SELECT list!
select col1, col2
from table_name
group by col1, col2
having count(case when col3 = 'SSSS' and col4 = 'TTTT' then 1 else null end) = 0
;
You should use the EXCEPT operator.
EXCEPT and MINUS are two different versions of the same operator.
Here is the syntax of what your query should look like
SELECT col1, col2 FROM table1
EXCEPT
SELECT col1, col2 FROM table1 WHERE col3 = 'SSSS' AND col4 = 'TTTT';
One important consideration is to know if your desired answer requires either the and or OR operator. Do you want to see the records where col3 = 'SSSS' and col4 has a value different than col4 = 'TTTT'?
If the answer is no you should use the version below:
SELECT col1, col2 FROM table1
EXCEPT
SELECT col1, col2 FROM table1 WHERE col3 = 'SSSS' OR col4 = 'TTTT';
You can learn more about the MINUS or EXCEPT operator on the Amazon Redshift documentation here.

T-SQL group by and filter for all

Please edit my question for a better title (and remove this line)
Hi I have a table
id|Col_A|Col_B|
--|-----|-----|
1| A | NULL|
2| A | a |
3| B | a |
4| C | NULL|
5| C | NULL|
I want to select only those rows where all Col_B is NULL group by Col_A (i.e. only C should be returned in this example)
Tried
SELECT Col_A FROM MYTABLE WHERE Col_B IS NULL GROUP BY Col_A
which gives me A and C. Also tried
SELECT Col_A, Col_B FROM MYTABLE GROUP BY Col_A, Col_B HAVING Col_B IS NULL
which also gives me A and C.
How do you write a T-SQL query to select a group only when all match the criteria?
Thank you
This seems to work:
declare #t table (id int,Col_A char(1),col_b char(1))
insert into #t(id,Col_A,Col_B) values
(1,'A',NULL),
(2,'A', 'a' ),
(3,'B', 'a' ),
(4,'C',NULL),
(5,'C',NULL)
select Col_A from #t group by Col_A having MAX(Col_B) is null
Because MAX (or MIN) can only return NULL if all of the inputs were NULL.
Result:
Col_A
-----
C
Similarly, if you wanted to check that all of the values where all some specific, non-NULL value, you could check that by first confirming the the MIN() and MAX() of that column were equal and then checking either one of them against the sought for value:
select Col_A from #t group by Col_A
having MAX(Col_B) = MIN(Col_B) and
MIN(Col_B) = 'a'
Would return just B, since that's the only group where all input values are equal to a.