Pyspark - compute min after group by ignoring null values

Pyspark - compute min after group by ignoring null values - pyspark

I would like to group a dataset and to compute for each group the min of a variable, ignoring the null values. For instance:
NAME | COUNTRY | AGE
Marc | France | 20
Anne | France | null
Claire | France | 18
Harry | USA | 20
David | USA | null
George | USA | 28
If I compute
from pyspark.sql import functions as F
min_values = data.groupBy("COUNTRY").agg(F.min("AGE").alias("MIN_AGE"))
I obtain
COUNTRY | MIN_AGE
France null
USA null
Instead of
COUNTRY | MIN_AGE
France 18
USA 20
Do you know how to fix it? Thank you very much!

You can drop na values:
min_values = data.na.drop().groupBy("COUNTRY").agg(F.min("AGE").alias("MIN_AGE"))

Related

Selecting rows with max value, in correlation to two other columns

I have a table with three columns: region, country, count.
I want to reduce my table to rows of region, country and count - where count is maximal among the region.
For example, if I have the following table:
region | country | count
asia | jo | 12
asia | ir | 12
asia | il | 10
europe | fr | 8
europe | it | 2
I'd expect to get in return:
region | country | count
asia | jo | 12
asia | ir | 12
europe | fr | 8

You can achieve this in this way: First you need to group the data by the region field and get the maximum value. Then execute a simple IN condition with fields region and count
select * from my_table WHERE (region, count) in (select region, MAX(count) from my_table GROUP BY region)
Demo in sqldaddy.io

Postgresql: Looping through a date_trunc generated group

I've got some records on my database that have a 'createdAt' timestamp.
What I'm trying to get out of postgresql is those records grouped by 'createdAt'
So far I've got this query:
SELECT date_trunc('day', "updatedAt") FROM goal GROUP BY 1
Which gives me:
+---+------------+-------------+
| date_trunc |
+---+------------+-------------+
| Sep 20 00:00:00 |
+---+------------+-------------+
Which are the days where the records got created.
My question is: Is there any way to generate something like:
| Sep 20 00:00:00 |
| id | name | gender | state | age |
|----|-------------|--------|-------|-----|
| 1 | John Kenedy | male | NY | 32 |
| |
| Sep 24 00:00:00 |
| |
| id | name | gender | state | age |
|----|-------------|--------|-------|-----|
| 1 | John Kenedy | male | NY | 32 |
| 2 | John De | male | NY | 32 |
That means group by date_trunc and select all the columns of those rows?
Thanks a lot!

Please try SELECT date_trunc('day', "updatedAt"), name, gender, state, age FROM goal GROUP BY 1,2,3. It will not provide as the structure, you expect, but will "group by date_trunc and select all the columns ".

how to migrate relational tables to dynamoDB table

I am new at DynamoDB, in my current project, I am trying to migrate most relational tables to Dynamo DB. I am facing a tricky scenario which I don't know how to solve
In Posgresql, 2 tables:
Student
id | name | age | address | phone
---+--------+-----+---------+--------
1 | Alex | 18 | aaaaaa | 88888
2 | Tome | 19 | bbbbbb | 99999
3 | Mary | 18 | ccccc | 00000
4 | Peter | 20 | dddddd | 00000
Registration
id | class | student | year
---+--------+---------+---------
1 | A1 | 1 | 2018
2 | A1 | 3 | 2018
3 | A1 | 4 | 2017
4 | B1 | 2 | 2018
My query:
select s.id, s.name, s.age, s.address, s.phone
from Registration r inner join Student s on r.student = s.id
where r.class = 'A1' and r.year = '2018'
Result:
id | name | age | address | phone
---+--------+-----+---------+--------
1 | Alex | 18 | aaaaaa | 88888
3 | Mary | 18 | ccccc | 00000
So, how can I design the dynamoDB table to achieve this result? in extend for CRUD
Any advice is appreciated

DynamoDB table design is going to depend largely on your access patterns. Without knowing the full requirements and queries needed by your app, it's not going to be possible to write a proper answer. But given your example here's a table design that might work:
| (GSI PK) |
(P. Key) | (Sort) | (GSI Sort)
studentId | itemType | name | age | address | phone | year
----------+----------+--------+-----+---------+-------+------
1 | Details | Alex | 18 | aaaaaa | 88888 |
1 | Class_A1 | | | | | 2018
2 | Details | Tome | 19 | bbbbbb | 99999 |
2 | Class_B1 | | | | | 2018
3 | Details | Mary | 18 | ccccc | 00000 |
3 | Class_A1 | | | | | 2018
4 | Details | Peter | 20 | dddddd | 00000 |
4 | Class_A1 | | | | | 2017
Note the global secondary index with the partition key on the item type and the sort key on the year.
With this design we have a few query options:
1) Get student for a given id: GetItem(partitionKey: studentId, sortkey: Details)
2) Get all classes for a given student id: Query(partitionKey: studentId, sortkey: STARTS_WITH("Class"));
3) Get all students in class A1 and year 2018: Query(GSI partitionkey: "Class_A1", sortkey: equals(2018))
For global secondary indexes, the partition and sort key don't need to be unique therefore you can have many Class_A1, 2018 combos. If you haven't already read the Best Practices for DyanmoDB I highly recommend reading it in full.

Translating SQL query to Tableau

I am trying to translate the following SQL query into Tableau:
select store1.name, store1.city, store1.order_date
from store1
where order_date = (select max(store2.order_date) from store2
where store2.name = store1.name
and store2.city = store1.city)
I am quite new to Tableau and can't figure out how to translate the where clause as it is selecting from another table.
For example, given the following tables
Store 1:
Name | City | Order Date
Andrew | Boston | 23-Aug-16
Bob | Boston | 31-Jan-17
Cathy | Boston | 31-Jan-17
Cathy | San Diego | 19-Jan-17
Dan | New York | 3-Dec-16
Store 2:
Name | City | Order Date
Andrew | Boston | 2-Sep-16
Brandy | Miami | 4-Feb-17
Cathy | Boston | 31-Jan-17
Cathy | Boston | 2-Mar-16
Dan | New York | 2-Jul-16
My query would return the following from Store 1:
Name | City | Order Date
Bob | Boston | 31-Jan-17
Cathy | Boston | 31-Jan-17

Point for point, converting that SQL query into Tableau Custom SQL Query would be:
SELECT [Store1].[Name], [Store1].[City], [Store1].[Order Date]
FROM [Store1]
WHERE [Order Date] = (SELECT MAX([Store2].[Order Date]) FROM [Store2]
WHERE [Store2].[Name] = [Store1].[Name]
AND [Store2].[City] = [Store1].[City])
In the preview you will notice it will only return Cathy. But once you join the SQL Query onto your primary table on Order Date, you will see both Bob and Cathy as you expect.

MS Access Group By breaks when using a date

For some reason using a date/time field in a select query with Group By in Access 2010 breaks (records are not properly "grouped by" the text field first, showing the same "aTextField" value multiple times). I am able to replicate the issue in a simple, one table query. Ex:
SELECT aTextField, SUM(aIntField) AS SumOfaIntField
FROM simpleTable
GROUP BY aTextField, aDateField
HAVING aDateField >= Date()
ORDER BY aTextField;
As soon as you remove the "aDateField" from the query (Group By and Having lines) then it works properly. I can even remove the HAVING line and it still breaks. Leaving me to believe that it is something with the Group By.
Any feedback would be great. Thanks!
EDIT More details
**simpleTable**
--------------------------------------------
| ID | aTextField | aIntField | aDateField |
============================================
| 1 | John Doe | 1 | 3/14/2013 |
| 2 | John Doe | | 3/15/2013 |
| 3 | Jane Doe | 1 | 3/15/2013 |
| 4 | John Doe | 2 | 3/18/2013 |
| 5 | Jane Doe | 1 | 3/19/2013 |
| 6 | John Doe | | 3/20/2013 |
| 7 | John Doe | 3 | 3/21/2013 |
| 8 | Jane Doe | 1 | 3/19/2013 |
| 9 | John Doe | | 3/22/2013 |
| 10 | Jane Doe | 2 | 3/20/2013 |
| 11 | Jane Doe | | 3/21/2013 |
| 12 | Jane Doe | | 3/22/2013 |
--------------------------------------------
**Expected Result**
-------------------------------
| aTextField | SumOfaIntField |
===============================
| Jane Doe | 4 |
| John Doe | 3 |
-------------------------------
**Actual Result**
-------------------------------
| aTextField | SumOfaIntField |
===============================
| Jane Doe | 2 |
| Jane Doe | 2 |
| Jane Doe | |
| Jane Doe | |
| John Doe | |
| John Doe | 3 |
| John Doe | |
-------------------------------
So what appears to be happening is that there is a seperate row for each date as well. I just need to filter by the date and not necessarily Group By it. However, Access will not accept the query without grouping it. Options?

You're grouping by aTextField and aDateField. Perhaps simpleTable includes rows where the date is the same, but the time of day is different. In that case your grouping would produce a row for each date/time combination.
Whether or not that was the explanation, you should check what the db engine actually evaluates by including aDateField in the SELECT list.
SELECT aTextField, aDateField, SUM(aIntField)
FROM simpleTable
GROUP BY aTextField, aDateField
HAVING aDateField >= Date()
ORDER BY aTextField;
Also consider using a WHERE instead of HAVING clause:
WHERE aDateField >= Date()
Based on your sample data, I suspect you want ...
SELECT aTextField, SUM(aIntField)
FROM simpleTable
GROUP BY aTextField
WHERE aDateField >= Date()
ORDER BY aTextField;

You should be able to use the following:
SELECT aTextField, SUM(aIntField) AS SumOfaIntField
FROM simpleTable
WHERE aDateField >= Date()
GROUP BY aTextField
ORDER BY aTextField;
You will notice that I removed the GROUP BY on the aDateField column. Since you want the total for each aTextField, then you do not need to group by the date. Grouping by date will result in a separate row for each distinct date.
Note: this query was tested in MS Access 2010 and generated your desired result.

I think you are misunderstanding on how GROUP BY works. You should be seeing the same aTextField once for each unique textfield/datetime combination
Sample
a 2012-01-01
a 2012-01-01
b 2012-01-01
b 2012-01-02
b 2012-01-02
group by aTextField, aDateField
a 2012-01-01
b 2012-01-01
b 2012-01-02
group by aTextField
a
b

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Pyspark - compute min after group by ignoring null values - pyspark

You can drop na values: min_values = data.na.drop().groupBy("COUNTRY").agg(F.min("AGE").alias("MIN_AGE"))

Related

Selecting rows with max value, in correlation to two other columns

Postgresql: Looping through a date_trunc generated group

how to migrate relational tables to dynamoDB table

Translating SQL query to Tableau

MS Access Group By breaks when using a date

Categories

Resources