Delete redundant entries within groups

Delete redundant entries within groups - postgresql

I want to remove redundant rows in my database within each group (in this case datasource), which I define as rows which contains strictly less information or different information than some other row.
For example in the table below. Row 1 is redundant as the other row row 0 in its same group contains the exact same information as it but with more data.
For the same reason row 6 is redundant all the other rows 3, 4 and 5 in the group contains more information that it. However I keep both row 4 and 5 as they have some additional different information than the other rows in the group.
datasource city country
0 1 Shallotte US
1 1 None US
2 2 austin US
3 3 Casselberry US
4 3 None AU
5 3 Springfield None
6 3 None None
An example when there are more columns, rows 0 and 1, 4 are different information. However rows 2 and 3(or row 1) contains redundant information.
datasource city country Count
0 1 None US 11
1 1 austin None None
2 1 None None 11
3 1 austin None None
4 1 None CA None
Expected output
datasource city country Count
0 1 None US 11
1 1 austin None None
4 1 None CA None
Is there a simple way which I could achieve such logic in pandas or SQL (PostrgeSQL) for any number of columns?

Here's a different approach using the same basic strategy as Bharath shetty's solution. This way feels a bit neater to me.
First, construct the example data frame:
import pandas as pd
data = {"datasource": [1,1,2,3,3,3,3],
"city": ["Shallotte", None, "austin", "Casselberry", None, "Springfield", None],
"country": ["US", "US", "US", "US", "AU", None, None]}
df = pd.DataFrame(data)
df['null'] = df.isnull().sum(axis=1)
print(df)
city country datasource null
0 Shallotte US 1 0
1 None US 1 1
2 austin US 2 0
3 Casselberry US 3 0
4 None AU 3 1
5 Springfield None 3 1
6 None None 3 2
Now make a boolean mask using groupby and apply - we just drop the biggest null values per group:
def null_filter(d):
if len(d) > 1:
return d.null < d.null.max()
return d.null == d.null
mask = df.groupby("datasource").apply(null_filter).values
df.loc(mask).drop("null", 1)
Output:
city country datasource
0 Shallotte US 1
2 austin US 2
3 Casselberry US 3
4 None AU 3
5 Springfield None 3

One of the way is based on None Count and removing rows of maximum None values i.e
#Count the None values across the row
df['Null'] = (df.values == 'None').sum(axis=1)
#Get the maximum of the count based on groupby
df['Max'] = df.groupby('datasource')['Null'].transform(max)
# Get the values are not equal to max and equal to zero and drop the columns
df = df[~((df['Max'] !=0) & (df['Max'] == df['Null']))].drop(['Null','Max'],axis=1)
Output :
datasource city country
0 1 Shallotte US
2 2 austin US
3 3 Casselberry US
4 3 None AU
5 3 Springfield None
Hope it helps

Related

Is there a simple way in Pyspark to find out number of promotions it took to convert someone into customer?

I have a date-level promotion data frame that looks something like this:
ID
Date
Promotions
Converted to customer
1
2-Jan
2
0
1
10-Jan
3
1
1
14-Jan
3
0
2
10-Jan
19
1
2
10-Jan
8
0
2
10-Jan
12
0
Now I want to see what were the number of promotions it took to convert someone into a customer
For eg., It took (2+3) promotions to convert ID 1 to the customer and (19) to convert ID 2 to the customer.
Eg.
ID
Date
1
5
2
19
I am unable to think of an idea to solve it. Can you please help me?
#Corralien and mozway have helped with the solution in Python. But I am unable to implement it in Pyspark because of the huge dataframe size (>1 TB).

You can use:
prom = (df.groupby('ID')['Promotions'].cumsum()
.where(df['Converted to customer'].eq(1))
.dropna().astype(int))
out = df.loc[prom.index, ['ID', 'Date']].assign(Promotion=prom)
print(out)
# Output
ID Date Promotion
1 1 10-Jan 5
3 2 10-Jan 19

Use one groupby to generate a mask to hide the rows, then one groupby.sum for the sum:
mask = (df.groupby('ID', group_keys=False)['Converted to customer']
.apply(lambda s: s.eq(1).shift(fill_value=False).cummax())
)
out = df[~mask].groupby('ID')['Promotions'].sum()
Output:
ID
1 5
2 19
Name: Promotions, dtype: int64
Alternative output:
df[~mask].groupby('ID', as_index=False).agg(**{'Number': ('Promotions', 'sum')})
Output:
ID Number
0 1 5
1 2 19
If you potentially have groups without conversion to customer, you might want to also aggregate the "" column as indicator:
mask = (df.groupby('ID', group_keys=False)['Converted to customer']
.apply(lambda s: s.eq(1).shift(fill_value=False).cummax())
)
out = (df[~mask]
.groupby('ID', as_index=False)
.agg(**{'Number': ('Promotions', 'sum'),
'Converted': ('Converted to customer', 'max')
})
)
Output:
ID Number Converted
0 1 5 1
1 2 19 1
2 3 39 0
Alternative input:
ID Date Promotions Converted to customer
0 1 2-Jan 2 0
1 1 10-Jan 3 1
2 1 14-Jan 3 0
3 2 10-Jan 19 1
4 2 10-Jan 8 0
5 2 10-Jan 12 0
6 3 10-Jan 19 0 # this group has
7 3 10-Jan 8 0 # no conversion
8 3 10-Jan 12 0 # to customer

you want to compute something by ID, so a groupby ID seems appropriate, e.g.
data.groupby("ID").apply(fct)
Now write a separate function agg_fct which computes the result for a
dataframe consisting of only one ID
Assuming data are ordered by Date, I guess that
def agg_fct(df):
index_of_conv = df["Converted to customer"].argmax()
return df.iloc[0:index_of_conv,df.columns.get_loc("Promotions")].sum()
would be fine. You might want to make some adjustments in case of a customer who has never been converted.

Counting presence or absence of tallies by rows

Hi I am not sure how to explain what I need but I'll try. I need a query (if there is one) for counting if a tally was present in a point (column). So all species with more than one tally in a point will count only as one.
This is how the data looks:
Sp Site Pnt1 Pnt2 Pnt3 Total
A 1 1 1 1 3
A 2 1 1 2
A 3 1 1
B 1 1 1 1 3
B 2 1 1
C 1 1 1 2
C 2 0
I want to count if the sites have tally or not and if they are repeated by points (for the same species) I want to count them as one. I would like the resulting table to look like this.
Sp Pnt1 Pnt2 Pnt3 Total
A 1 1 1 3
B 1 1 1 3
C 0 1 1 2
Thanks for all the help you can provide.

How to filter out bad values in a data set regarding a matrix in matlab?

I wanted to ask any keen users here how to "filter out" bad values regarding a tremendous amount of a data matrix in matlab.
e.g: I have a MATLAB data file containing values 2*5000 (double) which represent x and y coordinates. How is it possible to delete all values above or under a certain limit?
or easier:
(matrix from data file)
1 2 4 134 2
3 5 5 4 2
or
1 2 4 9 2
3 5 5 234 2
setting a certain limit and delete column:
1 2 4 2
3 5 5 2

Find the "bad" elements, e.g. A < 0 | A > 20
Find the "good" columns, e.g. ~max(A < 0 | A > 20)
Keep the "good" columns / Remove the "bad" columns, e.g. A(:, ~max(A < 0 | A > 20))

Consecutive episode

Good afternoon.
I have data like this
ID Indicator
1 0
1 1
1 0
1 1
1 0
1 1
2 0
2 1
2 1
2 1
2 1
2 1
2 1
2 1
I need to get ID which has at least 4 consecutive indicators =1. In this example I should get ID = 2, since it has 4 consecutive indicators= 1. Please help me how to do this in SPSS Modeler. Thank you so much for your help. ID 1 has first indicator=0, 2=1, 3=0,4=1, 5=0 , 6=1, ID 2 has first indicator=o, and others all = 1. There are two columns ID and Indicator, ID 1 has 6 rows and 2 has 8 rows.
To be precise: I want to output the ID that has 4 or more indicators set to 1 consecutively.

What you first need as a way to count the number of consecutive Indicator = 1 records for the same ID.
For this, you can use the "Derive" node with the following settings:
Set the 'Derive as' option to Count
Set the 'Increment when' to ID = #OFFSET(ID, 1) and INDICATOR = 1
Set the 'Increment by' to 1
Set the 'Reset when' to INDICATOR = 0
Following the 'Derive' node, you can then use a 'Select' node to only select the records where the number of consecutive 1's is equal to 4, and finally, use a 'Distinct' node to keep only one record for each ID.
I have shared a sample stream that shows the process here.

(q/kdb+) Merge items in a list

I have a list of items and need to merge them into a single column
using the list
list:(1 2;3 4 5 7;0 1 3)
index value
0 1 2
1 3 4 5 7
2 0 1 3
my goal is
select from list2
value
1
2
3
4
5
7
0
1
3

'raze' function flattens out 1 level of the list.
q) raze (1 2;3 4 5 7;0 1 3)
q) 1 2 3 4 5 7 0 1 3
If you have list with multi level indexing then use 'over' adverb with raze:
q) (raze/)(1 2 3;(11 12;33 44);5 6)
To convert that to table column:
q) t:([]c:raze list)

ungroup would also work provided your table doesn't have multiple columns with different nesting (or strings)
q)ungroup ([]list)
list
----
1
2
3
4
5
7
0
1
3

If you just wanted your list to appear like that I would do the following.
1 cut raze list
I see that you have used a select statement, however if you want your column defined as this in your table do the following
a:raze list
tab:([] b:a)
Your output from this should look like this
q)tab
b
-
1
2
3
4
5
7
0
1
3
Overall, a more concise way to achieve what you want to do would be
select from ([]raze list)
To avoid any errors you should not call the column header 'value' as this is a protected keyword in kdb+ and when you try to reassign it as a column header kdb will through an assign error
`assign
Hope this helps

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Delete redundant entries within groups - postgresql

Related

Is there a simple way in Pyspark to find out number of promotions it took to convert someone into customer?

Counting presence or absence of tallies by rows

How to filter out bad values in a data set regarding a matrix in matlab?

Consecutive episode

(q/kdb+) Merge items in a list

Categories

Resources