Delete redundant entries within groups - postgresql

I want to remove redundant rows in my database within each group (in this case datasource), which I define as rows which contains strictly less information or different information than some other row.
For example in the table below. Row 1 is redundant as the other row row 0 in its same group contains the exact same information as it but with more data.
For the same reason row 6 is redundant all the other rows 3, 4 and 5 in the group contains more information that it. However I keep both row 4 and 5 as they have some additional different information than the other rows in the group.
datasource city country
0 1 Shallotte US
1 1 None US
2 2 austin US
3 3 Casselberry US
4 3 None AU
5 3 Springfield None
6 3 None None
An example when there are more columns, rows 0 and 1, 4 are different information. However rows 2 and 3(or row 1) contains redundant information.
datasource city country Count
0 1 None US 11
1 1 austin None None
2 1 None None 11
3 1 austin None None
4 1 None CA None
Expected output
datasource city country Count
0 1 None US 11
1 1 austin None None
4 1 None CA None
Is there a simple way which I could achieve such logic in pandas or SQL (PostrgeSQL) for any number of columns?

Here's a different approach using the same basic strategy as Bharath shetty's solution. This way feels a bit neater to me.
First, construct the example data frame:
import pandas as pd
data = {"datasource": [1,1,2,3,3,3,3],
"city": ["Shallotte", None, "austin", "Casselberry", None, "Springfield", None],
"country": ["US", "US", "US", "US", "AU", None, None]}
df = pd.DataFrame(data)
df['null'] = df.isnull().sum(axis=1)
print(df)
city country datasource null
0 Shallotte US 1 0
1 None US 1 1
2 austin US 2 0
3 Casselberry US 3 0
4 None AU 3 1
5 Springfield None 3 1
6 None None 3 2
Now make a boolean mask using groupby and apply - we just drop the biggest null values per group:
def null_filter(d):
if len(d) > 1:
return d.null < d.null.max()
return d.null == d.null
mask = df.groupby("datasource").apply(null_filter).values
df.loc(mask).drop("null", 1)
Output:
city country datasource
0 Shallotte US 1
2 austin US 2
3 Casselberry US 3
4 None AU 3
5 Springfield None 3

One of the way is based on None Count and removing rows of maximum None values i.e
#Count the None values across the row
df['Null'] = (df.values == 'None').sum(axis=1)
#Get the maximum of the count based on groupby
df['Max'] = df.groupby('datasource')['Null'].transform(max)
# Get the values are not equal to max and equal to zero and drop the columns
df = df[~((df['Max'] !=0) & (df['Max'] == df['Null']))].drop(['Null','Max'],axis=1)
Output :
datasource city country
0 1 Shallotte US
2 2 austin US
3 3 Casselberry US
4 3 None AU
5 3 Springfield None
Hope it helps

Related

Is there a simple way in Pyspark to find out number of promotions it took to convert someone into customer?

I have a date-level promotion data frame that looks something like this:
ID
Date
Promotions
Converted to customer
1
2-Jan
2
0
1
10-Jan
3
1
1
14-Jan
3
0
2
10-Jan
19
1
2
10-Jan
8
0
2
10-Jan
12
0
Now I want to see what were the number of promotions it took to convert someone into a customer
For eg., It took (2+3) promotions to convert ID 1 to the customer and (19) to convert ID 2 to the customer.
Eg.
ID
Date
1
5
2
19
I am unable to think of an idea to solve it. Can you please help me?
#Corralien and mozway have helped with the solution in Python. But I am unable to implement it in Pyspark because of the huge dataframe size (>1 TB).
You can use:
prom = (df.groupby('ID')['Promotions'].cumsum()
.where(df['Converted to customer'].eq(1))
.dropna().astype(int))
out = df.loc[prom.index, ['ID', 'Date']].assign(Promotion=prom)
print(out)
# Output
ID Date Promotion
1 1 10-Jan 5
3 2 10-Jan 19
Use one groupby to generate a mask to hide the rows, then one groupby.sum for the sum:
mask = (df.groupby('ID', group_keys=False)['Converted to customer']
.apply(lambda s: s.eq(1).shift(fill_value=False).cummax())
)
out = df[~mask].groupby('ID')['Promotions'].sum()
Output:
ID
1 5
2 19
Name: Promotions, dtype: int64
Alternative output:
df[~mask].groupby('ID', as_index=False).agg(**{'Number': ('Promotions', 'sum')})
Output:
ID Number
0 1 5
1 2 19
If you potentially have groups without conversion to customer, you might want to also aggregate the "" column as indicator:
mask = (df.groupby('ID', group_keys=False)['Converted to customer']
.apply(lambda s: s.eq(1).shift(fill_value=False).cummax())
)
out = (df[~mask]
.groupby('ID', as_index=False)
.agg(**{'Number': ('Promotions', 'sum'),
'Converted': ('Converted to customer', 'max')
})
)
Output:
ID Number Converted
0 1 5 1
1 2 19 1
2 3 39 0
Alternative input:
ID Date Promotions Converted to customer
0 1 2-Jan 2 0
1 1 10-Jan 3 1
2 1 14-Jan 3 0
3 2 10-Jan 19 1
4 2 10-Jan 8 0
5 2 10-Jan 12 0
6 3 10-Jan 19 0 # this group has
7 3 10-Jan 8 0 # no conversion
8 3 10-Jan 12 0 # to customer
you want to compute something by ID, so a groupby ID seems appropriate, e.g.
data.groupby("ID").apply(fct)
Now write a separate function agg_fct which computes the result for a
dataframe consisting of only one ID
Assuming data are ordered by Date, I guess that
def agg_fct(df):
index_of_conv = df["Converted to customer"].argmax()
return df.iloc[0:index_of_conv,df.columns.get_loc("Promotions")].sum()
would be fine. You might want to make some adjustments in case of a customer who has never been converted.

Counting presence or absence of tallies by rows

Hi I am not sure how to explain what I need but I'll try. I need a query (if there is one) for counting if a tally was present in a point (column). So all species with more than one tally in a point will count only as one.
This is how the data looks:
Sp Site Pnt1 Pnt2 Pnt3 Total
A 1 1 1 1 3
A 2 1 1 2
A 3 1 1
B 1 1 1 1 3
B 2 1 1
C 1 1 1 2
C 2 0
I want to count if the sites have tally or not and if they are repeated by points (for the same species) I want to count them as one. I would like the resulting table to look like this.
Sp Pnt1 Pnt2 Pnt3 Total
A 1 1 1 3
B 1 1 1 3
C 0 1 1 2
Thanks for all the help you can provide.

How to filter out bad values in a data set regarding a matrix in matlab?

I wanted to ask any keen users here how to "filter out" bad values regarding a tremendous amount of a data matrix in matlab.
e.g: I have a MATLAB data file containing values 2*5000 (double) which represent x and y coordinates. How is it possible to delete all values above or under a certain limit?
or easier:
(matrix from data file)
1 2 4 134 2
3 5 5 4 2
or
1 2 4 9 2
3 5 5 234 2
setting a certain limit and delete column:
1 2 4 2
3 5 5 2
Find the "bad" elements, e.g. A < 0 | A > 20
Find the "good" columns, e.g. ~max(A < 0 | A > 20)
Keep the "good" columns / Remove the "bad" columns, e.g. A(:, ~max(A < 0 | A > 20))

Consecutive episode

Good afternoon.
I have data like this
ID Indicator
1 0
1 1
1 0
1 1
1 0
1 1
2 0
2 1
2 1
2 1
2 1
2 1
2 1
2 1
I need to get ID which has at least 4 consecutive indicators =1. In this example I should get ID = 2, since it has 4 consecutive indicators= 1. Please help me how to do this in SPSS Modeler. Thank you so much for your help. ID 1 has first indicator=0, 2=1, 3=0,4=1, 5=0 , 6=1, ID 2 has first indicator=o, and others all = 1. There are two columns ID and Indicator, ID 1 has 6 rows and 2 has 8 rows.
To be precise: I want to output the ID that has 4 or more indicators set to 1 consecutively.
What you first need as a way to count the number of consecutive Indicator = 1 records for the same ID.
For this, you can use the "Derive" node with the following settings:
Set the 'Derive as' option to Count
Set the 'Increment when' to ID = #OFFSET(ID, 1) and INDICATOR = 1
Set the 'Increment by' to 1
Set the 'Reset when' to INDICATOR = 0
Following the 'Derive' node, you can then use a 'Select' node to only select the records where the number of consecutive 1's is equal to 4, and finally, use a 'Distinct' node to keep only one record for each ID.
I have shared a sample stream that shows the process here.

(q/kdb+) Merge items in a list

I have a list of items and need to merge them into a single column
using the list
list:(1 2;3 4 5 7;0 1 3)
index value
0 1 2
1 3 4 5 7
2 0 1 3
my goal is
select from list2
value
1
2
3
4
5
7
0
1
3
'raze' function flattens out 1 level of the list.
q) raze (1 2;3 4 5 7;0 1 3)
q) 1 2 3 4 5 7 0 1 3
If you have list with multi level indexing then use 'over' adverb with raze:
q) (raze/)(1 2 3;(11 12;33 44);5 6)
To convert that to table column:
q) t:([]c:raze list)
ungroup would also work provided your table doesn't have multiple columns with different nesting (or strings)
q)ungroup ([]list)
list
----
1
2
3
4
5
7
0
1
3
If you just wanted your list to appear like that I would do the following.
1 cut raze list
I see that you have used a select statement, however if you want your column defined as this in your table do the following
a:raze list
tab:([] b:a)
Your output from this should look like this
q)tab
b
-
1
2
3
4
5
7
0
1
3
Overall, a more concise way to achieve what you want to do would be
select from ([]raze list)
To avoid any errors you should not call the column header 'value' as this is a protected keyword in kdb+ and when you try to reassign it as a column header kdb will through an assign error
`assign
Hope this helps