Conditional logic on groupBy using pyspark-sql-dataframe - group-by

I have a pyspark sql dataframe that looks like this:
id | code
------------------------
| 1 | 02 |
| 1 | 03 |
| 1 | 06 |
| 2 | 02 |
| 2 | 04 |
| 2 | 02 |
| 3 | 06 |
| 3 | 04 |
And I am trying to get an output like this:
id | bin
------------------------
| 1 | 1 |
| 2 | 0 |
| 3 | 1 |
The logic for bin is that if any entry of id contains the code 03 OR 06 then the value of its bin=1 else bin=0. For example, id=1 has bin=1 because some of its code contains 03 and 06; id=2 has bin=0 because none of its code contains either 03 or 06; and id=3 has bin=0 because one its code contains 03.
I have tried using groupBy together with agg but I can only get as far as countDistinct or sum or some flavour of it. Any help will be much appreciated

import pyspark.sql.functions as func
from pyspark.sql import types as T
def bin_fxn(x):
code = set(['03','06'])
return 1 - int(set(x).intersection(code)==set())
def main():
df = sc.parallelize([[1, '02'],
[1, '03'],
[1, '06'],
[2, '02'],
[2, '04'],
[2, '02'],
[3, '06'],
[3, '04']]).toDF(['id', 'code'])
bin_def = func.udf(bin_fxn, T.IntegerType())
df.groupBy('id').agg(bin_def(func.collect_list('code')).alias('bin')).show()
return
The code above achieves the goal. The use of UDF together with pyspark.sql.functions had to be made. For details visit the following website

Related

PySpark Return Exact Match from list of strings

I have a dataset as follows:
| id | text |
--------------
| 01 | hello world |
| 02 | this place is hell |
I also have a list of keywords I'm search for:
Keywords = ['hell', 'horrible', 'sucks']
When using the following solution using .rlike() or .contains(), sentences with either partial and exact matches to the list of words are returned to be true. I would like only exact matches to be returned.
Current code:
KEYWORDS = 'hell|horrible|sucks'
df = (
df
.select(
F.col('id'),
F.col('text'),
F.when(F.col('text').rlike(KEYWORDS), 1).otherwise(0).alias('keyword_found')
)
)
Current output:
| id | text | keyword_found |
-------------------------------
| 01 | hello world | 1 |
| 02 | this place is hell | 1 |
Expected output:
| id | text | keyword_found |
--------------------------------
| 01 | hello world | 0 |
| 02 | this place is hell | 1 |
Try below code, I have just change the Keyword only :
from pyspark.sql.functions import col,when
data = [["01","hello world"],["02","this place is hell"]]
schema =["id","text"]
df2 = spark.createDataFrame(data, schema)
df2.show()
+---+------------------+
| id| text|
+---+------------------+
| 01| hello world|
| 02|this place is hell|
+---+------------------+
KEYWORDS = '(hell|horrible|sucks)$'
df = (
df2
.select(
col('id'),
col('text'),
when(col('text').rlike(KEYWORDS), 1).otherwise(0).alias('keyword_found')
)
)
df.show()
+---+------------------+-------------+
| id| text|keyword_found|
+---+------------------+-------------+
| 01| hello world| 0|
| 02|this place is hell| 1|
+---+------------------+-------------+
Let me know if you need more help on this.
This should work
Keywords = 'hell|horrible|sucks'
df = (df.select(F.col('id'),F.col('text'),F.when(F.col('text').rlike('('+Keywords+')(\s|$)').otherwise(0).alias('keyword_found')))
id
text
keyword_found
01
hello world
0
02
this place is hell
1

columns value in order

I have this DataFrame bellow:
Ref ° | indice_1 | Indice_2 | rank_1 | rank_2 | echelon_from | section_from | echelon_to | section_to
--------------------------------------------------------------------------------------------------------------------------------------------
70574931 | 19 | 37.1 | 32 | 62 | ["10032,20032"] | ["11/12","13"] | ["40062"] | ["14A"]
---------------------------------------------------------------------------------------------------------------------------------------------
70574931 | 18 | 36 | 32 | 62 | ["20032"] | ["13"] | ["30062,40062"] | ["14,14A"]
I want concatenate the lines that have the same Ref° number, to concatenate echelon_from values, section_from values, echelon_to values and section_to values with duplicates there values, like in example bellow, and without touch the rest of the columns.
Ref ° | Indice_1 | Indice_2 | rank_1 | rank_2 | echelon_from | section_from | echelon_to | section_to
---------------------------------------------------------------------------------------------------------------------------------------------
70574931 | 19 | 37.1 | 32 | 62 | ["10032,20032"] | ["11/12","13"] | ["30062,40062"] | ["14,14A"]
----------------------------------------------------------------------------------------------------------------------------------------------
70574931 | 18 | 36 | 32 | 62 | ["10032,20032"] | ["11/12","13"] | ["30062,40062"] | ["14,14A"]
Some columns values in my original Dataframe are duplicates, I shouldn't touch them, I should keep there values to keep the same line numer of my DataFrame.
Someone can help me please how can I do it ?
Thank you!
There is multiple ways of doing this. One way is to explode all the given lists and collect them back again as a set.
from pyspark.sql import functions as F
lists_to_concat = ['echelon_from', 'section_from', 'echelon_to', 'section_to']
columns_not_to_concat = [c for c in df.columns if c not in lists_to_concat]
for c in lists_to_concat:
df = df.withColumn(c, F.explode(c))
df = (
df
.groupBy(*columns_not_to_concat)
.agg(
*[F.collect_set(c).alias(c) for c in lists_to_concat]
)
)
Another more elegant way is to use flatten().
from pyspark.sql import functions as F
lists_to_concat = ['echelon_from', 'section_from', 'echelon_to', 'section_to']
for c in lists_to_concat:
df = df.withColumn(c, F.flatten(c))
References:
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.flatten

Tableau - How check if a value equals a value from another row and column

I have the following table:
+------------+--------------+---------+---------+---------+
| Category | Subcategory |FruitName| Date1 | Date2 |
+------------+--------------+---------+---------+---------+
| A | 1 | Foo | 2011 | 2017 |
| | +---------+---------+---------+
| | |Pineapple| 2011 | 2013 |
| | +---------+---------+---------+
| | | Apple | 2017 | 2018 |
| +--------------+---------+---------+---------+
| | 2 | Peach | 2014 | 2015 |
| | +---------+---------+---------+
| | | Orange | 2015 | 2018 |
| | +---------+---------+---------+
| | | Banana | 2009 | 2013 |
+------------+--------------+---------+---------+---------+
I'd like to display the fruit names where Date1 from one row == Date2 from another row, but only if they are equals within the same Subcategory. In the table above, this filter should retrieve the rows based on those criterias:
And the final table would look like this:
+------------+--------------+---------+---------+---------+
| Category | Subcategory |FruitName| Date1 | Date2 |
+------------+--------------+---------+---------+---------+
| A | 1 | Foo | 2011 | 2017 |
| | +---------+---------+---------+
| | | Apple | 2017 | 2018 |
| +--------------+---------+---------+---------+
| | 2 | Peach | 2014 | 2015 |
| | +---------+---------+---------+
| | | Orange | 2015 | 2018 |
+------------+--------------+---------+---------+---------+
How can I possibly achieve this?
Your logic provided doesnot match with the output provided. If you are after the output, your logic should be:
SELECT f1.* from fruits f1 JOIN fruits f2
ON f1.Subcategory=f2.Subcategory
WHERE f1.Date1=f2.Date2 OR f1.Date2 = f2.Date1;
If your data source supports custom SQL, you can straight away use the above query. If not you can still achieve it in Tableau using a Full Outer Join and a calculated Field.(Tableau doesn't support OR condition in Joins.)
Create a self full outerjoin with the following criteria
Create a calculation called 'FILTER' as below
Apply a datasource filter to keep only 'FILTER' = True
Hide Fields from the rightside connection and you will have the required output.

Creating a query to subtract different values from a single column into a new column

Poorly worded title, but I can't think of a succinct way to describe my problem.
I have a table with the following columns:
year | month | basin_id | value
I need to take the values for all basin_ids of one year/month and subtract from that the corresponding values for all basin_ids of another year/month, and store the resulting values in such a way that they are still associated with their respective basin_ids.
This seems like it should be a rather simple query/subquery, and I can calculate the difference in values just fine with:
SELECT (val1.value-val2.value)
FROM value_table_1 as val1,
value_table_2 as val2
WHERE val1.basin_id=val2.basin_id
where value_table_1 and value_table_2 are temporary tables I've made by segregating all values associated with year1/month1 and year2/month2 for the sake of simplifying my query.
My problem from here is I get a column with all of the new values, but not with their associated basins. How can I achieve this? I am writing this within a plpgsql stored procedure, if that helps.
Say my table is as follows:
year | month | basin_id | value
-----+-------+----------+-------
2017 | 04 | 123 | 10
2017 | 04 | 456 | 6
2017 | 05 | 123 | 12
2017 | 05 | 456 | 4
and I'm given the inputs:
year1 := 2017
month1 := 04
year2 := 2017
month2 := 05
I want to get the following table as a result:
basin_id | value
----------+------
123 | -2
456 | 2
I think you want something like this..
CREATE TABLE foo
AS
SELECT *
FROM ( VALUES
( 2010, 02, 5, 8 ),
( 2013, 05, 5, 3 )
) AS t( year, month, basinid, value );
CREATE TEMPORARY TABLE bar
AS
SELECT basinid,
f1.year AS f1y, f1.month AS f1m,
f2.year AS f2y, f2.month AS f2m,
f1.value-f2.value AS value
FROM foo AS f1
INNER JOIN foo as f2
USING (basinid);
basinid | f1y | f1m | f2y | f2m | value
---------+------+-----+------+-----+----------
5 | 2010 | 2 | 2010 | 2 | 0
5 | 2010 | 2 | 2013 | 5 | 5
5 | 2013 | 5 | 2010 | 2 | -5
5 | 2013 | 5 | 2013 | 5 | 0
(4 rows)
SELECT *
FROM bar
WHERE f1y = 2013
AND f1m = 5
AND f2y = 2010
AND f2m = 2;

Cognos Calculate Variance Crosstab (Relational)

I have a simple crosstab such as this:
Trans | Pants | Shirts |
| 2013 | 2014 | 2013 | 2014 |
---------------------------------------
Jan | 33 | 37 | 41 | 53 |
Feb | 31 | 33 | 38 | 43 |
Mar | 26 | 29 | 51 | 56 |
Pants and Shirt belong to the data item: Category
Years belong to the data item: Years
Months belong to the data item: Months
Trans (transactions) belongs to the data item: Trans
Here is what is looks like in report studio:
Trans | <#Category#> | <#Category#> |
| <#Years#> | <#Years#> | <#Years#> | <#Years#> |
-----------------------------------------------------------
<#Months#>| <#1234#> | <#1234#> | <#1234#> | <#1234#> |
I want to be able to calculate the variance of pants and shirts between the years. To get something like this:
Trans | Pants | Shirts |
| 2013 | 2014 | YOY Variance | 2013 | 2014 | YOY Variance |
---------------------------------------------------------------------
Jan | 33 | 37 | 12.12 | 41 | 53 | 29.27 |
Feb | 31 | 33 | 6.45 | 38 | 43 | 13.16 |
Mar | 26 | 29 | 11.54 | 51 | 56 | 9.80 |
I've tried inserting a data item for YOY Variance with the expression below just to see if I can even get the 2014 value but cannot, some odd reason it only returns the 2013 values:
Total([Trans] for maximum[Year],[Category],[Months])
Any ideas? Help?
(I'm assuming you don't have a DMR.)
There is no easy/clean way to do this in Cognos. In your query, you'll have to build a calculation for each year in your output. So, something like this for 2013:
total (if [Years] = 2013) then ([Trans]) else (0))
And basically the same for 2014.
Cut the Trans piece out of your crossab. Then you'll nest those two calcs under your years. To get rid of all the zeroes or nulls, select the two columns. From the menu, select Data,
Suppress,Suppress Columns Only.
Finally, you will drop a calc in next to your Years in the crosstab (not under them). The expression will be ([2014 trans] - [2013 trans])/[2014 trans] (or whatever you end up naming your calcs). Format it as a percent, and you should be good to go.
Told you it was a pain!