Aggregation of a data frame based on condition (Pyspark) - pyspark

I need help with following task: I have obtained a groupby results whereby I have a list of multiple addresses (here just a cut out with one address) with people occupying those addresses. I need to calculate ratio of usage of the app, so that I am dividing [name] + Active Count / [name] + Passive Count and creating a new dataframe with [address][name][usage_ratio]I have never done aggregation alike and I have no idea where to start or how to loop it to execute it. Can anyone help?
+------------+--------------------+----------------+-----+
| address| name| use_of_app|count|
+------------+--------------------+----------------+-----+
| 33| Mark| active| 35|
| 33| Mark| passive| 4|
| 33| Abbie| active| 30|
| 33| Abbie| passive| 2|
| 33| Anna| passive| 3|
| 33| Anna| active| 32|
| 33| Tom| passive| 38|
| 33| Tom| active| 50|
| 33| Patrick| passive| 40|
| 33| Patrick| active| 57|
+------------+--------------------+----------------+-----+

Here is my code - I use a sum on count column because I am not sure how many lines of each use_of_app you will have :
from pyspark.sql import functions as F
df = df.groupBy("address", "name").agg(
(
F.sum(F.when(F.col("use_of_app") == "active", F.col("count")))
/ F.sum(F.when(F.col("use_of_app") == "passive", F.col("count")))
).alias("usage_ratio")
)
df.show()
+-------+-------+------------------+
|address| name| usage_ratio|
+-------+-------+------------------+
| 33| Abbie| 15.0|
| 33| Mark| 8.75|
| 33| Tom|1.3157894736842106|
| 33| Anna|10.666666666666666|
| 33|Patrick| 1.425|
+-------+-------+------------------+

Another option would be a pivot-step:
from pyspark.sql import function as F
(
df.groupby("address", "name")
.pivot("use_of_app", values=["active", "passive"])
.agg(F.sum("count"))
.withColumn("ratio", F.col("active") / F.col("passive"))
.show()
)
# Output
+-------+-----+------+-------+------------------+
|address| name|active|passive| ratio|
+-------+-----+------+-------+------------------+
| 33|Abbie| 30| 2| 15.0|
| 33| Anna| 32| 3|10.666666666666666|
| 33| Mark| 35| 4| 8.75|
+-------+-----+------+-------+------------------+
+++
Updated according to the suggestion by Steven: .pivot("use_of_app", values=["active", "passive"]).

I found a solution which might work but is really resource heavy and time consuming so I am inviting everyone to post faster working solution. So my solution is to take the dataframe above, split it by .filter(use_of_app) for two separate activeDF and passiveDF, and then join them back together based on name = name condition with extra column dividing count(activeDF) by count(passiveDF)

Related

How does Spark DataFrame find out some lines that only appear once?

I want to eliminate some rows that only appear once in the ‘county’ column, which is not conducive to my statistics.
I used groupBy+count to find:
fault_data.groupBy("county").count().show()
The data looks like this:
+----------+-----+
| county|count|
+----------+-----+
| A| 117|
| B| 31|
| C| 1|
| D| 272|
| E| 1|
| F| 1|
| G| 280|
| H| 1|
| I| 1|
| J| 1|
| K| 112|
| L| 63|
| M| 18|
| N| 71|
| O| 1|
| P| 1|
| Q| 82|
| R| 2|
| S| 31|
| T| 2|
+----------+-----+
Next, I want to eliminate the data whose count is 1.
But when I wrote it like this, it was wrong:
fault_data.filter("count(county)=1").show()
The result is:
Aggregate/Window/Generate expressions are not valid in where clause of the query.
Expression in where clause: [(count(county) = CAST(1 AS BIGINT))]
Invalid expressions: [count(county)];
Filter (count(county#7) = cast(1 as bigint))
+- Relation [fault_id#0,fault_type#1,acs_way#2,fault_1#3,fault_2#4,province#5,city#6,county#7,town#8,detail#9,num#10,insert_time#11] JDBCRelation(fault_data) [numPartitions=1]
So I want to know the right way, thank you.
fault_data.groupBy("county").count().where(col("count")===1).show()

Spark adding indexes to dataframe and append other dataset that doesn't have index

I have a dataset that has column userid and index values.
+---------+--------+
| userid | index|
+---------+--------+
| user1| 1|
| user2| 2|
| user3| 3|
| user4| 4|
| user5| 5|
| user6| 6|
| user7| 7|
| user8| 8|
| user9| 9|
| user10| 10|
+---------+--------+
I want to append a new data frame to it and add an index to the newly added columns.
The userid is unique and the existing data frame will not have the Dataframe 2 user ids.
+----------+
| userid |
+----------+
| user11|
| user21|
| user41|
| user51|
| user64|
+----------+
The expected output with newly added userid and index
+---------+--------+
| userid | index|
+---------+--------+
| user1| 1|
| user2| 2|
| user3| 3|
| user4| 4|
| user5| 5|
| user6| 6|
| user7| 7|
| user8| 8|
| user9| 9|
| user10| 10|
| user11| 11|
| user21| 12|
| user41| 13|
| user51| 14|
| user64| 15|
+---------+--------+
Is it possible to achive this by passing a max index value and start index for second Dataframe from given index value.
If the userid has some ordering, then you can use the rownumber function. Even if it does not have, then you can add an id using monotonically_increasing_id(). For now I assume that userid can be ordered. Then you can do this:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
df_merge = df1.select('userid').union(df2.select('userid'))
w=Window.orderBy('userid')
df_result = df_merge.withColumn('indexid',F.row_number().over(w))
EDIT : After discussions in comment.
#%% Test data and imports
import pyspark.sql.functions as F
from pyspark.sql import Window
df = sqlContext.createDataFrame([('a',100),('ab',50),('ba',300),('ced',60),('d',500)],schema=['userid','index'])
df1 = sqlContext.createDataFrame([('fgh',100),('ff',50),('fe',300),('er',60),('fi',500)],schema=['userid','dummy'])
#%%
#%% Merge the two dataframes, with a null columns as the index
df1=df1.withColumn('index', F.lit(None))
df_merge = df.select(df.columns).union(df1.select(df.columns))
#%%Define a window to arrange the newly added rows at the last and order them by userid
#%% The user id, even though random strings, can be ordered
w= Window.orderBy(F.col('index').asc_nulls_last(),F.col('userid'))# if possible add a partition column here, otherwise all your data will come in one partition, consider salting
#%% For the newly added rows, define index as the maximum value + increment of number of rows in main dataframe
df_final = df_merge.withColumn("index_new",F.when(~F.col('index').isNull(),F.col('index')).otherwise((F.last(F.col('index'),ignorenulls=True).over(w))+F.sum(F.lit(1)).over(w)))
#%% If number of rows in main dataframe is huge, then add an offset in the above line
df_final.show()
+------+-----+---------+
|userid|index|index_new|
+------+-----+---------+
| ab| 50| 50|
| ced| 60| 60|
| a| 100| 100|
| ba| 300| 300|
| d| 500| 500|
| er| null| 506|
| fe| null| 507|
| ff| null| 508|
| fgh| null| 509|
| fi| null| 510|
+------+-----+---------+

Convert matrix to Pyspark Dataframe

I've a matrix which is 1000*10000 size. I want to convert this matrix into pyspark dataframe.
Can someone please tell me how to do it? This post has an example. But my number of columns is large. So, assigning column names manually will be difficult.
Thanks!
In order to create a Pyspark Dataframe, you can use the function createDataFrame()
matrix=([11,12,13,14,15],[21,22,23,24,25],[31,32,33,34,35],[41,42,43,44,45])
df=spark.createDataFrame(matrix)
df.show()
+---+---+---+---+---+
| _1| _2| _3| _4| _5|
+---+---+---+---+---+
| 11| 12| 13| 14| 15|
| 21| 22| 23| 24| 25|
| 31| 32| 33| 34| 35|
| 41| 42| 43| 44| 45|
+---+---+---+---+---+
As you can see above, the columns will be named automatically with numbers.
You can also pass your own column names to the createDataFrame() function:
columns=[ 'mycol_'+str(col) for col in range(5) ]
df=spark.createDataFrame(matrix,schema=columns)
df.show()
+-------+-------+-------+-------+-------+
|mycol_0|mycol_1|mycol_2|mycol_3|mycol_4|
+-------+-------+-------+-------+-------+
| 11| 12| 13| 14| 15|
| 21| 22| 23| 24| 25|
| 31| 32| 33| 34| 35|
| 41| 42| 43| 44| 45|
+-------+-------+-------+-------+-------+

Pyspark Autonumber over a partitioning column

I have a column in my data frame that is sensitive. I need to replace the sensitive value with a number, but have to do it so that the distinct counts of the column in question stays accurate. I was thinking of a sql function over a window partition. But couldn't find a way.
A sample dataframe is below.
df = (sc.parallelize([
{"sensitive_id":"1234"},
{"sensitive_id":"1234"},
{"sensitive_id":"1234"},
{"sensitive_id":"2345"},
{"sensitive_id":"2345"},
{"sensitive_id":"6789"},
{"sensitive_id":"6789"},
{"sensitive_id":"6789"},
{"sensitive_id":"6789"}
]).toDF()
.cache()
)
I would like to create a dataframe like below.
What is a way to get this done.
You are looking for dense_rank function :
df.withColumn(
"non_sensitive_id",
F.dense_rank().over(Window.partitionBy().orderBy("sensitive_id"))
).show()
+------------+----------------+
|sensitive_id|non_sensitive_id|
+------------+----------------+
| 1234| 1|
| 1234| 1|
| 1234| 1|
| 2345| 2|
| 2345| 2|
| 6789| 3|
| 6789| 3|
| 6789| 3|
| 6789| 3|
+------------+----------------+
This is another way of doing this, may not be very efficient because join() will involve a shuffle -
Creating the DataFrame -
from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number
df = sqlContext.createDataFrame([(1234,),(1234,),(1234,),(2345,),(2345,),(6789,),(6789,),(6789,),(6789,)],['sensitive_id'])
Creating a DataFrame of distinct elements and labeling them 1,2,3... and finally joining the two dataframes.
df_distinct = df.select('sensitive_id').distinct().withColumn('non_sensitive_id', row_number().over(Window.orderBy('sensitive_id')))
df = df.join(df_distinct, ['sensitive_id'],how='left').orderBy('sensitive_id')
df.show()
+------------+----------------+
|sensitive_id|non_sensitive_id|
+------------+----------------+
| 1234| 1|
| 1234| 1|
| 1234| 1|
| 2345| 2|
| 2345| 2|
| 6789| 3|
| 6789| 3|
| 6789| 3|
| 6789| 3|
+------------+----------------+

Spark-Monotonically increasing id not working as expected in dataframe?

I have a dataframe df in Spark which looks something like this:
scala> df.show()
+--------+--------+
|columna1|columna2|
+--------+--------+
| 0.1| 0.4|
| 0.2| 0.5|
| 0.1| 0.3|
| 0.3| 0.6|
| 0.2| 0.7|
| 0.2| 0.8|
| 0.1| 0.7|
| 0.5| 0.5|
| 0.6| 0.98|
| 1.2| 1.1|
| 1.2| 1.2|
| 0.4| 0.7|
+--------+--------+
I tried to include an id column with the following code
val df_id = df.withColumn("id",monotonicallyIncreasingId)
but the id column is not what I expect:
scala> df_id.show()
+--------+--------+----------+
|columna1|columna2| id|
+--------+--------+----------+
| 0.1| 0.4| 0|
| 0.2| 0.5| 1|
| 0.1| 0.3| 2|
| 0.3| 0.6| 3|
| 0.2| 0.7| 4|
| 0.2| 0.8| 5|
| 0.1| 0.7|8589934592|
| 0.5| 0.5|8589934593|
| 0.6| 0.98|8589934594|
| 1.2| 1.1|8589934595|
| 1.2| 1.2|8589934596|
| 0.4| 0.7|8589934597|
+--------+--------+----------+
As you can see, it goes well from 0 to 5 but then the next id is 8589934592 instead of 6 and so on.
So what is wrong here? Why is the id column not properly indexed here?
It works as expected. This function is not intended for generating consecutive values. Instead it encodes partition number and index by partition
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.
As an example, consider a DataFrame with two partitions, each with 3 records. This expression would return the following IDs:
0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594.
If you want consecutive numbers, use RDD.zipWithIndex.