how to reduce dataFrame wisely - pyspark

I want to transform the following dataFrmae structure where we have details regarding each id and kpi couple of records one with the value_left and the 2nd for the value_right I want to group the 2 records into a single record (as you see in the expected results)
i want to reduce the fillowing dataFrame
+---+---+----------+-----------+
| id|kpi|value_left|value_right|
+---+---+----------+-----------+
| 1|sum| 10| null|
| 1|sum| null| 20|
| 2|avg| 15| null|
| 2|avg| null| 15|
+---+---+----------+-----------+
Expected output dataFrame
+---+---+----------+-----------+
| id|kpi|value_left|value_right|
+---+---+----------+-----------+
| 1|sum| 10| 20|
| 2|avg| 15| 15|
+---+---+----------+-----------+
expected dataFrame

Related

Is there any method by which we can limit the rows in repartition function?

In spark I am trying to limit the numbers of rows to 100 in each partition. But i don't want to write it in the file.. i need to perform more operations on the file before overwriting the records
you can do it using repartition.
to keep n record in each partition you need to repartition your data as total_data_count/repartition=100
For example : i have 100 record now if i want to have each partition 10 records then i have to repartition my data in 10 parts df.repartition(10)
>>> df=spark.read.csv("/path to csv/sample2.csv",header=True)
>>> df.count()
100
>>> df1=df.repartition(10)
>>> df1\
... .withColumn("partitionId", spark_partition_id())\
... .groupBy("partitionId")\
... .count()\
... .orderBy(asc("count"))\
... .show()
+-----------+-----+
|partitionId|count|
+-----------+-----+
| 6| 10|
| 3| 10|
| 5| 10|
| 9| 10|
| 8| 10|
| 4| 10|
| 7| 10|
| 1| 10|
| 0| 10|
| 2| 10|
+-----------+-----+
here you can see each partition have 10 records

Pyspark get count in aggregate table

I have a table that looks like this:
+-------------+-----+
| PULocationID| fare|
+-------------+-----+
| 1| 5|
| 1| 15|
| 2| 2|
+-------------+-----+
I want to get a table that looks like this:
+-------------+----------+------+
| PULocationID| avg_fare | count|
+-------------+----------+------+
| 1| 10| 2|
| 2| 2| 1|
+-------------+----------+------+
Here is what I'm trying:
result_table = trips.groupBy("PULocationID") \
.agg(
{"total_amount": "avg"},
{"PULocationID": "count"}
)
If I take out the count line, it works fine getting the avg column. But I need to get the count also of how many rows had that particular PULocationID
NOTE: I can't add any other imports other than pyspark.sql.functions import col
Thanks for the help!
I was so close, I was just formatting it as two dictionaries instead of one.
result_table = trips.groupBy("PULocationID") \
.agg(
{"total_amount": "avg","PULocationID":"count"}
)
This should be the working solution for you - use avg() and count()
df = spark.createDataFrame([(1,5),(1,15),(2,2)],[ "PULocationID","fare"])
df.show()
df_group = df.groupBy("PULocationID").agg(F.avg("fare").alias("avg_fare"), F.count("PULocationID").alias("count"))
df_group.show()
**Input**
+------------+----+
|PULocationID|fare|
+------------+----+
| 1| 5|
| 1| 15|
| 2| 2|
+------------+----+
Output
+------------+--------+-----+
|PULocationID|avg_fare|count|
+------------+--------+-----+
| 1| 10.0| 2|
| 2| 2.0| 1|
+------------+--------+-----+

Spark adding indexes to dataframe and append other dataset that doesn't have index

I have a dataset that has column userid and index values.
+---------+--------+
| userid | index|
+---------+--------+
| user1| 1|
| user2| 2|
| user3| 3|
| user4| 4|
| user5| 5|
| user6| 6|
| user7| 7|
| user8| 8|
| user9| 9|
| user10| 10|
+---------+--------+
I want to append a new data frame to it and add an index to the newly added columns.
The userid is unique and the existing data frame will not have the Dataframe 2 user ids.
+----------+
| userid |
+----------+
| user11|
| user21|
| user41|
| user51|
| user64|
+----------+
The expected output with newly added userid and index
+---------+--------+
| userid | index|
+---------+--------+
| user1| 1|
| user2| 2|
| user3| 3|
| user4| 4|
| user5| 5|
| user6| 6|
| user7| 7|
| user8| 8|
| user9| 9|
| user10| 10|
| user11| 11|
| user21| 12|
| user41| 13|
| user51| 14|
| user64| 15|
+---------+--------+
Is it possible to achive this by passing a max index value and start index for second Dataframe from given index value.
If the userid has some ordering, then you can use the rownumber function. Even if it does not have, then you can add an id using monotonically_increasing_id(). For now I assume that userid can be ordered. Then you can do this:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
df_merge = df1.select('userid').union(df2.select('userid'))
w=Window.orderBy('userid')
df_result = df_merge.withColumn('indexid',F.row_number().over(w))
EDIT : After discussions in comment.
#%% Test data and imports
import pyspark.sql.functions as F
from pyspark.sql import Window
df = sqlContext.createDataFrame([('a',100),('ab',50),('ba',300),('ced',60),('d',500)],schema=['userid','index'])
df1 = sqlContext.createDataFrame([('fgh',100),('ff',50),('fe',300),('er',60),('fi',500)],schema=['userid','dummy'])
#%%
#%% Merge the two dataframes, with a null columns as the index
df1=df1.withColumn('index', F.lit(None))
df_merge = df.select(df.columns).union(df1.select(df.columns))
#%%Define a window to arrange the newly added rows at the last and order them by userid
#%% The user id, even though random strings, can be ordered
w= Window.orderBy(F.col('index').asc_nulls_last(),F.col('userid'))# if possible add a partition column here, otherwise all your data will come in one partition, consider salting
#%% For the newly added rows, define index as the maximum value + increment of number of rows in main dataframe
df_final = df_merge.withColumn("index_new",F.when(~F.col('index').isNull(),F.col('index')).otherwise((F.last(F.col('index'),ignorenulls=True).over(w))+F.sum(F.lit(1)).over(w)))
#%% If number of rows in main dataframe is huge, then add an offset in the above line
df_final.show()
+------+-----+---------+
|userid|index|index_new|
+------+-----+---------+
| ab| 50| 50|
| ced| 60| 60|
| a| 100| 100|
| ba| 300| 300|
| d| 500| 500|
| er| null| 506|
| fe| null| 507|
| ff| null| 508|
| fgh| null| 509|
| fi| null| 510|
+------+-----+---------+

How to find the difference of two dataframes

I am working on some Unit Testing Spark code which should be able to generate the difference between two dataframes(raw bucket and curated bucket). Both dataframes(buckets) are the same and we want to execute this code to capture the possible changes after we copy files from raw to curated. . I am aware that I can use except function as follow:
val difference =CuratedDataFrame.union(RawDataFrame).except(CuratedDataFrame.intersect(RawDataFrame))
+-----------+-------+-------------+---------+---------------+
|record |pid |feetype |freq |default |
+-----------+-------+-------------+---------+---------------+
| 1| 45| FAC| Y| T|
| 1| 45| FAC| Y| TTY|
| 1| 47| FAC| R| M|
| 1| 99| FAC| R| M|
+-----------+-------+-------------+---------+---------------+
The except function is returning the entire row but my desired output is as follow :
+-----------+-------+-------------+---------+---------------+
|record |pid |feetype |freq |default |
+-----------+-------+-------------+---------+---------------+
| null|[47,99]| null| null| null |
| null| null| null| null| [T, TTY]|
+----------+-----------+-------+-------------+---------+-----
It means if there is a change in column then it should appear if there is no change then it should be hidden or Null.
For doing this I am using the following approach :
val mapDiffs=(name: String) => when($"l.$name" === $"r.$name", null )
.otherwise(array($"l.$name", $"r.$name")).as(name)
val result = difference.as("l")
.join(RawDataFrame.as("r"), $"l.primaryKey" === $"r.primaryKey","inner")
.select($"l.primaryKey" :: cols.map(mapDiffs): _*)
The above approach requires primary key to be able to join both dataframes and compare them row by row. None of the dataframes have primary key so I had to combine some of the columns to specify a primary key :
+-----------+-------+-------------+---------+---------------+----------+
|record |pid |feetype |freq |default |primaryKey|
+-----------+-------+-------------+---------+---------------+----------+
| 1| 40| FAC| A| N| FAC40A|
| 1| 45| FAC| Y| T| FAC45Y|
| 1| 47| FAC| R| M| FAC47R|
+-----------+-------+-------------+---------+---------------+----------+
The problem is that if any changes happen in the target bucket, the primary key will be consequently changed so comparing both dataframes would be impossible.

Pyspark Autonumber over a partitioning column

I have a column in my data frame that is sensitive. I need to replace the sensitive value with a number, but have to do it so that the distinct counts of the column in question stays accurate. I was thinking of a sql function over a window partition. But couldn't find a way.
A sample dataframe is below.
df = (sc.parallelize([
{"sensitive_id":"1234"},
{"sensitive_id":"1234"},
{"sensitive_id":"1234"},
{"sensitive_id":"2345"},
{"sensitive_id":"2345"},
{"sensitive_id":"6789"},
{"sensitive_id":"6789"},
{"sensitive_id":"6789"},
{"sensitive_id":"6789"}
]).toDF()
.cache()
)
I would like to create a dataframe like below.
What is a way to get this done.
You are looking for dense_rank function :
df.withColumn(
"non_sensitive_id",
F.dense_rank().over(Window.partitionBy().orderBy("sensitive_id"))
).show()
+------------+----------------+
|sensitive_id|non_sensitive_id|
+------------+----------------+
| 1234| 1|
| 1234| 1|
| 1234| 1|
| 2345| 2|
| 2345| 2|
| 6789| 3|
| 6789| 3|
| 6789| 3|
| 6789| 3|
+------------+----------------+
This is another way of doing this, may not be very efficient because join() will involve a shuffle -
Creating the DataFrame -
from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number
df = sqlContext.createDataFrame([(1234,),(1234,),(1234,),(2345,),(2345,),(6789,),(6789,),(6789,),(6789,)],['sensitive_id'])
Creating a DataFrame of distinct elements and labeling them 1,2,3... and finally joining the two dataframes.
df_distinct = df.select('sensitive_id').distinct().withColumn('non_sensitive_id', row_number().over(Window.orderBy('sensitive_id')))
df = df.join(df_distinct, ['sensitive_id'],how='left').orderBy('sensitive_id')
df.show()
+------------+----------------+
|sensitive_id|non_sensitive_id|
+------------+----------------+
| 1234| 1|
| 1234| 1|
| 1234| 1|
| 2345| 2|
| 2345| 2|
| 6789| 3|
| 6789| 3|
| 6789| 3|
| 6789| 3|
+------------+----------------+