how to find size of pyspark dataframe - pyspark

I have a two pyspark dataframe like below
df1 :-
+---+----------+------+
|age| dob|is_fan|
+---+----------+------+
| 30|1990-05-04| true|
| 26|1994-09-24| false|
+---+----------+------+
df2 :-
+---+----------+------+
|age| dob|is_fan|
+---+----------+------+
| 29|1990-05-03| true|
| 25|1994-09-23| false|
+---+----------+------+
I am doing union operation on this and creating the new dataframe df3.
+---+----------+------+
|age| dob|is_fan|
+---+----------+------+
| 29|1990-05-03| true|
| 25|1994-09-23| false|
| 30|1990-05-04| true|
| 26|1994-09-24| false|
+---+----------+------+
I want to find the size of the df3 dataframe in MB.
For single datafrme df1 i have tried below code and look it into Statistics part to find it. But after union there are multiple Statistics parameter.
dd3.createOrReplaceTempView('test')
spark.sql('explain cost select * from test').show(truncate=False)
Is there any other way to find the size of dataframe after union operation?

Related

How to combine two dataframes df1 and df2 keeping common columns from df2

I have df1
+-----+------------------------+---------------------------+----------------------+---------+
|JobId|TotalRecordType1Count |TotalRecordType2Count |TotalRecordType3Count |JobStatus|
+-----+------------------------+---------------------------+----------------------+---------+
| 100| 0| 0| 0|Success |
+-----+------------------------+---------------------------+----------------------+---------+
df2 as
+---------------------------+----------------------+
|TotalRecordType1Count |TotalRecordType2Count |
+---------------------------+----------------------+
| 800| 900|
+---------------------------+----------------------+
Both df1 and df2 will have only one row.
I want to combine df1 and df2 on common counts columns and keep the counts from df2
+-----+------------------------+---------------------------+----------------------+---------+
|JobId|TotalRecordType1Count |TotalRecordType2Count |TotalRecordType3Count |JobStatus|
+-----+------------------------+---------------------------+----------------------+---------+
| 100| 800| 900| 0|Success |
+-----+------------------------+---------------------------+----------------------+---------+
You can do a cross join and select the columns as needed:
val cols = df1.columns.map(x => if(df2.columns.contains(x)) df2(x) else df1(x))
result = df1.crossJoin(df2).select(cols:_*)
result.show
+-----+---------------------+---------------------+---------------------+---------+
|JobId|TotalRecordType1Count|TotalRecordType2Count|TotalRecordType3Count|JobStatus|
+-----+---------------------+---------------------+---------------------+---------+
| 100| 800| 900| 0| Success|
+-----+---------------------+---------------------+---------------------+---------+

Pyspark filter where value is in another dataframe

I have two data frames. I need to filter one to only show values that are contained in the other.
table_a:
+---+----+
|AID| foo|
+---+----+
| 1 | bar|
| 2 | bar|
| 3 | bar|
| 4 | bar|
+---+----+
table_b:
+---+
|BID|
+---+
| 1 |
| 2 |
+---+
In the end I want to filter out what was in table_a to only the IDs that are in the table_b, like this:
+--+----+
|ID| foo|
+--+----+
| 1| bar|
| 2| bar|
+--+----+
Here is what I'm trying to do
result_table = table_a.filter(table_b.BID.contains(table_a.AID))
But this doesn't seem to be working. It looks like I'm getting ALL values.
NOTE: I can't add any other imports other than pyspark.sql.functions import col
You can join the two tables and specify how = 'left_semi'
A left semi-join returns values from the left side of the relation that has a match with the right.
result_table = table_a.join(table_b, (table_a.AID == table_b.BID), \
how = "left_semi").drop("BID")
result_table.show()
+---+---+
|AID|foo|
+---+---+
| 1|bar|
| 2|bar|
+---+---+
In case you have duplicates or Multiple values in the second dataframe and you want to take only distinct values, below approach can be useful to tackle such use cases -
Create the Dataframe
df = spark.createDataFrame([(1,"bar"),(2,"bar"),(3,"bar"),(4,"bar")],[ "col1","col2"])
df_lookup = spark.createDataFrame([(1,1),(1,2)],[ "id","val"])
df.show(truncate=True)
df_lookup.show()
+----+----+
|col1|col2|
+----+----+
| 1| bar|
| 2| bar|
| 3| bar|
| 4| bar|
+----+----+
+---+---+
| id|val|
+---+---+
| 1| 1|
| 1| 2|
+---+---+
get all the unique values of val column in dataframe two and take in a set/list variable
df_lookup_var = df_lookup.groupBy("id").agg(F.collect_set("val").alias("val")).collect()[0][1][0]
print(df_lookup_var)
df = df.withColumn("case_col", F.when((F.col("col1").isin([1,2])), F.lit("1")).otherwise(F.lit("0")))
df = df.filter(F.col("case_col") == F.lit("1"))
df.show()
+----+----+--------+
|col1|col2|case_col|
+----+----+--------+
| 1| bar| 1|
| 2| bar| 1|
+----+----+--------+
This should work too:
table_a.where( col(AID).isin(table_b.BID.tolist() ) )

Unsure how to apply row-wise normalization on pyspark dataframe

Disclaimer: I'm a beginner when it comes to Pyspark.
For each cell in a row, I'd like to apply the following function
new_col_i = col_i / max(col_1,col_2,col_3,...,col_n)
At the very end, I'd like the range of values to go from 0.0 to 1.0.
Here are the details of my dataframe:
Dimensions: (6.5M, 2905)
Dtypes: Double
Initial DF:
+-----+-------+-------+-------+
|. id| col_1| col_2| col_n |
+-----+-------+-------+-------+
| 1| 7.5| 0.1| 2.0|
| 2| 0.3| 3.5| 10.5|
+-----+-------+-------+-------+
Updated DF:
+-----+-------+-------+-------+
|. id| col_1| col_2| col_n |
+-----+-------+-------+-------+
| 1| 1.0| 0.013| 0.26|
| 2| 0.028| 0.33| 1.0|
+-----+-------+-------+-------+
Any help would be appreciated.
You can find the maximum value from an array of columns and loop your dataframe to replace the normalized column value.
cols = df.columns[1:]
import builtins as p
df2 = df.withColumn('max', array_max(array(*[col(c) for c in cols]))) \
for c in cols:
df2 = df2.withColumn(c, col(c) / col('max'))
df2.show()
+---+-------------------+--------------------+-------------------+----+
| id| col_1| col_2| col_n| max|
+---+-------------------+--------------------+-------------------+----+
| 1| 1.0|0.013333333333333334|0.26666666666666666| 7.5|
| 2|0.02857142857142857| 0.3333333333333333| 1.0|10.5|
+---+-------------------+--------------------+-------------------+----+

PySpark - data overwritten in Partition

I am seeing a situation where when save a pyspark dataframe to a hive table with multiple column partition, it overwrites the data in subpartition too. Or - may be I am assuming it is a subpartition.
I want to treat the column 'month' as subpartition. So that, I can see 4 records (in hive table) instead of 2 when I save df2 to the same table.
mode=append will work. But, if year & month are same, I want data to be overwritten. Is there a way to do this when saving a pyspark dataframe?
>>> df1 = spark.sql('select * from test_input')
>>> df1.show()
+---+---+----+-----+
| f1| f2|year|month|
+---+---+----+-----+
| a| b|2018| 01|
| c| d|2018| 01|
+---+---+----+-----+
>>> df1.write.saveAsTable('test_output',mode='overwrite',partitionBy=('year','month'))
>>> spark.sql('select * from test_output').show()
+---+---+----+-----+
| f1| f2|year|month|
+---+---+----+-----+
| a| b|2018| 01|
| c| d|2018| 01|
+---+---+----+-----+
>>> df2 = spark.sql('select * from test_input')
>>> df2.show()
+---+---+----+-----+
| f1| f2|year|month|
+---+---+----+-----+
| a| b|2018| 02|
| c| d|2018| 02|
+---+---+----+-----+
>>> df2.write.saveAsTable('test_output',mode='overwrite',partitionBy=('year','month'))
>>> spark.sql('select * from test_output').show()
+---+---+----+-----+
| f1| f2|year|month|
+---+---+----+-----+
| a| b|2018| 02|
| c| d|2018| 02|
+---+---+----+-----+
It seems like you misunderstand the concept of partitioning.
This is not window function partitioning that you would come across in SQL statement; it instead refers to the way data is stored and referenced in memory or on a file system. Here's a helpful introduction.
Changing the partitioning of a Spark dataframe will never alter the number of rows in that dataframe.

Spark (Scala): Insert missing rows to complete sequence in a specific column

I have some data source that comes in with certain rows missing when they are invalid.
In the following DataFrame, each user should have 3 rows, from index 0 to index 2.
val df = sc.parallelize(List(
("user1",0,true),
("user1",1,true),
("user1",2,true),
("user2",1,true),
("user3",2,true)
)).toDF("user_id","index","is_active")
For instance:
user1 has all the necessary index.
user2 is missing index 0 and index 2.
user3 is missing index 1.
Like this:
+-------+-----+---------+
|user_id|index|is_active|
+-------+-----+---------+
| user1| 0| true|
| user1| 1| true|
| user1| 2| true|
| user2| 1| true|
| user3| 0| true|
| user3| 2| true|
+-------+-----+---------+
I'd like to insert the default rows and make them into the following DataFrame.
+-------+-----+---------+
|user_id|index|is_active|
+-------+-----+---------+
| user1| 0| true|
| user1| 1| true|
| user1| 2| true|
| user2| 0| false|
| user2| 1| true|
| user2| 2| false|
| user3| 0| true|
| user3| 1| false|
| user3| 2| true|
+-------+-----+---------+
I've seen a separate and similar question where the answer to pivot the table so each user has 3 columns. But 1, that relies on the fact that index 0 to 2 exist for some users. In my real case the index is over a very large range so i cannot guarantee that after the pivot all columns would be complete. It also seems a pretty expensive operation to pivot, then un-pivot to get the second DataFrame.
I also tried to create a new DataFrame like this:
val indexDF = df.select("user_id").distinct.join(sc.parallelize(Seq.range(0, 3)).toDF("ind"))
val result = indexDF.join(df, Seq("user_id", "index"), "left").na.fill(0)
This actually works but when I ran it with real data (with millions of record and hundreds of values in index) it took very long and so I suspect it could be done more efficiently.
Thanks in advance!