Related
How can I generate the expected value, ExpectedGroup such that the same value exists when True, but changes and increments by 1, when we run into a False statement in cond1.
Consider:
df = spark.createDataFrame(sc.parallelize([
['A', '2019-01-01', 'P', 'O', 2, None],
['A', '2019-01-02', 'O', 'O', 5, 1],
['A', '2019-01-03', 'O', 'O', 10, 1],
['A', '2019-01-04', 'O', 'P', 4, None],
['A', '2019-01-05', 'P', 'P', 300, None],
['A', '2019-01-06', 'P', 'O', 2, None],
['A', '2019-01-07', 'O', 'O', 5, 2],
['A', '2019-01-08', 'O', 'O', 10, 2],
['A', '2019-01-09', 'O', 'P', 4, None],
['A', '2019-01-10', 'P', 'P', 300, None],
['B', '2019-01-01', 'P', 'O', 2, None],
['B', '2019-01-02', 'O', 'O', 5, 3],
['B', '2019-01-03', 'O', 'O', 10, 3],
['B', '2019-01-04', 'O', 'P', 4, None],
['B', '2019-01-05', 'P', 'P', 300, None],
]),
['ID', 'Time', 'FromState', 'ToState', 'Hours', 'ExpectedGroup'])
# condition statement
cond1 = (df.FromState == 'O') & (df.ToState == 'O')
df = df.withColumn('condition', cond1.cast("int"))
df = df.withColumn('conditionLead', F.lead('condition').over(Window.orderBy('ID', 'Time')))
df = df.na.fill(value=0, subset=["conditionLead"])
df = df.withColumn('finalCondition', ( (F.col('condition') == 1) & (F.col('conditionLead') == 1)).cast('int'))
# working pandas option:
# cond1 = ( (df.FromState == 'O') & (df.ToState == 'O') )
# df['ExpectedGroup'] = (cond1.shift(-1) & cond1).cumsum().mask(~cond1)
# other working option:
# cond1 = ( (df.FromState == 'O') & (df.ToState == 'O') )
# df['ExpectedGroup'] = (cond1.diff()&cond1).cumsum().where(cond1)
# failing here
windowval = (Window.partitionBy('ID').orderBy('Time').rowsBetween(Window.unboundedPreceding, 0))
df = df.withColumn('ExpectedGroup2', F.sum(F.when(cond1, F.col('finalCondition'))).over(windowval))
Just use the same logic shown in your Pandas code, use Window lag function to get the previous value of cond1, set the flag to 1 only when the current cond1 is true and the previous cond1 is false, and then do the cumsum based on cond1, see below code(BTW, you probably want to add ID to partitionBy clause of the WindSpec, in that case the last ExpectedGroup1 should be 1 instead of 3):
from pyspark.sql import functions as F, Window
w = Window.partitionBy().orderBy('ID', 'time')
df_new = (df.withColumn('cond1', (F.col('FromState')=='O') & (F.col('ToState')=='O'))
.withColumn('f', F.when(F.col('cond1') & (~F.lag(F.col('cond1')).over(w)),1).otherwise(0))
.withColumn('ExpectedGroup1', F.when(F.col('cond1'), F.sum('f').over(w)))
)
df_new.show()
+---+----------+---------+-------+-----+-------------+-----+---+--------------+
| ID| Time|FromState|ToState|Hours|ExpectedGroup|cond1| f|ExpectedGroup1|
+---+----------+---------+-------+-----+-------------+-----+---+--------------+
| A|2019-01-01| P| O| 2| null|false| 0| null|
| A|2019-01-02| O| O| 5| 1| true| 1| 1|
| A|2019-01-03| O| O| 10| 1| true| 0| 1|
| A|2019-01-04| O| P| 4| null|false| 0| null|
| A|2019-01-05| P| P| 300| null|false| 0| null|
| A|2019-01-06| P| O| 2| null|false| 0| null|
| A|2019-01-07| O| O| 5| 2| true| 1| 2|
| A|2019-01-08| O| O| 10| 2| true| 0| 2|
| A|2019-01-09| O| P| 4| null|false| 0| null|
| A|2019-01-10| P| P| 300| null|false| 0| null|
| B|2019-01-01| P| O| 2| null|false| 0| null|
| B|2019-01-02| O| O| 5| 3| true| 1| 3|
| B|2019-01-03| O| O| 10| 3| true| 0| 3|
| B|2019-01-04| O| P| 4| null|false| 0| null|
| B|2019-01-05| P| P| 300| null|false| 0| null|
+---+----------+---------+-------+-----+-------------+-----+---+--------------+
How to create a group column counter in PySpark?
To create a group column counter in PySpark, we can use the Window function and the row_number function. The Window function allows us to define a partitioning and ordering criteria for the rows, and the row_number function returns the position of the row within the window.
For example, suppose we have a PySpark DataFrame called df with the following data:
id
name
category
1
A
X
2
B
X
3
C
Y
4
D
Y
5
E
Z
We want to create a new column called group_counter that assigns a number to each row within the same category, starting from 1. To do this, we can use the following code:
# Import the required modules
from pyspark.sql import Window
from pyspark.sql.functions import row_number
# Define the window specification
window = Window.partitionBy("category").orderBy("id")
# Create the group column counter
df = df.withColumn("group_counter", row_number().over(window))
# Show the result
df.show()
The output of the code is:
id
name
category
group_counter
1
A
X
1
2
B
X
2
3
C
Y
1
4
D
Y
2
5
E
Z
1
As we can see, the group_counter column increments by 1 for each row within the same category, and resets to 1 when the category changes.
Why is creating a group column counter useful?
Creating a group column counter can be useful for various purposes, such as:
Ranking the rows within a group based on some criteria, such as sales, ratings, or popularity.
Assigning labels or identifiers to the rows within a group, such as customer segments, product categories, or order numbers.
Performing calculations or aggregations based on the group column counter, such as cumulative sums, averages, or percentages.
I have a Spark Dataframe which I am trying to summarise in order to find overly long columns:
// Set up test data
// Look for long columns (>=3), ie 1 is ok row,, 2 is bad on column 3, 3 is bad on column 2
val df = Seq(
( 1, "a", "bb", "cc", "file1" ),
( 2, "d", "ee", "fff", "file2" ),
( 3, "g", "hhhh", "ii", "file3" )
).
toDF("rowId", "col1", "col2", "col3", "filename")
I can summarise the lengths of the columns and find overly long ones like this:
// Look for long columns (>=3), ie 1 is ok row,, 2 is bad on column 3, 3 is bad on column 2
val df2 = df.columns
.map(c => (c, df.agg(max(length(df(s"$c")))).as[String].first()))
.toSeq.toDF("columnName", "maxLength")
.filter($"maxLength" > 2)
If I try and add the existing filename column to the map I get an error:
val df2 = df.columns
.map(c => ($"filename", c, df.agg(max(length(df(s"$c")))).as[String].first()))
.toSeq.toDF("fn", "columnName", "maxLength")
.filter($"maxLength" > 2)
I have tried a few variations of the $"filename" syntax. How can I incorporate the filename column into the summary?
columnName
maxLength
filename
col2
4
file3
col3
3
file2
The real dataframes have 300+ columns and millions of rows so I cannot hard-type column names.
#wBob does the following achieve your goal?
group by file name and get the maximum per column:
val cols = df.columns.dropRight(1) // to remove the filename col
val maxLength = cols.map(c => s"max(length(${c})) as ${c}").mkString(",")
print(maxLength)
df.createOrReplaceTempView("temp")
val df1 = spark
.sql(s"select filename, ${maxLength} from temp group by filename")
df1.show()`
With the output:
+--------+-----+----+----+----+
|filename|rowId|col1|col2|col3|
+--------+-----+----+----+----+
| file1| 1| 1| 2| 2|
| file2| 1| 1| 2| 3|
| file3| 1| 1| 4| 2|
+--------+-----+----+----+----+
Use subqueries to get the maximum per column and concatenate the results using union:
df1.createOrReplaceTempView("temp2")
val res = cols.map(col => {
spark.sql(s"select '${col}' as columnName, $col as maxLength, filename from temp2 " +
s"where $col = (select max(${col}) from temp2)")
}).reduce(_ union _)
res.show()
With the result:
+----------+---------+--------+
|columnName|maxLength|filename|
+----------+---------+--------+
| rowId| 1| file1|
| rowId| 1| file2|
| rowId| 1| file3|
| col1| 1| file1|
| col1| 1| file2|
| col1| 1| file3|
| col2| 4| file3|
| col3| 3| file2|
+----------+---------+--------+
Note that there are multiple entries for rowId and col1 since the maximum is not unique.
There is probably a more elegant way to write it, but I am struggling to find one at the moment.
Pushed a little further for better result.
df.select(
col("*"),
array( // make array of columns name/value/length
(for{ col_name <- df.columns } yield
struct(
length(col(col_name)).as("length"),
lit(col_name).as("col"),
col(col_name).cast("String").as("col_value")
)
).toSeq:_* ).alias("rowInfo")
)
.select(
col("rowId"),
explode( // explode array into rows
expr("filter(rowInfo, x -> x.length >= 3)") //filter the array for the length your interested in
).as("rowInfo")
)
.select(
col("rowId"),
col("rowInfo.*") // turn struct fields into columns
)
.sort("length").show
+-----+------+--------+---------+
|rowId|length| col|col_value|
+-----+------+--------+---------+
| 2| 3| col3| fff|
| 3| 4| col2| hhhh|
| 3| 5|filename| file3|
| 1| 5|filename| file1|
| 2| 5|filename| file2|
+-----+------+--------+---------+
It might be enough to sort your table by total text length. This can be achieved quickly and concisely.
df.select(
col("*"),
length( // take the length
concat( //slap all the columns together
(for( col_name <- df.columns ) yield col(col_name)).toSeq:_*
)
)
.as("length")
)
.sort( //order by total length
col("length").desc
).show()
+-----+----+----+----+--------+------+
|rowId|col1|col2|col3|filename|length|
+-----+----+----+----+--------+------+
| 3| g|hhhh| ii| file3| 13|
| 2| d| ee| fff| file2| 12|
| 1| a| bb| cc| file1| 11|
+-----+----+----+----+--------+------+
Sorting an array[struct] it will sort on the first field first and second field next. This works as we put the size of the sting up front. If you re-order the fields you'll get different results. You can easily accept more than 1 result if you so desired but I think dsicovering a row is challenging is likely enough.
df.select(
col("*"),
reverse( //sort ascending
sort_array( //sort descending
array( // add all columns lengths to an array
(for( col_name <- df.columns ) yield struct(length(col(col_name)),lit(col_name),col(col_name).cast("String")) ).toSeq:_* )
)
)(0) // grab the row max
.alias("rowMax") )
.sort("rowMax").show
+-----+----+----+----+--------+--------------------+
|rowId|col1|col2|col3|filename| rowMax|
+-----+----+----+----+--------+--------------------+
| 1| a| bb| cc| file1|[5, filename, file1]|
| 2| d| ee| fff| file2|[5, filename, file2]|
| 3| g|hhhh| ii| file3|[5, filename, file3]|
+-----+----+----+----+--------+--------------------+
everyone!!
I have tried to filter a dataset in pyspark. I had to filter the column date (date type) and I have written this code, but there is somwthing wrong: the dataset is empty. Someone could tell me how to fix it?
df = df.filter((F.col("date") > "2018-12-12") & (F.col("date") < "2019-12-12"))
Tanks
You need first to make sure date column is in date format then use lit for your filter:
df exemple:
df = spark.createDataFrame(
[
('20/12/2018', '1', 50),
('18/01/2021', '2', 23),
('31/01/2022', '3', -10)
], ['date', 'id', 'value']
)
df.show()
+----------+---+-----+
| date| id|value|
+----------+---+-----+
|20/12/2018| 1| 50|
|18/01/2021| 2| 23|
|31/01/2022| 3| -10|
+----------+---+-----+
from pyspark.sql import functions as F
df\
.withColumn('date', F.to_date('date', 'd/M/y'))\
.filter((F.col('date') > F.lit('2018-12-12')) & (F.col("date") < F.lit('2019-12-12')))\
.show()
+----------+---+-----+
| date| id|value|
+----------+---+-----+
|2018-12-20| 1| 50|
+----------+---+-----+
I have PySpark code that effectively groups up rows numerically, and increments when a certain condition is met. I'm having trouble figuring out how to transform this code, efficiently, into one that can be applied to groups.
Take this sample dataframe df
df = sqlContext.createDataFrame(
[
(33, [], '2017-01-01'),
(33, ['apple', 'orange'], '2017-01-02'),
(33, [], '2017-01-03'),
(33, ['banana'], '2017-01-04')
],
('ID', 'X', 'date')
)
This code achieves what I want for this sample df, which is to order by date and to create groups ('grp') that increment when the size column goes back to 0.
df \
.withColumn('size', size(col('X'))) \
.withColumn(
"grp",
sum((col('size') == 0).cast("int")).over(Window.orderBy('date'))
).show()
This is partly based on Pyspark - Cumulative sum with reset condition
Now what I am trying to do is apply the same approach to a dataframe that has multiple IDs - achieving a result that looks like
df2 = sqlContext.createDataFrame(
[
(33, [], '2017-01-01', 0, 1),
(33, ['apple', 'orange'], '2017-01-02', 2, 1),
(33, [], '2017-01-03', 0, 2),
(33, ['banana'], '2017-01-04', 1, 2),
(55, ['coffee'], '2017-01-01', 1, 1),
(55, [], '2017-01-03', 0, 2)
],
('ID', 'X', 'date', 'size', 'group')
)
edit for clarity
1) For the first date of each ID - the group should be 1 - regardless of what shows up in any other column.
2) However, for each subsequent date, I need to check the size column. If the size column is 0, then I increment the group number. If it is any non-zero, positive integer, then I continue the previous group number.
I've seen a few way to handle this in pandas, but I'm having difficulty understanding the applications in pyspark and the ways in which grouped data is different in pandas vs spark (e.g. do I need to use something called UADFs?)
Create a column zero_or_first by checking whether the size is zero or the row is the first row. Then sum.
df2 = sqlContext.createDataFrame(
[
(33, [], '2017-01-01', 0, 1),
(33, ['apple', 'orange'], '2017-01-02', 2, 1),
(33, [], '2017-01-03', 0, 2),
(33, ['banana'], '2017-01-04', 1, 2),
(55, ['coffee'], '2017-01-01', 1, 1),
(55, [], '2017-01-03', 0, 2),
(55, ['banana'], '2017-01-01', 1, 1)
],
('ID', 'X', 'date', 'size', 'group')
)
w = Window.partitionBy('ID').orderBy('date')
df2 = df2.withColumn('row', F.row_number().over(w))
df2 = df2.withColumn('zero_or_first', F.when((F.col('size')==0)|(F.col('row')==1), 1).otherwise(0))
df2 = df2.withColumn('grp', F.sum('zero_or_first').over(w))
df2.orderBy('ID').show()
Here' the output. You can see that column group == grp. Where group is the expected results.
+---+---------------+----------+----+-----+---+-------------+---+
| ID| X| date|size|group|row|zero_or_first|grp|
+---+---------------+----------+----+-----+---+-------------+---+
| 33| []|2017-01-01| 0| 1| 1| 1| 1|
| 33| [banana]|2017-01-04| 1| 2| 4| 0| 2|
| 33|[apple, orange]|2017-01-02| 2| 1| 2| 0| 1|
| 33| []|2017-01-03| 0| 2| 3| 1| 2|
| 55| [coffee]|2017-01-01| 1| 1| 1| 1| 1|
| 55| [banana]|2017-01-01| 1| 1| 2| 0| 1|
| 55| []|2017-01-03| 0| 2| 3| 1| 2|
+---+---------------+----------+----+-----+---+-------------+---+
I added a window function, and created an index within each ID. Then I expanded the conditional statement to also reference that index. The following seems to produce my desired output dataframe - but I am interested in knowing if there is a more efficient way to do this.
window = Window.partitionBy('ID').orderBy('date')
df \
.withColumn('size', size(col('X'))) \
.withColumn('index', rank().over(window).alias('index')) \
.withColumn(
"grp",
sum(((col('size') == 0) | (col('index') == 1)).cast("int")).over(window)
).show()
which yields
+---+---------------+----------+----+-----+---+
| ID| X| date|size|index|grp|
+---+---------------+----------+----+-----+---+
| 33| []|2017-01-01| 0| 1| 1|
| 33|[apple, orange]|2017-01-02| 2| 2| 1|
| 33| []|2017-01-03| 0| 3| 2|
| 33| [banana]|2017-01-04| 1| 4| 2|
| 55| [coffee]|2017-01-01| 1| 1| 1|
| 55| []|2017-01-03| 0| 2| 2|
+---+---------------+----------+----+-----+---+
I am trying to convert the following Python dict into PySpark DataFrame but I am not getting expected output.
dict_lst = {'letters': ['a', 'b', 'c'],
'numbers': [10, 20, 30]}
df_dict = sc.parallelize([dict_lst]).toDF() # Result not as expected
df_dict.show()
Is there a way to do this without using Pandas?
Quoting myself:
I find it's useful to think of the argument to createDataFrame() as a list of tuples where each entry in the list corresponds to a row in the DataFrame and each element of the tuple corresponds to a column.
So the easiest thing is to convert your dictionary into this format. You can easily do this using zip():
column_names, data = zip(*dict_lst.items())
spark.createDataFrame(zip(*data), column_names).show()
#+-------+-------+
#|letters|numbers|
#+-------+-------+
#| a| 10|
#| b| 20|
#| c| 30|
#+-------+-------+
The above assumes that all of the lists are the same length. If this is not the case, you would have to use itertools.izip_longest (python2) or itertools.zip_longest (python3).
from itertools import izip_longest as zip_longest # use this for python2
#from itertools import zip_longest # use this for python3
dict_lst = {'letters': ['a', 'b', 'c'],
'numbers': [10, 20, 30, 40]}
column_names, data = zip(*dict_lst.items())
spark.createDataFrame(zip_longest(*data), column_names).show()
#+-------+-------+
#|letters|numbers|
#+-------+-------+
#| a| 10|
#| b| 20|
#| c| 30|
#| null| 40|
#+-------+-------+
Your dict_lst is not really the format you want to adopt to create a dataframe. It would be better if you had a list of dict instead of a dict of list.
This code creates a DataFrame from you dict of list :
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
dict_lst = {'letters': ['a', 'b', 'c'],
'numbers': [10, 20, 30]}
values_lst = dict_lst.values()
nb_rows = [len(lst) for lst in values_lst]
assert min(nb_rows)==max(nb_rows) #We must have the same nb of elem for each key
row_lst = []
columns = dict_lst.keys()
for i in range(nb_rows[0]):
row_values = [lst[i] for lst in values_lst]
row_dict = {column: value for column, value in zip(columns, row_values)}
row = Row(**row_dict)
row_lst.append(row)
df = sqlContext.createDataFrame(row_lst)
Using pault's answer above I imposed a specific schema on my dataframe as follows:
import pyspark
from pyspark.sql import SparkSession, functions
spark = SparkSession.builder.appName('dictToDF').getOrCreate()
get data:
dict_lst = {'letters': ['a', 'b', 'c'],'numbers': [10, 20, 30]}
data = dict_lst.values()
create schema:
from pyspark.sql.types import *
myschema= StructType([ StructField("letters", StringType(), True)\
,StructField("numbers", IntegerType(), True)\
])
create df from dictionary - with schema:
df=spark.createDataFrame(zip(*data), schema = myschema)
df.show()
+-------+-------+
|letters|numbers|
+-------+-------+
| a| 10|
| b| 20|
| c| 30|
+-------+-------+
show df schema:
df.printSchema()
root
|-- letters: string (nullable = true)
|-- numbers: integer (nullable = true)
You can also use a Python List to quickly prototype a DataFrame. The idea is based from Databricks's tutorial.
df = spark.createDataFrame(
[(1, "a"),
(1, "a"),
(1, "b")],
("id", "value"))
df.show()
+---+-----+
| id|value|
+---+-----+
| 1| a|
| 1| a|
| 1| b|
+---+-----+
Try this out :
dict_lst = [{'letters': 'a', 'numbers': 10},
{'letters': 'b', 'numbers': 20},
{'letters': 'c', 'numbers': 30}]
df_dict = sc.parallelize(dict_lst).toDF() # Result as expected
Output:
>>> df_dict.show()
+-------+-------+
|letters|numbers|
+-------+-------+
| a| 10|
| b| 20|
| c| 30|
+-------+-------+
The most efficient approach is to use Pandas
import pandas as pd
spark.createDataFrame(pd.DataFrame(dict_lst))