Filter df by date using pyspark - pyspark

everyone!!
I have tried to filter a dataset in pyspark. I had to filter the column date (date type) and I have written this code, but there is somwthing wrong: the dataset is empty. Someone could tell me how to fix it?
df = df.filter((F.col("date") > "2018-12-12") & (F.col("date") < "2019-12-12"))
Tanks

You need first to make sure date column is in date format then use lit for your filter:
df exemple:
df = spark.createDataFrame(
[
('20/12/2018', '1', 50),
('18/01/2021', '2', 23),
('31/01/2022', '3', -10)
], ['date', 'id', 'value']
)
df.show()
+----------+---+-----+
| date| id|value|
+----------+---+-----+
|20/12/2018| 1| 50|
|18/01/2021| 2| 23|
|31/01/2022| 3| -10|
+----------+---+-----+
from pyspark.sql import functions as F
df\
.withColumn('date', F.to_date('date', 'd/M/y'))\
.filter((F.col('date') > F.lit('2018-12-12')) & (F.col("date") < F.lit('2019-12-12')))\
.show()
+----------+---+-----+
| date| id|value|
+----------+---+-----+
|2018-12-20| 1| 50|
+----------+---+-----+

Related

Create summary of Spark Dataframe

I have a Spark Dataframe which I am trying to summarise in order to find overly long columns:
// Set up test data
// Look for long columns (>=3), ie 1 is ok row,, 2 is bad on column 3, 3 is bad on column 2
val df = Seq(
( 1, "a", "bb", "cc", "file1" ),
( 2, "d", "ee", "fff", "file2" ),
( 3, "g", "hhhh", "ii", "file3" )
).
toDF("rowId", "col1", "col2", "col3", "filename")
I can summarise the lengths of the columns and find overly long ones like this:
// Look for long columns (>=3), ie 1 is ok row,, 2 is bad on column 3, 3 is bad on column 2
val df2 = df.columns
.map(c => (c, df.agg(max(length(df(s"$c")))).as[String].first()))
.toSeq.toDF("columnName", "maxLength")
.filter($"maxLength" > 2)
If I try and add the existing filename column to the map I get an error:
val df2 = df.columns
.map(c => ($"filename", c, df.agg(max(length(df(s"$c")))).as[String].first()))
.toSeq.toDF("fn", "columnName", "maxLength")
.filter($"maxLength" > 2)
I have tried a few variations of the $"filename" syntax. How can I incorporate the filename column into the summary?
columnName
maxLength
filename
col2
4
file3
col3
3
file2
The real dataframes have 300+ columns and millions of rows so I cannot hard-type column names.
#wBob does the following achieve your goal?
group by file name and get the maximum per column:
val cols = df.columns.dropRight(1) // to remove the filename col
val maxLength = cols.map(c => s"max(length(${c})) as ${c}").mkString(",")
print(maxLength)
df.createOrReplaceTempView("temp")
val df1 = spark
.sql(s"select filename, ${maxLength} from temp group by filename")
df1.show()`
With the output:
+--------+-----+----+----+----+
|filename|rowId|col1|col2|col3|
+--------+-----+----+----+----+
| file1| 1| 1| 2| 2|
| file2| 1| 1| 2| 3|
| file3| 1| 1| 4| 2|
+--------+-----+----+----+----+
Use subqueries to get the maximum per column and concatenate the results using union:
df1.createOrReplaceTempView("temp2")
val res = cols.map(col => {
spark.sql(s"select '${col}' as columnName, $col as maxLength, filename from temp2 " +
s"where $col = (select max(${col}) from temp2)")
}).reduce(_ union _)
res.show()
With the result:
+----------+---------+--------+
|columnName|maxLength|filename|
+----------+---------+--------+
| rowId| 1| file1|
| rowId| 1| file2|
| rowId| 1| file3|
| col1| 1| file1|
| col1| 1| file2|
| col1| 1| file3|
| col2| 4| file3|
| col3| 3| file2|
+----------+---------+--------+
Note that there are multiple entries for rowId and col1 since the maximum is not unique.
There is probably a more elegant way to write it, but I am struggling to find one at the moment.
Pushed a little further for better result.
df.select(
col("*"),
array( // make array of columns name/value/length
(for{ col_name <- df.columns } yield
struct(
length(col(col_name)).as("length"),
lit(col_name).as("col"),
col(col_name).cast("String").as("col_value")
)
).toSeq:_* ).alias("rowInfo")
)
.select(
col("rowId"),
explode( // explode array into rows
expr("filter(rowInfo, x -> x.length >= 3)") //filter the array for the length your interested in
).as("rowInfo")
)
.select(
col("rowId"),
col("rowInfo.*") // turn struct fields into columns
)
.sort("length").show
+-----+------+--------+---------+
|rowId|length| col|col_value|
+-----+------+--------+---------+
| 2| 3| col3| fff|
| 3| 4| col2| hhhh|
| 3| 5|filename| file3|
| 1| 5|filename| file1|
| 2| 5|filename| file2|
+-----+------+--------+---------+
It might be enough to sort your table by total text length. This can be achieved quickly and concisely.
df.select(
col("*"),
length( // take the length
concat( //slap all the columns together
(for( col_name <- df.columns ) yield col(col_name)).toSeq:_*
)
)
.as("length")
)
.sort( //order by total length
col("length").desc
).show()
+-----+----+----+----+--------+------+
|rowId|col1|col2|col3|filename|length|
+-----+----+----+----+--------+------+
| 3| g|hhhh| ii| file3| 13|
| 2| d| ee| fff| file2| 12|
| 1| a| bb| cc| file1| 11|
+-----+----+----+----+--------+------+
Sorting an array[struct] it will sort on the first field first and second field next. This works as we put the size of the sting up front. If you re-order the fields you'll get different results. You can easily accept more than 1 result if you so desired but I think dsicovering a row is challenging is likely enough.
df.select(
col("*"),
reverse( //sort ascending
sort_array( //sort descending
array( // add all columns lengths to an array
(for( col_name <- df.columns ) yield struct(length(col(col_name)),lit(col_name),col(col_name).cast("String")) ).toSeq:_* )
)
)(0) // grab the row max
.alias("rowMax") )
.sort("rowMax").show
+-----+----+----+----+--------+--------------------+
|rowId|col1|col2|col3|filename| rowMax|
+-----+----+----+----+--------+--------------------+
| 1| a| bb| cc| file1|[5, filename, file1]|
| 2| d| ee| fff| file2|[5, filename, file2]|
| 3| g|hhhh| ii| file3|[5, filename, file3]|
+-----+----+----+----+--------+--------------------+

Get the first row with positive amount in Pyspark

I have data like this
I want to flag the first positive amount as below
How do I flag the first positive amount for each id as shown above in Active column?
df = spark.createDataFrame(
[
('10/01/2022', '1', None),
('18/01/2022', '1', 50),
('31/01/2022', '1', -100)
], ['Date', 'Id', 'Amount']
)
from pyspark.sql import Window as W
from pyspark.sql import functions as F
w = W.partitionBy('Id').orderBy('Date')
df\
.withColumn('only_pos', F.when(F.col('Amount')>0, F.col('Amount')).otherwise(F.lit(None)))\
.withColumn('First_pos', F.first('only_pos', True).over(w))\
.withColumn('Active', F.when(F.col('only_pos')==F.col('First_pos'),F.lit('Yes')).otherwise(F.lit(None)))\
.select('Date', 'Id', 'Amount', 'Active')\
.show()
+----------+---+------+------+
| Date| Id|Amount|Active|
+----------+---+------+------+
|10/01/2022| 1| null| null|
|18/01/2022| 1| 50| Yes|
|31/01/2022| 1| -100| null|
+----------+---+------+------+

show all the matched string in pyspark dataframe

I wanted to show all the filtered results of similar matched string.
codes:
# Since most of the stackoverflow questionaire-s and also answerer-s are all super SMART and leave out all the necessary import libraries and required codes before using pyspark so that the readers can crack their minds in researching more instead of getting direct answer, I share the codes from beginning as below in order for future reviewers.
# Import libraries
from pyspark.sql import SparkSession
from pyspark import SparkContext
import pandas as pd
import numpy as np
# Initiate the session
spark = SparkSession\
.builder\
.appName('Operations')\
.getOrCreate()
# sc = SparkContext()
sc =SparkContext.getOrCreate()
# Create dataframe 1
sdataframe_temp = spark.createDataFrame([
(1,2,'3'),
(2,2,'yes')],
['a', 'b', 'c']
)
# Create Dataframe 2
sdataframe_temp2 = spark.createDataFrame([
(4,6,'yes'),
(5,7,'yes')],
['a', 'b', 'c']
)
# Combine dataframe
sdataframe_union_1_2 = sdataframe_temp.union(sdataframe_temp2)
# Filter out the columns based on respective rules
sdataframe_temp\
.filter(sdataframe_union_1_2['c'] == 'yes')\
.select(['a', 'b'])\ # I wish to stick with dataframe method if possible.
.show()
Output:
+---+---+
| a| b|
+---+---+
| 2| 2|
+---+---+
Expected output:
+---+---+
| a| b|
+---+---+
| 2| 2|
+---+---+
| 4| 6|
+---+---+
| 5| 7|
+---+---+
Can anyone please give some suggestions or improvement?
Here's a way using unionByName:
df = (sdataframe_temp1
.unionByName(sdataframe_temp2)
.where("c == 'yes'")
.drop('c'))
df.show()
+---+---+
| a| b|
+---+---+
| 2| 2|
| 4| 6|
| 5| 7|
+---+---+
you should change last line of code. for col function you should import from pyspark.sql.functions
from pyspark.sql.functions import *
sdataframe_union_1_2\
.filter(col('c') == 'yes')\
.select(['a', 'b'])\ # I wish to stick with dataframe method if possible.
.show()
or
sdataframe_union_1_2\
.filter(sdataframe_union_1_2['c'] == 'yes')\
.select(['a', 'b'])\ # I wish to stick with dataframe method if possible.
.show()
you have to select data from sdataframe_union_1_2 and you are selecting from sdataframe_temp that's why you are getting one record.

Adding dictionary keys as column name and dictionary value as the constant value of that column in Pyspark df

I have a dictionary x = {'colA': 20, 'colB': 30} and a pyspark df.
ID Value
1 ABC
1 BCD
1 AKB
2 CAB
2 AIK
3 KIB
I want to create df1 using x as follows:
ID Value colA colB
1 ABC 20.0 30.0
1 BCD 20.0 30.0
1 AKB 20.0 30.0
2 CAB 20.0 30.0
...
Any idea how to do it Pyspark.
I know I can create a constant column like this,
df1 = df.withColumn('colA', lit(20.0))
df1 = df1.withColumn('colB', lit(30.0))
But not sure about the dynamic process to do it from dictionary
There are ways to hide the loop, but the execution will be the same. For instance, you can use select:
from pyspark.sql.functions import lit
df2 = df.select("*", *[lit(val).alias(key) for key, val in x.items()])
df2.show()
#+---+-----+----+----+
#| ID|Value|colB|colA|
#+---+-----+----+----+
#| 1| ABC| 30| 20|
#| 1| BCD| 30| 20|
#| 1| AKB| 30| 20|
#| 2| CAB| 30| 20|
#| 2| AIK| 30| 20|
#| 3| KIB| 30| 20|
#+---+-----+----+----+
Or functools.reduce and withColumn:
from functools import reduce
df3 = reduce(lambda df, key: df.withColumn(key, lit(x[key])), x, df)
df3.show()
# Same as above
Or pyspark.sql.functions.struct with select() and the "*" syntax:
from pyspark.sql.functions import struct
df4 = df.withColumn('x', struct([lit(val).alias(key) for key, val in x.items()]))\
.select("ID", "Value", "x.*")
df4.show()
#Same as above
But if you look at the execution plan of these methods, you'll see that they're exactly the same:
df2.explain()
#== Physical Plan ==
#*Project [ID#44L, Value#45, 30 AS colB#151, 20 AS colA#152]
#+- Scan ExistingRDD[ID#44L,Value#45]
df3.explain()
#== Physical Plan ==
#*Project [ID#44L, Value#45, 30 AS colB#102, 20 AS colA#107]
#+- Scan ExistingRDD[ID#44L,Value#45]
df4.explain()
#== Physical Plan ==
#*Project [ID#44L, Value#45, 30 AS colB#120, 20 AS colA#121]
#+- Scan ExistingRDD[ID#44L,Value#45]
Further if you compare the loop method in #anil's answer:
df1 = df
for key in x:
df1 = df1.withColumn(key, lit(x[key]))
df1.explain()
#== Physical Plan ==
#*Project [ID#44L, Value#45, 30 AS colB#127, 20 AS colA#132]
#+- Scan ExistingRDD[ID#44L,Value#45]
You'll see that this is the same as well.
Loop through the dictionary as below
df1 = df
for key in x:
df1 = df1.withColumn(key, lit(x[key]))

PySpark Dataframe from Python Dictionary without Pandas

I am trying to convert the following Python dict into PySpark DataFrame but I am not getting expected output.
dict_lst = {'letters': ['a', 'b', 'c'],
'numbers': [10, 20, 30]}
df_dict = sc.parallelize([dict_lst]).toDF() # Result not as expected
df_dict.show()
Is there a way to do this without using Pandas?
Quoting myself:
I find it's useful to think of the argument to createDataFrame() as a list of tuples where each entry in the list corresponds to a row in the DataFrame and each element of the tuple corresponds to a column.
So the easiest thing is to convert your dictionary into this format. You can easily do this using zip():
column_names, data = zip(*dict_lst.items())
spark.createDataFrame(zip(*data), column_names).show()
#+-------+-------+
#|letters|numbers|
#+-------+-------+
#| a| 10|
#| b| 20|
#| c| 30|
#+-------+-------+
The above assumes that all of the lists are the same length. If this is not the case, you would have to use itertools.izip_longest (python2) or itertools.zip_longest (python3).
from itertools import izip_longest as zip_longest # use this for python2
#from itertools import zip_longest # use this for python3
dict_lst = {'letters': ['a', 'b', 'c'],
'numbers': [10, 20, 30, 40]}
column_names, data = zip(*dict_lst.items())
spark.createDataFrame(zip_longest(*data), column_names).show()
#+-------+-------+
#|letters|numbers|
#+-------+-------+
#| a| 10|
#| b| 20|
#| c| 30|
#| null| 40|
#+-------+-------+
Your dict_lst is not really the format you want to adopt to create a dataframe. It would be better if you had a list of dict instead of a dict of list.
This code creates a DataFrame from you dict of list :
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
dict_lst = {'letters': ['a', 'b', 'c'],
'numbers': [10, 20, 30]}
values_lst = dict_lst.values()
nb_rows = [len(lst) for lst in values_lst]
assert min(nb_rows)==max(nb_rows) #We must have the same nb of elem for each key
row_lst = []
columns = dict_lst.keys()
for i in range(nb_rows[0]):
row_values = [lst[i] for lst in values_lst]
row_dict = {column: value for column, value in zip(columns, row_values)}
row = Row(**row_dict)
row_lst.append(row)
df = sqlContext.createDataFrame(row_lst)
Using pault's answer above I imposed a specific schema on my dataframe as follows:
import pyspark
from pyspark.sql import SparkSession, functions
spark = SparkSession.builder.appName('dictToDF').getOrCreate()
get data:
dict_lst = {'letters': ['a', 'b', 'c'],'numbers': [10, 20, 30]}
data = dict_lst.values()
create schema:
from pyspark.sql.types import *
myschema= StructType([ StructField("letters", StringType(), True)\
,StructField("numbers", IntegerType(), True)\
])
create df from dictionary - with schema:
df=spark.createDataFrame(zip(*data), schema = myschema)
df.show()
+-------+-------+
|letters|numbers|
+-------+-------+
| a| 10|
| b| 20|
| c| 30|
+-------+-------+
show df schema:
df.printSchema()
root
|-- letters: string (nullable = true)
|-- numbers: integer (nullable = true)
You can also use a Python List to quickly prototype a DataFrame. The idea is based from Databricks's tutorial.
df = spark.createDataFrame(
[(1, "a"),
(1, "a"),
(1, "b")],
("id", "value"))
df.show()
+---+-----+
| id|value|
+---+-----+
| 1| a|
| 1| a|
| 1| b|
+---+-----+
Try this out :
dict_lst = [{'letters': 'a', 'numbers': 10},
{'letters': 'b', 'numbers': 20},
{'letters': 'c', 'numbers': 30}]
df_dict = sc.parallelize(dict_lst).toDF() # Result as expected
Output:
>>> df_dict.show()
+-------+-------+
|letters|numbers|
+-------+-------+
| a| 10|
| b| 20|
| c| 30|
+-------+-------+
The most efficient approach is to use Pandas
import pandas as pd
spark.createDataFrame(pd.DataFrame(dict_lst))