How do I create new columns containing lead values generated from another column in Pyspark? - pyspark

The following code pulls down daily oil prices (dcoilwtico), resamples the daily figures to monthly, calculates the 12-month (i.e. year over year percent) change and finally contains a loop to shift the YoY percent change ahead 1 month (dcoilwtico_1), 2 months (dcoilwtico_2) all the way out to 12 months (dcoilwtico_12) as new columns:
import pandas_datareader as pdr
start = datetime.datetime (2016, 1, 1)
end = datetime.datetime (2022, 12, 1)
#1. Get historic data
df_fred_daily = pdr.DataReader(['DCOILWTICO'],'fred', start, end).dropna().resample('M').mean() # Pull daily, remove NaN and collapse from daily to monthly
df_fred_daily.columns= df_fred_daily.columns.str.lower()
#2. Expand df range: index, column names
index_fred = pd.date_range('2022-12-31', periods=13, freq='M')
columns_fred_daily = df_fred_daily.columns.to_list()
#3. Append history + empty df
df_fred_daily_forecast = pd.DataFrame(index=index_fred, columns=columns_fred_daily)
df_fred_test_daily=pd.concat([df_fred_daily, df_fred_daily_forecast])
#4. New df, calculate yoy percent change for each commodity
df_fred_test_daily_yoy= ((df_fred_test_daily - df_fred_test_daily.shift(12))/df_fred_test_daily.shift(12))*100
#5. Extend each variable as a series from 1 to 12 months
for col in df_fred_test_daily_yoy.columns:
for i in range(1,13):
df_fred_test_daily_yoy["%s_%s"%(col,i)] = df_fred_test_daily_yoy[col].shift(i)
df_fred_test_daily_yoy.tail(18)
And produces the following df:
Question: My real world example contains hundreds of columns and I would like to generate these same results using Pyspark.
How would this be coded using Pyspark?

As your code is already ready, I would use koalas, "a pandas spark version", You just need to install https://pypi.org/project/koalas/
see the simple example
import databricks.koalas as ks
import pandas as pd
pdf = pd.DataFrame({'x':range(3), 'y':['a','b','b'], 'z':['a','b','b']})
# Create a Koalas DataFrame from pandas DataFrame
df = ks.from_pandas(pdf)
# Rename the columns
df.columns = ['x', 'y', 'z1']
# Do some operations in place:
df['x2'] = df.x * df.x

Related

Pyspark write stream one column at a time

The source .csv has 414 columns each with a new date:
The count increases by the total number of COVID deaths up to that date.
I want to display in a Databricks dashboard a stream which will increment up as the total deaths to date increases. Iterating through the date columns from left to right for 412 days. I will insert a pause on the stream after each day, then ingest the next day's results. Displaying the total by state as it increments up with each day.
So far:
df = spark.read.option("header", "true").csv("/databricks-datasets/COVID/USAFacts/covid_deaths_usafacts.csv")
This initial df has 418 columns and I have changed all of the day columns to IntegerType; keeping only the State and County columns as string.
from pyspark.sql import functions as F
for col in temp_df.columns:
temp_df = temp_df.withColumn(
col,
F.col(col).cast("integer")
)
and then
from pyspark.sql.functions import col
temp_df.withColumn("County Name",col("County Name").cast('integer')).withColumn("State",col("State").cast('integer'))
Then I use df.schema to get the schema and do a second ingest of the .csv, this time with the schema defined. But my next challenge is the most difficult, to stream in the results one column at a time.
Or can I simply PIVOT ? If yes, then like this?
pivotDF = df.groupBy("State").pivot("County", countyFIPS)

Pyspark remove columns with 10 null values

I am new to PySpark.
I have read a parquet file. I only want to keep columns that have atleast 10 values
I have used describe to get the count of not-null records for each column
How do I now extract the column names that have less than 10 values and then drop those columns before writing to a new file
df = spark.read.parquet(file)
col_count = df.describe().filter($"summary" == "count")
You can convert it into a dictionary and then filter out the keys(column names) based on their values (count < 10, the count is a StringType() which needs to be converted to int in the Python code):
# here is what you have so far which is a dataframe
col_count = df.describe().filter('summary == "count"')
# exclude the 1st column(`summary`) from the dataframe and save it to a dictionary
colCountDict = col_count.select(col_count.columns[1:]).first().asDict()
# find column names (k) with int(v) < 10
bad_cols = [ k for k,v in colCountDict.items() if int(v) < 10 ]
# drop bad columns
df_new = df.drop(*bad_cols)
Some notes:
use #pault's approach if the information can not be retrieved directly from df.describe() or df.summary() etc.
you need to drop() instead of select() columns since describe()/summary() only include numeric and string columns, selecting columns from a list processed by df.describe() will lose columns of TimestampType(), ArrayType() etc

How to maintain sort order in PySpark collect_list and collect multiple lists

I want to maintain the date sort-order, using collect_list for multiple columns, all with the same date order. I'll need them in the same dataframe so I can utilize to create a time series model input. Below is a sample of the "train_data":
I'm using a Window with PartitionBy to ensure sort order by tuning_evnt_start_dt for each Syscode_Stn. I can create one column with this code:
from pyspark.sql import functions as F
from pyspark.sql import Window
w = Window.partitionBy('Syscode_Stn').orderBy('tuning_evnt_start_dt')
sorted_list_df = train_data
.withColumn('spp_imp_daily', F.collect_list('spp_imp_daily').over(w)
)\
.groupBy('Syscode_Stn')\
.agg(F.max('spp_imp_daily').alias('spp_imp_daily'))
but how do I create two columns in the same new dataframe?
w = Window.partitionBy('Syscode_Stn').orderBy('tuning_evnt_start_dt')
sorted_list_df = train_data
.withColumn('spp_imp_daily',F.collect_list('spp_imp_daily').over(w))
.withColumn('MarchMadInd', F.collect_list('MarchMadInd').over(w))
.groupBy('Syscode_Stn')
.agg(F.max('spp_imp_daily').alias('spp_imp_daily')))
Note that MarchMadInd is not shown in the screenshot, but is included in train_data. Explanation of how I got to where I am: https://stackoverflow.com/a/49255498/8691976
Yes, the correct way is to add successive .withColumn statements, followed by a .agg statement that removes the duplicates for each array.
w = Window.partitionBy('Syscode_Stn').orderBy('tuning_evnt_start_dt')
sorted_list_df = train_data.withColumn('spp_imp_daily',
F.collect_list('spp_imp_daily').over(w)
)\
.withColumn('MarchMadInd', F.collect_list('MarchMadInd').over(w))\
.groupBy('Syscode_Stn')\
.agg(F.max('spp_imp_daily').alias('spp_imp_daily'),
F.max('MarchMadInd').alias('MarchMadInd')
)

finding the count, pass and fail percentage of the multiple values from single column while aggregating another column using pyspark

Data
I want to apply groupby for column1 and want to calculate the percentage of passed and failed percentage for each 1 and as well count
Example ouput I am looking for
Using pyspark I am doing the below code but I am only getting the percentage
levels = ["passed", "failed","blocked"]
exprs = [avg((col("Column2") == level).cast("double")*100).alias(level)
for level in levels]
df = sparkSession.read.json(hdfsPath)
result1 = df1.select('Column1','Column2').groupBy("Column1").agg(*exprs)
You would need to explicitly calculate the counts, and then do some string formatting to combine the percentages in the counts into a single column.
from pyspark.sql.functions import avg, col, count, concat, lit
levels = ["passed", "failed","blocked"]
# percentage aggregations
pct_exprs = [avg((col("Column2") == level).cast("double")*100).alias('{}_pct'.format(level))
for level in levels]
# count aggregations
count_exprs = [sum((col("Column2") == level).cast("int")).alias('{}_count'.format(level))
for level in levels]
# combine all aggregations
exprs = pct_exprs + count_exprs
# string formatting select expressions
select_exprs = [
concat(
col('{}_pct'.format(level)).cast('string'),
lit('('),
col('{}_count'.format(level)).cast('string'),
lit(')')
).alias('{}_viz'.format(level))
for level in levels
]
df = sparkSession.read.json(hdfsPath)
result1 = (
df1
.select('Column1','Column2')
.groupBy("Column1")
.agg(*exprs)
.select('Column1', *select_exprs)
)
NB: it seems like you are trying to use Spark to make a nice visualization of the results of your calculations, but I don't think Spark is well-suited for this task. If you have few enough records that you can see all of them at once, you might as well work locally in Pandas or something similar. And if you have enough records that using Spark makes sense, then you can't see all of them at once anyway so it doesn't matter too much whether they look nice.

pyspark: get unique items in each column of a dataframe

I have a spark dataframe containing 1 million rows and 560 columns. I need to find the count of unique items in each column of the dataframe.
I have written the following code to achieve this but it is getting stuck and taking too much time to execute:
count_unique_items=[]
for j in range(len(cat_col)):
var=cat_col[j]
count_unique_items.append(data.select(var).distinct().rdd.map(lambda r:r[0]).count())
cat_col contains the column names of all the categorical variables
Is there any way to optimize this?
Try using approxCountDistinct or countDistinct:
from pyspark.sql.functions import approxCountDistinct, countDistinct
counts = df.agg(approxCountDistinct("col1"), approxCountDistinct("col2")).first()
but counting distinct elements is expensive.
You can do something like this, but as stated above, distinct element counting is expensive. The single * passes in each value as an argument, so the return value will be 1 row X N columns. I frequently do a .toPandas() call to make it easier to manipulate later down the road.
from pyspark.sql.functions import col, approxCountDistinct
distvals = df.agg(*(approxCountDistinct(col(c), rsd = 0.01).alias(c) for c in
df.columns))
You can use get every different element of each column with
df.stats.freqItems([list with column names], [percentage of frequency (default = 1%)])
This returns you a dataframe with the different values, but if you want a dataframe with just the count distinct of each column, use this:
from pyspark.sql.functions import countDistinct
df.select( [ countDistinct(cn).alias("c_{0}".format(cn)) for cn in df.columns ] ).show()
The part of the count, taken from here: check number of unique values in each column of a matrix in spark