Pyspark write stream one column at a time - pyspark

The source .csv has 414 columns each with a new date:
The count increases by the total number of COVID deaths up to that date.
I want to display in a Databricks dashboard a stream which will increment up as the total deaths to date increases. Iterating through the date columns from left to right for 412 days. I will insert a pause on the stream after each day, then ingest the next day's results. Displaying the total by state as it increments up with each day.
So far:
df = spark.read.option("header", "true").csv("/databricks-datasets/COVID/USAFacts/covid_deaths_usafacts.csv")
This initial df has 418 columns and I have changed all of the day columns to IntegerType; keeping only the State and County columns as string.
from pyspark.sql import functions as F
for col in temp_df.columns:
temp_df = temp_df.withColumn(
col,
F.col(col).cast("integer")
)
and then
from pyspark.sql.functions import col
temp_df.withColumn("County Name",col("County Name").cast('integer')).withColumn("State",col("State").cast('integer'))
Then I use df.schema to get the schema and do a second ingest of the .csv, this time with the schema defined. But my next challenge is the most difficult, to stream in the results one column at a time.
Or can I simply PIVOT ? If yes, then like this?
pivotDF = df.groupBy("State").pivot("County", countyFIPS)

Related

Spark Structure Streaming Differentiate over time

I have a table with a timestamp column (t) and a list of columns for which I would like to compute the difference over time (v), grouped by some key(k): v_diff(t) = v(t)-v(t-1) for each k independently.
Normally I would write:
lag_window = Window.partitionBy(COLS_TO_DIFF).orderBy('timestamp')
for col in COLS_TO_DIFF:
df = df.withColumn(
col + "_diff",
df[col] - F.lag(df[col]).over(lag_window))
Yet for streaming I get:
AnalysisException: Non-time-based windows are not supported on streaming DataFrames/Datasets;
How do I get around that?
Note: my data is streaming slowly in batches

How do I create new columns containing lead values generated from another column in Pyspark?

The following code pulls down daily oil prices (dcoilwtico), resamples the daily figures to monthly, calculates the 12-month (i.e. year over year percent) change and finally contains a loop to shift the YoY percent change ahead 1 month (dcoilwtico_1), 2 months (dcoilwtico_2) all the way out to 12 months (dcoilwtico_12) as new columns:
import pandas_datareader as pdr
start = datetime.datetime (2016, 1, 1)
end = datetime.datetime (2022, 12, 1)
#1. Get historic data
df_fred_daily = pdr.DataReader(['DCOILWTICO'],'fred', start, end).dropna().resample('M').mean() # Pull daily, remove NaN and collapse from daily to monthly
df_fred_daily.columns= df_fred_daily.columns.str.lower()
#2. Expand df range: index, column names
index_fred = pd.date_range('2022-12-31', periods=13, freq='M')
columns_fred_daily = df_fred_daily.columns.to_list()
#3. Append history + empty df
df_fred_daily_forecast = pd.DataFrame(index=index_fred, columns=columns_fred_daily)
df_fred_test_daily=pd.concat([df_fred_daily, df_fred_daily_forecast])
#4. New df, calculate yoy percent change for each commodity
df_fred_test_daily_yoy= ((df_fred_test_daily - df_fred_test_daily.shift(12))/df_fred_test_daily.shift(12))*100
#5. Extend each variable as a series from 1 to 12 months
for col in df_fred_test_daily_yoy.columns:
for i in range(1,13):
df_fred_test_daily_yoy["%s_%s"%(col,i)] = df_fred_test_daily_yoy[col].shift(i)
df_fred_test_daily_yoy.tail(18)
And produces the following df:
Question: My real world example contains hundreds of columns and I would like to generate these same results using Pyspark.
How would this be coded using Pyspark?
As your code is already ready, I would use koalas, "a pandas spark version", You just need to install https://pypi.org/project/koalas/
see the simple example
import databricks.koalas as ks
import pandas as pd
pdf = pd.DataFrame({'x':range(3), 'y':['a','b','b'], 'z':['a','b','b']})
# Create a Koalas DataFrame from pandas DataFrame
df = ks.from_pandas(pdf)
# Rename the columns
df.columns = ['x', 'y', 'z1']
# Do some operations in place:
df['x2'] = df.x * df.x

Appending multiple samples of a column into dataframe in spark

I have n (length) values in a spark column. I want to create a spark dataframe of k columns (where k is number of samples) and m rows (where m is sample size). I tried using withColumn, it is not working. Join by creating unique id will be very inefficient for me.
e.g. Spark column has following values :
102
320
11
101
2455
124
I want to create 2 samples of fraction 0.5 as columns in data frame.
So sampled data frame will be something like
sample1,sample2
320,101
124,2455
2455,11
Let df has a column UNIQUE_ID_D, I need k samples from this column. Here is the sample code for k = 2
var df1 = df.select("UNIQUE_ID_D").sample(false, 0.1).withColumnRenamed("UNIQUE_ID_D", "ID_1")
var df2 = df.select("UNIQUE_ID_D").sample(false, 0.1).withColumnRenamed("UNIQUE_ID_D", "ID_2")
df1.withColumn("NEW_UNIQUE_ID", df2.col("ID_2")).show
This wont work since withColumn can not access df2 column.
There is only way to join df1 and df2 by adding sequence column(join column) in both df's.
It is very inefficient for my use case since if I want to take 100 samples, I need to join 100 times in a loop for a single column. I need to perform this operation for all columns in original df.
How could I achieve this?

Iterate across columns in spark dataframe and calculate min max value

I want to iterate across the columns of dataframe in my Spark program and calculate min and max value.
I'm new to Spark and scala and not able to iterate over the columns once I fetch it in a dataframe.
I have tried running the below code but it needs column number to be passed to it, question is how do I fetch it from dataframe and pass it dynamically and store the result in a collection.
val parquetRDD = spark.read.parquet("filename.parquet")
parquetRDD.collect.foreach ({ i => parquetRDD_subset.agg(max(parquetRDD(parquetRDD.columns(2))), min(parquetRDD(parquetRDD.columns(2)))).show()})
Appreciate any help on this.
You should not be iterating on rows or records. You should be using aggregation function
import org.apache.spark.sql.functions._
val df = spark.read.parquet("filename.parquet")
val aggCol = col(df.columns(2))
df.agg(min(aggCol), max(aggCol)).show()
First when you do spark.read.parquet you are reading a dataframe.
Next we define the column we want to work on using the col function. The col function translate a column name to a column. You could instead use df("name") where name is the name of the column.
The agg function takes aggregation columns so min and max are aggregation functions which take a column and return a column with an aggregated value.
Update
According to the comments, the goal is to have min and max for all columns. You can therefore do this:
val minColumns = df.columns.map(name => min(col(name)))
val maxColumns = df.columns.map(name => max(col(name)))
val allMinMax = minColumns ++ maxColumns
df.agg(allMinMax.head, allMinMax.tail: _*).show()
You can also simply do:
df.describe().show()
which gives you statistics on all columns including min, max, avg, count and stddev

pyspark: get unique items in each column of a dataframe

I have a spark dataframe containing 1 million rows and 560 columns. I need to find the count of unique items in each column of the dataframe.
I have written the following code to achieve this but it is getting stuck and taking too much time to execute:
count_unique_items=[]
for j in range(len(cat_col)):
var=cat_col[j]
count_unique_items.append(data.select(var).distinct().rdd.map(lambda r:r[0]).count())
cat_col contains the column names of all the categorical variables
Is there any way to optimize this?
Try using approxCountDistinct or countDistinct:
from pyspark.sql.functions import approxCountDistinct, countDistinct
counts = df.agg(approxCountDistinct("col1"), approxCountDistinct("col2")).first()
but counting distinct elements is expensive.
You can do something like this, but as stated above, distinct element counting is expensive. The single * passes in each value as an argument, so the return value will be 1 row X N columns. I frequently do a .toPandas() call to make it easier to manipulate later down the road.
from pyspark.sql.functions import col, approxCountDistinct
distvals = df.agg(*(approxCountDistinct(col(c), rsd = 0.01).alias(c) for c in
df.columns))
You can use get every different element of each column with
df.stats.freqItems([list with column names], [percentage of frequency (default = 1%)])
This returns you a dataframe with the different values, but if you want a dataframe with just the count distinct of each column, use this:
from pyspark.sql.functions import countDistinct
df.select( [ countDistinct(cn).alias("c_{0}".format(cn)) for cn in df.columns ] ).show()
The part of the count, taken from here: check number of unique values in each column of a matrix in spark