Transform some codes from scala to Pyspark

Transform some codes from scala to Pyspark - scala

I am new to pyspark and scala and I want to change the code that is written in scala to Pyspark code.
I dont know to transform the where clause and .withColumn("EVENT_YEAR",year('EVENT.cast("date"))) to pyspark and I am not the rest that I implemented is correct.
val df= df1.join(df2,Seq("PA_ID"))
.where('EVENT>='CONTINOUS_START && 'EVENT<= 'CONTINOUS_END) //>= and <= instead of > and <
.withColumn("EVENT",'EVENT.cast("date"))
.withColumn("EVENT_YEAR",year('EVENT.cast("date")))
.withColumn("AGE_AT_EVENT", 'EVENT_YEAR - 'BIRTH_YEAR)
.withColumn("LOOK_FORWARD", datediff('CONTINOUS_END,'EVENT_DATE))
.withColumn("LOOK_BACK", datediff('EVENT, 'CONTINOUS_START)
My try :
df = df1.join(df2,["PA_ID"])
.withColumn("EVENT",col("EVENT").cast("date"))
.withColumn('birth_year',year(col(EVENT))
.withColumn("AGE_AT_EVENT", 'EVENT_YEAR - 'BIRTH_YEAR)
.withColumn("LOOK_FORWARD", datediff('CONTINOUS_ENROL_END,'EVENT_DATE))
.withColumn("LOOK_FORWARD", datediff(col("CONTINOUS_END"),col("EVENT_DATE")))
.withColumn("LOOK_BACK", datediff(col("EVENT"),col("CONTINOUS_START")))

Related

Inconsistent behavior of pyspark code depending on order of line execution

I am running the code through jupyter on EMR, pyspark version 3.3.0.
I have two dataframes that I have preprocessed with the pyspark.ml.feature functions (OneHotEncoder, StringIndexer, VectorAssembler). The first dataframe, lets call it df_good, has 5 features, the second dataframe, lets call it df_bad, omits 2 of the features from df_good. The underlying dataset used to generate the two datasets is the same, the code to generate the datasets is identical (other than two features not be included in the VectorAssembler inputCols for df_bad).
Below is the code I am using to train the model:
import pyspark.sql.functions as F
from pyspark.sql.types import ArrayType, DoubleType
from pyspark.ml.classification import LogisticRegression
def split_array(col):
def to_list(v):
return v.toArray().tolist()
return F.udf(to_list, ArrayType(DoubleType()))(col)
def train_model(df):
train_df = df.selectExpr("label as label", "features as features")
logit = LogisticRegression()
logit = logit.setFamily("multinomial")
logit_mod = logit.fit(train_df)
df = logit_mod.transform(df)
df = df.withColumn("pred", split_array(F.col("probability"))[0])
return df
Here is where things get weird.
If I run the code below it works and runs in 10-20 seconds each:
df_good = spark.read.parquet("<s3_location_good>")
df_good = train_model(df_good)
df_good.select(F.sum("pred")).show()
df_bad = spark.read.parquet("<s3_location_bad>")
df_bad = train_model(df_bad)
df_bad.select(F.sum("pred")).show()
If I change the order, the code completely hangs on df_bad:
df_bad = spark.read.parquet("<s3_location_bad>")
df_bad = train_model(df_bad)
df_bad.select(F.sum("pred")).show()
df_good = spark.read.parquet("<s3_location_good>")
df_good = train_model(df_good)
df_good.select(F.sum("pred")).show()
The data is unchanged, the code is the same, the behavior is always the same.
Any thoughts are appreciated.

Faster way to get single cell value from Dataframe (using just transformation)

I have the following code where I want to get Dataframe dfDateFiltered from dfBackendInfo containing all rows with RowCreationTime greater than timestamp "latestRowCreationTime"
val latestRowCreationTime = dfVersion.agg(max("BackendRowCreationTime")).first.getTimestamp(0)
val dfDateFiltered = dfBackendInfo.filter($"RowCreationTime" > latestRowCreationTime)
The problem I see is that the first line adds a job in Databricks cluster making it slower.
Is there anyway if I could use a better way to filter (for ex. just using transformation instead of action)?
Below are the schemas of the 2 Dataframes:
case class Version(BuildVersion:String,
MainVersion:String,
Hotfix:String,
BackendRowCreationTime:Timestamp)
case class BackendInfo(SerialNumber:Integer,
NumberOfClients:Long,
BuildVersion:String,
MainVersion:String,
Hotfix:String,
RowCreationTime:Timestamp)

The below code worked:
val dfLatestRowCreationTime1 = dfVersion.agg(max($"BackendRowCreationTime").as("BackendRowCreationTime")).limit(1)
val latestRowCreationTime = dfLatestRowCreationTime1.withColumn("BackendRowCreationTime", when($"BackendRowCreationTime".isNull, DefaultTime))
val dfDateFiltered = dfBackendInfo.join(latestRowCreationTime, dfBackendInfo.col("RowCreationTime").gt(latestRowCreationTime.col("BackendRowCreationTime")))

Split one row into multiple rows of dataframe

I want to convert one row from dataframe into multiple rows. If hours is same then rows will not get split but if hour is different then rows will split into multiple rows wrt difference between hours.I am good with solution using dataframe function or hive query.
Input Table or Dataframe
Expected Output Table or Dataframe
Please help me to get workaround for expected output.

The easiest solution for such a simple schema is to use Dataset.flatMap after defining case classes for the input and output schema.
A simple UDF solution would return a sequence and then you can use functions.explode. Far less clean & efficient that using flatMap.
Last but not least, you could create your own table-generating UDF but that would be extreme overkill for this problem.

You can implement your own logic inside the map operation and use flatMap to achieve this.
The following is the crude way, that I have implemented the solution, you can improvise it as per the need.
import java.time.format.DateTimeFormatter
import java.time.temporal.ChronoUnit
import java.time.{Duration, LocalDateTime}
import org.apache.spark.sql.Row
import scala.collection.mutable.ArrayBuffer
import sparkSession.sqlContext.implicits._
val df = Seq(("john", "2/9/2018", "2/9/2018 5:02", "2/9/2018 5:12"),
("smit", "3/9/2018", "3/9/2018 6:12", "3/9/2018 8:52"),
("rick", "4/9/2018", "4/9/2018 23:02", "5/9/2018 2:12")
).toDF("UserName", "Date", "start_time", "end_time")
val rdd = df.rdd.map(row => {
val result = new ArrayBuffer[Row]()
val formatter1 = DateTimeFormatter.ofPattern("d/M/yyyy H:m")
val formatter2 = DateTimeFormatter.ofPattern("d/M/yyyy H:mm")
val d1 = LocalDateTime.parse(row.getAs[String]("start_time"), formatter1)
val d2 = LocalDateTime.parse(row.getAs[String]("end_time"), formatter1)
if (d1.getHour == d2.getHour) result += row
else {
val hoursDiff = Duration.between(d1, d2).toHours.toInt
result += Row.fromSeq(Seq(
row.getAs[String]("UserName"),
row.getAs[String]("Date"),
row.getAs[String]("start_time"),
d1.plus(1, ChronoUnit.HOURS).withMinute(0).format(formatter2)))
for (index <- 1 until hoursDiff) {
result += Row.fromSeq(Seq(
row.getAs[String]("UserName"),
row.getAs[String]("Date"),
d1.plus(index, ChronoUnit.HOURS).withMinute(0).format(formatter1),
d1.plus(1 + index, ChronoUnit.HOURS).withMinute(0).format(formatter2)))
}
result += Row.fromSeq(Seq(
row.getAs[String]("UserName"),
row.getAs[String]("Date"),
d2.withMinute(0).format(formatter2),
row.getAs[String]("end_time")))
}
result
}).flatMap(_.toIterator)
rdd.collect.foreach(println)
and finally, your result is as follows:
[john,2/9/2018,2/9/2018 5:02,2/9/2018 5:12]
[smit,3/9/2018,3/9/2018 6:12,3/9/2018 7:00]
[smit,3/9/2018,3/9/2018 7:0,3/9/2018 8:00]
[smit,3/9/2018,3/9/2018 8:00,3/9/2018 8:52]
[rick,4/9/2018,4/9/2018 23:02,5/9/2018 0:00]
[rick,4/9/2018,5/9/2018 0:0,5/9/2018 1:00]
[rick,4/9/2018,5/9/2018 1:0,5/9/2018 2:00]
[rick,4/9/2018,5/9/2018 2:00,5/9/2018 2:12]

filter spark dataframe based on maximum value of a column

I want to do something like this:
df
.withColumn("newCol", <some formula>)
.filter(s"""newCol > ${(math.min(max("newCol").asInstanceOf[Double],10))}""")
Exception I'm getting:
org.apache.spark.sql.Column cannot be cast to java.lang.Double
Can you please suggest me a way to achieve what i want?

I assume newCol is already present in df, then:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
df
.withColumn("max_newCol",max($"newCol").over(Window.partitionBy()))
.filter($"newCol"> least($"max_newCol",lit(10.0)))
Instead of max($"newCol").over(Window.partitionBy()) you can also jjst write max($"newCol").over()

I think dataframe describe function is what you are looking for.
ds.describe("age", "height").show()
// output:
// summary age height
// count 10.0 10.0
// mean 53.3 178.05
// stddev 11.6 15.7
// min 18.0 163.0
// max 92.0 192.0

I'd separate both steps and either:
val newDF = df
.withColumn("newCol", <some formula>)
// Spark 2.1 or later
// With 1.x use join
newDf.alias("l").crossJoin(
newDf.alias("r")).where($"l.newCol" > least($"r.newCol", lit(10.0)))
or
newDf.where(
$"newCol" > (newDf.select(max($"newCol")).as[Double].first min 10.0))

The solution is two parts,
Part I
Find the maximum value,
df.select(max($"col1")).first()(0)
Part II
Use that value to filter on it
df.filter($"col1" === df.select(max($"col1")).first()(0)).show
Bonus
To avoid potential errors, you can also get the maximum value in a specific format you need, using the .get family on it df.select(max($"col1")).first.getDouble(0)
In this case col1 is DoubleType, so I chose to pick it in the correct format. You can get pretty much all other types. Options are:
getBoolean, getClass, getDecimal, getFloat, getJavaMap, getLong, getSeq, getString, getTimestamp, getAs, getByte, getDate, getDouble, getInt, getList, getMap, getShort, getStruct, getValuesMap
Making the full solution in this case
df.filter($"col1" === df.select(max($"col1")).first.getDouble(0)).show

Pyspark - window.PartitionBy() - Number of partitions

I am using PySpark v1.6.2 and my code goes like this:
df = sqlContext.sql('SELECT * FROM <lib>.<table>')
df.rdd.numPartitions() ## 2496
df = df.withColumns('count', lit(1)) ## up to this point it still has 2496 partitions
df = df.repartition(2496,'trip_id').sortWithinPartitions('trip_id','time')
# This is where the trouble starts
sequenceWS = Window.partitionBy('trip_id').orderBy('trip_id','time') ## Defining a window
df = df.withColumn('delta_time', (df['time'] - min(df['time']).over(sequenceWS.rowsBetween(-1, 0))))
# Done with window function
df.rdd.numPartitions() ## 200
My question is:
Is there a way to tell pyspark how many partitions it should make when using the function Window.partionBy(*cols)?
Alternatively, is there a way to influence PySpark as to keep the same number of partitions as it had before the window function is applied on its DataFrame?

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Transform some codes from scala to Pyspark - scala

Related

Inconsistent behavior of pyspark code depending on order of line execution

Faster way to get single cell value from Dataframe (using just transformation)

Split one row into multiple rows of dataframe

filter spark dataframe based on maximum value of a column

Pyspark - window.PartitionBy() - Number of partitions

Categories

Resources