I have tried below code and it gives me -55 as difference between above 2 time_stamps. It should give me 5 minutes. Is there any direct function to get correct time difference in pyspark?
import pyspark.sql.functions as F
# ts1 := 2019-11-07T22:00:00.000+0000
# ts2 := 2019-11-07T21:55:00.000+0000
df.withColumn("time_diff", F.minute("time_stamp") - F.minute("time_stamp2"))
Doesn't give me correct answer. Please help.
You can use the following function to get the time difference in seconds:
from pyspark.sql.functions import *
diff_secs_col = col("time_stamp").cast("long") - col("time_stamp2").cast("long")
Then do the math to get in minutes.
Can you try this
import pyspark.sql.functions as F
import pyspark.sql.types as Types
df = df.withColumn('t1_unix', F.unix_timestamp(df.t1, "yyyy-MM-dd'T'HH:mm:ss.SSS"))
df = df.withColumn('t2_unix', F.unix_timestamp(df.t2, "yyyy-MM-dd'T'HH:mm:ss.SSS"))
df = df.withColumn('diff', ((df.t1_unix-df.t2_unix)/60).cast(Types.IntegerType()))
Related
I am running the code through jupyter on EMR, pyspark version 3.3.0.
I have two dataframes that I have preprocessed with the pyspark.ml.feature functions (OneHotEncoder, StringIndexer, VectorAssembler). The first dataframe, lets call it df_good, has 5 features, the second dataframe, lets call it df_bad, omits 2 of the features from df_good. The underlying dataset used to generate the two datasets is the same, the code to generate the datasets is identical (other than two features not be included in the VectorAssembler inputCols for df_bad).
Below is the code I am using to train the model:
import pyspark.sql.functions as F
from pyspark.sql.types import ArrayType, DoubleType
from pyspark.ml.classification import LogisticRegression
def split_array(col):
def to_list(v):
return v.toArray().tolist()
return F.udf(to_list, ArrayType(DoubleType()))(col)
def train_model(df):
train_df = df.selectExpr("label as label", "features as features")
logit = LogisticRegression()
logit = logit.setFamily("multinomial")
logit_mod = logit.fit(train_df)
df = logit_mod.transform(df)
df = df.withColumn("pred", split_array(F.col("probability"))[0])
return df
Here is where things get weird.
If I run the code below it works and runs in 10-20 seconds each:
df_good = spark.read.parquet("<s3_location_good>")
df_good = train_model(df_good)
df_good.select(F.sum("pred")).show()
df_bad = spark.read.parquet("<s3_location_bad>")
df_bad = train_model(df_bad)
df_bad.select(F.sum("pred")).show()
If I change the order, the code completely hangs on df_bad:
df_bad = spark.read.parquet("<s3_location_bad>")
df_bad = train_model(df_bad)
df_bad.select(F.sum("pred")).show()
df_good = spark.read.parquet("<s3_location_good>")
df_good = train_model(df_good)
df_good.select(F.sum("pred")).show()
The data is unchanged, the code is the same, the behavior is always the same.
Any thoughts are appreciated.
I started in the pyspark world some time ago and I'm racking my brain with an algorithm, initially I want to create a function that calculates the difference of months between two dates, I know there is a function for that (months_between), but it works a little bit different from what I want, I want to extract the months from two dates and subtract without taking into account the days, only the month and the year, the point is, I can do this by manipulating base, creating new columns with the months and subtracting , but I want to do this as a UDF function, like below:
from datetime import datetime
import pyspark.sql.functions as f
base_study = spark.createDataFrame([("1", "2009-01-31", "2007-01-31"),("2","2009-01-31","2011-01-31")], ['ID', 'A', 'B'])
base_study = base_study.withColumn("A",f.to_date(base_study["A"], 'yyyy-MM-dd'))
base_study = base_study.withColumn("B",f.to_date(base_study["B"], 'yyyy-MM-dd'))
def intckSasFunc(RecentDate, PreviousDate):
RecentDate = f.month("RecentDate")
PreviousDate = f.month("PreviousDate")
months_diff = (RecentDate.year - PreviousDate.year) * 12 + (RecentDate.month - PreviousDate.month)
return months_diff
intckSasFuncUDF = f.udf(intckSasFunc, IntegerType())
base_study.withColumn('Result', intckSasFuncUDF(f.col('B'), f.col('A') ))
What I'm doing wrong ?
Another question: When I pass parameters in a UDF function, they sent one by one or it pass entire column? And this column is a series?
Thank you!
I found a solution and upgraded it to handle missings too.
from datetime import datetime
import pyspark.sql.functions as f
base_study = spark.createDataFrame([("1", None, "2015-01-01"),("2","2015-01-31","2015-01-31")], ['ID', 'A', 'B'])
base_study = base_study.withColumn("A",f.to_date(base_study["A"], 'yyyy-MM-dd'))
base_study = base_study.withColumn("B",f.to_date(base_study["B"], 'yyyy-MM-dd'))
def intckSasFunc(RecentDate, PreviousDate):
if (PreviousDate and RecentDate) is not None:
months_diff = (RecentDate.year - PreviousDate.year) * 12 + (RecentDate.month - PreviousDate.month)
return months_diff
else:
return None
intckSasFuncUDF = f.udf(lambda x,y:intckSasFunc(x,y) , IntegerType())
display(base_study.withColumn('Result', intckSasFuncUDF(f.col('B'), f.col('A'))))
for those who have doubts, as I had, the function treats one record at a time, as if it were a normal python function, I couldn't use pyspark.sql functions inside this UDF, it gives an error, it seems, these functions are used only in pypsark columns, and inside the UDF the transformation is row by row.
I am wondering if it's possible to obtain the result of percentile_rank using the QuantileDiscretizer transformer in pyspark.
The purpose is that I am trying to avoid computing the percent_rank over the entire column, as it generates the following error:
WARN WindowExec: No Partition Defined for Window operation!
Moving all data to a single partition, this can cause serious performance degradation.
The method I am following is to first use QuantileDiscretizer then normalize to [0,1]:
from pyspark.sql.window import Window
from pyspark.ml.feature import QuantileDiscretizer
from scipy.stats import gamma
X1 = gamma.rvs(0.2, size=1000)
df = spark.createDataFrame(pd.DataFrame(X1, columns=["x"]))
df = df.withColumn("perc_rank", F.percent_rank().over(Window.orderBy("x")))
df = QuantileDiscretizer(numBuckets=df.count()+1,\
inputCol="x",\
outputCol="q_discretizer").fit(df).transform(df)
agg_values = df.agg(F.max(df["q_discretizer"]).alias("maxval"),\
F.min(df["q_discretizer"]).alias("minval")).collect()[0]
xmax, xmin = agg_values.__getitem__("maxval"), agg_values.__getitem__("minval")
normalize = F.udf(lambda x: (x-xmin)/(xmax-xmin))
df = df.withColumn("perc_discretizer", normalize("q_discretizer"))
df = df.withColumn("error", F.round(F.abs(F.col("perc_discretizer")- F.col("perc_rank")),6) )
print(df.select(F.max("error")).show())
df.show(5)
However, it seems that increasing the number of datapoints the error grows, so I am not sure this is the right way to do it.
Is it possible to use QuantileDiscretizer to obtain the percentile_rank?
Alternatively is there a way to compute percentile_rank over an entire column in an efficient way?
Well you can use the below to avoid the warning message:
X1 = gamma.rvs(0.2, size=10)
df = spark.createDataFrame(pd.DataFrame(X1, columns=["x"]))
df = df.withColumn("dummyCol", F.lit("some_val"))
win = Window.partitionBy("dummyCol").orderBy("x")
df = df.withColumn("perc_rank", F.percent_rank().over(win)).drop("dummyCol")
but nonetheless, the data would still be moved to a single worker, I don't think so there is any better alternative to avoid the shuffle here since the complete column needs to be rank-ordered.
In case you have multiple windows over the same column, you can try to pre-partition the data and then apply the ranking functions.
I have a dataset like below:
I am group by age and average on numbers of friends for each age
from pyspark.sql import SparkSession
from pyspark.sql import Row
import pyspark.sql.functions as F
def parseInput(line):
fields = line.split(',')
return Row(age = int(fields[2]), numFriends = int(fields[3]))
spark = SparkSession.builder.appName("FriendsByAge").getOrCreate()
lines = spark.sparkContext.textFile("data/fakefriends.csv")
friends = lines.map(parseInput)
friendDataset = spark.createDataFrame(friends)
counts = friendDataset.groupBy("age").count()
total = friendDataset.groupBy("age").sum('numFriends')
res = total.join(counts, "age").withColumn("Friend By Age", (F.col("sum(numFriends)") // F.col("count"))).drop('sum(numFriends)','count')
I got below error:
TypeError: unsupported operand type(s) for //: 'Column' and 'Column'
Usually, I use // in Python 3.0+ and return an integer value as I expected here, however, in PySpark datagram, // doesn't work and only / works. Any reason why it doesn't work? We have to use round function to get integer value?
Not sure about the reason. but you can type cast to int or use Floor function
from pyspark.sql import functions as F
tst= sqlContext.createDataFrame([(1,7,9),(1,8,4),(1,5,10),(5,1,90),(7,6,18),(0,3,11)],schema=['col1','col2','col3'])
tst1 = tst.withColumn("div", (F.col('col1')/F.col('col2')).cast('int'))
tst2 = tst.withColumn("div", F.floor(F.col('col1')/F.col('col2')))
//(floor division) is not supported in pyspark over column. Try below alternative-
counts = friendDataset.groupBy("age").count()
total = friendDataset.groupBy("age").agg(sum('numFriends').alias('sum'))
res = total.join(counts, "age").withColumn("Friend By Age", F.floor(F.col("sum") / F.col("count"))).drop('sum(numFriends)','count')
I conducted a tf-idf transform and now I want to get the keys and values from the result.
I am using the following udf code to get values:
def extract_values_from_vector(vector):
return vector.values.tolist()
extract_values_from_vector_udf = udf(lambda vector:extract_values_from_vector(vector), ArrayType(DoubleType()))
extract = rescaledData.withColumn("extracted_keys", extract_keys_from_vector_udf("features"))
So if the sparsevector looks like:
features=SparseVector(123241, {20672: 4.4233, 37393: 0.0, 109847: 3.7096, 118474: 5.4042}))
extracted_keys in my extract will look like:
[4.4233, 0.0, 3.7096, 5.4042]
My question is, how can I get the keys in the SparseVector dictionary? Such as keys = [20672, 37393, 109847, 118474] ?
I am trying the following code but it won't work
def extract_keys_from_vector(vector):
return vector.indices.tolist()
extract_keys_from_vector_udf = spf.udf(lambda vector:extract_keys_from_vector(vector), ArrayType(DoubleType()))
The result it gave me is: [null,null,null,null]
Can someone help?
Many thanks in advance!
Since the answer is in the comments above, I thought that I would take this time (while waiting to write a parquet of course) to write down the answer.
from pyspark.sql.types import *
from pyspark.sql import functions as F
def extract_keys_from_vector(vector):
return vector.indices.tolist()
feature_extract = F.UserDefinedFunction(lambda vector: extract_keys_from_vector(vector), ArrayType(IntegerType()))
df = df.withColumn("features", feature_extract(F.col("features")))