Filters inside for loop in pyspark is really slow - pyspark

I am reading data from s3 bucket and run a for loop and do few filers and find max value. Then running on emr cluster But this is taking hours to run.
df has 1.5 m rows and df_new has 50000 rows. The reason why I converted to np array to see whether it improves the performance for the loop.
Since i am new to pyspark i am not sure whether its a efficient way to do this or a better way to do this.
Thanks in advance
df = spark.read.format('parquet').load(os.path.join('s3://', bucket_name, bucket_path_exec+date_val, report_name)
df_new = df.filter(f.col("a") == 1)
df_new = np.array(trade_report_broker.select("a", "b","c", "d","e").collect())
rows = len(df_new)
for i in range(0,rows):
aaa = df_newr[i][0]
eee = df_new[i][4]
time = df_new[i][2]
sub = df_new.filter(f.col("a") == aaa)
sub = sub.filter(f.col("b") < time)
max_time = sub.groupby().agg(f.max("eee").alias("MaxTime"))

Related

Pyspark dataframe filter based on in between values

I have a Pyspark dataframe with below values -
[Row(id='ABCD123', score='28.095238095238095'), Row(id='EDFG456', score='36.2962962962963'), Row(id='HIJK789', score='37.56218905472637'), Row(id='LMNO1011', score='36.82352941176471')]
I want only the values from the DF which have score between the input score value and input score value + 1, say, the input score value is 36 then I want the output DF with only two ids - EDFG456 & LMNO1011 as their score falls between 36 & 37. I achieved this by doing as follows -
input_score_value = 36
input_df = my_df.withColumn("score_num", substring(my_df.score, 1,2))
output_matched = input_df.filter(input_df.score_num == input_score_value)
print(output_matched.take(5))
The above code gives the below output, but it takes too long to process 2 mil rows. I was thinking if there is some better way to do this to reduce the response time.
[Row(id='EDFG456', score='36.2962962962963'), Row(id='LMNO1011',score='36.82352941176471')]
You can use the function floor.
from pyspark.sql.functions import floor
output_matched = input_df.filter(foor(input_df.score_num) == input_score_value)
print(output_matched.take(5))
It should be much faster compared to substring. Let me know.

PySpark approxSimilarityJoin() not returning any results

I am trying to find similar users by vectorizing user features and sorting by distance between user vectors in PySpark. I'm running this in Databricks on Runtime 5.5 LTS ML cluster (Scala 2.11, Spark 2.4.3)
Following the code in the docs, I am using approxSimilarityJoin() method from the pyspark.ml.feature.BucketedRandomProjectionLSH model.
I have found similar users successfully using approxSimilarityJoin(), but every now and then I come across a user of interest that apparently has no users similar to them.
Usually when approxSimilarityJoin() doesn't return anything, I assume it's because the threshold parameter is set to low. That fixes the issue sometimes, but now I've tried using a threshold of 100000 and still getting nothing back.
I define the model as
brp = BucketedRandomProjectionLSH(inputCol="scaledFeatures", outputCol="hashes", bucketLength=1.0)
I'm not sure if I changing bucketLength or numHashTables would help in obtaining results.
The following example shows a pair of users where approxSimilarityJoin() returned something (dataA, dataB) and a pair of users (dataC, dataD) where it didn't.
from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import col
dataA = [(0, Vectors.dense([0.7016968702094931,0.2636417660310031,4.155293362824633,4.191398632883099]),)]
dataB = [(1, Vectors.dense([0.3757117100334294,0.2636417660310031,4.1539923630906745,4.190086328785612]),)]
dfA = spark.createDataFrame(dataA, ["customer_id", "scaledFeatures"])
dfB = spark.createDataFrame(dataB, ["customer_id", "scaledFeatures"])
brp = BucketedRandomProjectionLSH(inputCol="scaledFeatures", outputCol="hashes", bucketLength=2.0,
numHashTables=3)
model = brp.fit(dfA)
# returns
# theshold of 100000 is clearly overkill
# A dataframe with dfA and dfB feature vectors and a EuclideanDistance of 0.32599039770730354
model.approxSimilarityJoin(dfA, dfB, 100000, distCol="EuclideanDistance").show()
dataC = [(0, Vectors.dense([1.1600056435954367,78.27652460873155,3.5535837780801396,0.0030949620591871887]),)]
dataD = [(1, Vectors.dense([0.4660731192450482,39.85571715054726,1.0679201943112886,0.012330725745062067]),)]
dfC = spark.createDataFrame(dataC, ["customer_id", "scaledFeatures"])
dfD = spark.createDataFrame(dataD, ["customer_id", "scaledFeatures"])
brp = BucketedRandomProjectionLSH(inputCol="scaledFeatures", outputCol="hashes", bucketLength=2.0,
numHashTables=3)
model = brp.fit(dfC)
# returns empty df
model.approxSimilarityJoin(dfC, dfD, 100000, distCol="EuclideanDistance").show()
I was able to obtain results to the second half of the example above by increasing the bucketLength parameter value to 15. The threshold could have been lowered because the Euclidean Distance was ~34.
Per the PySpark docs:
bucketLength = the length of each hash bucket, a larger bucket lowers the false negative rate

Using PySpark Imputer on grouped data

I have a Class column which can be 1, 2 or 3, and another column Age with some missing data. I want to Impute the average Age of each Class group.
I want to do something along:
grouped_data = df.groupBy('Class')
imputer = Imputer(inputCols=['Age'], outputCols=['imputed_Age'])
imputer.fit(grouped_data)
Is there any workaround to that?
Thanks for your time
Using Imputer, you can filter down the dataset to each Class value, impute the mean, and then join them back, since you know ahead of time what the values can be:
subsets = []
for i in range(1, 4):
imputer = Imputer(inputCols=['Age'], outputCols=['imputed_Age'])
subset_df = df.filter(col('Class') == i)
imputed_subset = imputer.fit(subset_df).transform(subset_df)
subsets.append(imputed_subset)
# Union them together
# If you only have 3 just do it without a loop
imputed_df = subsets[0].unionByName(subsets[1]).unionByName(subsets[2])
If you don't know ahead of time what the values are, or if they're not easily iterable, you can groupBy, get the average values for each group as a DataFrame, and then coalesce join that back onto your original dataframe.
import pyspark.sql.functions as F
averages = df.groupBy("Class").agg(F.avg("Age").alias("avgAge"))
df_with_avgs = df.join(averages, on="Class")
imputed_df = df_with_avgs.withColumn("imputedAge", F.coalesce("Age", "avgAge"))
You need to transform your dataframe with fitted model. Then take average of filled data:
from pyspark.sql import functions as F
imputer = Imputer(inputCols=['Age'], outputCols=['imputed_Age'])
imp_model = imputer.fit(df)
transformed_df = imp_model.transform(df)
transformed_df \
.groupBy('Class') \
.agg(F.avg('Age'))

KMeans|| for sentiment analysis on Spark

I'm trying to write sentiment analysis program based on Spark. To do this I'm using word2vec and KMeans clustering. From word2Vec I've got 20k word/vectors collection in 100 dimension space and now I'm trying to clusterize this vectors space. When I run KMeans with default parallel implementation the algorithm worked 3 hours! But with random initialization strategy it was like 8 minutes.
What am I doing wrong? I have mac book pro machine with 4 kernels processor and 16 GB of RAM.
K ~= 4000
maxInteration was 20
var vectors: Iterable[org.apache.spark.mllib.linalg.Vector] =
model.getVectors.map(entry => new VectorWithLabel(entry._1, entry._2.map(_.toDouble)))
val data = sc.parallelize(vectors.toIndexedSeq).persist(StorageLevel.MEMORY_ONLY_2)
log.info("Clustering data size {}",data.count())
log.info("==================Train process started==================");
val clusterSize = modelSize/5
val kmeans = new KMeans()
kmeans.setInitializationMode(KMeans.K_MEANS_PARALLEL)
kmeans.setK(clusterSize)
kmeans.setRuns(1)
kmeans.setMaxIterations(50)
kmeans.setEpsilon(1e-4)
time = System.currentTimeMillis()
val clusterModel: KMeansModel = kmeans.run(data)
And spark context initialization is here:
val conf = new SparkConf()
.setAppName("SparkPreProcessor")
.setMaster("local[4]")
.set("spark.default.parallelism", "8")
.set("spark.executor.memory", "1g")
val sc = SparkContext.getOrCreate(conf)
Also few updates about running this program. I'm running it inside Intelij IDEA. I don't have real Spark cluster. But I thought that your personal machine can be Spark cluster
I saw that the program hangs inside this loop from Spark code LocalKMeans.scala:
// Initialize centers by sampling using the k-means++ procedure.
centers(0) = pickWeighted(rand, points, weights).toDense
for (i <- 1 until k) {
// Pick the next center with a probability proportional to cost under current centers
val curCenters = centers.view.take(i)
val sum = points.view.zip(weights).map { case (p, w) =>
w * KMeans.pointCost(curCenters, p)
}.sum
val r = rand.nextDouble() * sum
var cumulativeScore = 0.0
var j = 0
while (j < points.length && cumulativeScore < r) {
cumulativeScore += weights(j) * KMeans.pointCost(curCenters, points(j))
j += 1
}
if (j == 0) {
logWarning("kMeansPlusPlus initialization ran out of distinct points for centers." +
s" Using duplicate point for center k = $i.")
centers(i) = points(0).toDense
} else {
centers(i) = points(j - 1).toDense
}
}
Initialisation using KMeans.K_MEANS_PARALLEL is more complicated then random. However, it shouldn't make such a big difference. I would recommend to investigate, whether it is the parallel algorithm which takes to much time (it should actually be more efficient then KMeans itself).
For information on profiling see:
http://spark.apache.org/docs/latest/monitoring.html
If it is not the initialisation which takes up the time there is something seriously wrong. However, using random initialisation shouldn't be any worse for the final result (just less efficient!).
Actually when you use KMeans.K_MEANS_PARALLEL to initialise you should get reasonable results with 0 iterations. If this is not the case there might be some regularities in the distribution of the data which send KMeans offtrack. Hence, if you haven't distributed your data randomly you could also change this. However, such an impact would surprise me give a fixed number of iterations.
I've run spark on AWS with 3 slaves (c3.xlarge) and the result is the same - problem is that parallel KMeans initialize algo in N parallel runs, but it's still extremely slow for small amount of data, my solution is to contionue using Random initialization.
Data size approximately: 4k clusters for 21k 100-dim vectors.

Optimizing Spark Operations (Count and Collect)

I have two operations in my spark code that are taking the bulk of the computational time - The lines of code are
val d = x.map{case(x1,(x2,x3))=>(x2,x3)}.distinct.count.toDouble
which takes approximately 1.5 minutes on ~35 million rows of data. I have also tried using this command -
val d = x.map{case(x1,(x2,x3))=>(x2,x3)}.distinct.countApprox(timeout = 30, confidence = 0.95).
This does return the correct results in ~24 seconds. The result is -
res42: org.apache.spark.partial.PartialResult[org.apache.spark.partial.BoundedDouble] = (final: [8627.000, 8627.000])
which is correct but I do not know how to access this count to use in my calculations.
The second line of code is -
var num = y.values.toArray
which returns in ~ 1.6 minutes -
res46: Array[Double].
I use this Array to calculate -
y.map{case(k,v)=>(k,(num.filter(_<=v).length.toDouble/num.length.toDouble*100))}.
Is there a more efficient way to do these calculations? Any help would be greatly appreciated!
UPDATE:
I have found out how to use the approximate results of count -
x.map{case(x1,(x2,x3))=>((x2,x3),1)}.distinct.countApprox(30,0.95).getFinalValue.high.toDouble
ADDED
var s = x.map{case(x1,(x2,x3))=>((x1,x2,x3),1)}.reduceByKey(_+_).map{case((x1,x2,x3),x4)=>((x2,x3),(x1,x4))}.
join(x.map{case(x1,(x2,x3))=>((x2,x3),1)}.reduceByKey(_+_)).map{case((x2,x3),((x1,x4),x5))=>
(x1,(x4.toDouble/x5.toDouble*x3.toDouble, x3.toDouble))}
var f = s.mapValues(_._1).reduceByKey(_+_).join(s.mapValues(_._2).reduceByKey(_+_)).map{case(k,v)=>(k,v._1/v._2)}
val d = x.map{case(x1,(x2,x3))=>((x2,x3),1)}.distinct.countApprox(30,0.95).getFinalValue.high.toDouble
var varietyFrac = x.distinct.mapValues(_=>1).reduceByKey(_+_).map{case(x1,x2)=>(x1,x2.toDouble/d)}
var y = f.join(varietyFrac).map{case(name,(frac,varietyFrac))=>(name,pow((frac.toDouble*varietyFrac.toDouble),0.01)/0.01)}.persist