The most performant and proper way to get similarities between large embeddings/vectors - pyspark

I have 2 data frames with 69 & ~230.000 rows and (KEY, Embedding) columns.
The embedding columns are array type and each has a length of 768 which I obtained from a fine-tuned transformer model (i.e., all-distilroberta-v1). What I wanna achieve is to calculate cosine similarities of those embedding pairs from different DFs.
To this end, I calculated the dot products of the embeddings for each raw using UDFs as:
spark = SparkSession.builder \
.appName('app_similarity') \
.master('local[*]') \
.config('spark.sql.execution.arrow.pyspark.enabled', True) \
.config('spark.sql.session.timeZone', 'UTC') \
.config('spark.driver.memory','6G') \
.config('spark.ui.showConsoleProgress', True) \
.config('spark.sql.repl.eagerEval.enabled', True) \
.config('spark.executor.cores', "5") \
.config('spark.executor.memory', "4g") \
.config('spark.num.executors', "5") \
.getOrCreate()
dfSim = df_main.crossJoin(df_other).select("KEY1","KEY2", "Emb","EMBe")
dfSim = dfSim.withColumn("Emb", F.regexp_replace(F.col("Emb"), "\[", "")).withColumn("Emb", F.regexp_replace(F.col("Emb"), "]",""))
dfSim = dfSim.withColumn("EMBe", F.regexp_replace(F.col("EMBe"), "\[", "")).withColumn("EMBe", F.regexp_replace(F.col("EMBe"), "]",""))
dfSim = dfSim.withColumn("Emb", F.split(F.col("Emb"), ",").cast(ArrayType(DoubleType())))
dfSim = dfSim.withColumn("EMBe", F.split(F.col("EMBe"), ",").cast(ArrayType(DoubleType())))
sim_cos = F.udf(lambda x,y : float(np.array(x).dot(np.array(y))), FloatType())
norm = F.udf(lambda x : float(np.linalg.norm(x)), FloatType())
dfSim = dfSim.withColumn("dot", sim_cos(F.col("Emb"),F.col("EMBe")))
dfSim = dfSim.withColumn("Sim", F.round(F.col("dot"),5))
As seen, I use crossjoin which results in all the pairs of keys, i.e., 15 million rows.
With this approach I get some errors like:
when I apply a similarity filter e.g., dfSim.filter(F.col("Sim") > 0.3).show() it works but somehow when I increase the sim score to 0.4 or 0.5 it throws socket.timeout: timed out error
or I cannot upload to sql db etc.
I was wondering if it would be possible to apply an LSH or minHash approach to directly use embeddings and get their neighbors and similarities instead of creating 15 million pairs?
EDIT:
The code below worked. In the beginning, LSH approach did not work due to starting the spark session with "local[*]" causing all the CPUs to be used.
When restricting it by 5 it worked. However, bear in mind, this is still
memory intense process. Playing around 'spark.executor.memory' & 'spark.driver.memory' might be necessary.
from pyspark.ml.feature import BucketedRandomProjectionLSH
from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import col
spark = SparkSession.builder \
.appName('app_similarity') \
.master('local[5]') \
.config('spark.sql.execution.arrow.pyspark.enabled', True) \
.config('spark.sql.session.timeZone', 'UTC') \
.config('spark.driver.memory','8G') \
.config('spark.ui.showConsoleProgress', True) \
.config('spark.sql.repl.eagerEval.enabled', True) \
.config('spark.executor.cores', "3") \
.config('spark.executor.memory', "2g") \
.config('spark.num.executors', "3") \
.getOrCreate()
ut_rdd = df_main.drop("_c0").rdd.map(lambda row:row[0])
emb_rdd = df_main.drop("_c0").rdd.map(lambda row:row[1])
dfM = ut_rdd.zip(emb_rdd.map(lambda x:Vectors.dense(x))).toDF(schema=['ut','emb'])
uto_rdd = df_other.drop("_c0").rdd.map(lambda row:row[0])
embo_rdd = df_other.drop("_c0").rdd.map(lambda row:row[1])
dfO = uto_rdd.zip(embo_rdd.map(lambda x:Vectors.dense(x))).toDF(schema=['ut','emb'])
brp = BucketedRandomProjectionLSH(inputCol="emb", outputCol="hashes", bucketLength=2.0, numHashTables=3)
model = brp.fit(dfM)
dfSimLSH = model.approxSimilarityJoin(dfM, dfO, 1.2, distCol="EuclideanDistance").select(col("datasetA.ut").alias("utm"),
col("datasetB.ut").alias("uto"),
col("EuclideanDistance"))

Related

Is it bad to use `GroupBy` multiple times in pyspark?

This is an educational question.
I have a text file containing several records of power consumption of factories - identified by a unique id -. The file contains the following columns
factory_id, city, country, date, consumption
where date is in the format mm/YYYY. I want to compute which countries have less than 20 cities (including those with 0) which experienced a decrease in factories' consumption in two consecutive years. This is nothing but the total yearly consumption of the factories located in that city.
To do this, I used multiple times a groupBy + agg as follows
import pyspark.sql.functions as F
import pyspark.sql.types as T
df = df.withColumn("year", F.split("Date", "/")[1])
# compute for each city the yearly consumption
df_consump = df.groupBy("Country", "City", "year").agg(
F.sum("consumption").alias("consumption")
)
#F.udf(returnType=T.IntegerType())
def had_a_decrease(structs):
structs = sorted(structs, key=lambda s: s.year)
# retrieve 0 if list is monotonically growing, 1 otherwise
cur_cons = pairs[0].consumption
for struct in structs[1:]:
cons = struct.consumption
if cons <= cur_cons:
return 1
cur_cons = cons
return 0
df_cons_decrease = df_consump.groupBy("Country", "City").agg(
# here I collect a list of structs containing (year, consumption)
# which is needed because collect_list doesn't guarantee the order
# is respected so I keep the info on the year to sort this (small)
# list first in the udf "had_a_decrease" defined above.
# eventually this yields a column with a 1 if we had a decrease, 0 otherwise,
# which I sum afterwards.
had_a_decrease(F.collect_list(F.struct("year", "consumption"))).alias("had_decrease")
)
df_cons_decrease.groupBy("Country").agg(
F.sum("had_decrease").alias("num_cities_with_decrease")
).filter("num_cities_with_decrease < 20")\
.write.csv(outputFolder)
however I was wondering:
is this a bad practice (e.g. inefficient) ?
are dataframe better suited than RDDs for this ?
would you recommend a better approach than grouping this many times ?
Compare the consumption with the consomption 1 year and 2 year ago by using Window and lag function without udf and then group by.
data = [
[1, 1, 1, '01/2022', 100],
[1, 1, 1, '01/2021', 90],
[1, 1, 1, '01/2020', 80],
[1, 1, 2, '01/2022', 100],
[1, 1, 2, '01/2021', 110],
[1, 1, 2, '01/2020', 120]
]
cols = ['factory_id', 'city', 'country', 'date', 'consumption']
df = spark.createDataFrame(data, cols) \
.withColumn('year', f.split('date', '/')[1])
w = Window.partitionBy('country', 'city').orderBy('year')
df.groupBy('country', 'city', 'year') \
.agg(f.sum('consumption').alias('consumption')) \
.withColumn('consumption-1', f.lag('consumption', 1).over(w)) \
.withColumn('consumption-2', f.lag('consumption', 2).over(w)) \
.withColumn('is_decreased', f.expr('if(`consumption` < `consumption-1` and `consumption-1` < `consumption-2`, true, false)')) \
.filter('is_decreased = true') \
.select('country', 'city').distinct() \
.groupBy('country').count() \
.filter('count < 20') \
.select('country') \
.show()
+-------+
|country|
+-------+
| 2|
+-------+

Performance of PySpark DataFrames vs Glue DynamicFrames

So I recently started using Glue and PySpark for the first time. The task was to create a Glue job that does the following:
Load data from parquet files residing in an S3 bucket
Apply a filter to the data
Add a column, the value of which is derived from 2 other columns
Write the result to S3
Since the data is going from S3 to S3, I assumed that Glue DynamicFrames should be a decent fit for this, and I came up with the following code:
def AddColumn(r):
if r["option_type"] == 'S':
r["option_code_derived"]= 'S'+ r["option_code_4"]
elif r["option_type"] == 'P':
r["option_code_derived"]= 'F'+ r["option_code_4"][1:]
elif r["option_type"] == 'L':
r["option_code_derived"]= 'P'+ r["option_code_4"]
else:
r["option_code_derived"]= None
return r
glueContext = GlueContext(create_spark_context(role_arn=args['role_arn']))
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
inputGDF = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": [source_path], "recurse" : True}, format = source_format, additional_options = {"useS3ListImplementation":True})
filtered_gdf = Filter.apply(frame = inputGDF, f = lambda x: x["my_filter_column"] in ['50','80'])
additional_column_gdf = Map.apply(frame = filtered_gdf, f = AddColumn)
gdf_mapped = ApplyMapping.apply(frame = additional_column_gdf, mappings = mappings, transformation_ctx = "gdf_mapped")
glueContext.purge_s3_path(full_target_path_purge, {"retentionPeriod": 0})
outputGDF = glueContext.write_dynamic_frame.from_options(frame = gdf_mapped, connection_type = "s3", connection_options = {"path": full_target_path}, format = target_format)
This works but takes a very long time (just short of 10 hours with 20 G1.X workers).
Now, the dataset is quite large (almost 2 billion records, over 400 GB), but this was still unexpected (to me at least).
Then I gave it another try, this time with PySpark DataFrames instead of DynamicFrames.
The code looks like the following:
glueContext = GlueContext(create_spark_context(role_arn=args['role_arn'], source_bucket=args['s3_source_bucket'], target_bucket=args['s3_target_bucket']))
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
df = spark.read.parquet(full_source_path)
df_filtered = df.filter( (df.model_key_status == '50') | (df.model_key_status == '80') )
df_derived = df_filtered.withColumn('option_code_derived',
when(df_filtered.option_type == "S", concat(lit('S'), df_filtered.option_code_4))
.when(df_filtered.option_type == "P", concat(lit('F'), df_filtered.option_code_4[2:42]))
.when(df_filtered.option_type == "L", concat(lit('P'), df_filtered.option_code_4))
.otherwise(None))
glueContext.purge_s3_path(full_purge_path, {"retentionPeriod": 0})
df_reorderered = df_derived.select(target_columns)
df_reorderered.write.parquet(full_target_path, mode="overwrite")
This also works, but with otherwise identical settings (20 workers of type G1.X, same dataset), this takes less than 20 minutes.
My question is: Where does this massive difference in performance between DynamicFrames and DataFrames come from? Was I doing something fundamentally wrong in the first try?

Scala Apache Spark Filter DF Using Arbitrary Number of Bounding Boxes Read From File

Is there a way to perform rectangle bounding box filters, in a manner that scales, on a large data set without additional frameworks like Apache Sedona or GeoMesas?
Suppose a toy data set of:
val data = Seq((1,1646113023,34.073071,-118.257962),(2,1646199423,34.074715, -118.263144),(3,1646285823, 34.032621, -118.224268),(4,1646285823,33.718508, -117.808853),(5,1646372223,33.716589,-117.804304))
val df = spark.sparkContext.parallelize(data).toDF()
val new_col = Seq("id", "time", "lat", "long")
val columnList = df.columns.zip(new_col).map(f=>{col(f._1).as(f._2)})
val df2 = df.select(columnList:_*)
df2.show()
+---+----------+---------+-----------+
| id| time| lat| long|
+---+----------+---------+-----------+
| 1|1646113023|34.073071|-118.257962|
| 2|1646199423|34.074715|-118.263144|
| 3|1646285823|34.032621|-118.224268|
| 4|1646285823|33.718508|-117.808853|
| 5|1646372223|33.716589|-117.804304|
+---+----------+---------+-----------+
I know that I can use filter to subset a df using multiple conditions to create a single bounding box:
df2.filter(($"long" >= -117.858217) && ($"lat" >= 33.711760) && ($"long" <= -117.765176) && ($"lat" <= 33.759725)).show()
+---+----------+---------+-----------+
| id| time| lat| long|
+---+----------+---------+-----------+
| 4|1646285823|33.718508|-117.808853|
| 5|1646372223|33.716589|-117.804304|
+---+----------+---------+-----------+
But what about if there are an arbitrary number of bounding boxes read in from a file such that:
val bb = Seq((-117.858217, 33.711760, -117.765176, 33.759725), (-118.294084, 34.054451, -118.211515, 34.104072))
val bbDF = spark.sparkContext.parallelize(bb).toDF()
bbDF.show()
+-----------+---------+-----------+---------+
| _1| _2| _3| _4|
+-----------+---------+-----------+---------+
|-117.858217| 33.71176|-117.765176|33.759725|
|-118.294084|34.054451|-118.211515|34.104072|
+-----------+---------+-----------+---------+
How can I filter df2 using bbDF?
Open to pyspark answers, but first trying it out in Scala.
The first option is to try the combination of all the filters, if the performance is not good enough, you can try some strategy.
If the amount of data is too big and the filter is not the problem try to make the first filter with the minimum of the borders tho delete all positions that for sure wont be in any of the, something like:
val minLong = bb.map(_._1).min
val minLat = bb.map(_._2).min
val maxLong = bb.map(_._3).max
val maxLat = bb.map(_._4).max
val filterBB: Column = bb.map(...).reduce(_ || _)
df2.filter($"long" >= minLeft && $"long" <= maxRight && $"lat" >= minDown && $"lat" <= maxUp)
.filter(filterBB)
The first filter will be very easy to be pushed to the source file and if the format used in the file allows it, will retrieve only chunks of the data that will contain elements that will be in that area.
If the problem is the number of bounding boxes, you can try a strategy of clustering them to bigger boxes, and if the element is there, inspect with the smaller ones. For example, taking the previous code we can divide the box inside (minLong, minLat, maxLong, maxLat) into 4 boxes, and inspect what bb's are inside them. (some pseudocode)
val halfLong = minLong + maxLong / 2
val halfLat = minLat + maxLat / 2
val leftDownBB: List[BB] = bb.filter(...).reduce(_ || _) // simpler filter only for one square
df2.filter($"long" >= minLeft && $"long" <= maxRight && $"lat" >= minDown && $"lat" <= maxUp)
.filter(
when($"long" < halfLong && $"lat" < halfLong, leftDownBB)
.when($"long" < halfLong && $"lat" < halfLong, rightDownBB)
.when(..., leftUpBB)
.when(..., rightUpBB)
)
This is an example of 4 clusters, but if needed, make as much as you need. Also you can make the code recursive and divide each square in other 4 if need more layers of the tree. The when can be nested.
Hope I've explained myself.

Implement Louvain in pyspark using dataframes

I'm trying to implement the Louvain algorihtm in pyspark using dataframes. The problem is that my implementation is reaaaally slow. This is how I do it:
I collect all vertices and communityIds into simple python lists
For each vertex - communityId pair I calculate the modularity gain using dataframes (just a fancy formula involving edge weights sums/differences)
Repeat untill no change
What am I doing wrong?
I suppose that if I could somehow parallelize the for each loop the performance would increase, but how can I do that?
LATER EDIT:
I could use vertices.foreach(changeCommunityId) instead of the for each loop, but then I'd have to compute the modularity gain (that fancy formula) without dataframes.
See the code sample below:
def louvain(self):
oldModularity = 0 # since intially each node represents a community
graph = self.graph
# retrieve graph vertices and edges dataframes
vertices = verticesDf = self.graph.vertices
aij = edgesDf = self.graph.edges
canOptimize = True
allCommunityIds = [row['communityId'] for row in verticesDf.select('communityId').distinct().collect()]
verticesIdsCommunityIds = [(row['id'], row['communityId']) for row in verticesDf.select('id', 'communityId').collect()]
allEdgesSum = self.graph.edges.groupBy().sum('weight').collect()
m = allEdgesSum[0]['sum(weight)']/2
def computeModularityGain(vertexId, newCommunityId):
# the sum of all weights of the edges within C
sourceNodesNewCommunity = vertices.join(aij, vertices.id == aij.src) \
.select('weight', 'src', 'communityId') \
.where(vertices.communityId == newCommunityId);
destinationNodesNewCommunity = vertices.join(aij, vertices.id == aij.dst) \
.select('weight', 'dst', 'communityId') \
.where(vertices.communityId == newCommunityId);
k_in = sourceNodesNewCommunity.join(destinationNodesNewCommunity, sourceNodesNewCommunity.communityId == destinationNodesNewCommunity.communityId) \
.count()
# the rest of the formula computation goes here, I just wanted to show you an example
# just return some value for the modularity
return 0.9
def changeCommunityId(vertexId, currentCommunityId):
maxModularityGain = 0
maxModularityGainCommunityId = None
for newCommunityId in allCommunityIds:
if (newCommunityId != currentCommunityId):
modularityGain = computeModularityGain(vertexId, newCommunityId)
if (modularityGain > maxModularityGain):
maxModularityGain = modularityGain
maxModularityGainCommunityId = newCommunityId
if (maxModularityGain > 0):
return maxModularityGainCommunityId
return currentCommunityId
while canOptimize:
while self.changeInModularity:
self.changeInModularity = False
for vertexCommunityIdPair in verticesIdsCommunityIds:
vertexId = vertexCommunityIdPair[0]
currentCommunityId = vertexCommunityIdPair[1]
newCommunityId = changeCommunityId(vertexId, currentCommunityId)
self.changeInModularity = False
canOptimize = False

XML parsing using Scala

I have the following XML file which I want to parse using Scala:
<infoFile xmlns="http://latest/nmc-omc/cmNrm.doc#info" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://latest/nmc-omc/cmNrm.doc#info schema\pmResultSchedule.xsd">
<fileHeader fileFormatVersion="123456" operator="ABCD">
<fileSender elementType="MSC UTLI"/>
<infoCollec beginTime="2011-05-15T00:00:00-05:00"/>
</fileHeader>
<infoCollecData>
<infoMes infoMesID="551727">
<mesPeriod duration="TT1234" endTime="2011-05-15T00:30:00-05:00"/>
<mesrePeriod duration="TT1235"/>
<mesTypes>5517271 5517272 5517273 5517274 </measTypes>
<mesValue mesObj="RPC12/LMI_ANY:Label=BCR-1232_1111, ANY=1111">
<mesResults>149 149 3 3 </mesResults>
</mesValue>
</infoMes>
<infoMes infoMesID="551728">
<mesTypes>6132413 6132414 6132415</mesTypes>
<mesValue measObjLdn="RPC12/LMI_ANY:Label=BCR-1232_64446, CllID=64446">
<mesResults>0 0 6</mesResults>
</mesValue>
<mesValue measObjLdn="RPC13/LMI_ANY:Label=BCR-1232_64447, CllID=64447">
<mesResults>0 1 6</mesResults>
</mesValue>
</infoMes>
<infoMes infoMesID="551729">
<mesTypes>6132416 6132417 6132418 6132419</mesTypes>
<mesValue measObjLdn="RPC12/LMI_ANY:Label=BCR-1232_64448, CllID=64448">
<mesResults>1 4 6 8</mesResults>
</mesValue>
<mesValue measObjLdn="RPC13/LMI_ANY:Label=BCR-1232_64449, CllID=64449">
<mesResults>1 2 4 5 </mesResults>
</mesValue>
<mesValue measObjLdn="RPC13/LMI_ANY:Label=BCR-1232_64450, CllID=64450">
<mesResults>1 7 8 5 </mesResults>
</mesValue>
</infoMes>
</infoCollecData>
I want the file to be parsed as follows:
From the fileHeader I want to be able to extract operator name then to extract beginTime.
Next scenario ****extract the information which contains CllID then get its mesTypes and mesResults respectively ****
as the file contains number of with different CllID so I want the final result like this
CllID date time mesTypes mesResults
64446 2011-05-15 00:00:00 6132413 0
64446 2011-05-15 00:00:00 6132414 0
64446 2011-05-15 00:00:00 6132415 6
64447 2011-05-15 00:00:00 6132413 0
64447 2011-05-15 00:00:00 6132414 1
64447 2011-05-15 00:00:00 6132415 6
How could I achieve this ? Here is what I have tried so far:
import java.io._
import scala.xml.Node
object xml_parser {
def main (args:Array[String]) = {
val input_xmlFile = scala.xml.XML.loadFile("C:/Users/ss.xml")
val fileHeader = input_xmlFile \ "fileHeader"
val vendorName = input_xmlFile \ "fileHeader" \ "#operator"
val dateTime = input_xmlFile \ "fileHeader" \ "infoCollec" \"#beginTime"
val date = dateTime.text.split("T")(0)
val time = dateTime.text.split("T")(1).split("-")(0)
val CcIds = (input_xmlFile \ "infoCollecData" \ "infoMes" \\ "mesTypes" )
val cids = CcIds.text.split("\\s+").toList
al CounterValues = (input_xmlFile \ "infoCollecData" \\ "infoMes" \\ "mesValue" \\ "#meaObj")
println(date);println(time);print(cids)
May I suggest kantan.xpath? It seems like it should sort your problem rather easily.
Assuming your XML data is available in file data, you can write:
import kantan.xpath.implicits._
val xml = data.asUnsafeNode
// Date format to parse dates. Put in the right format.
// Note that this uses java.util.Date, you could also use the joda time module.
implicit val format = ???
// Extract the header data
xml.evalXPath[java.util.Date](xp"//fileheader/infocollec/#begintime")
xml.evalXPath[String](xp"//fileheader/#operator")
// Get the required infoMes nodes as a list, turn each one into whatever data type you need.
xml.evalXPath[List[Node]](xp"//infomes/mesvalue[contains(#measobjldn, 'CllID')]/..").map { node =>
...
}
Extracting the CllID bit is not terribly complicated with the right regular expression, you could either use the standard Scala Regex class or kantan.regex for something a bit more type safe but that might be overkill here.
The following code can implement what you want according to your xml format
def main(args: Array[String]): Unit = {
val inputFile = xml.XML.loadFile("C:/Users/ss.xml")
val fileHeader = inputFile \ "fileHeader"
val beginTime = fileHeader \"infoCollec"
val res = beginTime.map(_.attribute("beginTime")).apply(0).get.text
val dateTime = res.split("T")
val date = dateTime(0)
val time = dateTime(1).split("-").apply(0)
val title = ("CllID", "date", "time", "mesTypes", "mesResults")
println(s"${title._1}\t${title._2}\t\t${title._3}\t\t${title._4}\t${title._5}")
val infoMesNodeList = (inputFile \\ "infoMes").filter{node => (node \ "mesValue").exists(_.attribute("measObjLdn").nonEmpty)}
infoMesNodeList.foreach{ node =>
val mesTypesList = (node \ "mesTypes").text.split(" ").map(_.toInt)
(node \ "mesValue").foreach { node =>
val mesResultsList = (node \ "mesResults").text.split(" ").map(_.toInt)
val CllID = node.attribute("measObjLdn").get.text.split(",").apply(1).split("=").apply(1).toInt
val res = (mesTypesList zip mesResultsList).map(item => (CllID, date, time, item._1, item._2))
res.foreach(item => println(s"${item._1}\t${item._2}\t${item._3}\t${item._4}\t\t${item._5}"))
}
}
}
Notes: your xml file does not have the right format
1) miss close tag in the last line of the file
2) line 11, have a wrong tag , which should be