Spark word2vec findSynonyms on Dataframes - scala

I am trying to use findSynonyms operation without collecting (action). Here is an example. I have a DataFrame which holds vectors.
df.show()
+--------------------+
| result|
+--------------------+
|[-0.0081423431634...|
|[0.04309031420520...|
|[0.03857229948043...|
+--------------------+
I want to use findSynonyms on this DataFrame. I tried
df.map{case Row(vector:Vector) => model.findSynonyms(vector)}
but it throws null pointer exception. Then I've learned, spark does not support nested transformations or actions. One possible way to do is collecting this DataFrame and run findSynonyms then. How can I do this operation on DataFrame level?

If I have understood correctly, you want to perform a function on each row in the DataFrame. To do that, you can declare a User Defined Function (UDF). In your case the UDF will take a vector as input.
import org.apache.spark.sql.functions._
val func = udf((vector: Vector) => {model.findSynonyms(vector)})
df.withColumn("synonymes", func($"result"))
A new column "synonymes" will be created using the results from the func function.

Related

How to parallelize operations on partitions of a dataframe

I have a dataframe df =
+--------------------+
| id|
+-------------------+
|113331567dc042f...|
|5ffbbd1adb4c413...|
|08c782a4ae854e8...|
|24cee418e04a461...|
|a65f47c2aecc455...|
|a74355ef35d442d...|
|86f1a9b7ffc843b...|
|25c8abd6895e445...|
|b89ce33788f4484...|
.....................
with million elements.
I want to repartition the dataframe into multiple partitions and pass each partition elelemts as list to database api call that returns spark dataset.
Something like this.
df2 = df.repartition(10)
df2.foreach-partition { partition =>
val result = spark.read
.format("custom.databse")
.where(__key in partition.toList)
.load
}
And at the end I would ike to do a Union of all the result datasets returned for each of the partition.
expected output will be final dataset of strings.
+--------------------+
| customer names|
+-------------------+
|eddy |
|jaman |
|cally |
|sunny |
|adam |
.....................
Can anyone help me to convert it to real code in spark-scala
From what I see in documentation it could be possible to something like this. You'll have to use RDD API and SparkContext so you could use parallelize to partition your data into n partitions. After that you can call foreachPartition which should already give you iterator on your data directly, no need to collect data.
Conceptually what you are asking is not really possible in Spark.
Your API call is a SparkContext dependent function ( i.e. spark.read ) , And one cannot use a SparkContext inside a partition function. In simpler words, you cannot pass spark object to executors. For ref
For even simpler imagination : think of of a Dataset having each row as Dataset. Is it even possible ? no.
In your case there can be 2 ways to solve this :
Case 1 : One by One then Union
Convert the keys to list and Split them evenly
FOr each split call spark.read api and keep Unioning .
//split into 10000 sized lists
val listOfListOfKeys : List[List[String]]= df.collect().grouped(10000).toList
//Bring Dataset for 1st 10000 keys (1st list)
var resultDf = spark.read.format("custom.databse")
.where(__key in listOfListOfKeys.apply(0)).load
//drop the 1st item
listOfListOfKeys.drop(1)
//bring rest of them
for (listOfKeys <- listOfListOfKeys) {
val tempDf = spark.read.format("custom.databse")
.where(__key in listOfKeys).load
resultDf.union(tempDf);
}
There will scaling issues with this approah because of the collected data on the driver. But if you want to use the "spark.read" api, then this might be the only easy way.
Case 2 : foreachPartition + Normal DB call which returns a iterator
If you can find another way to get the data from your Db which returns a iterator or any single threaded spark independent object. Then you can achieve what you want exactly by applying what Filip has answered i.e. df.repartition.rdd.foreachPartition(yourDbCallFuntion())

Run SparkSQL query for each line in PySpark dataframe

I have a dataframe that contains parameters of a SQL query I need to run. I ultimately need the results of all of these SQL queries to be stored in a separate dataframe. Currently, I am mapping over each row of my parameter dataframe, then using a custom function to create the SQL query that needs run, like so:
# Example df
df = spark.createDataFrame(
[
("contract", 123),
("customer", 223),
],
["id_type", "ids"]
)
df.show()
+--------+----+
| id_type| ids|
+--------+----+
|contract| 123|
|customer| 223|
+--------+----+
________________________________________________________________________________________
# Create custom function that will write a sql query
def query_writer(id_type, ids):
qry = f'''
SELECT * FROM table
WHERE {id_type}_id = '{ids}'
'''
return sqlContext.sql(qry) # I also tried saving the results as a dictionary and outputting that
# Apply this function to each row of the dataframe
rdd1 = df.rdd.map(lambda x:(x[0], x[1], query_writer(x[0], x[1])))
qry = rdd1.take(1)
But, I get this error:
Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation.
SparkContext can only be used on the driver, not in code that it run on workers.
For more information, see SPARK-5063.
Is there any way to run a SQL query for each row of a dataframe in PySpark?
I happen to know that the query will always return 4 rows, if that is helpful.

How do I convert Array[Row] to RDD[Row]

I have a scenario where I want to convert the result of a dataframe which is in the format Array[Row] to RDD[Row]. I have tried using parallelize, but I don't want to use it as it needs to contain entire data in a single system which is not feasible in production box.
val Bid = spark.sql("select Distinct DeviceId, ButtonName from stb").collect()
val bidrdd = sparkContext.parallelize(Bid)
How do I achieve this? I tried the approach given in this link (How to convert DataFrame to RDD in Scala?), but it didn't work for me.
val bidrdd1 = Bid.map(x => (x(0).toString, x(1).toString)).rdd
It gives an error value rdd is not a member of Array[(String, String)]
The variable Bid which you've created here is not a DataFrame, it is an Array[Row], that's why you can't use .rdd on it. If you want to get an RDD[Row], simply call .rdd on the DataFrame (without calling collect):
val rdd = spark.sql("select Distinct DeviceId, ButtonName from stb").rdd
Your post contains some misconceptions worth noting:
... a dataframe which is in the format Array[Row] ...
Not quite - the Array[Row] is the result of collecting the data from the DataFrame into Driver memory - it's not a DataFrame.
... I don't want to use it as it needs to contain entire data in a single system ...
Note that as soon as you use collect on the DataFrame, you've already collected entire data into a single JVM's memory. So using parallelize is not the issue.

How to display the results brough from column functions using spark/scala like what show() does to dataframe

I just started how to use dataframe and column in Spark/Scala. I know if I want to show something on the screen, I can just do like df.show() for that. But how can I do this to a column. For example,
scala> val dfcol = df.apply("sgan")
dfcol: org.apache.spark.sql.Column = sgan
this can find a column called "sgan" from the dataframe df then give it to dfcol, so dfcol is a column. Then, if I do
scala> abs(dfcol)
res29: org.apache.spark.sql.Column = abs(sgan)
I just got the result shown on the screen like above. How can I show the result of this function on the screen like df.show() does? Or, in other words, how can I know the results of the functions like abs, min and so forth?
You should always use a dataframe, Column objects are not meant to be investigated this way. You can use select to create a dataframe with the column you're interested in, and then use show():
df.select(functions.abs(df("sgan"))).show()

Spark - extracting single value from DataFrame

I have a Spark DataFrame query that is guaranteed to return single column with single Int value. What is the best way to extract this value as Int from the resulting DataFrame?
You can use head
df.head().getInt(0)
or first
df.first().getInt(0)
Check DataFrame scala docs for more details
This could solve your problem.
df.map{
row => row.getInt(0)
}.first()
In Pyspark, you can simply get the first element if the dataframe is single entity with one column as a response, otherwise, a whole row will be returned, then you have to get dimension-wise response i.e. 2 Dimension list like df.head()[0][0]
df.head()[0]
If we have the spark dataframe as :
+----------+
|_c0 |
+----------+
|2021-08-31|
+----------+
x = df.first()[0]
print(x)
2021-08-31