Fastest way to create Dictionary from pyspark DF - pyspark

I'm using Snappydata with pyspark to run my sql queries and convert the output DF into a dictionary to bulk insert it into mongo.
I've gone through many similar quertions to test the convertion of a spark DF to Dictionary.
Currently I'm using map(lambda row: row.asDict(), x.collect()) this method to convert my bulk DF to dictionary. And it is taking 2-3sec for 10K records.
I've stated below how I impliment my idea:
x = snappySession.sql("select * from test")
df = map(lambda row: row.asDict(), x.collect())
db.collection.insert_many(df)
Is there any faster way?

I'd recommend using foreachPartition:
(snappySession
.sql("select * from test")
.foreachPartition(insert_to_mongo))
where insert_to_mongo:
def insert_to_mongo(rows):
client = ...
db = ...
db.collection.insert_many((row.asDict() for row in rows))

I would look into whether you can directly write to Mongo from Spark, as that will be the best method.
Failing that, you can use this method:
x = snappySession.sql("select * from test")
dictionary_rdd = x.rdd.map(lambda row: row.asDict())
for d in dictionary_rdd.toLocalIterator():
db.collection.insert_many(d)
This will create all the dictionaries in Spark in a distributed manner. The rows will returned to the driver and inserted into Mongo one row at a time so that you don't run out of memory.

Related

Pyspark - iterate on a big dataframe

I'm using the following code
events_df = []
for i in df.collect():
v = generate_event(i)
events_df.append(v)
events_df = spark.createDataFrame(events_df, schema)
to go over each dataframe item and add an event header calculated in the generate_event function
def generate_event(delta_row):
header = {
"id": 1,
...
}
row = Row(Data=delta_row)
return EntityEvent(header, row)
class EntityEvent:
def __init__(self, _header, _payload):
self.header = _header
self.payload = _payload
It works fine locally for df with few items (even with 1 000 000 items) but when we have more than 6 millions the aws glue job fail
Note: with rdd seems to be better but I can't use it because I've a problem with dates < 1900-01-01 (issue)
is there a way to chunk the dataframe and consolidate at the end ?
The best solution that we can preview is to use spark promise features, like adding new columns using struct and create_map functions...
events_df = (
df
.withColumn(
"header",
f.create_map(
f.lit("id"),
f.lit(1)
)
)
...
So we can create columns as much as we need and make transformations to get the required header structure
PS: this solution (add new columns to the dataframe rather than iterate on it) avoid using rdd and brings a big advantage in terms of performance !

How to loop over every row of streaming query dataframe in pyspark

I Have a Streaming query as below picture, now for every row i need to loop over dataframe do some tranformation and save the result to adls. Can anyone help me how to loop over streaming df. I m struck.
Please check the link for details on foreach and foreachbatch
using-foreach-and-foreachbatch
You can perform operations inside the function process_row() when calling it from pyspark.sql.DataFrame.writeStream interface
def process_row(row):
# Write row to storage (in your case adls)
pass
ehConf['eventhubs.connectionString'] = connectionString
ehConf['eventhubs.connectionString'] = sc.jvm.org.apache.spark.eventhubs.
EventHubsUtils.encrypt(connectionString) df = spark.readStream. \
format("eventhubs").options(**ehConf).load()
df_new = df.writeStream.foreach(process_row).start()

Can we convert data frame in Databricks to string and why do we get error Queries with streaming sources must be executed with writeStream.start()

I'm selecting a column that's a data frame. I would like to cast it as a string so that it can be used to frame cosmos DB dynamic query. The function collect() on data frame complains about queries with streaming sources must be executed with writeStream.start();;
val DF = AppointmentDF
.select("*")
.filter($"xyz" === "abc")
DF.createOrReplaceTempView("MyTable")
val column1DF = spark.sql("SELECT column1 FROM MyTable")
// This is not getting resolved
val sql="select c.abc from c where c.column = \"" + String.valueOf(column1DF) + "\""
println(sql)
Error:
org.apache.spark.sql.AnalysisException: cannot resolve '`column1DF`' given input columns: []; line 1 pos 12;
DF.collect().foreach { row =>
println(row.mkString(","))
}
Error:
org.apache.spark.sql.AnalysisException: Queries with streaming sources must
be executed with writeStream.start();;
A dataframe is a distributed data structure, not a structure located in your machine that can be printed. The value DF and column1DF are going to be exactly dataframes. To bring all the data of your queries, to the driver node, you can use the dataframe method collect, and extract from the returning Array of rows your value.
Collect can be harmful if you are bringing gigabytes of data to the memory of your driver node.
You can use collect and take head for getting first line of DataFrame:
val column1DF = spark.sql("SELECT column1 FROM MyTable").collect().head.getAs[String](0)
val sql="select c.abc from c where c.column = \"" + column1DF + "\""

Looping a dataframe from the column from the same table in scala

I have DataFrame in which it will contain table name with data. I need to loop the DataFrame with the table column name. Is there a better way to do it with a collect at first?
val tablename:Array[String] = df1.select("msgname").distinct().rdd.map(row=>row.getString(0).trim).collect
tablename.foreach{table =>
//print(table)
//val columns:Array[String] = df1.filter(s"msgname = '$table'").select("columns").distinct().rdd.map(row=>row.toString()).collect
df1.filter(s"msgname = '$table'").select("record_data").write.saveAsTable(s"$table")
//.toDF(columns:_*).show()
//.toDF(columns:_*).show()
}
2 ideas to improve performance: cache df1 and/or fire parallel spark jobs e.g. using parallel collections, like this:
df1.cache()
val tablename:Array[String] = df1.select(trim("msgname")).distinct().as[String].collect
tablename
.par // enable parallel execution
.foreach{table =>
df1.filter(s"msgname ='$table'").select("record_data").write.saveAsTable(s"$table")
}

How to get the row top 1 in Spark Structured Streaming?

I have an issue with Spark Streaming (Spark 2.2.1). I am developing a real time pipeline where first I get data from Kafka, second join the result with another table, then send the Dataframe to a ALS model (Spark ML) and it return a streaming Dataframe with one additional column predit. The problem is when I tried to get the row with the highest score, I couldn't find a way to resolve it.
I tried:
Apply SQL functions like Limit, Take, sort
dense_rank() function
search in StackOverflow
I read Unsupported Operations but doesn't seem to be much there.
Additional with the highest score I would send to a Kafka queue
My code is as follows:
val result = lines.selectExpr("CAST(value AS STRING)")
.select(from_json($"value", mySchema).as("data"))
//.select("data.*")
.selectExpr("cast(data.largo as int) as largo","cast(data.stock as int) as stock","data.verificavalormax","data.codbc","data.ide","data.timestamp_cli","data.tef_cli","data.nombre","data.descripcion","data.porcentaje","data.fechainicio","data.fechafin","data.descripcioncompleta","data.direccion","data.coordenadax","data.coordenaday","data.razon_social","data.segmento_app","data.categoria","data.subcategoria")
result.printSchema()
val model = ALSModel.load("ALSParaTiDos")
val fullPredictions = model.transform(result)
//fullPredictions is a streaming dataframe with a extra column "prediction", here i need the code to get the first row
val query = fullPredictions.writeStream.format("console").outputMode(OutputMode.Append()).option("truncate", "false").start()
query.awaitTermination()
Update
Maybe I was not clear, so I'm attaching an image with my problem. Also I wrote a more simple code to complement it: https://gist.github.com/.../9193c8a983c9007e8a1b6ec280d8df25
detailing what i need. Please I will appreciate any help :)
TL;DR Use stream-stream inner joins (Spark 2.3.0) or use memory sink (or a Hive table) for a temporary storage.
I think that the following sentence describes your case very well:
The problem is when I tried to get the row with the highest score, I couldn't find a way to resolve it.
Machine learning aside as it gives you a streaming Dataset with predictions so focusing on finding a maximum value in a column in a streaming Dataset is the real case here.
The first step is to calculate the max value as follows (copied directly from your code):
streaming.groupBy("idCustomer").agg(max("score") as "maxscore")
With that, you have two streaming Datasets that you can join as of Spark 2.3.0 (that has been released few days ago):
In Spark 2.3, we have added support for stream-stream joins, that is, you can join two streaming Datasets/DataFrames.
Inner joins on any kind of columns along with any kind of join conditions are supported.
Inner join the streaming Datasets and you're done.
Try this:
Implement a function that extract the max value of the column and then filter your dataframe with the max
def getDataFrameMaxRow(df:DataFrame , col:String):DataFrame = {
// get the maximum value
val list_prediction = df.select(col).toJSON.rdd
.collect()
.toList
.map { x => gson.fromJson[JsonObject](x, classOf[JsonObject])}
.map { x => x.get(col).getAsString.toInt}
val max = getMaxFromList(list_prediction)
// filter dataframe by the maximum value
val df_filtered = df.filter(df(col) === max.toString())
return df_filtered
}
def getMaxFromList(xs: List[Int]): Int = xs match {
case List(x: Int) => x
case x :: y :: rest => getMaxFromList( (if (x > y) x else y) :: rest )
}
And in the body of your code add:
import com.google.gson.JsonObject
import com.google.gson.Gson
import org.apache.spark.sql.DataFrame
val fullPredictions = model.transform(result)
val df_with_row_max = getDataFrameMaxRow(fullPredictions, "prediction")
Good Luck !!