How to loop over every row of streaming query dataframe in pyspark - pyspark

I Have a Streaming query as below picture, now for every row i need to loop over dataframe do some tranformation and save the result to adls. Can anyone help me how to loop over streaming df. I m struck.

Please check the link for details on foreach and foreachbatch
using-foreach-and-foreachbatch
You can perform operations inside the function process_row() when calling it from pyspark.sql.DataFrame.writeStream interface
def process_row(row):
# Write row to storage (in your case adls)
pass
ehConf['eventhubs.connectionString'] = connectionString
ehConf['eventhubs.connectionString'] = sc.jvm.org.apache.spark.eventhubs.
EventHubsUtils.encrypt(connectionString) df = spark.readStream. \
format("eventhubs").options(**ehConf).load()
df_new = df.writeStream.foreach(process_row).start()

Related

How to drop first row from parquet file?

I have parquet file which contain two columns(id,feature).file consists of 14348 row.file
How i drop first row id,feature from file
code
val df = spark.read.format("parquet").load("file:///usr/local/spark/dataset/model/data/user/part-r-00000-7d55ba81-5761-4e36-b488-7e6214df2a68.snappy.parquet")
val header = df.first()
val data = df.filter(row => row != header)
data .show()
result seems as output
If you are trying to "ignore" the schema defined in the file, it is implicitly done once you read your file, using spark like:
spark.read.format("parquet").load(your_file)
If you are trying to only skip the first row on your DF and if you already know the id you can do: val filteredDF = originalDF.filter(s"id != '${excludeID}' "). If you don't know the id, you can use monotonically_increasing_id to tag it and then filter, similar like: filter spark dataframe based on maximum value of a column
You need to drop the first row based on id if you know that, else go for indexing approach i.e., assigning the row number and delete the first row.
I'm using Spark 2.4.0, and you could use the header option to the DataFrameReader call like so -
spark.read.format("csv").option("header", true).load(<path_to_file>)
Reference for the other options for DataFrameReader are here

Apache Spark SQL: how to optimize chained join for dataframe

I have to make a left join between a principle data frame and several reference data frame, so a chained join computation. And I wonder how to make this action efficient and scalable.
Method 1 is easy to understand, which is also the current method, but I'm not satisfied because all the transformations have been chained and waited for the final action to trigger the computation, if I continue to add transformation and the volume of data, spark will fail at the end, so this method is not scalable.
Method 1:
def pipeline(refDF1: DataFrame, refDF2: DataFrame, refDF3: DataFrame, refDF4: DataFrame, refDF5: DataFrame): DataFrame = {
val transformations: List[DataFrame => DataFrame] = List(
castColumnsFromStringToLong(ColumnsToCastToLong),
castColumnsFromStringToFloat(ColumnsToCastToFloat),
renameColumns(RenameMapping),
filterAndDropColumns,
joinRefDF1(refDF1),
joinRefDF2(refDF2),
joinRefDF3(refDF3),
joinRefDF4(refDF4),
joinRefDF5(refDF5),
calculate()
)
transformations.reduce(_ andThen _)
}
pipeline(refDF1, refDF2, refDF3, refDF4, refDF5)(principleDF)
Method 2: I've not found a real way to achieve my idea, but I hope to trigger the computation of each join immediately.
according to my test, count() is too heavy for spark and useless for my application, but I don't know how to trigger the join computation with an efficient action. This kind of action is, in fact, the answer to this question.
val joinedDF_1 = castColumnsFromStringToLong(principleDF, ColumnsToCastToLong)
joinedDF_1.cache() // joinedDF is not always used multiple times, but for some data frame, it is, so I add cache() to indicate the usage
joinedDF_1.count()
val joinedDF_2 = castColumnsFromStringToFloat(joinedDF_1, ColumnsToCastToFloat)
joinedDF_2.cache()
joinedDF_2.count()
val joinedDF_3 = renameColumns(joinedDF_2, RenameMapping)
joinedDF_3.cache()
joinedDF_3.count()
val joinedDF_4 = filterAndDropColumns(joinedDF_4)
joinedDF_4.cache()
joinedDF_4.count()
...
When you want to force the computation of a given join (or any transformation that is not final) in Spark, you can use a simple show or count on your DataFrame. This kind of terminal points will force the computation of the result because otherwise it is simply not possible to execute the action.
Only after this will your DataFrame be effectively stored in your cache.
Once you're finished with a given DataFrame, don't hesitate to unpersist. This will unpersist your data if your cluster need more room for further computation.
You need to repartitions your dataset with the columns before calling the join transformation.
Example:
df1=df1.repartion(col("col1"),col("col2"))
df2=df2.repartion(col("col1"),col("col2"))
joinDF = df1.join(jf2,df1.col("col1").equals(df2.col("col1")) &....)
Try creating a new dataframe based on it.
Ex:
val dfTest = session.createDataFrame(df.rdd, df.schema).cache()
dfTest .storageLevel.useMemory // result should be a true.

PySpark - iterate rows of a Data Frame

I need to iterate rows of a pyspark.sql.dataframe.DataFrame.DataFrame.
I have done it in pandas in the past with the function iterrows() but I need to find something similar for pyspark without using pandas.
If I do for row in myDF: it iterates columns.DataFrame
Thanks
You can use select method to operate on your dataframe using a user defined function something like this :
columns = header.columns
my_udf = F.udf(lambda data: "do what ever you want here " , StringType())
myDF.select(*[my_udf(col(c)) for c in columns])
then inside the select you can choose what you want to do with each column .

what is the pysparkic way of doing for loop on spark df

in pyspark what is the best way to do an operation for a id when a groupby does not apply. here is a sample code:
for id in [int(i.id) for i in df.select('id').distinct().collect()]:
temp = df.where("id == {}".format(id))
temp = temp.sort("date")
my_window = Window.partitionBy().orderBy("id")
temp = temp.withColumn("prev_transaction",lag(temp['date']).over(my_window))
temp = temp.withColumn("diff", temp['date']-temp["prev_transaction"))
temp = temp.where('day_diff > 0')
#select a row and so on
whats the best way to optimize this?
I suppose you have transactions data frame you want to add previous transactions column to it.
from pyspark.sql.window import Window
from pyspark.sql import functions as F
windowSpec = Window.partitionBy('id').orderBy('date')
df.withColumn('pre_transactions', F.collect_list('id') \
.over(windowSpec.rangeBetween(Window.unboundedPreceding, 0)))
Code above will add array column with previous transactions untill this transactions .But last transactions will be also added to array. You can remove last transaction by simple custom udf like:
remove_element = lambda l: l[:-1]
remove_udf = udf(remove_element, ArrayType(StringType()))

Fastest way to create Dictionary from pyspark DF

I'm using Snappydata with pyspark to run my sql queries and convert the output DF into a dictionary to bulk insert it into mongo.
I've gone through many similar quertions to test the convertion of a spark DF to Dictionary.
Currently I'm using map(lambda row: row.asDict(), x.collect()) this method to convert my bulk DF to dictionary. And it is taking 2-3sec for 10K records.
I've stated below how I impliment my idea:
x = snappySession.sql("select * from test")
df = map(lambda row: row.asDict(), x.collect())
db.collection.insert_many(df)
Is there any faster way?
I'd recommend using foreachPartition:
(snappySession
.sql("select * from test")
.foreachPartition(insert_to_mongo))
where insert_to_mongo:
def insert_to_mongo(rows):
client = ...
db = ...
db.collection.insert_many((row.asDict() for row in rows))
I would look into whether you can directly write to Mongo from Spark, as that will be the best method.
Failing that, you can use this method:
x = snappySession.sql("select * from test")
dictionary_rdd = x.rdd.map(lambda row: row.asDict())
for d in dictionary_rdd.toLocalIterator():
db.collection.insert_many(d)
This will create all the dictionaries in Spark in a distributed manner. The rows will returned to the driver and inserted into Mongo one row at a time so that you don't run out of memory.