I'm trying to write data pulled from a Kafka to a Bigquery table every 120 seconds.
I would like to do some additional operations which by documentation should be possible inside the .foreach() or foreachBatch() method.
As a test i wanted to print a simple message every time data get's pulled from kafka and written to BigQuery.
batch_job=df_alarmsFromKafka.writeStream\
.trigger(processingTime='120 seconds') \
.foreachBatch(print("do i get printed every batch?"))
.format("bigquery").outputMode("append") \
.option("temporaryGcsBucket",path1) \
.option("checkpointLocation",path2) \
.option("table", table_kafka) \
.start()
batch_job.awaitTermination()
I would expect this message to be printed every 120 secs on jupyter Lab output cell, instead it gets printed only once and just keeps writing to BigQuery.
If i try to use .foreach() instead of foreachBatch()
batch_job=df_alarmsFromKafka.writeStream\
.trigger(processingTime='120 seconds') \
.foreach(print("do i get printed every batch?"))
.format("bigquery").outputMode("append") \
.option("temporaryGcsBucket",path1) \
.option("checkpointLocation",path2) \
.option("table", table_kafka) \
.start()
batch_job.awaitTermination()
it prints the message once and immediately after gives the following error, which i could not debug/understard:
/usr/lib/spark/python/pyspark/sql/streaming.py in foreach(self, f)
1335
1336 if not hasattr(f, 'process'):
-> 1337 raise Exception("Provided object does not have a 'process' method")
1338
1339 if not callable(getattr(f, 'process')):
Exception: Provided object does not have a 'process' method
Am i doing something wrong? how can i simply do some operations every 120 secs other than those performed directly on the dataframe evaluated df_alarmsFromKafka?
Additional operations allowed but only on the output data of streaming query. But here you are trying to print some strings which is not related with output data itself. It can be printed only once.
For example if you write foreachbatch function like below:
def write_to_cassandra(target_df, batch_id):
target_df.write \
.format("org.apache.spark.sql.cassandra") \
.option("keyspace", "tweet_db") \
.option("table", "tweet2") \
.mode("append") \
.save()
target_df.show()
It will pring target_df on every batch since .show() function is related with output data itself.
For your second question:
Foreach function expect from you to extend the class ForeachWriter by implementing open, process and close methods which you did not there.
Related
Say that I have a AWS Glue job that looks like this:
import threading
def thread_worker(df, id):
df.write.mode('overwrite') \
.save('./output_{0}'.format(id))
def main():
...
threads = [threading.Thread(target=thread_worker, args=(df, id) \
for id in range(2)]
for t in threads:
t.start()
for t in threads:
t.join()
But instead of having 2 I have 60,000 and the output goes to a single partition in S3. so like:
import threading
def thread_worker(df, id):
df.write.mode('overwrite') \
.save('s3://bucket/partition_name=x/')
def main():
...
threads = [threading.Thread(target=thread_worker, args=(df, id) \
for id in range(60000)]
for t in threads:
t.start()
for t in threads:
t.join()
This will fail with different FileNotFound Java exceptions. Which is caused by what I have come to learn is the _temporary directory created in S3. Each thread needs to have its own if they all write to the same partition.
So, my questions are:
Can i pass an argument somewhere on df.write to use a custom name that is not _temporary?
We are talking about a lot of data here, so it's either threading or several hours for the data to load. Is there a way to safely implement threads here?
It seems that your FileNotFound Java exception happens because you are passing a partition name (folder-like structure) in 's3://bucket/partition_name=x/'
My suggestion would to use the id argument in the thread_worker function to create a filename convention like filename-{id}.ext, so your function would be modified like this:
def thread_worker(df, id):
df.write.mode('overwrite') \
.save('s3://bucket/partition_name/filename-{0}.ext'.format(id))
Optionally you could create a filename based on uuid, if you think there is a chance of the same name getting repeated.
I am creating a Spark Session in my Scala code the following way:
sparkContext.getSparkSession()
.readStream
.format("kafka")
.option("kafka.prop1","true")
.option("kafka.prop2", "1")
.option("kafka.prop3", "ddd")
.option("kafka.prop4", "rttr")
.load()
.writeStream
.option("checkpointLocation",checkpointLocation)
.foreachBatch(forEachFunction.arceusForEachFunction(_,_))
.start()
I want to remove the hard codings in the code and want to make it generic. I have a config with all the parameters and can create a Map with the key value pairs for the .option fields.
But I am not sure how to create the .option parameters when sending to Spark
What I intend is e.g.
Map("kafka.prop1" --> "true"),
("kafka.prop2" --> "SASL_SSL")
and then use this to populate the option. But I not sure how to pass this to the .option.
Also, any other better way is also welcome.
I'm using spark streaming to read data from kafka and insert that into mongodb. I'm using pyspark 2.4.4. I'm trying to make use of ForeachWriter because just using for each method means the connection will establishing for every row.
def open(self, partition_id, epoch_id):
# Open connection. This method is optional in Python.
self.connection = MongoClient("192.168.0.100:27017")
self.db = self.connection['test']
self.coll = self.db['output']
print(epoch_id)
pass
def process(self, row):
# Write row to connection. This method is NOT optional in Python.
#self.coll=None -> used this to test, if I'm getting an exception if it is there but I'm not getting one
self.coll.insert_one(row.asDict())
pass
def close(self, error):
# Close the connection. This method in optional in Python.
print(error)
pass
df_w=df7\
.writeStream\
.foreach(ForeachWriter())\
.trigger(processingTime='1 seconds') \
.outputMode("update") \
.option("truncate", "false")\
.start()df_w=df7\
.writeStream\
.foreach(ForeachWriter())\
.trigger(processingTime='1 seconds') \
.outputMode("update") \
.option("truncate", "false")\
.start()
My problem it's not inserting to mongodb and I can't find solutions for this. If comment it out I'll get error. But process method is not executing. any one have any ideas?
You set up the collection to None in the first line of the process function. Therefore you insert the row into nowhere.
Also, I don't know if it just here, or in your code as well, but you have the writeStream part two times.
This is probably not documented in spark docs. But if you look at the definition of foreach in pyspark, it has the following line of code:
# Check if the data should be processed
should_process = True
if open_exists:
should_process = f.open(partition_id, epoch_id)
Therefore, whenever we open a new connection, the open must return True. In actual documentation, they have used 'pass' which results in 'process()' never getting called. (This answer is for future reference for anybody facing the same issue.)
I am just learning Spark and started with RDDs and now moving on to DataFrames. In my current pyspark project, I am reading an S3 file into an RDD and running some simple transformations on them. Here is the code.
segmentsRDD = sc.textFile(fileLocation). \
filter(lambda line: line.split(",")[6] in INCLUDE_SITES). \
filter(lambda line: line.split(",")[2] not in EXCLUDE_MARKETS). \
filter(lambda line: "null" not in line). \
map(splitComma). \
filter(lambda line: line.split(",")[5] == '1')
SplitComma is a function that does some date calculations on the row data and return 10 comma-delimited fields back. Once I get that I run the last filter as shown to only pickup rows where value in field [5] = 1. So far everything is fine.
Next, I would like to convert the segmentsRDD to DF with schema as shown below.
interim_segmentsDF = segmentsRDD.map(lambda x: x.split(",")).toDF("itemid","market","itemkey","start_offset","end_offset","time_shifted","day_shifted","tmsmarketid","caption","itemstarttime")
But I get an error about unable to convert a "pyspark.rdd.PipelinedRDD" to DataFrame. Can you please explain the difference between "pyspark.rdd.PipelinedRDD" and "row RDD"? I am attempting to convert to DF with a schema as shown. What am I missing here?
Thanks
You have to add the following lines in your code:
from pyspark.sql import SparkSession
spark = SparkSession(sc)
The method .toDF() is not an original method of the rdd.
If you take a look in Spark source code you will see that the method .toDF() is a monkey patch.
So, with SparkSession initialization you call this monkey pached method; in other words when you run rdd.toDF() you run directly the method .toDF() from Dataframe API.
I use Hortonworks 2.6 with 5 nodes. I spark-submit to YARN (with 16GB RAM and 4 cores).
I have a RDD transformation that runs fine in local but not with yarn master URL.
rdd1 has values like:
id name date
1 john 10/05/2001 (dd/mm/yyyy)
2 steve 11/06/2015
I'd like to change the date format from dd/mm/yyyy to mm/dd/yy, so I wrote a method transformations.transform that I use in RDD.map function as follows:
rdd2 = rdd1.map { rec => (rec.split(",")(0), transformations.transform(rec)) }
transformations.transform method is as follows:
object transformations {
def transform(t: String): String = {
val msg = s">>> transformations.transform($t)"
println(msg)
msg
}
}
Actually the above code works fine in local but not in cluster. The method just returns an output as if the map looked as follows:
rdd2 = rdd1.map { rec => (rec.split(",")(0), rec) }
rec does not seem to be passed to transformations.transform method.
I do use an action to trigger transformations.transform() method but no luck.
val rdd3 = rdd2.count()
println(rdd3)
println prints the count but does not call transformations.transform method. Why?
tl;dr Enable Log Aggregation in Hadoop and use yarn logs -applicationId to see the logs (with println in the logs of the two default Spark executors). Don't forget to bounce the YARN cluster using sbin/stop-yarn.sh followed by sbin/start-yarn.sh (or simply sbin/stop-all.sh and sbin/start-all.sh).
The reason why you don't see the println's output in the logs in YARN is that when a Spark application is spark-submit'ed to a YARN cluster, there are three YARN containers launched, i.e. one container for the ApplicationMaster and two containers for Spark executors.
RDD.map is a transformation that always runs on an Spark executor (as a set of tasks one per RDD partition). That means that println goes to the logs of executors.
NOTE: In local mode, a single JVM runs both the driver and the single executor (as a thread).
To my surprise, you won't be able to find the output of println in the ResourceManager web UI at http://localhost:8088/cluster for the Spark application either.
What worked for me was to enable log aggregation using yarn.log-aggregation-enable YARN property (that you can read about in the article Enable Log Aggregation):
// etc/hadoop/yarn-site.xml
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds</name>
<value>3600</value>
</property>
With that configuration change, you simply spark-submit --master yarn to submit a Spark application followed by yarn logs -applicationId (I used yarn logs -applicationId application_ID > output.txt and reviewed output.txt).
You should find >>> transformations.transform(1,john,10/05/2001) there.
The Code
The code I used was as follows:
import org.apache.spark.SparkContext
object HelloRdd extends App {
object transformations {
def transform(t: String): String = {
val msg = s">>> transformations.transform($t)"
println(msg)
msg
}
}
val sc = SparkContext.getOrCreate()
val rdd1 = sc.textFile(args(0))
val rdd2 = rdd1.map { rec => (rec.split(",")(0), transformations.transform(rec)) }
rdd2.count()
}
The following is the spark-submit I used for testing.
$ HADOOP_CONF_DIR=/tmp ~/dev/apps/spark/bin/spark-submit \
--master yarn \
target/scala-2.11/spark-project_2.11-0.1.jar `pwd`/hello.txt
You really don't provide enough information, and
Yes, I did in local its working fine its executing the if loop but in cluster else is executed
is contradictory to
the method inside the map is not accessible while running in cluster
If it's executing the else branch, it doesn't have any reason to call the method in the if branch, so it doesn't matter whether it's accessible.
And if the problem was that the method is inaccessible, you'd see exceptions being thrown, e.g. ClassNotFoundException or AbstractMethodError; Scala wouldn't just decide to ignore the method call instead.
But given your code style I am going to guess that transformation is a var. Then it's likely that code which sets it isn't executed on the driver (where the if is executed). In local mode it doesn't matter, but in cluster mode it just sets the copy of transformation on the node it's executed on.
This is the same issue described at https://spark.apache.org/docs/latest/rdd-programming-guide.html#local-vs-cluster-modes:
In general, closures - constructs like loops or locally defined methods, should not be used to mutate some global state. Spark does not define or guarantee the behavior of mutations to objects referenced from outside of closures. Some code that does this may work in local mode, but that’s just by accident and such code will not behave as expected in distributed mode.
Why is the code inside RDD.map not executed with count?
I want to change the date format from (dd/mm/yyyy) to (mm/dd/yy), so using a method called transform inside transformations(object) in map() function
If you are looking to change the dateformat only, then I would suggest you not to go through such complexities as its very difficult to analyze the cause of the issue. I would suggest you to apply dataframes instead of rdds as there are many inbuilt functions to meet your needs. For your specific requirement to_date and date_format inbuilt functions should do the trick
First of all, read the data to dataframe as
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", true)
.load("path to the data file")
Then just apply the to_date and date_format functions as
import org.apache.spark.sql.functions._
df.withColumn("date2", date_format(to_date(col("date"), "dd/MM/yyyy"), "MM/dd/yy")).show(false)
and you should get
+---+-----+----------+--------+
|id |name |date |date2 |
+---+-----+----------+--------+
|1 |john |10/05/2001|05/10/01|
|2 |steve|11/06/2015|06/11/15|
+---+-----+----------+--------+
Simple isn't it?