I use Hortonworks 2.6 with 5 nodes. I spark-submit to YARN (with 16GB RAM and 4 cores).
I have a RDD transformation that runs fine in local but not with yarn master URL.
rdd1 has values like:
id name date
1 john 10/05/2001 (dd/mm/yyyy)
2 steve 11/06/2015
I'd like to change the date format from dd/mm/yyyy to mm/dd/yy, so I wrote a method transformations.transform that I use in RDD.map function as follows:
rdd2 = rdd1.map { rec => (rec.split(",")(0), transformations.transform(rec)) }
transformations.transform method is as follows:
object transformations {
def transform(t: String): String = {
val msg = s">>> transformations.transform($t)"
println(msg)
msg
}
}
Actually the above code works fine in local but not in cluster. The method just returns an output as if the map looked as follows:
rdd2 = rdd1.map { rec => (rec.split(",")(0), rec) }
rec does not seem to be passed to transformations.transform method.
I do use an action to trigger transformations.transform() method but no luck.
val rdd3 = rdd2.count()
println(rdd3)
println prints the count but does not call transformations.transform method. Why?
tl;dr Enable Log Aggregation in Hadoop and use yarn logs -applicationId to see the logs (with println in the logs of the two default Spark executors). Don't forget to bounce the YARN cluster using sbin/stop-yarn.sh followed by sbin/start-yarn.sh (or simply sbin/stop-all.sh and sbin/start-all.sh).
The reason why you don't see the println's output in the logs in YARN is that when a Spark application is spark-submit'ed to a YARN cluster, there are three YARN containers launched, i.e. one container for the ApplicationMaster and two containers for Spark executors.
RDD.map is a transformation that always runs on an Spark executor (as a set of tasks one per RDD partition). That means that println goes to the logs of executors.
NOTE: In local mode, a single JVM runs both the driver and the single executor (as a thread).
To my surprise, you won't be able to find the output of println in the ResourceManager web UI at http://localhost:8088/cluster for the Spark application either.
What worked for me was to enable log aggregation using yarn.log-aggregation-enable YARN property (that you can read about in the article Enable Log Aggregation):
// etc/hadoop/yarn-site.xml
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds</name>
<value>3600</value>
</property>
With that configuration change, you simply spark-submit --master yarn to submit a Spark application followed by yarn logs -applicationId (I used yarn logs -applicationId application_ID > output.txt and reviewed output.txt).
You should find >>> transformations.transform(1,john,10/05/2001) there.
The Code
The code I used was as follows:
import org.apache.spark.SparkContext
object HelloRdd extends App {
object transformations {
def transform(t: String): String = {
val msg = s">>> transformations.transform($t)"
println(msg)
msg
}
}
val sc = SparkContext.getOrCreate()
val rdd1 = sc.textFile(args(0))
val rdd2 = rdd1.map { rec => (rec.split(",")(0), transformations.transform(rec)) }
rdd2.count()
}
The following is the spark-submit I used for testing.
$ HADOOP_CONF_DIR=/tmp ~/dev/apps/spark/bin/spark-submit \
--master yarn \
target/scala-2.11/spark-project_2.11-0.1.jar `pwd`/hello.txt
You really don't provide enough information, and
Yes, I did in local its working fine its executing the if loop but in cluster else is executed
is contradictory to
the method inside the map is not accessible while running in cluster
If it's executing the else branch, it doesn't have any reason to call the method in the if branch, so it doesn't matter whether it's accessible.
And if the problem was that the method is inaccessible, you'd see exceptions being thrown, e.g. ClassNotFoundException or AbstractMethodError; Scala wouldn't just decide to ignore the method call instead.
But given your code style I am going to guess that transformation is a var. Then it's likely that code which sets it isn't executed on the driver (where the if is executed). In local mode it doesn't matter, but in cluster mode it just sets the copy of transformation on the node it's executed on.
This is the same issue described at https://spark.apache.org/docs/latest/rdd-programming-guide.html#local-vs-cluster-modes:
In general, closures - constructs like loops or locally defined methods, should not be used to mutate some global state. Spark does not define or guarantee the behavior of mutations to objects referenced from outside of closures. Some code that does this may work in local mode, but that’s just by accident and such code will not behave as expected in distributed mode.
Why is the code inside RDD.map not executed with count?
I want to change the date format from (dd/mm/yyyy) to (mm/dd/yy), so using a method called transform inside transformations(object) in map() function
If you are looking to change the dateformat only, then I would suggest you not to go through such complexities as its very difficult to analyze the cause of the issue. I would suggest you to apply dataframes instead of rdds as there are many inbuilt functions to meet your needs. For your specific requirement to_date and date_format inbuilt functions should do the trick
First of all, read the data to dataframe as
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", true)
.load("path to the data file")
Then just apply the to_date and date_format functions as
import org.apache.spark.sql.functions._
df.withColumn("date2", date_format(to_date(col("date"), "dd/MM/yyyy"), "MM/dd/yy")).show(false)
and you should get
+---+-----+----------+--------+
|id |name |date |date2 |
+---+-----+----------+--------+
|1 |john |10/05/2001|05/10/01|
|2 |steve|11/06/2015|06/11/15|
+---+-----+----------+--------+
Simple isn't it?
Related
I am new both to Scala and to Databricks streaming. I am reading streamed events into a dataframe and I want to use an if-else statement to trigger a different notebook based on whether the dataframe is empty or not. The simple code below (and variations of it)
if(finalDF.isEmpty){
print("0")
}
else{
print("1")
}
persistently results in the following error
AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
eventhubs
How can I incorporate writeStream.start() into the above code? Or, how can I evaluate the dataframe content and based on that, take one or another action, given that the dataframe is populated by streaming events into it?
The streaming DF couldn't be empty or not by design - stream is infinite, and if you don't have data there right now, then you can get something new in next second. So your code won't work.
You can use foreachBatch to process "current" snapshot of data with which you can work as with "normal", non-streaming dataframe, but then you may not be able to trigger the notebook from inside it, so code for both conditions should be inside the same function, not in different notebooks.
I tested this code and it works as a way to introduce an if-else and decide actions based on event contents.
df.writeStream.foreachBatch((df: org.apache.spark.sql.DataFrame, batchID: Long) => myfunc(df)).start()
def myfunc(df: org.apache.spark.sql.DataFrame){
val test1= df.filter($"col" === "test1")
val test2= df.filter($"col" === "test2")
if(test1.count()>0){
dbutils.notebook.run("some_notebook", 60)
}
if(test2.count()>0){
dbutils.notebook.run("another_notebook", 60)
}
}
I have an application that processes records in an rdd and puts them into a cache. I put a couple of Spark Accumulators in my application to keep track of processed and failed records. These stats are sent to statsD before the application closes. Here is some simple sample code:
val sc: SparkContext = new SparkContext(conf)
val jdbcDF: DataFrame = sqlContext.read.format("jdbc").options(Map(...)).load().persist(StorageLevel.MEMORY_AND_DISK)
logger.info("Processing table with " + jdbcDF.count + " rows")
val processedRecords = sc.accumulator(0L, "processed records")
val erroredRecords = sc.accumulator(0L, "errored records")
jdbcDF.rdd.foreachPartition(iterator => {
processedRecords += iterator.length // Problematic line
val cache = getCacheInstanceFromBroadcast()
processPartition(iterator, cache, erroredRecords) // updates cache with iterator documents
}
submitStats(processedRecords, erroredRecords)
I built and ran this in my cluster and it appeared to be functioning correctly, the job was marked as a SUCCESS by Spark. I queried the stats using Grafana and both counts were accurate.
However, when I queried my cache, Couchbase, none of the documents were there. I've combed through both driver and executor logs to see if any errors were being thrown but I couldn't find anything. My thinking is that this is some memory issue, but a couple long accumulators is enough to cause a problem?
I was able to get this code snippet working by commenting out the line that increments processedRecords - see the line in the snippet noted with Problematic line.
Does anyone know why commenting out that line fixes the issue? Also why is Spark failing silently and not marking the job as FAILURE?
The application isn't "failing" per se. The main problem is, Iterators can only be "iterated" through one time.
Calling iterator.length actually goes through and exhausts the iterator. Thus, when processPartition receives iterator, the iterator is already exhausted and looks empty (so no records will be processed).
Reference Scala docs to confirm that size is "the number of elements returned by it. Note: it will be at its end after this operation!" -- you can also view the source code to confirm this.
Workaround
If you rewrite processPartition to return a long value, that can be fed into the accumulator.
Also, sc.accumulator is deprecated in recent versions of Spark.
The workaround could look something like:
val acc = sc.longAccumulator("total processed records")
...
df.rdd.foreachPartition(iterator => {
val cache = getCacheInstanceFromBroadcast()
acc.add(processPartition(iterator, cache, erroredRecords))
})
...
// do something else
I'm using spark streaming to read data from kafka and insert that into mongodb. I'm using pyspark 2.4.4. I'm trying to make use of ForeachWriter because just using for each method means the connection will establishing for every row.
def open(self, partition_id, epoch_id):
# Open connection. This method is optional in Python.
self.connection = MongoClient("192.168.0.100:27017")
self.db = self.connection['test']
self.coll = self.db['output']
print(epoch_id)
pass
def process(self, row):
# Write row to connection. This method is NOT optional in Python.
#self.coll=None -> used this to test, if I'm getting an exception if it is there but I'm not getting one
self.coll.insert_one(row.asDict())
pass
def close(self, error):
# Close the connection. This method in optional in Python.
print(error)
pass
df_w=df7\
.writeStream\
.foreach(ForeachWriter())\
.trigger(processingTime='1 seconds') \
.outputMode("update") \
.option("truncate", "false")\
.start()df_w=df7\
.writeStream\
.foreach(ForeachWriter())\
.trigger(processingTime='1 seconds') \
.outputMode("update") \
.option("truncate", "false")\
.start()
My problem it's not inserting to mongodb and I can't find solutions for this. If comment it out I'll get error. But process method is not executing. any one have any ideas?
You set up the collection to None in the first line of the process function. Therefore you insert the row into nowhere.
Also, I don't know if it just here, or in your code as well, but you have the writeStream part two times.
This is probably not documented in spark docs. But if you look at the definition of foreach in pyspark, it has the following line of code:
# Check if the data should be processed
should_process = True
if open_exists:
should_process = f.open(partition_id, epoch_id)
Therefore, whenever we open a new connection, the open must return True. In actual documentation, they have used 'pass' which results in 'process()' never getting called. (This answer is for future reference for anybody facing the same issue.)
This is a follow up question on
Pyspark filter operation on Dstream
To keep a count of how many error messages/warning messages has come through for say a day, hour - how does one design the job.
What I have tried:
from __future__ import print_function
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
def counts():
counter += 1
print(counter.value)
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: network_wordcount.py <hostname> <port>", file=sys.stderr)
exit(-1)
sc = SparkContext(appName="PythonStreamingNetworkWordCount")
ssc = StreamingContext(sc, 5)
counter = sc.accumulator(0)
lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))
errors = lines.filter(lambda l: "error" in l.lower())
errors.foreachRDD(lambda e : e.foreach(counts))
errors.pprint()
ssc.start()
ssc.awaitTermination()
this however has multiple issues, to start with print doesn't work (does not output to stdout, I have read about it, the best I can use here is logging). Can I save the output of that function to a text file and tail that file instead?
I am not sure why the program just comes out, there is no error/dump anywhere to look further into (spark 1.6.2)
How does one preserve state? What I am trying is to aggregate logs by server and severity, another use case is to count how many transactions were processed by looking for certain keywords
Pseudo Code for what I want to try:
foreachRDD(Dstream):
if RDD.contains("keyword1 | keyword2 | keyword3"):
dictionary[keyword] = dictionary.get(keyword,0) + 1 //add the keyword if not present and increase the counter
print dictionary //or send this dictionary to else where
The last part of sending or printing dictionary requires switching out of spark streaming context - Can someone explain the concept please?
print doesn't work
I would recommend reading the design patterns section of the Spark documentation. I think that roughly what you want is something like this:
def _process(iter):
for item in iter:
print item
lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))
errors = lines.filter(lambda l: "error" in l.lower())
errors.foreachRDD(lambda e : e.foreachPartition(_process))
This will get your call print to work (though it is worth noting that the print statement will execute on the workers and not the drivers, so if you're running this code on a cluster you will only see it on the worker logs).
However, it won't solve your second problem:
How does one preserve state?
For this, take a look at updateStateByKey and the related example.
I have developed a hadoop based solution that process a binary file. This uses classic hadoop MR technique. The binary file is about 10GB and divided into 73 HDFS blocks, and the business logic written as map process operates on each of these 73 blocks. We have developed a customInputFormat and CustomRecordReader in Hadoop that returns key (intWritable) and value (BytesWritable) to the map function. The value is nothing but the contents of a HDFS block(bianry data). The business logic knows how to read this data.
Now, I would like to port this code in spark. I am a starter in spark and could run simple examples (wordcount, pi example) in spark. However, could not straightforward example to process binaryFiles in spark. I see there are two solutions for this use case. In the first, avoid using custom input format and record reader. Find a method (approach) in spark the creates a RDD for those HDFS blocks, use a map like method that feeds HDFS block content to the business logic. If this is not possible, I would like to re-use the custom input format and custom reader using some methods such as HadoopAPI, HadoopRDD etc. My problem:- I do not know whether the first approach is possible or not. If possible, can anyone please provide some pointers that contains examples? I was trying second approach but highly unsuccessful. Here is the code snippet I used
package org {
object Driver {
def myFunc(key : IntWritable, content : BytesWritable):Int = {
println(key.get())
println(content.getSize())
return 1
}
def main(args: Array[String]) {
// create a spark context
val conf = new SparkConf().setAppName("Dummy").setMaster("spark://<host>:7077")
val sc = new SparkContext(conf)
println(sc)
val rd = sc.newAPIHadoopFile("hdfs:///user/hadoop/myBin.dat", classOf[RandomAccessInputFormat], classOf[IntWritable], classOf[BytesWritable])
val count = rd.map (x => myFunc(x._1, x._2)).reduce(_+_)
println("The count is *****************************"+count)
}
}
}
Please note that the print statement in the main method prints 73 which is the number of blocks whereas the print statements inside the map function prints 0.
Can someone tell where I am doing wrong here? I think I am not using API the right way but failed to find some documentation/usage examples.
A couple of problems at a glance. You define myFunc but call func. Your myFunc has no return type, so you can't call collect(). If your myFunc truly doesn't have a return value, you can do foreach instead of map.
collect() pulls the data in an RDD to the driver to allow you to do stuff with it locally (on the driver).
I have made some progress in this issue. I am now using the below function which does the job
var hRDD = new NewHadoopRDD(sc, classOf[RandomAccessInputFormat],
classOf[IntWritable],
classOf[BytesWritable],
job.getConfiguration()
)
val count = hRDD.mapPartitionsWithInputSplit{ (split, iter) => myfuncPart(split, iter)}.collect()
However, landed up with another error the details of which i have posted here
Issue in accessing HDFS file inside spark map function
15/10/30 11:11:39 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 40.221.94.235): java.io.IOException: No FileSystem for scheme: spark
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)