Say that I have a AWS Glue job that looks like this:
import threading
def thread_worker(df, id):
df.write.mode('overwrite') \
.save('./output_{0}'.format(id))
def main():
...
threads = [threading.Thread(target=thread_worker, args=(df, id) \
for id in range(2)]
for t in threads:
t.start()
for t in threads:
t.join()
But instead of having 2 I have 60,000 and the output goes to a single partition in S3. so like:
import threading
def thread_worker(df, id):
df.write.mode('overwrite') \
.save('s3://bucket/partition_name=x/')
def main():
...
threads = [threading.Thread(target=thread_worker, args=(df, id) \
for id in range(60000)]
for t in threads:
t.start()
for t in threads:
t.join()
This will fail with different FileNotFound Java exceptions. Which is caused by what I have come to learn is the _temporary directory created in S3. Each thread needs to have its own if they all write to the same partition.
So, my questions are:
Can i pass an argument somewhere on df.write to use a custom name that is not _temporary?
We are talking about a lot of data here, so it's either threading or several hours for the data to load. Is there a way to safely implement threads here?
It seems that your FileNotFound Java exception happens because you are passing a partition name (folder-like structure) in 's3://bucket/partition_name=x/'
My suggestion would to use the id argument in the thread_worker function to create a filename convention like filename-{id}.ext, so your function would be modified like this:
def thread_worker(df, id):
df.write.mode('overwrite') \
.save('s3://bucket/partition_name/filename-{0}.ext'.format(id))
Optionally you could create a filename based on uuid, if you think there is a chance of the same name getting repeated.
Related
I have an application that processes records in an rdd and puts them into a cache. I put a couple of Spark Accumulators in my application to keep track of processed and failed records. These stats are sent to statsD before the application closes. Here is some simple sample code:
val sc: SparkContext = new SparkContext(conf)
val jdbcDF: DataFrame = sqlContext.read.format("jdbc").options(Map(...)).load().persist(StorageLevel.MEMORY_AND_DISK)
logger.info("Processing table with " + jdbcDF.count + " rows")
val processedRecords = sc.accumulator(0L, "processed records")
val erroredRecords = sc.accumulator(0L, "errored records")
jdbcDF.rdd.foreachPartition(iterator => {
processedRecords += iterator.length // Problematic line
val cache = getCacheInstanceFromBroadcast()
processPartition(iterator, cache, erroredRecords) // updates cache with iterator documents
}
submitStats(processedRecords, erroredRecords)
I built and ran this in my cluster and it appeared to be functioning correctly, the job was marked as a SUCCESS by Spark. I queried the stats using Grafana and both counts were accurate.
However, when I queried my cache, Couchbase, none of the documents were there. I've combed through both driver and executor logs to see if any errors were being thrown but I couldn't find anything. My thinking is that this is some memory issue, but a couple long accumulators is enough to cause a problem?
I was able to get this code snippet working by commenting out the line that increments processedRecords - see the line in the snippet noted with Problematic line.
Does anyone know why commenting out that line fixes the issue? Also why is Spark failing silently and not marking the job as FAILURE?
The application isn't "failing" per se. The main problem is, Iterators can only be "iterated" through one time.
Calling iterator.length actually goes through and exhausts the iterator. Thus, when processPartition receives iterator, the iterator is already exhausted and looks empty (so no records will be processed).
Reference Scala docs to confirm that size is "the number of elements returned by it. Note: it will be at its end after this operation!" -- you can also view the source code to confirm this.
Workaround
If you rewrite processPartition to return a long value, that can be fed into the accumulator.
Also, sc.accumulator is deprecated in recent versions of Spark.
The workaround could look something like:
val acc = sc.longAccumulator("total processed records")
...
df.rdd.foreachPartition(iterator => {
val cache = getCacheInstanceFromBroadcast()
acc.add(processPartition(iterator, cache, erroredRecords))
})
...
// do something else
I use Hortonworks 2.6 with 5 nodes. I spark-submit to YARN (with 16GB RAM and 4 cores).
I have a RDD transformation that runs fine in local but not with yarn master URL.
rdd1 has values like:
id name date
1 john 10/05/2001 (dd/mm/yyyy)
2 steve 11/06/2015
I'd like to change the date format from dd/mm/yyyy to mm/dd/yy, so I wrote a method transformations.transform that I use in RDD.map function as follows:
rdd2 = rdd1.map { rec => (rec.split(",")(0), transformations.transform(rec)) }
transformations.transform method is as follows:
object transformations {
def transform(t: String): String = {
val msg = s">>> transformations.transform($t)"
println(msg)
msg
}
}
Actually the above code works fine in local but not in cluster. The method just returns an output as if the map looked as follows:
rdd2 = rdd1.map { rec => (rec.split(",")(0), rec) }
rec does not seem to be passed to transformations.transform method.
I do use an action to trigger transformations.transform() method but no luck.
val rdd3 = rdd2.count()
println(rdd3)
println prints the count but does not call transformations.transform method. Why?
tl;dr Enable Log Aggregation in Hadoop and use yarn logs -applicationId to see the logs (with println in the logs of the two default Spark executors). Don't forget to bounce the YARN cluster using sbin/stop-yarn.sh followed by sbin/start-yarn.sh (or simply sbin/stop-all.sh and sbin/start-all.sh).
The reason why you don't see the println's output in the logs in YARN is that when a Spark application is spark-submit'ed to a YARN cluster, there are three YARN containers launched, i.e. one container for the ApplicationMaster and two containers for Spark executors.
RDD.map is a transformation that always runs on an Spark executor (as a set of tasks one per RDD partition). That means that println goes to the logs of executors.
NOTE: In local mode, a single JVM runs both the driver and the single executor (as a thread).
To my surprise, you won't be able to find the output of println in the ResourceManager web UI at http://localhost:8088/cluster for the Spark application either.
What worked for me was to enable log aggregation using yarn.log-aggregation-enable YARN property (that you can read about in the article Enable Log Aggregation):
// etc/hadoop/yarn-site.xml
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds</name>
<value>3600</value>
</property>
With that configuration change, you simply spark-submit --master yarn to submit a Spark application followed by yarn logs -applicationId (I used yarn logs -applicationId application_ID > output.txt and reviewed output.txt).
You should find >>> transformations.transform(1,john,10/05/2001) there.
The Code
The code I used was as follows:
import org.apache.spark.SparkContext
object HelloRdd extends App {
object transformations {
def transform(t: String): String = {
val msg = s">>> transformations.transform($t)"
println(msg)
msg
}
}
val sc = SparkContext.getOrCreate()
val rdd1 = sc.textFile(args(0))
val rdd2 = rdd1.map { rec => (rec.split(",")(0), transformations.transform(rec)) }
rdd2.count()
}
The following is the spark-submit I used for testing.
$ HADOOP_CONF_DIR=/tmp ~/dev/apps/spark/bin/spark-submit \
--master yarn \
target/scala-2.11/spark-project_2.11-0.1.jar `pwd`/hello.txt
You really don't provide enough information, and
Yes, I did in local its working fine its executing the if loop but in cluster else is executed
is contradictory to
the method inside the map is not accessible while running in cluster
If it's executing the else branch, it doesn't have any reason to call the method in the if branch, so it doesn't matter whether it's accessible.
And if the problem was that the method is inaccessible, you'd see exceptions being thrown, e.g. ClassNotFoundException or AbstractMethodError; Scala wouldn't just decide to ignore the method call instead.
But given your code style I am going to guess that transformation is a var. Then it's likely that code which sets it isn't executed on the driver (where the if is executed). In local mode it doesn't matter, but in cluster mode it just sets the copy of transformation on the node it's executed on.
This is the same issue described at https://spark.apache.org/docs/latest/rdd-programming-guide.html#local-vs-cluster-modes:
In general, closures - constructs like loops or locally defined methods, should not be used to mutate some global state. Spark does not define or guarantee the behavior of mutations to objects referenced from outside of closures. Some code that does this may work in local mode, but that’s just by accident and such code will not behave as expected in distributed mode.
Why is the code inside RDD.map not executed with count?
I want to change the date format from (dd/mm/yyyy) to (mm/dd/yy), so using a method called transform inside transformations(object) in map() function
If you are looking to change the dateformat only, then I would suggest you not to go through such complexities as its very difficult to analyze the cause of the issue. I would suggest you to apply dataframes instead of rdds as there are many inbuilt functions to meet your needs. For your specific requirement to_date and date_format inbuilt functions should do the trick
First of all, read the data to dataframe as
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", true)
.load("path to the data file")
Then just apply the to_date and date_format functions as
import org.apache.spark.sql.functions._
df.withColumn("date2", date_format(to_date(col("date"), "dd/MM/yyyy"), "MM/dd/yy")).show(false)
and you should get
+---+-----+----------+--------+
|id |name |date |date2 |
+---+-----+----------+--------+
|1 |john |10/05/2001|05/10/01|
|2 |steve|11/06/2015|06/11/15|
+---+-----+----------+--------+
Simple isn't it?
This is a follow up question on
Pyspark filter operation on Dstream
To keep a count of how many error messages/warning messages has come through for say a day, hour - how does one design the job.
What I have tried:
from __future__ import print_function
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
def counts():
counter += 1
print(counter.value)
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: network_wordcount.py <hostname> <port>", file=sys.stderr)
exit(-1)
sc = SparkContext(appName="PythonStreamingNetworkWordCount")
ssc = StreamingContext(sc, 5)
counter = sc.accumulator(0)
lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))
errors = lines.filter(lambda l: "error" in l.lower())
errors.foreachRDD(lambda e : e.foreach(counts))
errors.pprint()
ssc.start()
ssc.awaitTermination()
this however has multiple issues, to start with print doesn't work (does not output to stdout, I have read about it, the best I can use here is logging). Can I save the output of that function to a text file and tail that file instead?
I am not sure why the program just comes out, there is no error/dump anywhere to look further into (spark 1.6.2)
How does one preserve state? What I am trying is to aggregate logs by server and severity, another use case is to count how many transactions were processed by looking for certain keywords
Pseudo Code for what I want to try:
foreachRDD(Dstream):
if RDD.contains("keyword1 | keyword2 | keyword3"):
dictionary[keyword] = dictionary.get(keyword,0) + 1 //add the keyword if not present and increase the counter
print dictionary //or send this dictionary to else where
The last part of sending or printing dictionary requires switching out of spark streaming context - Can someone explain the concept please?
print doesn't work
I would recommend reading the design patterns section of the Spark documentation. I think that roughly what you want is something like this:
def _process(iter):
for item in iter:
print item
lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))
errors = lines.filter(lambda l: "error" in l.lower())
errors.foreachRDD(lambda e : e.foreachPartition(_process))
This will get your call print to work (though it is worth noting that the print statement will execute on the workers and not the drivers, so if you're running this code on a cluster you will only see it on the worker logs).
However, it won't solve your second problem:
How does one preserve state?
For this, take a look at updateStateByKey and the related example.
I have developed a hadoop based solution that process a binary file. This uses classic hadoop MR technique. The binary file is about 10GB and divided into 73 HDFS blocks, and the business logic written as map process operates on each of these 73 blocks. We have developed a customInputFormat and CustomRecordReader in Hadoop that returns key (intWritable) and value (BytesWritable) to the map function. The value is nothing but the contents of a HDFS block(bianry data). The business logic knows how to read this data.
Now, I would like to port this code in spark. I am a starter in spark and could run simple examples (wordcount, pi example) in spark. However, could not straightforward example to process binaryFiles in spark. I see there are two solutions for this use case. In the first, avoid using custom input format and record reader. Find a method (approach) in spark the creates a RDD for those HDFS blocks, use a map like method that feeds HDFS block content to the business logic. If this is not possible, I would like to re-use the custom input format and custom reader using some methods such as HadoopAPI, HadoopRDD etc. My problem:- I do not know whether the first approach is possible or not. If possible, can anyone please provide some pointers that contains examples? I was trying second approach but highly unsuccessful. Here is the code snippet I used
package org {
object Driver {
def myFunc(key : IntWritable, content : BytesWritable):Int = {
println(key.get())
println(content.getSize())
return 1
}
def main(args: Array[String]) {
// create a spark context
val conf = new SparkConf().setAppName("Dummy").setMaster("spark://<host>:7077")
val sc = new SparkContext(conf)
println(sc)
val rd = sc.newAPIHadoopFile("hdfs:///user/hadoop/myBin.dat", classOf[RandomAccessInputFormat], classOf[IntWritable], classOf[BytesWritable])
val count = rd.map (x => myFunc(x._1, x._2)).reduce(_+_)
println("The count is *****************************"+count)
}
}
}
Please note that the print statement in the main method prints 73 which is the number of blocks whereas the print statements inside the map function prints 0.
Can someone tell where I am doing wrong here? I think I am not using API the right way but failed to find some documentation/usage examples.
A couple of problems at a glance. You define myFunc but call func. Your myFunc has no return type, so you can't call collect(). If your myFunc truly doesn't have a return value, you can do foreach instead of map.
collect() pulls the data in an RDD to the driver to allow you to do stuff with it locally (on the driver).
I have made some progress in this issue. I am now using the below function which does the job
var hRDD = new NewHadoopRDD(sc, classOf[RandomAccessInputFormat],
classOf[IntWritable],
classOf[BytesWritable],
job.getConfiguration()
)
val count = hRDD.mapPartitionsWithInputSplit{ (split, iter) => myfuncPart(split, iter)}.collect()
However, landed up with another error the details of which i have posted here
Issue in accessing HDFS file inside spark map function
15/10/30 11:11:39 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 40.221.94.235): java.io.IOException: No FileSystem for scheme: spark
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
I have a simple mongo application that happens to be async (using Akka).
I send a message to an actor, which in turn write 3 records to a database.
I'm using WriteConcern.SAFE because I want to be sure the write happened (also tried WriteConcern.FSYNC_SAFE).
I pause for a second to let the writes happen then do a read--and get nothing.
So my write code might be:
collection.save( myObj, WriteConcern.SAFE )
println("--1--")
collection.save( myObj, WriteConcern.SAFE )
println("--2--")
collection.save( myObj, WriteConcern.SAFE )
println("--3--")
then in my test code (running outside the actor--in another thread) I print out the # of records I find:
println( collection.findAll(...) )
My output looks like this:
--1--
--2--
--3--
(pauses)
0
Indeed if I look in the database I see no records. Sometimes I actually do see data there and the test works. Async code can be tricky and it's possible the test code is being hit before the writes happen, so I also tried printing out timestamps to ensure these are being executed in the order presented--they are. The data should be there. Sample output below w/timestamps:
Saved: brand_1 / dev 1375486024040
Saved: brand_1 / dev2 1375486024156
Saved: brand_1 / dev3 1375486024261
1375486026593 0 found
So the 3 saves clearly happened (and should have written) a full 2 seconds before the read was attempted.
I understand for more liberal WriteConcerns you could get this behavior, but I thought the two safest ones would assure me the write actually happened before proceeding.
Subtle but simple problem. I was using a def to create my connection... which I then proceeded to call twice as if it was a val. So I actually had 2 different writers so that explained the sometimes-difference in my results. Refactored to a val and all was predictable. Agonizing to identify, easy to understand/fix.