I'm running a simple Spark project on a EMR YARN cluster to:
read a textfile on S3 into an RDD[String]
define a schema and convert that RDD into a DF
I am doing a mapPartition on the RDD to convert that RDD[String] into an RDD[Row].
My problem - I get a java.Lang.NullPointerException and I can't figure out what the problem is.
The stacktrace lists these 2 line numbers in the source code -
the line of rdd1.mapPartition
within the anonymous function, the line with the match case that matches the regular
Here's the stacktrace excerpt -
Caused by: java.lang.NullPointerException
at packageA.Herewego$$anonfun$3.apply(Herewego.scala:107)
at packageA.Herewego$$anonfun$3.apply(Herewego.scala:88)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I've tried -
The error occurs when running in YARN cluster mode - and not in Local mode (in my IDE). This made me think that something isn't defined on the Executor? I moved the createrow function def into the anonymous function def - it didn't work though.
Here's the code block
val rdd4: RDD[Row] = rdd1.mapPartitions((it:Iterator[String]) => {
def createrow(a: List[String]): Row = {
val format = new java.text.SimpleDateFormat("dd/MMM/yyyy HH:mm:ss Z")
val re1: Row = Row.apply(a.head)
val d: Date = format.parse(a.tail.mkString(" "))
val t = new Timestamp(d.getTime)
val re2: Row = Row.apply(t)
Row.merge(re1, re2)
}
var output: List[Row] = List()
while (it.hasNext) {
val data: String = it.next()
val res = data match {
case rx(ipadd, date, time) => createrow(List(ipadd, date, time))
case _ => createrow(List("0.0.0.0", "00/Jan/0000", "00:00:00 0"))
}
output = output :+ res
}
output.toIterator
}).persist(MEMORY_ONLY)
// Collect and Persist the RDD in Memory
val tmp = rdd4.collect()
Do I need to broadcast any variables or functions used within the mapPartition?
Any pointers in the right direction will be more than appreciated.
Related
When I am trying to execute 'sparl.sql' inside an UDF, I am getting java.lang.NullPointerException. Is there any way, how can I execute that?
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.{Column, DataFrame, SparkSession}
// Define the UDF
def myUdf(spark: SparkSession): UserDefinedFunction = udf((col1: String, col2: String) => {
// Execute the SQL query
val result = spark.sql(s"SELECT 'Hello World!' as text")
// Return the result as a string
result.toString()
})
// Use the UDF in a DataFrame transformation
def transform(df: DataFrame, col1: Column, col2: Column): DataFrame = {
df.withColumn("result", myUdf(spark)(col1, col2))
}
val res = transform(df, col("salary"), col("gender"))
res.show()
Above code is throwing below exception
22/12/07 11:10:32 ERROR Executor: Exception in task 0.0 in stage 1679.0 (TID 19329)
org.apache.spark.SparkException: Failed to execute user defined function($read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$$$82b5b23cea489b2712a1db46c77e458$$$$w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$Lambda$4802/591483562: (string, string) => string)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.NullPointerException
at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:154)
at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:152)
at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:616)
at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:616)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:613)
I am afraid this won't work.
From a technical standpoint, UDFs in general run on an executor, and Spark session can only be accessed and used to schedule further work, on a driver. The null pointer exception you can is most likely an attempt to obtain some pieces of the Spark session that are not available on the executor.
From a semantic standpoint, if this were permitted, then processing of each row would create a new query potentially processing lots of rows. Imagine a dataframe with 10M records, and creating 10M queries. That would not be feasible to implement.
I have a dataframe df like the following
+--------+--------------------+--------+------+
| id| path|somestff| hash1|
+--------+--------------------+--------+------+
| 1|/file/dirA/fileA.txt| 58| 65161|
| 2|/file/dirB/fileB.txt| 52| 65913|
| 3|/file/dirC/fileC.txt| 99|131073|
| 4|/file/dirF/fileD.txt| 46|196233|
+--------+--------------------+--------+------+
One note: The /file/dir differ. Not all files are stored in the same directory. In fact there a hundreds of files in various directories.
What I want to accomplish here is to read the file in the column path and count the records within the files and write the result of the row count into a new column of a dataframe.
I tried the following function and udf:
def executeRowCount(fileCount: String): Long = {
val rowCount = spark.read.format("csv").option("header", "false").load(fileCount).count
rowCount
}
val execUdf = udf(executeRowCount _)
df.withColumn("row_count", execUdf (col("path"))).show()
This results in the following error
org.apache.spark.SparkException: Failed to execute user defined fu
nction($anonfun$1: (string) => bigint)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
at $line39.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:28)
at $line39.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:25)
... 19 more
I tried to iterate through the column when collected like
val te = df.select("path").as[String].collect()
te.foreach(executeRowCount)
and here it works just fine, but I want to store the result within the df...
I've tried several solutions, but I'm facing a dead end here.
That does not work as the data frames can only be created in the driver JVM but the UDF code is run in executor JVMs. What you can do is to load the CSVs into a separate data frame and enrich the data with a file name column:
val csvs = spark
.read
.format("csv")
.load("/file/dir/")
.withColumn("filename", input_file_name())
and then join the original df on filename column
I fixed this issue in the following way:
val queue = df.select("path").as[String].collect()
val countResult = for (item <- queue) yield {
val rowCount = (item, spark.read.format("csv").option("header", "false").load(item).count)
rowCount
}
val df2 = spark.createDataFrame(countResult)
Afterwards I joined the df with df2...
The problem here is as #ollik1 mentioned within the driver/worker architecture on udfs. The UDF is not serializable, what I would need with the spark.read function.
What about ? :
def executeRowCount = udf((fileCount: String) => {
spark.read.format("csv").option("header", "false").load(fileCount).count
})
df.withColumn("row_count", executeRowCount(col("path"))).show()
May be something like that ?
sqlContext
.read
.format("csv")
.load("/tmp/input/")
.withColumn("filename", input_file_name())
.groupBy("filename")
.agg(count("filename").as("record_count"))
.show()
I have two columns in a Spark SQL DataFrame with each entry in either column as an array of strings.
val ngramDataFrame = Seq(
(Seq("curious", "bought", "20"), Seq("iwa", "was", "asj"))
).toDF("filtered_words", "ngrams_array")
I want to merge the arrays in each row to make a single array in a new column. My code is as follows:
def concat_array(firstarray: Array[String],
secondarray: Array[String]) : Array[String] =
{ (firstarray ++ secondarray).toArray }
val concatUDF = udf(concat_array _)
val concatFrame = ngramDataFrame.withColumn("full_array", concatUDF($"filtered_words", $"ngrams_array"))
I can successfully use the concat_array function on two arrays. However when I run the above code, I get the following exception:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 16.0 failed 1 times, most recent failure: Lost task 0.0 in stage 16.0 (TID 12, localhost): org.apache.spark.SparkException: Failed to execute user defined function(anonfun$1: (array, array) => array) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:389) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String; at $line80.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:76) ... 13 more Driver stacktrace:
In Spark 2.4 or later you can use concat (if you want to keep duplicates):
ngramDataFrame.withColumn(
"full_array", concat($"filtered_words", $"ngrams_array")
).show
+--------------------+---------------+--------------------+
| filtered_words| ngrams_array| full_array|
+--------------------+---------------+--------------------+
|[curious, bought,...|[iwa, was, asj]|[curious, bought,...|
+--------------------+---------------+--------------------+
or array_union (if you want to drop duplicates):
ngramDataFrame.withColumn(
"full_array",
array_union($"filtered_words", $"ngrams_array")
)
These can be also composed from the other higher order functions, for example
ngramDataFrame.withColumn(
"full_array",
flatten(array($"filtered_words", $"ngrams_array"))
)
with duplicates, and
ngramDataFrame.withColumn(
"full_array",
array_distinct(flatten(array($"filtered_words", $"ngrams_array")))
)
without.
On a side note, you shouldn't use WrappedArray when working with ArrayType columns. Instead you should expect the guaranteed interface, which is Seq. So the udf should use function with following signature:
(Seq[String], Seq[String]) => Seq[String]
Please refer to SQL Programming Guide for details.
Arjun there is an error in the udf you had created.when you are passing the array type columns .data type is not Array[String] it is WrappedArray[String].below i am pasting the modified udf along with output.
val SparkCtxt = new SparkContext(sparkConf)
val sqlContext = new SQLContext(SparkCtxt)
import sqlContext.implicits
import org.apache.spark.sql.functions._
val temp=SparkCtxt.parallelize(Seq(Row(Array("String1","String2"),Array("String3","String4"))))
val df= sqlContext.createDataFrame(temp,
StructType(List(
StructField("Col1",ArrayType(StringType),true),
StructField("Col2",ArrayType(StringType),true)
)
) )
def concat_array(firstarray: mutable.WrappedArray[String],
secondarray: mutable.WrappedArray[String]) : mutable.WrappedArray[String] =
{
(firstarray ++ secondarray)
}
val concatUDF = udf(concat_array _)
val df2=df.withColumn("udftest",concatUDF(df.col("Col1"), df.col("Col2")))
df2.select("udftest").foreach(each=>{println("***********")
println(each(0))})
df2.show(true)
OUTPUT:
+------------------+------------------+--------------------+
| Col1| Col2| udftest|
+------------------+------------------+--------------------+
|[String1, String2]|[String3, String4]|[String1, String2...|
+------------------+------------------+--------------------+
WrappedArray(String1, String2, String3, String4)
I am unable to parallelize a list in scala, getting java.lang.NullPointerException
messages.foreachRDD( rdd => {
for(avroLine <- rdd){
val record = Injection.injection.invert(avroLine.getBytes).get
val field1Value = record.get("username")
val jsonStrings=Seq(record.toString())
val newRow = sqlContext.sparkContext.parallelize(Seq(record.toString()))
}
})
output
jsonStrings...List({"username": "user_118", "tweet": "tweet_218", "timestamp": 18})
Exception
Caused by: java.lang.NullPointerException
at com.capitalone.AvroConsumer$$anonfun$main$1$$anonfun$apply$1.apply(AvroConsumer.scala:83)
at com.capitalone.AvroConsumer$$anonfun$main$1$$anonfun$apply$1.apply(AvroConsumer.scala:74)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.util.CompletionIterator.foreach(CompletionIterator.scala:26)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:917)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:917)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
Thanks in Advance!!
You're trying to create an RDD in the spark worker context. While foreachRDD operates in the driver, the foreach operation you perform on each RDD is distributed to the workers. It seems unlikely that you actually want to create a new RDD for each line of the input stream.
Update after comments:
It's hard to have this discussion in a comment thread where there is no formatting for code. My basic question is why aren't you doing something like this:
val messages: ReceiverInputDStream[String] = RabbitMQUtils.createStream(ssc, rabbitParams)
def toJsonString(message: String): String = SparkUtils.getRecordInjection(QUEUE_NAME).invert(message.getBytes()).get
val jsonStrings: DStream[String] = messages map toJsonString
I haven't bothered to figure out and track down all the libraries you're using (please, next time, submit a MCVE), so I haven't tried to compile that. But it looks like all you want is to map each input message to a JSON string. Maybe you want to do something fancy with the resulting DStream of Strings but that might be a different question.
def toJsonString(message: String): String = {val record =
SparkUtils.getRecordInjection(QUEUE_NAME).invert(message.getBytes()).get }
dStreams.foreachRDD( rdd => {
val jsonStrings = rdd.map (stream =>toJsonString(stream))
val df = sqlContext.read.json(jsonStrings)
df.write.mode("Append").csv("/Users/Documents/kafka-poc/consumer-out/def/")}
Hi I am trying to generate output of Salt Examples but without using docker as mentioned in it's documentation. I found the scala code that helps in generating the output which is the Main.scala. I took to modify the Main.scala to a convenient one,
package BinExTest
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.Row
import software.uncharted.salt.core.projection.numeric._
import software.uncharted.salt.core.generation.request._
import software.uncharted.salt.core.generation.Series
import software.uncharted.salt.core.generation.TileGenerator
import software.uncharted.salt.core.generation.output.SeriesData
import software.uncharted.salt.core.analytic.numeric._
import java.io._
import scala.util.parsing.json.JSONObject
object Main {
// Defines the tile size in both x and y bin dimensions
val tileSize = 256
// Defines the output layer name
val layerName = "pickups"
// Creates and returns an Array of Double values encoded as 64bit Integers
def createByteBuffer(tile: SeriesData[(Int, Int, Int), (Int, Int), Double, (Double, Double)]): Array[Byte] = {
val byteArray = new Array[Byte](tileSize * tileSize * 8)
var j = 0
tile.bins.foreach(b => {
val data = java.lang.Double.doubleToLongBits(b)
for (i <- 0 to 7) {
byteArray(j) = ((data >> (i * 8)) & 0xff).asInstanceOf[Byte]
j += 1
}
})
byteArray
}
def main(args: Array[String]): Unit = {
val jarFile = "/home/kesava/Studies/BinExTest/BinExTest.jar";
val inputPath = "/home/kesava/Downloads/taxi_micro.csv"
val outputPath = "/home/kesava/SoftWares/salt/salt-examples/bin-example/Output"
val conf = new SparkConf().setAppName("salt-bin-example").setJars(Array(jarFile))
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load(s"file://$inputPath")
.registerTempTable("taxi_micro")
// Construct an RDD of Rows containing only the fields we need. Cache the result
val input = sqlContext.sql("select pickup_lon, pickup_lat from taxi_micro")
.rdd.cache()
// Given an input row, return pickup longitude, latitude as a tuple
val pickupExtractor = (r: Row) => {
if (r.isNullAt(0) || r.isNullAt(1)) {
None
} else {
Some((r.getDouble(0), r.getDouble(1)))
}
}
// Tile Generator object, which houses the generation logic
val gen = TileGenerator(sc)
// Break levels into batches. Process several higher levels at once because the
// number of tile outputs is quite low. Lower levels done individually due to high tile counts.
val levelBatches = List(List(0, 1, 2, 3, 4, 5, 6, 7, 8), List(9, 10, 11), List(12), List(13), List(14))
// Iterate over sets of levels to generate.
val levelMeta = levelBatches.map(level => {
println("------------------------------")
println(s"Generating level $level")
println("------------------------------")
// Construct the definition of the tiling jobs: pickups
val pickups = new Series((tileSize - 1, tileSize - 1),
pickupExtractor,
new MercatorProjection(level),
(r: Row) => Some(1),
CountAggregator,
Some(MinMaxAggregator))
// Create a request for all tiles on these levels, generate
val request = new TileLevelRequest(level, (coord: (Int, Int, Int)) => coord._1)
val rdd = gen.generate(input, pickups, request)
// Translate RDD of Tiles to RDD of (coordinate,byte array), collect to master for serialization
val output = rdd
.map(s => pickups(s).get)
.map(tile => {
// Return tuples of tile coordinate, byte array
(tile.coords, createByteBuffer(tile))
})
.collect()
// Save byte files to local filesystem
output.foreach(tile => {
val coord = tile._1
val byteArray = tile._2
val limit = (1 << coord._1) - 1
// Use standard TMS path structure and file naming
val file = new File(s"$outputPath/$layerName/${coord._1}/${coord._2}/${limit - coord._3}.bins")
file.getParentFile.mkdirs()
val output = new FileOutputStream(file)
output.write(byteArray)
output.close()
})
// Create map from each level to min / max values.
rdd
.map(s => pickups(s).get)
.map(t => (t.coords._1.toString, t.tileMeta.get))
.reduceByKey((l, r) => {
(Math.min(l._1, r._1), Math.max(l._2, r._2))
})
.mapValues(minMax => {
JSONObject(Map(
"min" -> minMax._1,
"max" -> minMax._2
))
})
.collect()
.toMap
})
// Flatten array of maps into a single map
val levelInfoJSON = JSONObject(levelMeta.reduce(_ ++ _)).toString()
// Save level metadata to filesystem
val pw = new PrintWriter(s"$outputPath/$layerName/meta.json")
pw.write(levelInfoJSON)
pw.close()
}
}
I created a separate folder for this scala with another folder in it named lib that had the jars required and I compiled it with scalac as follows,
scalac -cp "lib/salt.jar:lib/spark.jar" Main.scala
This ran successfully and generated classes under a folder BinExTest.
Now, the project's build.gradle had the following lines of code with which identified that this is the command that would help in generating the output dataset,
task run(overwrite: true, type: Exec, dependsOn: [assemble]) {
executable = 'spark-submit'
args = ["--class","software.uncharted.salt.examples.bin.Main","/opt/salt/build/libs/salt-bin-example-${version}.jar", "/opt/data/taxi_one_day.csv", "/opt/output"]
}
Seeing this, I made the following command,
spark-submit --class BinExTest.Main lib/salt.jar
When I do this, I get the following error,
java.lang.ClassNotFoundException: Main.BinExTest at
java.net.URLClassLoader$1.run(URLClassLoader.java:366) at
java.net.URLClassLoader$1.run(URLClassLoader.java:355) at
java.security.AccessController.doPrivileged(Native Method) at
java.net.URLClassLoader.findClass(URLClassLoader.java:354) at
java.lang.ClassLoader.loadClass(ClassLoader.java:425) at
java.lang.ClassLoader.loadClass(ClassLoader.java:358) at
java.lang.Class.forName0(Native Method) at
java.lang.Class.forName(Class.java:278) at
org.apache.spark.util.Utils$.classForName(Utils.scala:174) at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:689)
at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Can somebody help me out in this? I am completely new to this and came this far just by exploration.
[Update 1]
Taking in YoYo's suggestion,
spark-submit --class BinExTest.Main --jars "BinExTest.jar" "lib/salt.jar"
I got the ClassNotFoundException gone generating new error and is as follows,
Exception in thread "main" org.apache.spark.SparkException: Job
aborted due to stage failure: Task 1 in stage 3.0 failed 1 times, most
recent failure: Lost task 1.0 in stage 3.0 (TID 6, localhost):
java.lang.NoSuchMethodError:
scala.runtime.IntRef.create(I)Lscala/runtime/IntRef; at
BinExTest.Main$.createByteBuffer(Main.scala:29) at
BinExTest.Main$$anonfun$2$$anonfun$6.apply(Main.scala:101) at
BinExTest.Main$$anonfun$2$$anonfun$6.apply(Main.scala:99) at
scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at
scala.collection.Iterator$class.foreach(Iterator.scala:727) at
scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157) at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
at
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Any idea what's going on?
[Update 2]
Building Spark from source with Scala2.11 support solved my previous issue. However I got a new error and it is,
6/05/10 18:39:15 ERROR TaskSetManager: Task 0 in stage 2.0 failed 1
times; aborting job Exception in thread "main"
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0
in stage 2.0 (TID 3, localhost): java.lang.NoClassDefFoundError:
scala/collection/GenTraversableOnce$class at
software.uncharted.salt.core.util.SparseArray.(SparseArray.scala:37)
at
software.uncharted.salt.core.util.SparseArray.(SparseArray.scala:57)
at
software.uncharted.salt.core.generation.rdd.RDDSeriesWrapper.makeBins(RDDTileGenerator.scala:224)
at
software.uncharted.salt.core.generation.rdd.RDDTileGeneratorCombiner.createCombiner(RDDTileGenerator.scala:128)
at
software.uncharted.salt.core.generation.rdd.RDDTileGenerator$$anonfun$3.apply(RDDTileGenerator.scala:100)
at
software.uncharted.salt.core.generation.rdd.RDDTileGenerator$$anonfun$3.apply(RDDTileGenerator.scala:100)
at
org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:187)
at
org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:186)
at
org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:148)
at
org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)
at
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745) Caused by:
java.lang.ClassNotFoundException:
scala.collection.GenTraversableOnce$class at
java.net.URLClassLoader$1.run(URLClassLoader.java:366) at
java.net.URLClassLoader$1.run(URLClassLoader.java:355) at
java.security.AccessController.doPrivileged(Native Method) at
java.net.URLClassLoader.findClass(URLClassLoader.java:354) at
java.lang.ClassLoader.loadClass(ClassLoader.java:425) at
java.lang.ClassLoader.loadClass(ClassLoader.java:358)
Is this because scala2.11 does not have the mentioned class?
[Final Update]
Adding scala2.10 to the spark-submit did the trick.
spark-submit --class "BinExTest.Main" --jars
"BinExTest.jar,lib/scala210.jar" "lib/salt.jar"
For a Spark job to run, it need to self-replicate it's code over the different nodes that make part of your spark cluster. It does that by literally copying over the jar file to the other nodes.
That means that you need to make sure that your class files are packaged in a .jar file. In my typical solutions, I would build an Uber jar that packages the class files, and the dependent jar files together in a single .jar file. For that I use the Maven Shade plugin. That doesn't have to be your solution, but at least you should build a .jar file out of your generated classes.
To provide manually additional jar files - you will need to add them using the --jars option which will expect a comma delimited list.
Update 1
Actually, even for me there is a lot of confusion about all the available options, specifically to the jar files and how they are distributed, or modify the classpath in spark. See another topic I just posted.
Update 2
For the second part of your question that is already answered on another thread.