I got a rdd named index: RDD[(String, String)], I want to use index to deal with my file.
This is the code:
val get = file.map({x =>
val tmp = index.lookup(x).head
tmp
})
The question is that I can not use index in the file.map Function, I ran this program and it gave me feedback like this:
14/12/11 16:22:27 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 602, spark2): scala.MatchError: null
org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:770)
com.ynu.App$$anonfun$12.apply(App.scala:270)
com.ynu.App$$anonfun$12.apply(App.scala:265)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
scala.collection.AbstractIterator.to(Iterator.scala:1157)
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
org.apache.spark.rdd.RDD$$anonfun$28.apply(RDD.scala:1080)
org.apache.spark.rdd.RDD$$anonfun$28.apply(RDD.scala:1080)
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
I don't know why. And if I want to implement this function what can I do?
Thanks
You should see RDDs as virtual collections. The RDD reference, only points to where the data is, in itself it has no data, so there's no point on using it in a closure.
You will need to use functions that combine RDDs together in order to achieve the desired functionality. Also, lookup as defined here is a very sequential process that requires all the lookup data available in the memory of each worker - this will not scale up.
To resolve all elements of the file rdd that to their value in index you should join both RDDs:
val resolvedFileRDD = file.keyBy(identity).join(index) // this will have the form of (key, (key,index of key))
Related
I have a hive table partitionned by timestamp on top of parquet files with snappy conversion.
basically the paths look like :
s3:/bucketname/project/flowtime=0/
s3:/bucketname/project/flowtime=1/
s3:/bucketname/project/flowtime=2/
...
I detect some inconsistency considering this table. The problem is that because of one field that gives LongType at some parquet schemas and String at another, runing queries throws ClassCastException.
So what I am trying to do now is to read all my parquet files and check their schemas so I can recreate them. I want to map my filenames to the schema of the associate parquet. so that I can have :
filename | schema
s3:/bucketname/project/flowtime |StructField(StructField(Id,StringType,True),
|StructField(Date,StringType,True)
So I tried to use spark with Scala and function input_file_name of org.apache.spark.sql.functions which I wrap in an UDF. It works pretty fine.
val filename = (path: String) => path
val filenameUDF = udf(filename)
val df=sqlContext.parquetFile("s3a://bucketname/").select(filenameUDF(input_file_name())).toDF()
df.map(lines =>(lines.toString,sqlContext.read.parquet(lines.toString.replace("[","").replace("]","")).schema.toString)})
It is to give an RDD[(String,String)]
Only it seems that the part that reads the parquet within my map thows a nullPointerException.
ERROR scheduler.TaskSetManager: Task 0 in stage 14.0 failed 4 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 14.0 failed 4 times, most recent failure: Lost task 0.3 in stage 14.0 (TID 35, CONFIDENTIAL-SERVER-NAME, executor 13): java.lang.NullPointerException
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:32)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:32)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1328)
at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1328)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1888)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1888)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
If you have any idea why the read parquet seems to not work inside the map please let me know why, because both parts of the pair I wanna create (the filename AND the schema) seems to work fine but joining them dont.
If also, you have better ideas how to solve the inconsistency among my parquet files that makes my hive table corrupted, because I don't see another choice than to work it that way because parquet are immutable and changing hive metadata don't change the embedded parquet metadata in each file.
Thank you for your attention.
Renaud
Let me suggest you an other get and loop on yout bucket list.
first you can read and store s3 bucket name by using listStatus
then loop on each path.
import java.net.URI
import org.apache.hadoop.fs._
import org.apache.hadoop.conf._
import java.io._
val file = new File("/home/.../fileName.txt")
val path = "s3:/bucketname/project/"
val fileSystem = FileSystem.get(URI.create(path), new Configuration())
val folders = fileSystem.listStatus(new Path(path))
val bw = new BufferedWriter(new FileWriter(file))
for (folder <- folders) { bw.write(folder.getPath.toString().split("/")(6) + " => " + spark.read.parquet(folder.getPath.toString()).select("myColum").schema.toString() + "\n") }
bw.close
hope it will help you.
Regards.
Steven
Below I provide my code. I iterate over the DataFrame prodRows and for each product_PK I find some matching sub-list of product_PKs from prodRows.
numRecProducts = 10
var listOfProducts: Map[Long,Array[(Long, Int)]] = Map()
prodRows.foreach{ row : Row =>
val product_PK = row.get(row.fieldIndex("product_PK")).toString.toLong
val gender = row.get(row.fieldIndex("gender_PK")).toString
val selection = prodRows.filter($"gender_PK" === gender || $"gender_PK" === "UNISEX").limit(numRecProducts).select($"product_PK")
var productList: Array[(Long, Int)] = Array()
if (!selection.rdd.isEmpty()) {
productList = selection.rdd.map(x => (x(0).toString.toLong,1)).collect()
}
listOfProducts = listOfProducts + (product_PK -> productList)
}
But when I execute it, it gives me the following error. It looks like selection is empty in some iterations. However, I do not understand how can I handle this error:
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1690)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1678)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1677)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1677)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:855)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:855)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:855)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1905)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1860)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1849)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:671)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:918)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:916)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.foreach(RDD.scala:916)
at org.apache.spark.sql.Dataset$$anonfun$foreach$1.apply$mcV$sp(Dataset.scala:2325)
at org.apache.spark.sql.Dataset$$anonfun$foreach$1.apply(Dataset.scala:2325)
at org.apache.spark.sql.Dataset$$anonfun$foreach$1.apply(Dataset.scala:2325)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2823)
at org.apache.spark.sql.Dataset.foreach(Dataset.scala:2324)
at org.test.ComputeNumSim.run(ComputeNumSim.scala:69)
at org.test.ComputeNumSimRunner$.main(ComputeNumSimRunner.scala:19)
at org.test.ComputeNumSimRunner.main(ComputeNumSimRunner.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:635)
Caused by: java.lang.NullPointerException
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:170)
at org.apache.spark.sql.Dataset$.apply(Dataset.scala:61)
at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2877)
at org.apache.spark.sql.Dataset.filter(Dataset.scala:1304)
at org.test.ComputeNumSim$$anonfun$run$1.apply(ComputeNumSim.scala:74)
at org.test.ComputeNumSim$$anonfun$run$1.apply(ComputeNumSim.scala:69)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
What does it mean and how can I handle it?
You cannot access any of Spark's "driver-side" abstractions (RDDs, DataFrames, Datasets, SparkSession...) from within a function passed on to one of Spark's DataFrame/RDD transformations. You also cannot update driver-side mutable objects from within these functions.
In your case - you're trying to use prodRows and selection (both are DataFrames) within a function passed to DataFrame.foreach. You're also trying to update listOfProducts (a local driver-side variable) from within that same function.
Why?
DataFrames, RDDs, and SparkSession only exist on your Driver application. They serve as a "handle" to access data distributed over the cluster of worker machines.
Functions passed to RDD/DataFrame transformations get serialized and sent to that cluster, to be executed on the data partitions on each of the worker machines. When the serialized DataFrames/RDDs get deserialized on those machines - they are useless, they can't still represent the data on the cluster as they are just hollow copies of the ones created on the driver application, which actually maintains a connection to the cluster machines
For the same reason, attempting to update driver-side variables will fail: the variables (starting out as empty, in most cases) will be serialized, deserialized on each of the workers, get updated locally on the workers, and stay there... the original driver-side variable will remain unchanged
How can you solve this?
When working with Spark, especially with DataFrames, you should try to avoid "iteration" over the data, and use DataFrame's declarative operations instead. In most cases, when you want to reference data of another DataFrame for each record in your DataFrame, you'd want to use join to create a new DataFrame with records combining data from the two DataFrames.
In this specific case, here's a roughly equivalent solution that does what you're trying to do, if I managed to conclude it correctly. Try to use this and read the DataFrame documentation to figure out the details:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import spark.implicits._
val numRecProducts = 10
val result = prodRows.as("left")
// self-join by gender:
.join(prodRows.as("right"), $"left.gender_PK" === $"right.gender_PK" || $"right.gender_PK" === "UNISEX")
// limit to 10 results per record:
.withColumn("rn", row_number().over(Window.partitionBy($"left.product_PK").orderBy($"right.product_PK")))
.filter($"rn" <= numRecProducts).drop($"rn")
// group and collect_list to create products column:
.groupBy($"left.product_PK" as "product_PK")
.agg(collect_list(struct($"right.product_PK", lit(1))) as "products")
The problem is that you try to access prodRows from within prodRows.foreach. You cannot use a dataframe within a transformation, dataframes only exist on the driver.
Below I provide my code. I iterate over the DataFrame prodRows and for each product_PK I find some matching sub-list of product_PKs from prodRows.
numRecProducts = 10
var listOfProducts: Map[Long,Array[(Long, Int)]] = Map()
prodRows.foreach{ row : Row =>
val product_PK = row.get(row.fieldIndex("product_PK")).toString.toLong
val gender = row.get(row.fieldIndex("gender_PK")).toString
val selection = prodRows.filter($"gender_PK" === gender || $"gender_PK" === "UNISEX").limit(numRecProducts).select($"product_PK")
var productList: Array[(Long, Int)] = Array()
if (!selection.rdd.isEmpty()) {
productList = selection.rdd.map(x => (x(0).toString.toLong,1)).collect()
}
listOfProducts = listOfProducts + (product_PK -> productList)
}
But when I execute it, it gives me the following error. It looks like selection is empty in some iterations. However, I do not understand how can I handle this error:
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1690)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1678)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1677)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1677)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:855)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:855)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:855)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1905)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1860)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1849)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:671)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:918)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:916)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.foreach(RDD.scala:916)
at org.apache.spark.sql.Dataset$$anonfun$foreach$1.apply$mcV$sp(Dataset.scala:2325)
at org.apache.spark.sql.Dataset$$anonfun$foreach$1.apply(Dataset.scala:2325)
at org.apache.spark.sql.Dataset$$anonfun$foreach$1.apply(Dataset.scala:2325)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2823)
at org.apache.spark.sql.Dataset.foreach(Dataset.scala:2324)
at org.test.ComputeNumSim.run(ComputeNumSim.scala:69)
at org.test.ComputeNumSimRunner$.main(ComputeNumSimRunner.scala:19)
at org.test.ComputeNumSimRunner.main(ComputeNumSimRunner.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:635)
Caused by: java.lang.NullPointerException
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:170)
at org.apache.spark.sql.Dataset$.apply(Dataset.scala:61)
at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2877)
at org.apache.spark.sql.Dataset.filter(Dataset.scala:1304)
at org.test.ComputeNumSim$$anonfun$run$1.apply(ComputeNumSim.scala:74)
at org.test.ComputeNumSim$$anonfun$run$1.apply(ComputeNumSim.scala:69)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
What does it mean and how can I handle it?
You cannot access any of Spark's "driver-side" abstractions (RDDs, DataFrames, Datasets, SparkSession...) from within a function passed on to one of Spark's DataFrame/RDD transformations. You also cannot update driver-side mutable objects from within these functions.
In your case - you're trying to use prodRows and selection (both are DataFrames) within a function passed to DataFrame.foreach. You're also trying to update listOfProducts (a local driver-side variable) from within that same function.
Why?
DataFrames, RDDs, and SparkSession only exist on your Driver application. They serve as a "handle" to access data distributed over the cluster of worker machines.
Functions passed to RDD/DataFrame transformations get serialized and sent to that cluster, to be executed on the data partitions on each of the worker machines. When the serialized DataFrames/RDDs get deserialized on those machines - they are useless, they can't still represent the data on the cluster as they are just hollow copies of the ones created on the driver application, which actually maintains a connection to the cluster machines
For the same reason, attempting to update driver-side variables will fail: the variables (starting out as empty, in most cases) will be serialized, deserialized on each of the workers, get updated locally on the workers, and stay there... the original driver-side variable will remain unchanged
How can you solve this?
When working with Spark, especially with DataFrames, you should try to avoid "iteration" over the data, and use DataFrame's declarative operations instead. In most cases, when you want to reference data of another DataFrame for each record in your DataFrame, you'd want to use join to create a new DataFrame with records combining data from the two DataFrames.
In this specific case, here's a roughly equivalent solution that does what you're trying to do, if I managed to conclude it correctly. Try to use this and read the DataFrame documentation to figure out the details:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import spark.implicits._
val numRecProducts = 10
val result = prodRows.as("left")
// self-join by gender:
.join(prodRows.as("right"), $"left.gender_PK" === $"right.gender_PK" || $"right.gender_PK" === "UNISEX")
// limit to 10 results per record:
.withColumn("rn", row_number().over(Window.partitionBy($"left.product_PK").orderBy($"right.product_PK")))
.filter($"rn" <= numRecProducts).drop($"rn")
// group and collect_list to create products column:
.groupBy($"left.product_PK" as "product_PK")
.agg(collect_list(struct($"right.product_PK", lit(1))) as "products")
The problem is that you try to access prodRows from within prodRows.foreach. You cannot use a dataframe within a transformation, dataframes only exist on the driver.
My kafka topics have millions of messages in them, unfortunately Spark Streaming seems to queue all of the messages no matter the time limit I get. I have set my standalone server with 16 cores and 64GB RAM, I've given my driver 12G and the Executor 12G of Memory.
16/09/14 19:27:45 WARN TaskSetManager: Lost task 4.0 in stage 1.0 (TID 7, 10.206.41.172): java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.twitter.chill.Tuple4Serializer.read(TupleSerializers.scala:68)
at com.twitter.chill.Tuple4Serializer.read(TupleSerializers.scala:59)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:396)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:307)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708)
at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)
at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
at org.apache.spark.serializer.DeserializationStream.readValue(Serializer.scala:159)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.readNextItem(ExternalAppendOnlyMap.scala:515)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.hasNext(ExternalAppendOnlyMap.scala:535)
at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:1004)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.org$apache$spark$util$collection$ExternalAppendOnlyMap$ExternalIterator$$readNextHashCode(ExternalAppendOnlyMap.scala:332)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator$$anonfun$5.apply(ExternalAppendOnlyMap.scala:316)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator$$anonfun$5.apply(ExternalAppendOnlyMap.scala:314)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.<init>(ExternalAppendOnlyMap.scala:314)
at org.apache.spark.util.collection.ExternalAppendOnlyMap.iterator(ExternalAppendOnlyMap.scala:288)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:43)
at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:91)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:109)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:695)
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:691)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:691)
at org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:145)
at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:49)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:109)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
org.apache.spark.shuffle.FetchFailedException: /tmp/spark-e4238a07-bf89-4a7d-9de3-176cba0a076d/executor-93a11b25-1cb1-4e13-b1b6-b1d64d3a9602/blockmgr-c11cd046-7c37-429c-9137-936d391d3cbc/30/shuffle_0_0_0.index (No such file or directory)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:154)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:41)
at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:91)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:109)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: /tmp/spark-e4238a07-bf89-4a7d-9de3-176cba0a076d/executor-93a11b25-1cb1-4e13-b1b6-b1d64d3a9602/blockmgr-c11cd046-7c37-429c-9137-936d391d3cbc/30/shuffle_0_0_0.index (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at org.apache.spark.shuffle.IndexShuffleBlockResolver.getBlockData(IndexShuffleBlockResolver.scala:192)
at org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:278)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchLocalBlocks(ShuffleBlockFetcherIterator.scala:258)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:292)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.<init>(ShuffleBlockFetcherIterator.scala:120)
at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:45)
... 9 more
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
Any recommendations on what I can do to handle this? Is my solution going to be just feed it more memory? I pointing at a shuffle issue because one exeuctor has:
Shuffle Read Size/Records: 815.5 MB / 27445419
Shuffle Spill (memory): 13GB
Shuffle Spill (Disk): 697.4 MB
I have no idea what it might be shuffling.
val messageStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, Int, Long, String)](ssc, getKafkaBrokers(), getKafkaTopics("raw"), (mmd: MessageAndMetadata[String, String]) => {
(mmd.topic, mmd.partition, mmd.offset, mmd.message)
})
//first step is to take our RDD of messages and group them by the topic (x._1) and partition (x._2)
messageStream.foreachRDD(x => x.groupBy(x => (x._1, x._2)).foreach(x => {
//Populate parameters based on the grouping and get the instruction sets to run on each message
val rawTopic: String = x._1._1
val partitionID: Int = x._1._2
val sourceSystemName: String = rawTopic.split("_")(0)
val cleanTopic = sourceSystemName + "_clean"
val topicID = getTopicID(rawTopic)
val schemaID = getPartitionByTopic(topicID).filter(x => x._1 == partitionID)(0)._2
val classpath = getClasspath(schemaID.toInt)
val classPath = bcClassMap.value.get(classpath)
val cleanerObj = classPath.newInstance()
val cleanMethod = classPath.getMethod("clean", Class.forName("java.lang.String"))
//For each message run the instruction sets we've gotten above
x._2.foreach(kafkaMessage => {
val offset: Long = kafkaMessage._3
val rawRecord: String = kafkaMessage._4
val cleanRecord = cleanMethod.invoke(cleanerObj, rawRecord).asInstanceOf[String]
if (cleanRecord != null) {
sendKafka(cleanRecord, cleanTopic, partitionID.toString)
}
offsetUpdate(schemaID, offset.toInt, topicID)
})
}))
An obvious suspect here is the initial groupBy.
While it can be maintainable in a batch application, when you have some idea about data distribution, it is simply not acceptable in a streaming application, where single skewed batch can break a whole pipeline. Even if it doesn't come that far it can cripple overall performance.
Since your code doesn't really depend on the grouping and it is only an optimization attempt you're in a pretty good situation. You can for example:
Drop groupBy completely.
Use shared cache for each partition or executor to handle expensive API calls.
If number of API calls is still to high you can generate salted partitioning keys:
_.keyBy(x => (x._1, x._2, smallRandomInteger))
and repartition each RDD.
You could use salting for groupBy as well but if used like this you address only data skews not the cost of grouping and maintaining large local buffers.
Hi I am trying to generate output of Salt Examples but without using docker as mentioned in it's documentation. I found the scala code that helps in generating the output which is the Main.scala. I took to modify the Main.scala to a convenient one,
package BinExTest
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.Row
import software.uncharted.salt.core.projection.numeric._
import software.uncharted.salt.core.generation.request._
import software.uncharted.salt.core.generation.Series
import software.uncharted.salt.core.generation.TileGenerator
import software.uncharted.salt.core.generation.output.SeriesData
import software.uncharted.salt.core.analytic.numeric._
import java.io._
import scala.util.parsing.json.JSONObject
object Main {
// Defines the tile size in both x and y bin dimensions
val tileSize = 256
// Defines the output layer name
val layerName = "pickups"
// Creates and returns an Array of Double values encoded as 64bit Integers
def createByteBuffer(tile: SeriesData[(Int, Int, Int), (Int, Int), Double, (Double, Double)]): Array[Byte] = {
val byteArray = new Array[Byte](tileSize * tileSize * 8)
var j = 0
tile.bins.foreach(b => {
val data = java.lang.Double.doubleToLongBits(b)
for (i <- 0 to 7) {
byteArray(j) = ((data >> (i * 8)) & 0xff).asInstanceOf[Byte]
j += 1
}
})
byteArray
}
def main(args: Array[String]): Unit = {
val jarFile = "/home/kesava/Studies/BinExTest/BinExTest.jar";
val inputPath = "/home/kesava/Downloads/taxi_micro.csv"
val outputPath = "/home/kesava/SoftWares/salt/salt-examples/bin-example/Output"
val conf = new SparkConf().setAppName("salt-bin-example").setJars(Array(jarFile))
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load(s"file://$inputPath")
.registerTempTable("taxi_micro")
// Construct an RDD of Rows containing only the fields we need. Cache the result
val input = sqlContext.sql("select pickup_lon, pickup_lat from taxi_micro")
.rdd.cache()
// Given an input row, return pickup longitude, latitude as a tuple
val pickupExtractor = (r: Row) => {
if (r.isNullAt(0) || r.isNullAt(1)) {
None
} else {
Some((r.getDouble(0), r.getDouble(1)))
}
}
// Tile Generator object, which houses the generation logic
val gen = TileGenerator(sc)
// Break levels into batches. Process several higher levels at once because the
// number of tile outputs is quite low. Lower levels done individually due to high tile counts.
val levelBatches = List(List(0, 1, 2, 3, 4, 5, 6, 7, 8), List(9, 10, 11), List(12), List(13), List(14))
// Iterate over sets of levels to generate.
val levelMeta = levelBatches.map(level => {
println("------------------------------")
println(s"Generating level $level")
println("------------------------------")
// Construct the definition of the tiling jobs: pickups
val pickups = new Series((tileSize - 1, tileSize - 1),
pickupExtractor,
new MercatorProjection(level),
(r: Row) => Some(1),
CountAggregator,
Some(MinMaxAggregator))
// Create a request for all tiles on these levels, generate
val request = new TileLevelRequest(level, (coord: (Int, Int, Int)) => coord._1)
val rdd = gen.generate(input, pickups, request)
// Translate RDD of Tiles to RDD of (coordinate,byte array), collect to master for serialization
val output = rdd
.map(s => pickups(s).get)
.map(tile => {
// Return tuples of tile coordinate, byte array
(tile.coords, createByteBuffer(tile))
})
.collect()
// Save byte files to local filesystem
output.foreach(tile => {
val coord = tile._1
val byteArray = tile._2
val limit = (1 << coord._1) - 1
// Use standard TMS path structure and file naming
val file = new File(s"$outputPath/$layerName/${coord._1}/${coord._2}/${limit - coord._3}.bins")
file.getParentFile.mkdirs()
val output = new FileOutputStream(file)
output.write(byteArray)
output.close()
})
// Create map from each level to min / max values.
rdd
.map(s => pickups(s).get)
.map(t => (t.coords._1.toString, t.tileMeta.get))
.reduceByKey((l, r) => {
(Math.min(l._1, r._1), Math.max(l._2, r._2))
})
.mapValues(minMax => {
JSONObject(Map(
"min" -> minMax._1,
"max" -> minMax._2
))
})
.collect()
.toMap
})
// Flatten array of maps into a single map
val levelInfoJSON = JSONObject(levelMeta.reduce(_ ++ _)).toString()
// Save level metadata to filesystem
val pw = new PrintWriter(s"$outputPath/$layerName/meta.json")
pw.write(levelInfoJSON)
pw.close()
}
}
I created a separate folder for this scala with another folder in it named lib that had the jars required and I compiled it with scalac as follows,
scalac -cp "lib/salt.jar:lib/spark.jar" Main.scala
This ran successfully and generated classes under a folder BinExTest.
Now, the project's build.gradle had the following lines of code with which identified that this is the command that would help in generating the output dataset,
task run(overwrite: true, type: Exec, dependsOn: [assemble]) {
executable = 'spark-submit'
args = ["--class","software.uncharted.salt.examples.bin.Main","/opt/salt/build/libs/salt-bin-example-${version}.jar", "/opt/data/taxi_one_day.csv", "/opt/output"]
}
Seeing this, I made the following command,
spark-submit --class BinExTest.Main lib/salt.jar
When I do this, I get the following error,
java.lang.ClassNotFoundException: Main.BinExTest at
java.net.URLClassLoader$1.run(URLClassLoader.java:366) at
java.net.URLClassLoader$1.run(URLClassLoader.java:355) at
java.security.AccessController.doPrivileged(Native Method) at
java.net.URLClassLoader.findClass(URLClassLoader.java:354) at
java.lang.ClassLoader.loadClass(ClassLoader.java:425) at
java.lang.ClassLoader.loadClass(ClassLoader.java:358) at
java.lang.Class.forName0(Native Method) at
java.lang.Class.forName(Class.java:278) at
org.apache.spark.util.Utils$.classForName(Utils.scala:174) at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:689)
at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Can somebody help me out in this? I am completely new to this and came this far just by exploration.
[Update 1]
Taking in YoYo's suggestion,
spark-submit --class BinExTest.Main --jars "BinExTest.jar" "lib/salt.jar"
I got the ClassNotFoundException gone generating new error and is as follows,
Exception in thread "main" org.apache.spark.SparkException: Job
aborted due to stage failure: Task 1 in stage 3.0 failed 1 times, most
recent failure: Lost task 1.0 in stage 3.0 (TID 6, localhost):
java.lang.NoSuchMethodError:
scala.runtime.IntRef.create(I)Lscala/runtime/IntRef; at
BinExTest.Main$.createByteBuffer(Main.scala:29) at
BinExTest.Main$$anonfun$2$$anonfun$6.apply(Main.scala:101) at
BinExTest.Main$$anonfun$2$$anonfun$6.apply(Main.scala:99) at
scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at
scala.collection.Iterator$class.foreach(Iterator.scala:727) at
scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157) at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
at
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Any idea what's going on?
[Update 2]
Building Spark from source with Scala2.11 support solved my previous issue. However I got a new error and it is,
6/05/10 18:39:15 ERROR TaskSetManager: Task 0 in stage 2.0 failed 1
times; aborting job Exception in thread "main"
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0
in stage 2.0 (TID 3, localhost): java.lang.NoClassDefFoundError:
scala/collection/GenTraversableOnce$class at
software.uncharted.salt.core.util.SparseArray.(SparseArray.scala:37)
at
software.uncharted.salt.core.util.SparseArray.(SparseArray.scala:57)
at
software.uncharted.salt.core.generation.rdd.RDDSeriesWrapper.makeBins(RDDTileGenerator.scala:224)
at
software.uncharted.salt.core.generation.rdd.RDDTileGeneratorCombiner.createCombiner(RDDTileGenerator.scala:128)
at
software.uncharted.salt.core.generation.rdd.RDDTileGenerator$$anonfun$3.apply(RDDTileGenerator.scala:100)
at
software.uncharted.salt.core.generation.rdd.RDDTileGenerator$$anonfun$3.apply(RDDTileGenerator.scala:100)
at
org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:187)
at
org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:186)
at
org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:148)
at
org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)
at
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745) Caused by:
java.lang.ClassNotFoundException:
scala.collection.GenTraversableOnce$class at
java.net.URLClassLoader$1.run(URLClassLoader.java:366) at
java.net.URLClassLoader$1.run(URLClassLoader.java:355) at
java.security.AccessController.doPrivileged(Native Method) at
java.net.URLClassLoader.findClass(URLClassLoader.java:354) at
java.lang.ClassLoader.loadClass(ClassLoader.java:425) at
java.lang.ClassLoader.loadClass(ClassLoader.java:358)
Is this because scala2.11 does not have the mentioned class?
[Final Update]
Adding scala2.10 to the spark-submit did the trick.
spark-submit --class "BinExTest.Main" --jars
"BinExTest.jar,lib/scala210.jar" "lib/salt.jar"
For a Spark job to run, it need to self-replicate it's code over the different nodes that make part of your spark cluster. It does that by literally copying over the jar file to the other nodes.
That means that you need to make sure that your class files are packaged in a .jar file. In my typical solutions, I would build an Uber jar that packages the class files, and the dependent jar files together in a single .jar file. For that I use the Maven Shade plugin. That doesn't have to be your solution, but at least you should build a .jar file out of your generated classes.
To provide manually additional jar files - you will need to add them using the --jars option which will expect a comma delimited list.
Update 1
Actually, even for me there is a lot of confusion about all the available options, specifically to the jar files and how they are distributed, or modify the classpath in spark. See another topic I just posted.
Update 2
For the second part of your question that is already answered on another thread.