HDFS MiniCluster Error on windows machine - scala

I am trying to mock the hdfs with the following code but always getting this particular error.
test("some test") {
val testDataPath = new File(PathUtils.getTestDir(getClass()), "miniclusters")
//Configuration conf;
//MiniDFSCluster cluster;
//testDataPath = new File(PathUtils.getTestDir(getClass()), miniclusters");
System.clearProperty(MiniDFSCluster.PROP_TEST_BUILD_DATA)
val confMini = new HdfsConfiguration()
val testDataCluster1 = new File(testDataPath, "CLUSTER_1")
println(testDataCluster1)
val c1Path = testDataCluster1.getAbsolutePath()
println(c1Path)
confMini.set(MiniDFSCluster.HDFS_MINIDFS_BASEDIR, c1Path)
val cluster = new MiniDFSCluster.Builder(confMini).build()
val fs = FileSystem.get(confMini);
println(fs)
assert(true)
}
The error is the Following
An exception or error caused a run to abort: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
I am not sure about the error and whats the reason for the error.

Related

Trying an alternative to Spark broadcast variable because of NullPointerException ONLY in Local mode

I was using my Spark app in cluster mode and everything went well. Now, I need to do some test in my local installation (on my laptop) and I get NullPointerException in the following line:
val brdVar = spark.sparkContext.broadcast(rdd.collectAsMap())
EDIT: This is the full stacktrace:
Exception in thread "main" java.lang.NullPointerException
at learner.LearnCh$.learn(LearnCh.scala:81)
at learner.Learner.runLearningStage(Learner.scala:166)
at learner.Learner.run(Learner.scala:29)
at Driver$.runTask(Driver.scala:26)
at Driver$.main(Driver.scala:19)
at Driver.main(Driver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I was reading a lot, and I couldn't get the answer to my problem (EDIT: I'm using def main(args: Array[String]): Unit = ...). The use case for this brdVar is to get a numerical id value from a string one:
val newRdd: RDD[(Long, Map[Byte, Int])] = origRdd.mapPartitions { partition => partition.map(r => (r.idString, r)) }
.aggregateByKey // this line doesn't affect to my problem ....
.mapPartitions { partition => partition.map { case (idString, listIndexes) => (brdVar.value(idString), .....) } }
So, in order to continue and don't get stuck with broadcast in local mode, I change the idea and I wanted to simulate the brdVar saving its data in a file, and reading and searching the key calling a function instead of this part brdVar.value(idString) doing this getNumericalID(id). To do so, I've written this function:
def getNumericalID(strID: String): Long = {
val pathToRead = ....
val file = spark.sparkContext.textFile(pathToRead)
val process = file.map{line =>
val l = line.split(",")
(l(0), l(1))
}.filter(e=>e._1 == strID).collect()
process(0)._2.toLong
}
But I'm still getting NullPointerException message, but this time in this val file = .... line. I've checked, and the file has content. I think maybe I'm misunderstanding something, any ideas?

Kryo Serialization not registering even after registering the class in conf

I made a class Person and registered it but on runtime, it shows class not registered.Why is it showing so?
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Failed to serialize task 0, not attempting to retry it. Exception during serialization: java.io.IOException: java.lang.IllegalArgumentException: Class is not registered: KyroExample$Person[]
Note: To register this class use: kryo.register(KyroExample$Person[].class);
Here is the sample code :
val conf = new SparkConf().setAppName("kyroExample").setMaster("local")
conf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer")
conf.registerKryoClasses(Array(classOf[Person],classOf[String])) //registered the class
conf.set("spark.kryo.registrationRequired", "true")
val sparkContext = new SparkContext(conf)
case class Person(name:String, age:Int) //this is the class
val personList: immutable.Seq[Person] = (1 to 100000).map(value=> Person(value+"",value))
val rdd: RDD[Person] = sparkContext.parallelize(personList)
val evenAge: RDD[Person] = rdd.filter(_.age %2 ==0)
evenAge.persist(StorageLevel.MEMORY_ONLY_SER)
evenAge.count()
evenAge.persist(StorageLevel.MEMORY_ONLY_SER)
evenAge.count()
Thread.sleep(200000)
It worked after registering both use cases :
.registerKryoClasses(
Array(classOf[Person],classOf[Array[Person]])
)

how to Connect to NEO4J in Spark worker nodes?

I need to get a small subgraph in a spark map function. I have tried to use AnormCypher and NEO4J-SPARK-CONNECTOR, but neither works. AnormCypher will lead to a java IOException Error (I build the connection in a mapPartition function, test at localhost server). And Neo4j-spark-connector will cause TASK NOT SERIALIZABLE EXCEPTION below.
Is there a good way to get a subgraph(or connect to graph data base like neo4j) in the Spark worker node?
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792)
at ....
my code snippet using neo4j-spark-connector 2.0.0-m2:
val neo = Neo4j(sc) // this runs on the driver
// this runs by a map function
def someFunctionToBeMapped(p: List[Long]) = {
val metaGraph = neo.cypher("match p = (a:TourPlace) -[r:could_go_to] -> (b:TourPlace)" +
"return a.id ,r.distance, b.id").loadRowRdd.map( row => ((row(0).asInstanceOf[Long],row(2).asInstanceOf[Long]), row(1).asInstanceOf[Double]) ).collect().toList
The AnromCypher code is :
def partitionMap(partition: Iterator[List[Long]]) = {
import org.anormcypher._
import play.api.libs.ws._
// Provide an instance of WSClient
val wsclient = ning.NingWSClient()
// Setup the Rest Client
// Need to add the Neo4jConnection type annotation so that the default
// Neo4jConnection -> Neo4jTransaction conversion is in the implicit scope
implicit val connection: Neo4jConnection = Neo4jREST("127.0.0.1", 7474, "neo4j", "000000")(wsclient)
//
// Provide an ExecutionContext
implicit val ec = scala.concurrent.ExecutionContext.global
val res = partition.filter( placeList => {
val startPlace = Cypher("match p = (a:TourPlace) -[r:could_go_to] -> (b:TourPlace)" +
"return p")().flatMap( row => row.data )
})
wsclient.close()
res
}
I have used spark standalone mode and able to connect neo4j database
Version used :
spark 2.1.0
neo4j-spark-connector 2.1.0-m2
My code:-
val sparkConf = new SparkConf().setAppName("Neo$j").setMaster("local")
val sc = new SparkContext(sparkConf)
println("***Getting Started ****")
val neo = Neo4j(sc)
val rdd = neo.cypher("MATCH (n) RETURN id(n) as id").loadDataFrame
println(rdd.count)
Spark submit:-
spark-submit --class package.classname --jars pathofneo4jsparkconnectoryJAR --conf spark.neo4j.bolt.password=***** targetJarFile.jar

Spark Scala Exception in thread "main" java.lang.UnsupportedOperationException: empty collection

Launching spark-submit I'm getting the following error:
Exception in thread "main" java.lang.UnsupportedOperationException: empty collection
Spark Code Class 1:
var vertices = sc.textFile("hdfs:///user/cloudera/dstlf.txt")
.flatMap { line => line.split("\\s+") }
.distinct()
vertices.map {
vertex => vertex.replace("-", "") + "\t" + vertex.saveAsTextFile("hdfs:///user/cloudera/metadata-lookup-tlf")
}
Spark Code Class 2:
val sc = new SparkContext(conf)
val graph = GraphLoader.edgeListFile(sc,"hdfs:///user/cloudera/metadata-processed-tlf")
// Run PageRank
val ranks = graph.pageRank(0.0001).vertices
Looks like for some reason it is not working in the RDD
The input file dstlf contains 2 columns with numbers:
94-92250 94-92174
94-92889 94-91869
94-94172 94-91682
...

Apache Spark - java.lang.NoSuchMethodError: breeze.linalg.DenseVector

I am having issues running Apache Spark 1.0.1 within a Play! app. Currently, I am trying to run Spark within the Play! application and use some of the basic Machine Learning within Spark.
Here's my app creation:
def sparkFactory: SparkContext = {
val logFile = "public/README.md" // Should be some file on your system
val driverHost = "localhost"
val conf = new SparkConf(false) // skip loading external settings
.setMaster("local[4]") // run locally with enough threads
.setAppName("firstSparkApp")
.set("spark.logConf", "true")
.set("spark.driver.host", s"$driverHost")
new SparkContext(conf)
}
And here's an error when I try to do some basic discovery of a Tall and Skinny Matrix:
[error] o.a.s.e.ExecutorUncaughtExceptionHandler - Uncaught exception in thread Thread[Executor task launch worker-3,5,main]
java.lang.NoSuchMethodError: breeze.linalg.DenseVector$.dv_v_ZeroIdempotent_InPlaceOp_Double_OpAdd()Lbreeze/linalg/operators/BinaryUpdateRegistry;
at org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$5.apply(RowMatrix.scala:313) ~[spark-mllib_2.10-1.0.1.jar:1.0.1]
at org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$5.apply(RowMatrix.scala:313) ~[spark-mllib_2.10-1.0.1.jar:1.0.1]
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) ~[scala-library-2.10.4.jar:na]
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) ~[scala-library-2.10.4.jar:na]
at scala.collection.Iterator$class.foreach(Iterator.scala:727) ~[scala-library-2.10.4.jar:na]
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) ~[scala-library-2.10.4.jar:na]
The error above is triggered by the following:
def computePrincipalComponents(datasetId: String) = Action {
val datapoints = DataPoint.listByDataset(datasetId)
// load the data into spark
val rows = datapoints.map(_.data).map { row =>
row.map(_.toDouble)
}
val RDDRows = WorkingSpark.context.makeRDD(rows).map { line =>
Vectors.dense(line)
}
val mat = new RowMatrix(RDDRows)
val result = mat.computePrincipalComponents(mat.numCols().toInt)
Ok(result.toString)
}
It looks like a dependency issue, but no idea where it starts. Any ideas?
Ah this was indeed caused by a dependency conflict. Apparently the new Spark uses new Breeze methods that were not available in a version I had pulled in. By removing Breeze from my Play! Build file I was able to run the function above just fine.
For those interested, here's the output:
-0.23490049167080018 0.4371989078912155 0.5344916752692394 ... (6 total)
-0.43624389448418854 0.531880914138611 0.1854269324452522 ...
-0.5312372137092107 0.17954211389001487 -0.456583286485726 ...
-0.5172743086226219 -0.2726152326516076 -0.36740474569706394 ...
-0.3996400343756039 -0.5147253632175663 0.303449047782936 ...
-0.21216780828347453 -0.39301803119012546 0.4943679121187219 ...