Issue in accessing HDFS file inside spark map function - scala

My use case requires to access the file stored in HDFS from inside the spark map function. ThIs use case uses custom input format that does not provide any data to the map function whereas the map function obtains the input split and access the data. I am using the below code to do this
val hConf: Configuration = sc.hadoopConfiguration
hConf.set("fs.hdfs.impl", classOf[org.apache.hadoop.hdfs.DistributedFileSystem].getName)
hConf.set("fs.file.impl", classOf[org.apache.hadoop.fs.LocalFileSystem].getName)
var job = new Job(hConf)
FileInputFormat.setInputPaths(job,new Path("hdfs:///user/bala/MyBinaryFile"));
var hRDD = new NewHadoopRDD(sc, classOf[RandomAccessInputFormat],
classOf[IntWritable],
classOf[BytesWritable],
job.getConfiguration()
)
val count = hRDD.mapPartitionsWithInputSplit{ (split, iter) => myfuncPart(split, iter)}.collect()
As of now, I am not doing anything inside the myfuncPart. This simple returns a map as below
iter.map { tpl ⇒ (tpl._1, tpl._2.getCapacity) }
When i submit the job along with the dependencies, I get the below error
15/10/30 11:11:39 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 40.221.94.235): java.io.IOException: No FileSystem for scheme: spark
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
At first glance, it seems a small error related to spark jars but could not crack. Any help will be greatly appreciated.

It turned out to be a mistake from my side with the way I was launching the job. The command I was using did not have proper option in it. Hence, the issue. I was using the command below
spark-submit --class org.myclass --jars myjar spark://myhost:7077 myjob.jar
Below is the correct one
spark-submit --class org.myclass --jars myjar --master spark://myhost:7077 myjob.jar
This is a small mistake but somehow I missed it. Now it is working

Related

Pass opt arguments in an application executed as a .jar through spark-submit --class and use the existing context

I am writting a scala project that I want to have classes that are executable from spark-submit as a jar class. (e.g. spark-submit --class org.project
My problems are the following:
I want to use the spark-context-configuration that the user sets when doing a spark submit and overwrite optionally some parameters like the Application name. Example: spark-submit --num-executors 6 --class org.project will pass 6 in number of exectors configuration field in spark context.
I want to be able to pass option parameters like --inputFile or --verbose to my project without interfering with the spark parameters (possibly with avoid name overlap)
Example: spark-submit --num-executors 6 --class org.project --inputFile ./data/mystery.txt should pass "--inputFile ./data/mystery.txt" to the args input of class org.project main method.
What my progress is in those problems is the following:
I run val conf = new SparkConf().setAppName("project");
val sc = new SparkContext(conf);
in my main method,
but I am not sure if this does things as expected.
Sparks considers those optional arguments as arguments of the spark-submit and outputs an error.
Note.1: My java class project currently does not inherit any other class.
Note.2: I am new to the world of spark and I couldn't find something relative from a basic search.
You will have to handle parameter parsing yourself. Here we use Scopt.
When your spark-submit your job, it must enter through an object def main(args: Array[String]). Takes theses args and parse them using your favorite argument parser, set your sparkConf and SparkSession accordingly and launch your process.
Spark has examples of that whole idea:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/DenseKMeans.scala

Spark Pipe function throws No such file or directory

I am running the spark pipe function on EMR master server in REPL just to test out the pipe functionality. I am using the following examples
https://stackoverflow.com/a/32978183/8876462
http://blog.madhukaraphatak.com/pipe-in-spark/
http://hadoop-makeitsimple.blogspot.com/2016/05/pipe-in-spark.html
This is my code ::
import org.apache.spark._
val distScript = "/home/hadoop/PipeEx.sh"
val distScriptName = "PipeEx.sh"
sc.addFile(distScript)
val ipData =
sc.parallelize(List("asd","xyz","zxcz","sdfsfd","Ssdfd","Sdfsf"))
val opData = ipData.pipe(SparkFiles.get(distScriptName))
opData.foreach(println)
I have tried different things like making the file executable, placed in file in /usr/lib/spark/bin as suggested in another post. I changed the distScript to say
"file:///home/hadoop/PipeEx.sh"
I always get no such file or directory in tmp/spark*/userFiles* location. I have tried to access and run the shell program from the tmp location and it runs fine.
My shell script is the same as http://blog.madhukaraphatak.com/pipe-in-spark/
Here is the first part of the log::
[Stage 9:> (0 + 2)
/ 2]18/03/19 19:58:22 WARN TaskSetManager: Lost task 1.0 in stage 9.0 (TID
72, ip-172-31-42-11.ec2.internal, executor 9): java.io.IOException: Cannot
run program "/mnt/tmp/spark-bdd582ec-a5ac-4bb1-874e-832cd5427b18/userFiles-
497f6051-6f49-4268-b9c5-a28c2ad5edc6/PipeEx.sh": error=2, No such file or
directory
Does any one have any idea? I am using Spark 2.2.1 and scala 2.11.8
Thanks
I was able to solve this , once I removed the
SparkFiles.get(distScriptName)
command.
So my final code looks like this
val distScript = "/home/hadoop/PipeEx.sh"
val distScriptName = "./PipeEx.sh"
sc.addFile(distScript)
val ipData = sc.parallelize(List("asd","xyz","zxcz","sdfsfd","Ssdfd","Sdfsf"))
val opData = ipData.pipe(distScriptName)
opData.collect().foreach(println)
I am not very sure why removing the SparkFiles.get() solved the problem

Flink Read AvroInputFormat PROCESS_CONTINUOUSLY

I am new to Flink (v1.3.2) and trying to read avro record continuously in scala on EMR. I know for file you can use something like following and it will keep running and scanning directory.
val stream = env.readFile(
inputFormat = textInputFormat,
filePath = path,
watchType = FileProcessingMode.PROCESS_CONTINUOUSLY,
interval = Time.seconds(10).toMilliseconds
)
Is there a similar way in Flink for avro record? So I have the following code
val textInputFormat = new AvroInputFormat(new Path(path), classOf[User])
textInputFormat.setNestedFileEnumeration(true)
val avroInputStream = env.createInput(textInputFormat)
val output = avroInputStream.map(line => (line.getUserID, 1))
.keyBy(0)
.timeWindow(Time.seconds(10))
.sum(1)
output.print()
I am able to see the output there then Flink switched to FINISHED, but still want to get the code running/waiting for any new files arrive in the future. Is there something like FileProcessingMode.PROCESS_CONTINUOUSLY? Please suggest!
I figure out this by setting up a flink-yarn-session on EMR and make it run PROCESS_CONTINUOUSLY.
env.readFile(textInputFormat, path, FileProcessingMode.PROCESS_CONTINUOUSLY, 100)
Create a new flink yarn session using flink-yarn-session -n 2 -d
Get application_id using yarn application -list, for example, it is application_0000000000_0002
Attached flink run job with the application_id,flink run -m yarn-cluster -yid application_0000000000_0002 xxx.jar
More detail can be found on EMR documentation now: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-flink.html

Spark Interactive/Adhoc Job which can take Dynamic Arguments for Spark Context

I am looking for a solution for interactive/adhoc spark job. I have some arguments that I need to pass to my Spark Job. This is fine, but I want to pass these arguments as selected by the user form the dropdown menu by the user.
So for e.g. the Spark-submit job looks something like below, with the following arguments "prod /opt/var/var-spark/application.conf vc FX yes yes yes".
$SPARK_HOME/bin/spark-submit \
--class main.gmr.SparkGMR \
--deploy-mode client \
--verbose \
--driver-java-options "-Dlog4j.configuration=file:///opt/var/spark-2.1.0-bin-hadoop2.7/conf/log4j.properties" \
--conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:///opt/var/spark-2.1.0-bin-hadoop2.7/conf/log4j.properties" \
file:///opt/var/var-spark/var-spark-assembly-1.0.jar \
prod /opt/var/var-spark/application.conf vc FX yes yes yes
Now I want to make this job always running because it caches many dataframes in memory which can be used for later analysis. But the problem is this job dies and the in-memory dataframes/views are not there any more.
Also, I want to submit different arguments to this job next time, so e.g.
"prod /opt/var/var-spark/application.conf sc dx yes yes yes".
Approaches I tried: I tried to use the Livy API /batches endpoint to submit my job with arguments but the job starts, does processing and then dies. /sessions endpoint is ideal choice but that does not allow me to submit the class name and the arguments parameters in the request header.
Also tried to use the Spark Structured Streaming with the following code to get to get the arguments from the dataframe, but it fails with the error below:
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 4042)
.load();
import spark.implicits._;
val words = lines.as[String].flatMap(_.split(",")).collect;
var a = words(0);
var b = words(1);
var c = words(2);
var d = words(3);
var e = words(4);
val wordCounts = words.groupBy("value").count();
val query = wordCounts.writeStream
.outputMode("complete")
.format("console")
.start()
query.awaitTermination()
ERROR:
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Queries with streaming sources must be executed with
writeStream.start();; textSocket at
org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:196)
I want to get the arguments in the a, b, c, d, e from the above code and then I can pass the dynamic arguments for the queries that run in my job.
Any clues and other approach would be appreciated.
Thanks

reading compressed file in spark with scala

I am trying to read the content of .gz file in spark/scala in a dataframe/rdd using the following code
val conf = new SparkConf()
val sc = new SparkContext(conf)
val data = sc.wholeTextFiles("path to gz file")
data.collect().foreach(println);
.gz file is 28 mb and when i do the spark submit using this command
spark-submit --class sample--master local[*] target\spark.jar
It gives ma Java Heap space issue in the console .
Is this the best way of reading .gz file and if yes how could i solve java heap error issue .
Thanks
Disclaimer: That code and description will purely read in a small compressed text file using spark, collect it to an array of every line and print every line in the entire file to console. The number of ways and reasons to do this outside far outnumber those to do it in spark
1) use SparkSession instead of SparkContext if you can swing it. sparkSession.read.text() is the command to use (it automatically handles a few compression formats)
2) Or at least use sc.textFile() instead of wholeTextFiles
3) you're calling .collect on that data which brings the entire file back to the driver (in this case since you're local not network bound). Add the --driver-memory option to the spark shell to increase memory if you MUST do the collect.