Spark word count in Scala (running in Apache Sandbox) - scala

I am trying to do a word count lab in Spark on Scala. I am able to successfully load the text file into a variable (RDD), but when I do the .flatmap, .map, and reduceByKey, I receive the attached error message. I am new to this, so any type of help would be greatly appreciated. Please let me know.capture

Your program is failing because it was not able to detect the file present on Hadoop
Need to specify the file in the following format
sc.textFile("hdfs://namenodedetails:8020/input.txt")

You need to give the complete qualified path of the file. Since Spark builds a Dependency graph and evaluates lazily when an action is called, you are facing the error when you are trying to call an action.
It is better to debug after reading the file from HDFS using .first or .take(n) methods

Related

Spark giving multiple datasource error on saving parquet file

I am trying to learn spark and scala, on my trying to write the dataframe object of my result to parquet file by calling the parquet method, i am getting error as such
Code Base that fails:-
df2.write.mode(SaveMode.Overwrite).parquet(outputPath)
This fails too
df2.write.format("org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat").mode(SaveMode.Overwrite).parquet(outputPath)
Error Log:-
Exception in thread "main" org.apache.spark.sql.AnalysisException: Multiple sources found for parquet (org.apache.spark.sql.execution.datasources.v2.parquet.ParquetDataSourceV2, org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat), please specify the fully qualified class name.;
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:707)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:733)
at org.apache.spark.sql.DataFrameWriter.lookupV2Provider(DataFrameWriter.scala:967)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:304)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:288)
at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:848)
How ever if I called another method for the save, the code works properly,
This works fine:-
df2.write.format("org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat").mode(SaveMode.Overwrite).save(outputPath)
Although I have a solution for the issue, i'd like to understand why the first approach is not working and how I can solve it.
The details of the specification i am using are:-
Scala 2.12.9
Java 1.8
Spark 2.4.4
P.S. This issue is only seen on spark-submit

MS azure, zeppelin load scala file

I am following an OReilly book, "Advanced Analytics with spark" book. It seems that they expect you to use the command shell to follow the examples in the book (PuTTY). But i don't want to use that. I'd prefer to use Zeppelin. I'd like to create notebooks, but my own comments into the code etc.
So, using an Azure subscription, I spin up a Spark cluster and go into zeppelin. I am able to follow the guide fine for the most part. But there is one bit that trips me up. And its probably pretty basic.
You are asked to create a scala file called "StatsWithMissing.scala" with code in it. I do that. I upload it to blob to: //user/Zeppelin
(this is where i expect the Zeppelin user directory to be)
Then it asks you to run the following;
":load StatsWithMissing.scala"
At this point it gives the error:
:1: error: illegal start of definition
My first question is, where exactly is this scala file supposed to be on Blob Storage for Zeppelin to see it? How do i determine that? Is where i am putting it correct?
And second what does this message mean? Does it not like the Load statement?
I believe the Interpreter set at the top of the page is Livy, and that covers scala.
Any help would be great.
Regards
Conor

Scala + Spark: ways to pass parameters in a program. Is it possible to use the Context for this?

I am wondering if it is possible to pass parameters in a Scala Spark program using the context or something similar. I mean, I read some parameters from spark-submit inside my app, but those parameters will be necessary "at the end"(let's say). So I have to pass them from the driver to another file, and then to another file and so on... So, my call to a method have a huge list of parameters.
Thank you in advance!
The key point to understand is, you provide spark submit the application jar file and any command line parameters that you wishes spark submit provide while invoking the jar.
My understanding is, you only need some of those parameters at the very end of execution and you do not carry all those arguments in nested function calls. I will say, there is definite scope of refactoring the design.
Anycase, one trick you can employ is, write those parameters to a json file and make it available to be read by your spark application when necessary(I would write those parameter to aws s3 and read them when needed).
Or, you can create an implicit variable and carry it through out the code which I believe will not be a good design.

Spark cannot find case class on classpath

I have an issue where Spark is failing to generate code for a case class. Here is the spark error
Caused by: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 52, Column 43: Identifier expected instead of '.'
Here is the referenced line in the generated code
/* 052 */ private com.avro.message.video.public.MetricObservation MapObjects_loopValue34;
It should be noted that com.avro.message.video.public.MetricObservation is a nested case class in part of a larger hierarchy. It is also used in other places in the code fine. It should also be noted that this pipeline works fine if I use the RDD API, but I want to use the Dataset API because I want to write out the Dataset in parquet. Has anyone seen this issue before?
I'm using Scala 2.11 and Spark 2.1.0. I was able to upgrade to Spark 2.2.1 and the issue is still there.
Do you think that SI-7555 or something like it has any bearing on this? I have noticed the past that Scala reflection has had issues generating TypeTags for statically nested classes. Do you think something like that is going on or is this strictly a catalyst issue in spark? You might want to file a spark ticket too.
So it turns out that changing the package name of the affect class "fixes" (ie made go away) the problem. I really have no idea why this is or even how to reproduce it in a small test case. What worked for me was I just created a higher level package that work. Specifically com.avro.message.video.public -> com.avro.message.publicVideo.

How can we subscribe to resources in scala?

How can we subscribe to file present as resource in Scala project so that any live changed to the file can be detected in the the service?
Example there is a Scala code which is calculating sum of numbers from the text file , how to subscribe to that file in the code so that program can act upon immediately for any addition of new numbers in the file.
In Scala you can use Java classes and APIs.
You can use the Java Watch Service API in java.nio.file.
You can about it here.