Is it possible to read a modified json file without rebuilding the project in Scala? - scala

After I change the configuration in the app.json file, when the project starts, the new changes are not applied and I have to rebuild the project (build new jar file). Is it possible to make this code read the modified app.json without rebuilding whole project?
Below is a package reading data from a app.json file
import net.liftweb.json._
import scala.io._
case class KafkaConfiguration(bootstrap_servers: String,
topic_extractor: String,
topic_vertica: String,
message_max_bytes: String,
group_id: String
)
case class HbaseConfiguration(table_name: String= "test", batch_size: Int)
case class SparkConfiguration(id: String, frequency: Int)
case class VerticaConfiguration(vertica_delimiter: String, vertica_qv: String)
case class Configuration(kafka: KafkaConfiguration,
spark: SparkConfiguration,
hbase: HbaseConfiguration,
vertica: VerticaConfiguration)
object Configuration{
def getConfiguration(filePath: String = "app.json"): Configuration ={
implicit val formats = DefaultFormats
val json = Source.fromURL(getClass.getClassLoader.getResource(filePath), "utf-8").mkString
val jValue = parse(json)
val kafkaConf = jValue.\("kafka").extract[KafkaConfiguration]
val sparkConfig = jValue.\("spark").extract[SparkConfiguration]
val hbaseConfig = jValue.\("hbase").extract[HbaseConfiguration]
val verticaConfig = jValue.\("vertica").extract[VerticaConfiguration]
val configuration: Configuration = new Configuration(kafkaConf, sparkConfig, hbaseConfig,verticaConfig)
configuration
}
}
Script to run my application:
. /etc/spark2/conf/spark-env.sh
export JAVA_HOME=/usr/java/jdk1.8.0_201-amd64
export PATH=$PATH:$HOME/bin
export SPARK_KAFKA_VERSION=0.10
CONFIG_FILE="app.json"
COMMAND="spark2-submit \
--master yarn \
--num-executors 4 \
--driver-memory 1g \
--executor-memory 2g \
--deploy-mode cluster \
--name "app1" \
--class DataPointStreaming ./app1.jar $CONFIG_FILE"
echo $COMMAND exec $COMMAND

Since you're using Spark, taking advantage of HOCON's environment variable substitution or even JVM properties might not be workable, as the JAR will be shipped across a cluster.
It is posssible to replace a file in a JAR (or, more formally, build a new JAR based on a given JAR with a file added). Something along these lines in bash can work:
#!/bin/sh
base=$( basename -- "$1" )
filename="${base%.*}"
config=$( basename -- "$2" )
confignoext="${config%.*}"
tmpdir=$( mktemp -d )
tmpdest="$tmpdir/$base"
confdest="$tmpdir/app.json"
cp $1 $tmpdest
cp $2 $confdest
zip -jur $tmpdest $confdest
mkdir -p build/configured
dest=build/configured/$filename-$confignoext.jar
cp $tmpdest $dest
echo $dest
Then you don't have to recompile everything to build a jar, you just keep a set of json files to patch into the jar for deployment (e.g. app-dev.json, app-prod.json, etc.)

Related

Dataproc job Java API method setLoggingConfig has no effect

I'm using groovy script with dependency com.google.cloud:google-cloud-dataproc:2.3.2 and trying to set logging config using code like this:
import com.google.cloud.dataproc.v1.*
...
final def LOGGING_LEVELS = ['com.example.Myclass': LoggingConfig.Level.DEBUG,'org.apache.spark': LoggingConfig.Level.WARN]
final def args = 'programArg1 programArg2'
def sparkJob = SparkJob
.newBuilder()
.addJarFileUris(jarLocation)
.setMainClass(className)
.addAllArgs(args.split(" ") as Iterable<String>)
.setLoggingConfig(LoggingConfig.newBuilder().putAllDriverLogLevels(LOGGING_LEVELS).build())
.build()
def job = Job.newBuilder().setPlacement(jobPlacement).setSparkJob(sparkJob).build()
...
and it doesn't make any effect.
However, when I submit the job via gcloud utility, it works fine:
gcloud dataproc jobs submit spark \
--driver-log-levels com.example.Myclass=DEBUG,org.apache.spark=WARN \
--project=my-project \
--cluster=my-cluster \
--region=us-central1 \
--class=Myclass \
--jars=gs://mybucket/Myclass-0.1-SNAPSHOT.jar \
-- programArg1 programArg2

Executing Spark scala program after compilation

I have compiled Spark scala program on command line. But now I want to execute it. I dont want to use Maven or sbt.
the program .I have used the command to execute the
scala -cp ".:sparkDIrector/jars/*" wordcount
But I am getting this error
java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)Lscala/collection/mutable/ArrayOps;
import org.apache.spark._
import org.apache.spark.SparkConf
/** Create a RDD of lines from a text file, and keep count of
* how often each word appears.
*/
object wordcount1 {
def main(args: Array[String]) {
// Set up a SparkContext named WordCount that runs locally using
// all available cores.
println("before conf")
val conf = new SparkConf().setAppName("WordCount")
conf.setMaster("local[*]")
val sc = new SparkContext(conf)
println("after the textfile")
// Create a RDD of lines of text in our book
val input = sc.textFile("book.txt")
println("after the textfile")
// Use flatMap to convert this into an rdd of each word in each line
val words = input.flatMap(line => line.split(' '))
// Convert these words to lowercase
val lowerCaseWords = words.map(word => word.toLowerCase())
// Count up the occurence of each unique word
println("before text file")
val wordCounts = lowerCaseWords.countByValue()
// Print the first 20 results
val sample = wordCounts.take(20)
for ((word, count) <- sample) {
println(word + " " + count)
}
sc.stop()
}
}
It is showing that the error is at location
val conf = new SparkConf().setAppName("WordCount").
Any help?
Starting from Spark 2.0 the entry point is the SparkSession:
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder
.appName("App Name")
.getOrCreate()
Then you can access the SparkContext and read the file with:
spark.sparkContext().textFile(yourFileOrURL)
Remember to stop your session at the end:
spark.stop()
I suggest you to have a look at these examples: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples
Then, to launch your application, you have to use spark-submit:
./bin/spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
In your case, it will be something like:
./bin/spark-submit \
--class wordcount1 \
--master local \
/path/to/your.jar

Spark MLLib unable to write out to S3 : path already exists

I have data in a S3 bucket in directory /data/vw/. Each line is of the form:
| abc:2 def:1 ghi:3 ...
I want to convert it to the following format:
abc abc def ghi ghi ghi
The new converted lines should go to S3 in directory /data/spark
Basically, repeat each string the number of times that follows the colon. I am trying to convert a VW LDA input file to a corresponding file for consumption by Spark's LDA library.
The code:
import org.apache.spark.{SparkConf, SparkContext}
object Vw2SparkLdaFormatConverter {
def repeater(s: String): String = {
val ssplit = s.split(':')
(ssplit(0) + ' ') * ssplit(1).toInt
}
def main(args: Array[String]) {
val inputPath = args(0)
val outputPath = args(1)
val conf = new SparkConf().setAppName("FormatConverter")
val sc = new SparkContext(conf)
val vwdata = sc.textFile(inputPath)
val sparkdata = vwdata.map(s => s.trim().split(' ').map(repeater).mkString)
val coalescedSparkData = sparkdata.coalesce(100)
coalescedSparkData.saveAsTextFile(outputPath)
sc.stop()
}
}
When I run this (as a Spark EMR job in AWS), the step fails with exception:
18/01/20 00:16:28 ERROR ApplicationMaster: User class threw exception: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3a://mybucket/data/spark already exists
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3a://mybucket/data/spark already exists
at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:131)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1119)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1096)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1096)
at ...
The code is run as:
spark-submit --class Vw2SparkLdaFormatConverter --deploy-mode cluster --master yarn --conf spark.yarn.submit.waitAppCompletion=true --executor-memory 4g s3a://mybucket/scripts/myscalajar.jar s3a://mybucket/data/vw s3a://mybucket/data/spark
I have tried specifying new output paths (/data/spark1 etc), ensuring that it does not exist before the step is run. Even then it is not working.
What am I doing wrong? I am new to Scala and Spark so I might be overlooking something here.
You could convert to a dataframe and then save with overwrite enabled.
coalescedSparkData.toDF.write.mode('overwrite').csv(outputPath)
Or if you insist on using RDD methods, you can do as described already in this answer

Apache Flink ALS with ids in Long instead of Int

I am trying the code of ALS in Flink version 1.1.3 using:
mvn archetype:generate \
-DarchetypeGroupId=org.apache.flink \
-DarchetypeArtifactId=flink-quickstart-scala \
-DarchetypeVersion=1.1.3 \
-DgroupId=org.apache.flink.quickstart \
-DartifactId=flink-scala-project \
-Dversion=0.1 \
-Dpackage=org.apache.flink.quickstart \
-DinteractiveMode=false
I am following the example code in: https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/libs/ml/als.html and changed the Int for the Long in the Dataset
val env = ExecutionEnvironment.getExecutionEnvironment
val csvInput: DataSet[(Long, Long, Double)] = env.readCsvFile[(Long, Long, Double)]("tmp-contactos.csv")
// Setup the ALS learner
val als = ALS()
.setIterations(10)
.setNumFactors(10)
.setBlocks(100)
// Set the other parameters via a parameter map
val parameters = ParameterMap()
.add(ALS.Lambda, 0.9)
.add(ALS.Seed, 42L)
// Calculate the factorization
als.fit(csvInput, parameters)
But it throws in runetime:
Exception in thread "main" java.lang.RuntimeException: There is no FitOperation defined for org.apache.flink.ml.recommendation.ALS which trains on a DataSet[(Long, Int, Double)]
at org.apache.flink.ml.pipeline.Estimator$$anon$4.fit(Estimator.scala:85)
at org.apache.flink.ml.pipeline.Estimator$class.fit(Estimator.scala:55)
at org.apache.flink.ml.recommendation.ALS.fit(ALS.scala:122)
at org.apache.flink.quickstart.BatchJob$.main(BatchJob.scala:119)
at org.apache.flink.quickstart.BatchJob.main(BatchJob.scala)
It is posible to use Longs instead of Ints??
I searched and found this for the 0.9 version but nothing for 1.1.13:
https://issues.apache.org/jira/browse/FLINK-2211
So far it is not officially supported but I've created a branch where I've fixed this limitation. You can try out this branch. I'll contribute it to Flink so that it should become part of the master in the next time.

Run Scala Program with Spark on Hadoop

i have create a scala program that search a word in a text file.
I create the file scala with eclipse and after i compile and create a jar with sbt and sbt assembly.After that i run the .jar with Spark in local and it run correctly.
Now i want try to run this program using Spark on hadoop, i have 1 master and 2 work machine.
I have to change the code ? and what command i do with the shell of the master?
i have create a bucket and i have put the text file in hadoop
this is my code:
import scala.io.Source
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object wordcount {
def main(args: Array[String]) {
// set spark context
val conf = new SparkConf().setAppName("wordcount").setMaster("local[*]")
val sc = new SparkContext(conf)
val distFile = sc.textFile("bible.txt")
print("Enter word to look for in the HOLY BILE: ")
val word = Console.readLine
var count = 0;
var finalCount=0;
println("You entered " + word)
val input = sc.textFile("bible.txt")
val splitedLines = input.flatMap(line => line.split(" "))
.filter(x => x.equals(word))
System.out.println("The word " + word + " appear " + splitedLines.count())
}
}
Thanks all
Just change the following line,
val conf = new SparkConf().setAppName("wordcount").setMaster("local[*]")
to
val conf = new SparkConf().setAppName("wordcount")
This will allow you not to modify the code whenever you want to switch from local mode to cluster mode. The master option can be passed via the spark-submit command as follows,
spark-submit --class wordcount --master <master-url> --jars wordcount.jar
and if you want to run your program locally, use the following command,
spark-submit --class wordcount --master local[*] --jars wordcount.jar
here is the list of master option that you can set while running the application.