Dataproc job Java API method setLoggingConfig has no effect - google-cloud-dataproc

I'm using groovy script with dependency com.google.cloud:google-cloud-dataproc:2.3.2 and trying to set logging config using code like this:
import com.google.cloud.dataproc.v1.*
...
final def LOGGING_LEVELS = ['com.example.Myclass': LoggingConfig.Level.DEBUG,'org.apache.spark': LoggingConfig.Level.WARN]
final def args = 'programArg1 programArg2'
def sparkJob = SparkJob
.newBuilder()
.addJarFileUris(jarLocation)
.setMainClass(className)
.addAllArgs(args.split(" ") as Iterable<String>)
.setLoggingConfig(LoggingConfig.newBuilder().putAllDriverLogLevels(LOGGING_LEVELS).build())
.build()
def job = Job.newBuilder().setPlacement(jobPlacement).setSparkJob(sparkJob).build()
...
and it doesn't make any effect.
However, when I submit the job via gcloud utility, it works fine:
gcloud dataproc jobs submit spark \
--driver-log-levels com.example.Myclass=DEBUG,org.apache.spark=WARN \
--project=my-project \
--cluster=my-cluster \
--region=us-central1 \
--class=Myclass \
--jars=gs://mybucket/Myclass-0.1-SNAPSHOT.jar \
-- programArg1 programArg2

Related

Is it possible to read a modified json file without rebuilding the project in Scala?

After I change the configuration in the app.json file, when the project starts, the new changes are not applied and I have to rebuild the project (build new jar file). Is it possible to make this code read the modified app.json without rebuilding whole project?
Below is a package reading data from a app.json file
import net.liftweb.json._
import scala.io._
case class KafkaConfiguration(bootstrap_servers: String,
topic_extractor: String,
topic_vertica: String,
message_max_bytes: String,
group_id: String
)
case class HbaseConfiguration(table_name: String= "test", batch_size: Int)
case class SparkConfiguration(id: String, frequency: Int)
case class VerticaConfiguration(vertica_delimiter: String, vertica_qv: String)
case class Configuration(kafka: KafkaConfiguration,
spark: SparkConfiguration,
hbase: HbaseConfiguration,
vertica: VerticaConfiguration)
object Configuration{
def getConfiguration(filePath: String = "app.json"): Configuration ={
implicit val formats = DefaultFormats
val json = Source.fromURL(getClass.getClassLoader.getResource(filePath), "utf-8").mkString
val jValue = parse(json)
val kafkaConf = jValue.\("kafka").extract[KafkaConfiguration]
val sparkConfig = jValue.\("spark").extract[SparkConfiguration]
val hbaseConfig = jValue.\("hbase").extract[HbaseConfiguration]
val verticaConfig = jValue.\("vertica").extract[VerticaConfiguration]
val configuration: Configuration = new Configuration(kafkaConf, sparkConfig, hbaseConfig,verticaConfig)
configuration
}
}
Script to run my application:
. /etc/spark2/conf/spark-env.sh
export JAVA_HOME=/usr/java/jdk1.8.0_201-amd64
export PATH=$PATH:$HOME/bin
export SPARK_KAFKA_VERSION=0.10
CONFIG_FILE="app.json"
COMMAND="spark2-submit \
--master yarn \
--num-executors 4 \
--driver-memory 1g \
--executor-memory 2g \
--deploy-mode cluster \
--name "app1" \
--class DataPointStreaming ./app1.jar $CONFIG_FILE"
echo $COMMAND exec $COMMAND
Since you're using Spark, taking advantage of HOCON's environment variable substitution or even JVM properties might not be workable, as the JAR will be shipped across a cluster.
It is posssible to replace a file in a JAR (or, more formally, build a new JAR based on a given JAR with a file added). Something along these lines in bash can work:
#!/bin/sh
base=$( basename -- "$1" )
filename="${base%.*}"
config=$( basename -- "$2" )
confignoext="${config%.*}"
tmpdir=$( mktemp -d )
tmpdest="$tmpdir/$base"
confdest="$tmpdir/app.json"
cp $1 $tmpdest
cp $2 $confdest
zip -jur $tmpdest $confdest
mkdir -p build/configured
dest=build/configured/$filename-$confignoext.jar
cp $tmpdest $dest
echo $dest
Then you don't have to recompile everything to build a jar, you just keep a set of json files to patch into the jar for deployment (e.g. app-dev.json, app-prod.json, etc.)

Issues in passing application configuration parameters to spark application

I created an object using Spark/Scala to load data from an Oracle source to Hive Table. Database Password is passed through application.properties through typesafe.ConfigFactory.
I ATTEMPTED APPLICATION.CONF IN MY USER FOLDER and also IN CLASSPATH IN ANOTHER ATTEMPT with below spark-submit.
On every attempt I encounter error saying "java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':".
which indicates like properties not reached configFactory methods
Can someone help me what am missing?
//my object snippet
object LoadFromOracleToHive {
def SaveToHive(spark :SparkSession):Unit = {
try {
val appConf = ConfigFactory.load(s"application.conf").getConfig("my.config")
val sparkConfig = appConf.getConfig("spark") // config.getConfig("spark")
val df = spark
.read
.format("jdbc")
.options(Map("password" -> sparkConfig.getString("password") , "driver" -> "oracle.jdbc.driver.OracleDriver"))
//my application.conf
my.config {
spark {
password = "password"
}
}
//my spark-submit
spark-submit --class LoadFromOracleToHive \
--master yarn \
--deploy-mode client \
--driver-memory 4g \
--executor-memory 8g \
--num-executors 15 \
--executor-cores 5 \
--conf spark.kryoserializer.buffer.max=512m \
--queue csg \
--jars /home/myuserfolder/ojdbc7.jar /home/myuserfolder/SandeepTest-1.0-SNAPSHOT-jar-with-dependencies.jar \
--queue /home/myuserfolder/application.conf \
--conf spark.driver.extraClassPath=-Dconfig.file=/home/myuserfolder/application.conf \
--conf spark.executor.extraClassPath=-Dconfig.file=/home/myuserfolder/application.conf

Need a solution on connecting Teradata using Pyspark

I have a below code which will be used to connect the hadoop env with Teradata.
sc = spark.sparkContext
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format("jdbc").options(url="jdbc:teradata://teradata-dns-sysa.fg.rbc.com",driver="com.teradata.jdbc.TeraDriver",dbtable="table",user="userid",password="xxxxxxxx").load()
Now the userid & password is different for different users. Hence looking out for a solution where credentials can be stored in a file in a secure location and the code simply refer to the data (userid & password) in the file
Here you can use property file where you can store required user id and password in file. You can refer properties file using argument parameter --properties-file file_name in command while running spark-submit command. Below is sample code for same.
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("Teradata Connect") \
.getOrCreate()
sc = spark.sparkContext
DB_DRIVER = sc._conf.get('spark.DB_DRIVER')
JDBC_URL = sc._conf.get('spark.JDBC_URL')
DB_USER = sc._conf.get('spark.DB_USER')
DB_PASS = sc._conf.get('spark.DB_PASS')
jdbcDF = (spark.read.format("jdbc").option("driver", DB_DRIVER)
.option("url", JDBC_URL)
.option("dbtable", "sql_query")
.option("user", DB_USER)
.option("password", DB_PASS)
.load())
jdbcDF.show(10)
Sample Properties file
spark.DB_DRIVER com.teradata.jdbc.TeraDriver
spark.JDBC_URL jdbc:teradata://teradata-dns-sysa.fg.rbc.com
spark.DB_USER userid
spark.DB_PASS password
Spark submit command
spark2-submit --master yarn \
--deploy-mode cluster \
--properties-file $CONF_FILE \
pyspark_script.py

Executing Spark scala program after compilation

I have compiled Spark scala program on command line. But now I want to execute it. I dont want to use Maven or sbt.
the program .I have used the command to execute the
scala -cp ".:sparkDIrector/jars/*" wordcount
But I am getting this error
java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)Lscala/collection/mutable/ArrayOps;
import org.apache.spark._
import org.apache.spark.SparkConf
/** Create a RDD of lines from a text file, and keep count of
* how often each word appears.
*/
object wordcount1 {
def main(args: Array[String]) {
// Set up a SparkContext named WordCount that runs locally using
// all available cores.
println("before conf")
val conf = new SparkConf().setAppName("WordCount")
conf.setMaster("local[*]")
val sc = new SparkContext(conf)
println("after the textfile")
// Create a RDD of lines of text in our book
val input = sc.textFile("book.txt")
println("after the textfile")
// Use flatMap to convert this into an rdd of each word in each line
val words = input.flatMap(line => line.split(' '))
// Convert these words to lowercase
val lowerCaseWords = words.map(word => word.toLowerCase())
// Count up the occurence of each unique word
println("before text file")
val wordCounts = lowerCaseWords.countByValue()
// Print the first 20 results
val sample = wordCounts.take(20)
for ((word, count) <- sample) {
println(word + " " + count)
}
sc.stop()
}
}
It is showing that the error is at location
val conf = new SparkConf().setAppName("WordCount").
Any help?
Starting from Spark 2.0 the entry point is the SparkSession:
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder
.appName("App Name")
.getOrCreate()
Then you can access the SparkContext and read the file with:
spark.sparkContext().textFile(yourFileOrURL)
Remember to stop your session at the end:
spark.stop()
I suggest you to have a look at these examples: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples
Then, to launch your application, you have to use spark-submit:
./bin/spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
In your case, it will be something like:
./bin/spark-submit \
--class wordcount1 \
--master local \
/path/to/your.jar

Apache Flink ALS with ids in Long instead of Int

I am trying the code of ALS in Flink version 1.1.3 using:
mvn archetype:generate \
-DarchetypeGroupId=org.apache.flink \
-DarchetypeArtifactId=flink-quickstart-scala \
-DarchetypeVersion=1.1.3 \
-DgroupId=org.apache.flink.quickstart \
-DartifactId=flink-scala-project \
-Dversion=0.1 \
-Dpackage=org.apache.flink.quickstart \
-DinteractiveMode=false
I am following the example code in: https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/libs/ml/als.html and changed the Int for the Long in the Dataset
val env = ExecutionEnvironment.getExecutionEnvironment
val csvInput: DataSet[(Long, Long, Double)] = env.readCsvFile[(Long, Long, Double)]("tmp-contactos.csv")
// Setup the ALS learner
val als = ALS()
.setIterations(10)
.setNumFactors(10)
.setBlocks(100)
// Set the other parameters via a parameter map
val parameters = ParameterMap()
.add(ALS.Lambda, 0.9)
.add(ALS.Seed, 42L)
// Calculate the factorization
als.fit(csvInput, parameters)
But it throws in runetime:
Exception in thread "main" java.lang.RuntimeException: There is no FitOperation defined for org.apache.flink.ml.recommendation.ALS which trains on a DataSet[(Long, Int, Double)]
at org.apache.flink.ml.pipeline.Estimator$$anon$4.fit(Estimator.scala:85)
at org.apache.flink.ml.pipeline.Estimator$class.fit(Estimator.scala:55)
at org.apache.flink.ml.recommendation.ALS.fit(ALS.scala:122)
at org.apache.flink.quickstart.BatchJob$.main(BatchJob.scala:119)
at org.apache.flink.quickstart.BatchJob.main(BatchJob.scala)
It is posible to use Longs instead of Ints??
I searched and found this for the 0.9 version but nothing for 1.1.13:
https://issues.apache.org/jira/browse/FLINK-2211
So far it is not officially supported but I've created a branch where I've fixed this limitation. You can try out this branch. I'll contribute it to Flink so that it should become part of the master in the next time.