How to set Spark configuration properties using Apache Livy? - scala

I don't know how to pass SparkSession parameters programmatically when submitting Spark job to Apache Livy:
This is the Test Spark job:
class Test extends Job[Int]{
override def call(jc: JobContext): Int = {
val spark = jc.sparkSession()
// ...
}
}
This is how this Spark job is submitted to Livy:
val client = new LivyClientBuilder()
.setURI(new URI(livyUrl))
.build()
try {
client.uploadJar(new File(testJarPath)).get()
client.submit(new Test())
} finally {
client.stop(true)
}
How can I pass the following configuration parameters to SparkSession?
.config("es.nodes","1localhost")
.config("es.port",9200)
.config("es.nodes.wan.only","true")
.config("es.index.auto.create","true")

You can do that easily through the LivyClientBuilder like this:
val client = new LivyClientBuilder()
.setURI(new URI(livyUrl))
.setConf("es.nodes","1localhost")
.setConf("key", "value")
.build()

Configuration parameters can be set to LivyClientBuilder using
public LivyClientBuilder setConf(String key, String value)
so that your code starts with:
val client = new LivyClientBuilder()
.setURI(new URI(livyUrl))
.setConf("es.nodes","1localhost")
.setConf("es.port",9200)
.setConf("es.nodes.wan.only","true")
.setConf("es.index.auto.create","true")
.build()

LivyClientBuilder.setConf will not work, I think. Because Livy will modify all configs not starting with spark.. And Spark cannot read the modified config. See here
private static File writeConfToFile(RSCConf conf) throws IOException {
Properties confView = new Properties();
for (Map.Entry<String, String> e : conf) {
String key = e.getKey();
if (!key.startsWith(RSCConf.SPARK_CONF_PREFIX)) {
key = RSCConf.LIVY_SPARK_PREFIX + key;
}
confView.setProperty(key, e.getValue());
}
...
}
So the answer is quite simple: add spark. to all es configs, like this,
.config("spark.es.nodes","1localhost")
.config("spark.es.port",9200)
.config("spark.es.nodes.wan.only","true")
.config("spark.es.index.auto.create","true")
Don't know it is elastic-spark does the compatibility job, or spark. It just works.
PS: I've tried with the REST API, and it works. But not with the Programmatic API.

Related

How to dispose database connections when flink retarts

I use dbcp2.BasicDataSource as database-connection-pool. The database query is used in some map function to get additional info of sensors; I found out that, when the flink job restarts due to exceptions, the old DB connections are still active on the server side.
flink version 1.7
BasicDataSource construct code here
object DbHelper extends Lazing with Logging {
private lazy val connectionPool: BasicDataSource = createDataSource()
private def createDataSource(): BasicDataSource = {
val conn_str = props.getProperty("db.url")
val conn_user = props.getProperty("db.user")
val conn_pwd = props.getProperty("db.pwd")
val initialSize = props.getProperty("db.initial.size", "3").toInt
val bds = new BasicDataSource
bds.setDriverClassName("org.postgresql.Driver")
bds.setUrl(conn_str)
bds.setUsername(conn_user)
bds.setPassword(conn_pwd)
bds.setInitialSize(initialSize)
bds
}
}
Change your map function to a RichMapFunction. Override the close() method of the RichMapFunction and put the code to close your database connection there. You should likely be putting the code to open the connection in the open() method as well.

Quartz Cluster recovery mechanism

I run a simple controller with spring to test quartz capabilities.
#PostMapping(path = ["/api/v1/start/{jobKey}/{jobGroup}"])
fun start(#PathVariable jobKey: String, #PathVariable jobGroup: String): ResponseEntity<String> {
val simpleJob = JobBuilder
.newJob(SampleJob::class.java)
.requestRecovery(true)
.withIdentity(JobKey.jobKey(jobKey, jobGroup))
.build()
val sampleTrigger = TriggerBuilder
.newTrigger()
.withIdentity(jobKey, jobGroup)
.withSchedule(
SimpleScheduleBuilder
.repeatSecondlyForever(5)
.withMisfireHandlingInstructionIgnoreMisfires())
.build()
val scheduler = factory.scheduler
scheduler.jobGroupNames.contains(jobGroup)
if (scheduler.jobGroupNames.contains(jobGroup)) {
return ResponseEntity.ok("Scheduler exists.")
}
scheduler.scheduleJob(simpleJob, sampleTrigger)
scheduler.start()
return ResponseEntity.ok("Scheduler started.")
}
#PostMapping(path = ["/api/v1/stop/{jobKey}/{jobGroup}"])
fun stop(#PathVariable jobKey: String, #PathVariable jobGroup: String): String {
val scheduler = factory.scheduler
scheduler.interrupt(JobKey.jobKey(jobKey, jobGroup))
val jobGroupNames = scheduler.jobGroupNames
logger.info("Existing jobGroup names: {}", jobGroupNames)
return scheduler.deleteJob(JobKey.jobKey(jobKey, jobGroup)).toString()
}
Then I start two applications on different ports with the same code and start playing with it. Let's call them APP1 and APP2
I use PostgreSQL as JobStore.
So I run several scenarios.
1) Create the job with group1 and key1 in APP1
2) Try to create a job with group1 and key1 in APP2. - it gives the error that the job already started. The behavior is like I expected.
3) Stop APP1. I expect that the job will be executed in APP2, as it still exists in JobStore, but it didn't. Do I need to provide some additional configuration?
4) Start APP1, also nothing happens. Furthermore, the record for group1 and key1 still presented in DB and can't be started.
Do I need to modify shutdown behavior to remove job on the application shutdown and start jobs in another application? or I just need to configure the trigger in some another correct way?
My bad, that was a silly problem. I forget to start sheduler in my application
#Bean
open fun schedulerFactory(): SchedulerFactory {
val factory = StdSchedulerFactory()
factory.initialize(ClassPathResource("quartz.properties").inputStream)
factory.scheduler.start() // this line was missed
return factory
}

Error with AWS EMR Client: java.lang.NoSuchFieldError: SIGNING_REGION

I'm having a sudden issue with this error when I am running my AWS EMR client on our deployed server. This does not happen locally, and runs fine. Basically, I have an EMR client that I use to build and execute steps as such:
class EMRClient(emrClusterId:String) extends LazyLogging{
val accessKey = ...// access key
val secretKey = ...//secret key
val credentials = new BasicAWSCredentials(accessKey, secretKey)
val REGION = <my region>
println(">>>>>>>>>>>>>>>>>>>>Initializing EMR client for clusterId " + emrClusterId + " . The region is " + REGION)
val emr = AmazonElasticMapReduceClientBuilder
.standard()
.withCredentials(new AWSStaticCredentialsProvider(credentials))
.withRegion(REGION)
.build()
def executeHQLStep(s3ScriptPath:String, stepName:String, args:String = ""): AddJobFlowStepsResult= {
val hqlScriptStep = buildHQLStep(hqlScriptPath, stepName, args)
val stepSet = new java.util.HashSet[StepConfig]()
//stepSet.add(enableDebugging)
stepSet.add(hqlScriptStep)
executeJobFlowSteps(stepSet)
}
/**
* Builds a StepConfig to be executed in a job flow for a given .hql file from S3
* #param hqlScriptPath the location in S3 of the script file containing the script to run
* #param args optional field for arguments for hive script.
* #param stepName An identifier to give to EMR to name your Step
* #return
*/
private def buildHQLStep(hqlScriptPath:String, stepName:String, args:String= ""): StepConfig = {
new StepConfig()
.withName(stepName)
.withActionOnFailure(ActionOnFailure.CANCEL_AND_WAIT)
.withHadoopJarStep(stepFactory.newRunHiveScriptStep(hqlScriptPath, args))
}
private def executeJobFlowSteps(steps: java.util.Set[StepConfig]): AddJobFlowStepsResult = {
emr.addJobFlowSteps(new AddJobFlowStepsRequest()
.withJobFlowId(emrClusterId)
.withSteps(steps)) // where the error is thrown
}
}
However, when this class is instantiated on the server, none of the println statements at the top are visible, my executeJobFlowSteps method is called and throws this error:
java.lang.NoSuchFieldError: SIGNING_REGION
at com.amazonaws.services.elasticmapreduce.AmazonElasticMapReduceClient.executeAddJobFlowSteps(AmazonElasticMapReduceClient.java:439)
at com.amazonaws.services.elasticmapreduce.AmazonElasticMapReduceClient.addJobFlowSteps(AmazonElasticMapReduceClient.java:421)
at emrservices.EMRClient.executeJobFlowSteps(EMRClient.scala:64)
at emrservices.EMRClient.executeHQLStep(EMRClient.scala:44)
This project is composed of several projects, and similar issues to mine have said it has to do with their AWS dependencies, but across the board all of the projects have this in the build.sbt's library dependencies: "com.amazonaws" % "aws-java-sdk" % "1.11.286"
Any idea on what the issue is?
This looks like you are mixing the 1.11.286 version of aws-java-sdk with an older version (1.11.149) of aws-java-sdk-core. The newer client is using a new field added to the core module but since your core module is out of date you are seeing the no such field error. Can you ensure all of your dependencies are in sync with one another?

Raise Alert through apache spark

I am using Apache Spark to take real time data from Apache Kafka which are from any sensors in Json format.
example of data format :
{
"meterId" : "M1",
"meterReading" : "100"
}
I want to apply rule to raise alert in real time. i.e. if I did not get data of "meter M 1" from last 2 hours or meter Reading exceed some limit the alert should be created.
so how can I achieve this in Scala?
I will respond here as an answer - too long for comment.
As I said json in kafka should be: one message per one line - send this instead -> {"meterId":"M1","meterReading":"100"}
If you are using kafka there is KafkaUtils with that you can create stream:
JavaPairDStream<String, String> input = KafkaUtils.createStream(jssc, zkQuorum, group, topics);
Pair means <kafkaTopicName, JsonMessage>. So basically you can take a look only to jsonmessage if you dont need to use kafkaTopicName.
for input you can use many methods that are described in JavaPairDStream documentation - eg. you can use map to get only messages to simple JavaDStream.
And of course you can use some json parser like gson, jackson or org.json it depends on use cases, performance for different cases and so on.
So you need to do something like this:
JavaDStream<String> messagesOnly = input.map(
new Function<Tuple2<String, String>, String>() {
public String call(Tuple2<String, String> message) {
return message._2();
}
}
);
now you have only messages withou kafka topic name, now you can use your logic like you described in question.
JavaPairDStream<String, String> alerts = messagesOnly.filter(
new Function<Tuple2<String, String>, Boolean>() {
public Boolean call(Tuple2<String, String> message) {
// here use gson parser e.g
// filter messages with meterReading that doesnt exceed limit
// return true or false based on your logic
}
}
);
And here you have only alert messages - you can send it to another place.
-- AFTER EDIT
Below is the example in scala
// batch every 2 seconds
val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.checkpoint("checkpoint")
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
def filterLogic(message: String): Boolean=
{
// here your logic for filtering
}
// map _._2 takes your json messages
val messages = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
// filtered data after filter transformation
val filtered = messages.filter(m => filterLogic(m))

Avro with Kafka - Deserializing with changing schema

Based on Avro schema I generated a class (Data) to work with the class appropriate to the schema
After it I encode the data and send in to other application "A" using kafka
Data data; // <- The object was initialized before . Here it is only the declaration "for example"
EncoderFactory encoderFactory = EncoderFactory.get();
ByteArrayOutputStream out = new ByteArrayOutputStream();
BinaryEncoder encoder = encoderFactory. directBinaryEncoder(out, null);
DatumWriter<Tloog> writer;
writer = new SpecificDatumWriter<Data>( Data.class);
writer.write(data, encoder);
byte[] avroByteMessage = out.toByteArray();
On the other side (in the application "A") I deserilize the the data by implementing Deserializer
class DataDeserializer implements Deserializer<Data> {
private String encoding = "UTF8";
#Override
public void configure(Map<String, ?> configs, boolean isKey) {
// nothing to do
}
#Override
public Tloog deserialize(String topic, byte[] data) {
try {
if (data == null)
{
return null;
}
else
{
DatumReader<Tloog> reader = new SpecificDatumReader<Data>( Data.class);
DecoderFactory decoderFactory = DecoderFactory.get();
BinaryDecoder decoder = decoderFactory.binaryDecoder( data, null);
Data decoded = reader.read(null, decoder);
return decoded;
}
} catch (Exception e) {
throw new SerializationException("Error when deserializing byte[] to string due to unsupported encoding " + encoding);
}
}
The problem is that this approach requires the use of SpecificDatumReader, I.e.the Data class should be integrated with the application code...This could be problematic - schema could change and therefore Data class should be re-generated and integrated once more
2 questions:
Should I use GenericDatumReader in the application? How to do that
correctly. (I can save the schema simply in the application)
Isthere a simple way to work with SpecificDatumReader if Data changes? How could it be integrated with out much trouble?
Thanks
I use GenericDatumReader -- well, actually I derive my reader class from it, but you get the point. To use it, I keep my schemas in a special Kafka topic -- Schema surprisingly enough. Consumers and producers both, on startup, read from this topic and configure their respective parsers.
If you do it like this, you can even have your consumers and producers update their schemas on the fly, without having to restart them. This was a design goal for me -- I didn't want to have to restart my applications in order to add or change schemas. Which is why SpecificDatumReader doesn't work for me, and honestly why I use Avro in the first place instead of something like Thrift.
Update
The normal way to do Avro is to store the schema in the file with the records. I don't do it that way, primarily because I can't. I use Kafka, so I can't store the schemas directly with the data -- I have to store the schemas in a separate topic.
The way I do it, first I load all of my schemas. You can read them from a text file; but like I said, I read them from a Kafka topic. After I read them from Kafka, I have an array like this:
val schemaArray: Array[String] = Array(
"""{"name":"MyObj","type":"record","fields":[...]}""",
"""{"name":"MyOtherObj","type":"record","fields":[...]}"""
)
Apologize for the Scala BTW, but it's what I got.
At any rate, then you need to create a parser, and foreach schema, parse it and create readers and writers, and save them off to Maps:
val parser = new Schema.Parser()
val schemas = Map(schemaArray.map{s => parser.parse(s)}.map(s => (s.getName, s)):_*)
val readers = schemas.map(s => (s._1, new GenericDatumReader[GenericRecord](s._2)))
val writers = schemas.map(s => (s._1, new GenericDatumWriter[GenericRecord](s._2)))
var decoder: BinaryDecoder = null
I do all of that before I parse an actual record -- that's just to configure the parser. Then, to decode an individual record I would do:
val byteArray: Array[Byte] = ... // <-- Avro encoded record
val schemaName: String = ... // <-- name of the Avro schema
val reader = readers.get(schemaName).get
decoder = DecoderFactory.get.binaryDecoder(byteArray, decoder)
val record = reader.read(null, decoder)