Spark how to read from multiple Elastic Search clusters - scala

I need to read data from two different Elastic Search clusters. one for logs and one for products data and I tried to put different sparkConf() when creating SparkSession but it seems it works only with the first SparkSession I created
val config1 = new SparkConf().setAppName("test")
.set("spark.driver.allowMultipleContexts", "true")
.set("es.index.auto.create", "true")
.set("es.nodes.discovery", "false")
.set("es.nodes.wan.only", "true")
.set("es.nodes.client.only", "false")
.set("es.nodes", s"$esNode1:$esPort1")
val config2 = new SparkConf().setAppName("test")
.set("spark.driver.allowMultipleContexts", "true")
.set("es.index.auto.create", "true")
.set("es.nodes.discovery", "false")
.set("es.nodes.wan.only", "true")
.set("es.nodes.client.only", "false")
.set("es.nodes", s"$esNode2:$esPort2")
val session1 = SparkSession.builder.master('local').config(config1).getOrCreate()
val session2 = SparkSession.builder.master('local').config(config2).getOrCreate()
session1.read.format("org.elasticsearch.spark.sql").load(path)
session2.read.format("org.elasticsearch.spark.sql").load(path)
it seems spark does not support for multiple sessions with the same format because I am using the same SparkSession with Mysql(jdbc) too and it works well. is there an alternative way to get data from multiple ElasticSearch clusters?

Create only one session per Spark application. Then read 2 DataFrames this way:
val config = new SparkConf().setAppName("test")
.set("spark.driver.allowMultipleContexts", "true")
.set("es.index.auto.create", "true")
.set("es.nodes.discovery", "false")
.set("es.nodes.wan.only", "true")
.set("es.nodes.client.only", "false")
val session = SparkSession.builder.master("local").config(config).getOrCreate
val df1 = session.read.format("org.elasticsearch.spark.sql")
.option("es.nodes", s"$esNode1:$esPort1").load(path)
val df2 = session.read.format("org.elasticsearch.spark.sql")
.option("es.nodes", s"$esNode2:$esPort2").load(path)

Related

Except and ExceptAll functions for apache spark's dataset are giving an empty dataframe during streaming

Except and ExceptAll function for apache spark's dataset are giving empty dataframe during streaming.
I am using two datasets, both are batch,
the left-one has few rows that are not present in the right-one.
Upon running it gives correct output i.e. rows that are in the left-one but not in the right one.
Now I am repeating the same in streaming, the left-one is streaming while the right-one is batch source, in this scenario, I am getting an empty dataframe upon which dataframe.writeStream throws an exception None.get error.
package exceptPOC
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
object exceptPOC {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf()
.setAppName("pocExcept")
.setMaster("local")
.set("spark.driver.host", "localhost")
val sparkSession =
SparkSession.builder().config(sparkConf).getOrCreate()
val schema = new StructType()
.add("sepal_length", DoubleType, true)
.add("sepal_width", DoubleType, true)
.add("petal_length", DoubleType, true)
.add("petal_width", DoubleType, true)
.add("species", StringType, true)
var df1 = sparkSession.readStream
.option("header", "true")
.option("inferSchema", "true")
.option("treatEmptyValuesAsNulls", "false")
.option("delimiter", ",")
.schema(schema)
.csv(
"some/path/exceptPOC/streamIris" //contains iris1
)
var df2 = sparkSession.read
.option("header", "true")
.option("inferSchema", "true")
.option("treatEmptyValuesAsNulls", "false")
.option("delimiter", ",")
.schema(schema)
.csv(
"/some/path/exceptPOC/iris2.csv"
)
val exceptDF = df1.except(df2)
exceptDF.writeStream
.format("console")
.start()
.awaitTermination()
}
}

How to checkpoint many source of spark streaming

I have many CSV spark.readStream in a different locations, I have to checkpoint all of them with scala, I specified a query for every stream but when I run the job, I got this message
java.lang.IllegalArgumentException: Cannot start query with name "query1" as a query with that name is already active
I solved my problem by creating a many streaming query like this :
val spark = SparkSession
.builder
.appName("test")
.config("spark.local", "local[*]")
.getOrCreate()
spark.sparkContext.setCheckpointDir(path_checkpoint)
val event1 = spark
.readStream //
.schema(schema_a)
.option("header", "true")
.option("sep", ",")
.csv(path_a)
val query = event1.writeStream
.outputMode("append")
.format("console")
.start()
spark.streams.awaitAnyTermination()
val spark = SparkSession
.builder
.appName("test")
.config("spark.local", "local[*]")
.getOrCreate()
spark.sparkContext.setCheckpointDir(path_checkpoint)
val event1 = spark
.readStream //
.schema(schema_a)
.option("header", "true")
.option("sep", ",")
.csv(path_a)
val query = event1.writeStream
.outputMode("append")
.format("console")
.start()
spark.streams.awaitAnyTermination()

Checkpoint for many streaming source

im working with zeppelin ,I read many files from many source in spark streaming like this :
val var1 = spark
.readStream
.schema(var1_raw)
.option("sep", ",")
.option("mode", "PERMISSIVE")
.option("maxFilesPerTrigger", 100)
.option("treatEmptyValuesAsNulls", "true")
.option("newFilesOnly", "true")
.csv(path_var1 )
val chekpoint_var1 = var1
.writeStream
.format("csv")
.option("checkpointLocation", path_checkpoint_var1)
.option("Path",path_checkpoint )
.option("header", true)
.outputMode("Append")
.queryName("var1_backup")
.start().awaitTermination()
val var2 = spark
.readStream
.schema(var2_raw)
.option("sep", ",")
.option("mode", "PERMISSIVE") //
.option("maxFilesPerTrigger", 100)
.option("treatEmptyValuesAsNulls", "true")
.option("newFilesOnly", "true")
.csv(path_var2 )
val chekpoint_var2 = var2
.writeStream
.format("csv")
.option("checkpointLocation", path_checkpoint_var2) //
.option("path",path_checkpoint_2 )
.option("header", true)
.outputMode("Append")
.queryName("var2_backup")
.start().awaitTermination()
when i re run the job i got this message :
java.lang.IllegalArgumentException: Cannot start query with name var1_backup as a query with that name is already active
*****************the solution*******************
val spark = SparkSession
.builder
.appName("test")
.config("spark.local", "local[*]")
.getOrCreate()
spark.sparkContext.setCheckpointDir(path_checkpoint)
and after i call the checkpoint function on the dataframe
*****************the solution*******************
val spark = SparkSession
.builder
.appName("test")
.config("spark.local", "local[*]")
.getOrCreate()
spark.sparkContext.setCheckpointDir(path_checkpoint)
and after i call the checkpoint function on the dataframe

spark dataframe write to file using scala

I am trying to read a file and add two extra columns. 1. Seq no and 2. filename.
When I run spark job in scala IDE output is generated correctly but when I run in putty with local or cluster mode job is stucks at stage-2 (save at File_Process). There is no progress even i wait for an hour. I am testing on 1GB data.
Below is the code i am using
object File_Process
{
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession
.builder()
.master("yarn")
.appName("File_Process")
.getOrCreate()
def main(arg:Array[String])
{
val FileDF = spark.read
.csv("/data/sourcefile/")
val rdd = FileDF.rdd.zipWithIndex().map(indexedRow => Row.fromSeq((indexedRow._2.toLong+SEED+1)+:indexedRow._1.toSeq))
val FileDFWithSeqNo = StructType(Array(StructField("UniqueRowIdentifier",LongType)).++(FileDF.schema.fields))
val datasetnew = spark.createDataFrame(rdd,FileDFWithSeqNo)
val dataframefinal = datasetnew.withColumn("Filetag", lit(filename))
val query = dataframefinal.write
.mode("overwrite")
.format("com.databricks.spark.csv")
.option("delimiter", "|")
.save("/data/text_file/")
spark.stop()
}
If I remove logic to add seq_no, code is working fine.
code for creating seq no is
val rdd = FileDF.rdd.zipWithIndex().map(indexedRow =>Row.fromSeq((indexedRow._2.toLong+SEED+1)+:indexedRow._1.toSeq))
val FileDFWithSeqNo = StructType(Array(StructField("UniqueRowIdentifier",LongType)).++(FileDF.schema.fields))
val datasetnew = spark.createDataFrame(rdd,FileDFWithSeqNo)
Thanks in advance.

Spark: Why does my accumulator work in non-deterministic way?

My accumulator is non-deterministic. If I run the code
val acc0 = sc.accumulator(0L)
sc.range(0, 20000, step = 25).foreach { l => acc0 += 1 }
println(acc0.value)
I get every time different values. Here is my Spark configuration
val SparkConf = new SparkConf()
.setAppName("SparkDemoApp")
.setMaster("local[4]")
.set("spark.executor.memory", "1g")
.set("spark.eventLog.enabled", "true")
.set("spark.files.overwrite", "true")
.set("spark.hadoop.validateOutputSpecs", "false")
val sc = new SparkContext(SparkConf)
Why is that? Why result is different from time to time?