Spark: Why does my accumulator work in non-deterministic way?

Spark: Why does my accumulator work in non-deterministic way? - scala

My accumulator is non-deterministic. If I run the code
val acc0 = sc.accumulator(0L)
sc.range(0, 20000, step = 25).foreach { l => acc0 += 1 }
println(acc0.value)
I get every time different values. Here is my Spark configuration
val SparkConf = new SparkConf()
.setAppName("SparkDemoApp")
.setMaster("local[4]")
.set("spark.executor.memory", "1g")
.set("spark.eventLog.enabled", "true")
.set("spark.files.overwrite", "true")
.set("spark.hadoop.validateOutputSpecs", "false")
val sc = new SparkContext(SparkConf)
Why is that? Why result is different from time to time?

Related

Spark how to read from multiple Elastic Search clusters

I need to read data from two different Elastic Search clusters. one for logs and one for products data and I tried to put different sparkConf() when creating SparkSession but it seems it works only with the first SparkSession I created
val config1 = new SparkConf().setAppName("test")
.set("spark.driver.allowMultipleContexts", "true")
.set("es.index.auto.create", "true")
.set("es.nodes.discovery", "false")
.set("es.nodes.wan.only", "true")
.set("es.nodes.client.only", "false")
.set("es.nodes", s"$esNode1:$esPort1")
val config2 = new SparkConf().setAppName("test")
.set("spark.driver.allowMultipleContexts", "true")
.set("es.index.auto.create", "true")
.set("es.nodes.discovery", "false")
.set("es.nodes.wan.only", "true")
.set("es.nodes.client.only", "false")
.set("es.nodes", s"$esNode2:$esPort2")
val session1 = SparkSession.builder.master('local').config(config1).getOrCreate()
val session2 = SparkSession.builder.master('local').config(config2).getOrCreate()
session1.read.format("org.elasticsearch.spark.sql").load(path)
session2.read.format("org.elasticsearch.spark.sql").load(path)
it seems spark does not support for multiple sessions with the same format because I am using the same SparkSession with Mysql(jdbc) too and it works well. is there an alternative way to get data from multiple ElasticSearch clusters?

Create only one session per Spark application. Then read 2 DataFrames this way:
val config = new SparkConf().setAppName("test")
.set("spark.driver.allowMultipleContexts", "true")
.set("es.index.auto.create", "true")
.set("es.nodes.discovery", "false")
.set("es.nodes.wan.only", "true")
.set("es.nodes.client.only", "false")
val session = SparkSession.builder.master("local").config(config).getOrCreate
val df1 = session.read.format("org.elasticsearch.spark.sql")
.option("es.nodes", s"$esNode1:$esPort1").load(path)
val df2 = session.read.format("org.elasticsearch.spark.sql")
.option("es.nodes", s"$esNode2:$esPort2").load(path)

Difference in behaviour in Spark 1.6 and Spark 2.2 in converting rdd[row] to rdd[tuple]

So my code in Spark 1.6 works fine whereas the same code is giving a null pointer exception while running in Spark 2.2
I am currently running everything in local via IntelliJ:
val sparkConf = new SparkConf()
.setAppName("HbaseSpark")
.setMaster("local[*]")
.set("spark.hbase.host", "localhost")
val sc = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sc)
val df = sqlContext
.read
.format("com.databricks.spark.csv")
.option("delimiter", "\001")
.load("/Users/11130/small")
val df1 = df.withColumn("row_key", concat(col("C3"), lit("_"), col("C5"), lit("_"), col("C0")))
df1.registerTempTable("mytable")
val newDf = sqlContext.sql("Select row_key, C0, C1, C2, C3, C4, C5, C6, C7," +
"C8, C9, C10, C11, C12, C13, C14, C15, C16, C17, C18, C19 from mytable")
val rdd = newDf.rdd
val finalRdd = rdd.map(row => (row(0).toString, row(1).toString, row(2).toString, row(3).toString, row(4).toString, row(5).toString, row(6).toString,
row(7).toString, row(8).toString, row(9).toString, row(10).toString, row(11).toString, row(12).toString, row(13).toString,
row(14).toString, row(15).toString, row(16).toString, row(17).toString, row(18).toString, row(19).toString, row(20).toString))
finalRdd.toHBaseTable("mytable")
.toColumns("event_id", "device_id", "uidx", "session_id", "server_ts", "client_ts", "event_type", "data_set_name",
"screen_name", "card_type", "widget_item_whom", "widget_whom", "widget_v_position", "widget_item0_h_position",
"publisher_tag", "utm_medium", "utm_source", "utmCampaign", "referrer_url", "notificationClass")
.inColumnFamily("mycf")
.save()
Whereas, the same code when I write in Spark2.2 gives null pointer exception in converting rdd to finalRdd
val spark = SparkSession
.builder
.appName("FunnelSpark")
.master("local[*]")
.config("spark.hbase.host", "localhost")
.getOrCreate
val sc = spark.sparkContext
sc.hadoopConfiguration.set("spark.hbase.host", "localhost")
val df = spark
.read
.option("delimiter", "\001")
.csv("/Users/11130/small")
val df1 = df.withColumn("row_key", concat(col("_c3"), lit("_"), col("_c5"), lit("_"), col("_c0")))
df1.createOrReplaceTempView("mytable")
val newDf = spark.sql("Select row_key, _c0, _c1, _c2, _c3, _c4, _c5, _c6, _c7," +
"_c8, _c9, _c10, _c11, _c12, _c13, _c14, _c15, _c16, _c17, _c18, _c19 from mytable")
val rdd = newDf.rdd
val finalRdd = rdd.map(row => (row(0).toString, row(1).toString, row(2).toString, row(3).toString, row(4).toString, row(5).toString, row(6).toString,
row(7).toString, row(8).toString, row(9).toString, row(10).toString, row(11).toString, row(12).toString, row(13).toString,
row(14).toString, row(15).toString, row(16).toString, row(17).toString, row(18).toString, row(19).toString, row(20).toString))
println(finalRdd.first())
spark.stop()
Stacktrace: https://jpst.it/15srX

This happens because you code is extremely unsafe. When you call:
row(i).toString
it is bound to throw NPE every time you encounter null value.
You should use:
row.getString(i)
Your 1.6 program uses different source than 2.2 and spark-csv is similar, but not the same as built in csv format. The first one considers empty strings as empty strings, the second one as nulls.

spark dataframe write to file using scala

I am trying to read a file and add two extra columns. 1. Seq no and 2. filename.
When I run spark job in scala IDE output is generated correctly but when I run in putty with local or cluster mode job is stucks at stage-2 (save at File_Process). There is no progress even i wait for an hour. I am testing on 1GB data.
Below is the code i am using
object File_Process
{
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession
.builder()
.master("yarn")
.appName("File_Process")
.getOrCreate()
def main(arg:Array[String])
{
val FileDF = spark.read
.csv("/data/sourcefile/")
val rdd = FileDF.rdd.zipWithIndex().map(indexedRow => Row.fromSeq((indexedRow._2.toLong+SEED+1)+:indexedRow._1.toSeq))
val FileDFWithSeqNo = StructType(Array(StructField("UniqueRowIdentifier",LongType)).++(FileDF.schema.fields))
val datasetnew = spark.createDataFrame(rdd,FileDFWithSeqNo)
val dataframefinal = datasetnew.withColumn("Filetag", lit(filename))
val query = dataframefinal.write
.mode("overwrite")
.format("com.databricks.spark.csv")
.option("delimiter", "|")
.save("/data/text_file/")
spark.stop()
}
If I remove logic to add seq_no, code is working fine.
code for creating seq no is
val rdd = FileDF.rdd.zipWithIndex().map(indexedRow =>Row.fromSeq((indexedRow._2.toLong+SEED+1)+:indexedRow._1.toSeq))
val FileDFWithSeqNo = StructType(Array(StructField("UniqueRowIdentifier",LongType)).++(FileDF.schema.fields))
val datasetnew = spark.createDataFrame(rdd,FileDFWithSeqNo)
Thanks in advance.

Can SparkContext and StreamingContext co-exist in the same program?

I am trying to set up a Sparkstreaming code which reads line from the Kafka server but processes it using rules written in another local file. I am creating streamingContext for the streaming data and sparkContext for other applying all other spark features - like string manipulation, reading local files etc
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("ReadLine")
val ssc = new StreamingContext(sparkConf, Seconds(15))
ssc.checkpoint("checkpoint")
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
val sentence = lines.toString
val conf = new SparkConf().setAppName("Bi Gram").setMaster("local[2]")
val sc = new SparkContext(conf)
val stringRDD = sc.parallelize(Array(sentence))
But this throws the following error
Exception in thread "main" org.apache.spark.SparkException: Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMultipleContexts = true. The currently running SparkContext was created at:
org.apache.spark.SparkContext.<init>(SparkContext.scala:82)
org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:874)
org.apache.spark.streaming.StreamingContext.<init>(StreamingContext.scala:81)

One application can only have ONE SparkContext. StreamingContext is created on SparkContext. Just need to create ssc StreamingContext using SparkContext
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(15))
If using the following constructor.
StreamingContext(conf: SparkConf, batchDuration: Duration)
It internally create another SparkContext
this(StreamingContext.createNewSparkContext(conf), null, batchDuration)
the SparkContext can get from StreamingContext by
ssc.sparkContext

yes you can do it
you have to first start spark session and
then use its context to start any number of streaming context
val spark = SparkSession.builder().appName("someappname").
config("spark.sql.warehouse.dir",warehouseLocation).getOrCreate()
val ssc = new StreamingContext(spark.sparkContext, Seconds(1))
Simple!!!

Registring Kryo classes is not working

I have the following code :
val conf = new SparkConf().setAppName("MyApp")
val sc = new SparkContext(conf)
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
new conf.registerKryoClasses(new Class<?>[]{
Class.forName("org.apache.hadoop.io.LongWritable"),
Class.forName("org.apache.hadoop.io.Text")
});
But I am bumping into the following error :
')' expected but '[' found.
[error] new conf.registerKryoClasses(new Class<?>[]{
How can I solve this problem ?

You're mixing Scala and Java. In Scala, you can define an Array[Class[_]] (instead of a Class<?>[]):
val conf = new SparkConf()
.setAppName("MyApp")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.registerKryoClasses(Array[Class[_]](
Class.forName("org.apache.hadoop.io.LongWritable"),
Class.forName("org.apache.hadoop.io.Text")
));
val sc = new SparkContext(conf)
We can even do a little better. In order not to get our classes wrong using string literals, we can actually utilize the classes and use classOf to get their class type:
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
val conf = new SparkConf()
.setAppName("MyApp")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.registerKryoClasses(Array[Class[_]](
classOf[LongWritable],
classOf[Test],
))
val sc = new SparkContext(conf)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark: Why does my accumulator work in non-deterministic way? - scala

Related

Spark how to read from multiple Elastic Search clusters

Difference in behaviour in Spark 1.6 and Spark 2.2 in converting rdd[row] to rdd[tuple]

spark dataframe write to file using scala

Can SparkContext and StreamingContext co-exist in the same program?

Registring Kryo classes is not working

Categories

Resources