Spark NegativeArraySizeException - scala

In a spark job I join two RDDs,
val data: RDD[(Long, (String, String))] = sc.objectFile[(Long, scala.collection.mutable.HashMap[String, Object])](outputFile)
.leftOuterJoin(attributionData)
Here outputFile is output of another spark job which process data from hive. One of the tables in hive has 40 million records and when I limit to read table to fetch only 10 million records code works fine. However with full data (if I remove limit()) following error occurs,
10:43:27 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 2, buysub.com): java.lang.NegativeArraySizeException
at com.esotericsoftware.kryo.util.IdentityObjectIntMap.resize(IdentityObjectIntMap.java:409)
at com.esotericsoftware.kryo.util.IdentityObjectIntMap.putStash(IdentityObjectIntMap.java:227)
at com.esotericsoftware.kryo.util.IdentityObjectIntMap.push(IdentityObjectIntMap.java:221)
at com.esotericsoftware.kryo.util.IdentityObjectIntMap.put(IdentityObjectIntMap.java:117)
at com.esotericsoftware.kryo.util.IdentityObjectIntMap.putStash(IdentityObjectIntMap.java:228)
at com.esotericsoftware.kryo.util.IdentityObjectIntMap.push(IdentityObjectIntMap.java:221)
at com.esotericsoftware.kryo.util.IdentityObjectIntMap.put(IdentityObjectIntMap.java:117)
at com.esotericsoftware.kryo.util.MapReferenceResolver.addWrittenObject(MapReferenceResolver.java:23)
at com.esotericsoftware.kryo.Kryo.writeReferenceOrNull(Kryo.java:598)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:566)
at com.twitter.chill.TraversableSerializer$$anonfun$write$1.apply(Traversable.scala:29)
at com.twitter.chill.TraversableSerializer$$anonfun$write$1.apply(Traversable.scala:27)
at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
at com.twitter.chill.TraversableSerializer.write(Traversable.scala:27)
at com.twitter.chill.TraversableSerializer.write(Traversable.scala:21)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at com.twitter.chill.TraversableSerializer$$anonfun$write$1.apply(Traversable.scala:29)
at com.twitter.chill.TraversableSerializer$$anonfun$write$1.apply(Traversable.scala:27)
at scala.collection.immutable.List.foreach(List.scala:318)
at com.twitter.chill.TraversableSerializer.write(Traversable.scala:27)
at com.twitter.chill.TraversableSerializer.write(Traversable.scala:21)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:37)
at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:33)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at com.twitter.chill.TraversableSerializer$$anonfun$write$1.apply(Traversable.scala:29)
at com.twitter.chill.TraversableSerializer$$anonfun$write$1.apply(Traversable.scala:27)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at com.twitter.chill.TraversableSerializer.write(Traversable.scala:27)
at com.twitter.chill.TraversableSerializer.write(Traversable.scala:21)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
I am using Spark 1.6. Following is the spark configuration,
conf.set("spark.driver.memory", "4G")
conf.set("spark.executor.memory", "30G")
conf.set("spark.rdd.compress", "true")
conf.set("spark.storage.memoryFraction", "0.3")
conf.set("spark.shuffle.consolidateFiles", "true")
conf.set("spark.shuffle.memoryFraction", "0.5")
conf.set("spark.akka.frameSize", "384")
conf.set("spark.io.compression.codec", "lz4")
conf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer")

I found some info pointing to this being a bug in Kryo serialization:
https://github.com/EsotericSoftware/kryo/issues/382
It's fixed in Kryo 4, but spark is not yet using that version:
https://issues.apache.org/jira/browse/SPARK-20389
As a temporary work-around, sounds like this might help:
spark.executor.extraJavaOptions –XX:hashCode=0
spark.driver.extraJavaOptions –XX:hashCode=0
(From https://github.com/broadinstitute/gatk/issues/1524#issuecomment-189368808)
Or you could simply use a different serializer, though that might slow things down.

This happens when Kryo's reference table exceeds the max integer value (integer overflow).
This solve this, set spark.kryo.referenceTracking to false

Related

Re partitioning using pyspark failing with error

I have parquet in s3 folder with below column.Size of the parquet is around 40 mb.
org_id, device_id, channel_id, source, col1, col2
right now partition is on 3 column org_id device_id channel_id
I want change the partition to source, org_id, device_id, channel_id.
I am using pyspark to read file from s3 and write to s3 bucket.
sc = SparkContext(appName="parquet_ingestion1").getOrCreate()
spark = SparkSession(sc)
file_path = "s3://some-bucket/some_folder"
print("Reading parquet from s3:{}".format(file_path))
spark_df = spark.read.parquet(file_path)
print("Converting to parquet")
file_path_re = "s3://other_bucket/re-partition"
partition_columns = ["source", "org_id", "device_id", "channel_id "]
spark_df.repartition(1).write.partitionBy(partition_columns).mode('append').parquet(file_path_re)
I am getting error and parquet file is not generated.
spark_df.repartition(1).write.partitionBy(partition_columns).mode('append').parquet(file_path_re)
[Stage 1:> (0 + 8) / 224]20/04/29 13:29:44 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, ip-172-31-43-0.ap-south-1.compute.internal, executor 3): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainFloatDictionary
at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:41)
at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToBinary(ParquetDictionary.java:51)
at org.apache.spark.sql.execution.vectorized.WritableColumnVector.getUTF8String(WritableColumnVector.java:380)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:148)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Then i tried
spark_df.write.partitionBy(partition_columns).mode('append').parquet(file_path_re)
spark_df.write.partitionBy(partition_columns).mode('append').parquet(file_path_re)
[Stage 3:> (0 + 8) / 224]20/04/29 13:32:11 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 23, ip-172-31-42-4.ap-south-1.compute.internal, executor 5): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainFloatDictionary
at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:41)
at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToBinary(ParquetDictionary.java:51)
at org.apache.spark.sql.execution.vectorized.WritableColumnVector.getUTF8String(WritableColumnVector.java:380)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:216)
at org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec.scala:108)
at org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec.scala:101)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
[Stage 3:==> (8 + 8) / 224]20/04/29 13:32:22 WARN TaskSetManager: Lost task 0.2 in stage 3.0 (TID 40, ip-172-31-42-4.ap-south-1.compute.internal, executor 5): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainFloatDictionary
at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:41)
at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToBinary(ParquetDictionary.java:51)
In 2nd case it is giving failure but it is creating parquet also.Now i am not sure it is correctly creating all the data to new partition .
Let me know how is correct way of re partitioning the parquet.
UPDATE 1:
from pyspark.sql.types import StringType
for col1 in partition_columns:
spark_df=spark_df.withColumn(col1, col(col1).cast(dataType=StringType()))
Tried both
spark_df.repartition(1).write.partitionBy(partition_columns).mode('append').parquet(file_path_re)
spark_df.write.partitionBy(partition_columns).mode('append').parquet(file_path_re)
I get following error
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 20, ip-172-31-42-4.ap-south-1.compute.internal, executor 4): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainFloatDictionary
at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:41)
at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToBinary(ParquetDictionary.java:51)
at org.apache.spark.sql.execution.vectorized.WritableColumnVector.getUTF8String(WritableColumnVector.java:380)
UPDATE 2:
Now i found that there is schema mismatch in one of the column one is string other is float.I have depicted the scenario below.
Here you can see col1 column is string in one row and float for other row
org_id, device_id, channel_id, source, col1, col2
"100" "device1" "channel" "source1" 10 0.1
"100" "device1" "channel" "source2" "10" 0.1
I tried casting col1 column to float.it dodn;t worked
Any suggestion.
Try force type casting all partition_columns to StringType
Root cause of the issue is mentioned in UPDATE2. In my case we have 4 apps(part of different pipeline based on source) that write to parquet store. 2 app APP1 and APP2 don't use col1 and APP 3 used to write it as float.
Recently APP4 started getting col1 in their data and stored it as string in the parquet.parquet don't complain while writing.
While reading such parquet made
I tried casting it didn't worked
merge schema failed with mismatch in data type
I tried filter data based on source type. it worked partially in the sense if filter out APP4 data it worked.but if filter out APP3 data it didn't worked.
This may not be good solution, but i had to content with this for now.
Solutions:
1. filter out app4 source data and create data frame and convert it parquet and then filter only app4 source parquet in data frame and remove col1 and convert it into parquet.
Or Remove col from whole data frame and write to parquet.
df1 =df.select([c for c in df.columns if c!= 'col1'])

Why dataframe cannot be accessed inside UDF ? [Apache Spark Scala] [duplicate]

This question already has answers here:
Why accesing DataFrame from UDF results in NullPointerException?
(2 answers)
Closed 3 years ago.
I am currently doing streaming project using Apache Spark. I have 2 data source, the first one I get news data from Kafka. This data is always updating every time. And the second one, I get masterWord dictionary. This variable contains dataframe of words and the unique key of words.
I want to process news data, then convert it from Seq of words become Seq of words_id by matching the data to masterWord dictionary. But, I have problems when accessing the masterWord dataframe in my UDF. When I am trying to access dataframe inside UDF, Spark return this error
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 i
n stage 4.0 (TID 4, localhost, executor driver): java.lang.NullPointerException
Why dataframe cannot be accessed inside UDF ?
What is the best practice to get value from another dataframe ?
This is my code
// read data stream from Kafka
val kafka = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", PropertiesLoader.kafkaBrokerUrl)
.option("subscribe", PropertiesLoader.kafkaTopic)
.option("startingOffsets", "earliest")
.option("maxOffsetsPerTrigger", "100")
.load()
// Transform data stream to Dataframe
val kafkaDF = kafka.selectExpr("CAST(value AS STRING)").as[(String)]
.select(from_json($"value", ColsArtifact.rawSchema).as("data"))
.select("data.*")
.withColumn("raw_text", concat(col("title"), lit(" "), col("text"))) // add column aggregate title and text
// read master word dictionary
val readConfig = ReadConfig(Map("uri" -> "mongodb://10.252.37.112/prayuga", "database" -> "prayuga", "collection" -> "master_word_2"))
var masterWord = MongoSpark.load(spark, readConfig)
// call UDF
val aggregateDF = kafkaDF.withColumn("text_aggregate", aggregateMongo(col("text_selected")))
// UDF
val aggregateMongo = udf((content: Seq[String]) => {
masterWord.show()
...
// code for query masterWord whether var content exist or not in masterWord dictionary
})
The dataframe lives in the spark context and it only available as such inside the driver
Each of the tasks sees the fraction (partition) of the data and can work with that. if you want to make the data in the dataframe available inside a udf you have to serialize it to the master and then you can broadcast it (or pass it as parameter, which will essentially do the same) to the udf, in which case Spark will send the whole thing to each instance of the udf running
If you want to use DataFrames inside UDFs, you must create a Broadcast :
import spark.implicits._
val df_name =Seq("Raphael").toDF("name")
val bc_df_name: Broadcast[DataFrame] = spark.sparkContext.broadcast(df_name)
// use df_name inside udf
val udf_doSomething = udf(() => bc_df_name.value.as[String].first())
Seq(1,2,3)
.toDF("i")
.withColumn("test",udf_doSomething())
.show()
gives
+---+-------+
| i| test|
+---+-------+
| 1|Raphael|
| 2|Raphael|
| 3|Raphael|
+---+-------+
This at least works in local mode, nut sure whether this also works on clusters. Anyway I would not recommend this approach, better convert (collect) the content of the dataframe in a scala datastructure on the driver (e.g. a Map) und broadcast this variable, or use a join instead.

dataframe map and hivecontext issue

Env: Spark 1.6 and Scala
Hi,
I have dataframe DF and tried to run
val configTable= hivecontext.table("mydb.myTable")
configTable.rdd.map(row=>{
val abc =hivecontext.sql("select count(*) as num_rows from mydb2.mytable2")
}).collect()
I am getting exception
17/03/28 22:47:04 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.NullPointerException
Is it not allowed to use SparkSQL in rdd.map? Any work around for this?
Thanks

Trying to execute a spark sql query from a UDF

I am trying to write a inline function in spark framework using scala which will take a string input, execute a sql statement and return me a String value
val testfunc: (String=>String)= (arg1:String) =>
{val k = sqlContext.sql("""select c_code from r_c_tbl where x_nm = "something" """)
k.head().getString(0)
}
I am registering this scala function as an UDF
val testFunc_test = udf(testFunc)
I have a dataframe over a hive table
val df = sqlContext.table("some_table")
Then I am calling the udf in a withColumn and trying to save it in a new dataframe.
val new_df = df.withColumn("test", testFunc_test($"col1"))
But everytime i try do this i get an error
16/08/10 21:17:08 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, 10.0.1.5): java.lang.NullPointerException
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:41)
at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
at org.apache.spark.sql.DataFrame.foreach(DataFrame.scala:1434)
I am relatively new to spark and scala . But I am not sure why this code should not run. Any insights or an work around will be highly appreciated.
Please note that I have not pasted the whole error stack . Please let me know if it is required.
You can't use sqlContext in your UDF - UDFs must be serializable to be shipped to executors, and the context (which can be thought of as a connection to the cluster) can't be serialized and sent to the node - only the driver application (where the UDF is defined, but not executed) can use the sqlContext.
Looks like your usecase (perform a select from table X per record in table Y) would better be accomplished by using a join.

Read CSV as dataframe and convert to JSON string

I'm trying to aggregate a CSV file via Spark SQL and then show the result as JSON:
val people = sqlContext.read().format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").option("delimiter", ",").load("/tmp/people.csv")
people.registerTempTable("people")
val result = sqlContext.sql("select country, count(*) as cnt from people group by country")
That's where I'm stuck. I can to a result.schema().prettyJson() which works flawlessly, but I don't find a way to return the result as JSON.
I was assuming that result.toJSON.collect() should do what I desire, but this fails with a
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 101.0 failed 1 times, most recent failure: Lost task 1.0 in stage 101.0 (TID 159, localhost): java.lang.NegativeArraySizeException
at com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:171)
at com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:162)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:511)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.<init>(TungstenAggregationIterator.scala:686)
at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95)
at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:704)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:704)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
error. Can somebody guide me?
The error you're getting is odd, it sounds like result is probably empty?
You might want to try this command on the dataframe to get each line printed out instead:
result.toJSON.foreach(println)
See the Dataframe API for a little more information
Turns out this error was because of a "malformed" CSV file. It contained some rows which had more columns than others (with no header field name)... Strange error message though.
Try
val people = sqlContext.read().format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.option("mode", "DROPMALFORMED")
.option("delimiter", ",")
.load("/tmp/people.csv")
people.registerTempTable("people")
val result = sqlContext.sql("select country, count(*) as cnt from people group by country")