java.lang.IllegalArgumentException: Illegal sequence boundaries Spark - scala

I am using Azure Databricks and Scala. I wanna show() a Dataframe but I obtained an error that I can not understand and I would like to solve it. The lines of code that I have are:
println("----------------------------------------------------------------Printing schema")
df.printSchema()
println("----------------------------------------------------------------Printing dataframe")
df.show()
println("----------------------------------------------------------------Error before")
The Standard output is the following one, the message "----------------------------------------------------------------Error before" it does not appears.
> ----------------------------------------------------------------Printing schema
> root
> |-- processed: integer (nullable = false)
> |-- processDatetime: string (nullable = false)
> |-- executionDatetime: string (nullable = false)
> |-- executionSource: string (nullable = false)
> |-- executionAppName: string (nullable = false)
>
> ----------------------------------------------------------------Printing dataframe
> 2020-02-18T14:19:00.069+0000: [GC (Allocation Failure) [PSYoungGen: 1497248K->191833K(1789440K)] 2023293K->717886K(6063104K),
> 0.0823288 secs] [Times: user=0.18 sys=0.02, real=0.09 secs]
> 2020-02-18T14:19:40.823+0000: [GC (Allocation Failure) [PSYoungGen: 1637209K->195574K(1640960K)] 2163262K->721635K(5914624K),
> 0.0483384 secs] [Times: user=0.17 sys=0.00, real=0.05 secs]
> 2020-02-18T14:19:44.843+0000: [GC (Allocation Failure) [PSYoungGen: 1640950K->139092K(1809920K)] 2167011K->665161K(6083584K),
> 0.0301711 secs] [Times: user=0.11 sys=0.00, real=0.03 secs]
> 2020-02-18T14:19:50.910+0000: Track exception: Job aborted due to stage failure: Task 59 in stage 62.0 failed 4 times, most recent
> failure: Lost task 59.3 in stage 62.0 (TID 2672, 10.139.64.6, executor
> 1): java.lang.IllegalArgumentException: Illegal sequence boundaries:
> 1581897600000000 to 1581811200000000 by 86400000000
> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage23.processNext(Unknown
> Source)
> at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$15$$anon$2.hasNext(WholeStageCodegenExec.scala:659)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
> at org.apache.spark.scheduler.Task.doRunTask(Task.scala:139)
> at org.apache.spark.scheduler.Task.run(Task.scala:112)
> at org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1526)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
>
> Driver stacktrace:.
> 2020-02-18T14:19:50.925+0000: Track message: Process finished with exit code 1. Metric: Writer. Value: 1.0.

It's hard to know exactly without seeing your code, but I had a similar error and the other answer (about int being out of range) led me astray.
The java.lang.IllegalArgumentException you are getting is confusing but is actually quite specific:
Illegal sequence boundaries: 1581897600000000 to 1581811200000000 by 86400000000
This error is complaining that that you are using a sequence() spark SQL function and you are telling it to go from 1581897600000000 to 1581811200000000 by 86400000000. It's easy to miss because of the big numbers, but this an instruction to go from a larger number to a smaller number by an increment of a positive integer. E.g., from 12 to 6 by 3.
This is not allowed according to the DataBricks documentation:
start - an expression. The start of the range.
stop - an expression. The end the range (inclusive).
step - an optional expression. The step of the
range. By default step is 1 if start is less than
or equal to stop, otherwise -1. For the temporal
sequences it’s 1 day and -1 day respectively. If
start is greater than stop then the step must be
negative, and vice versa.
Additionally, I believe the other answer's focus on the int column is misleading. The large numbers mentioned in the illegal sequence error look like they are coming from a date column. You don't have any DateType columns but your string columns are named like date columns; presumably you are using them in a sequence function and they are getting coerced into dates.

You can get this error when you attempt to
sequence(start_date, end_date, [interval])
on a table which has some of start_dates less than end_dates and others greater
When applying this function all of date ranges should be either positive or negative, not mixed

Your schema is expecting an int, an int in Java has a maximum size of [-2 147 483 648 to +2 147 483 647].
So I would change the schema from int to long.

Related

Re partitioning using pyspark failing with error

I have parquet in s3 folder with below column.Size of the parquet is around 40 mb.
org_id, device_id, channel_id, source, col1, col2
right now partition is on 3 column org_id device_id channel_id
I want change the partition to source, org_id, device_id, channel_id.
I am using pyspark to read file from s3 and write to s3 bucket.
sc = SparkContext(appName="parquet_ingestion1").getOrCreate()
spark = SparkSession(sc)
file_path = "s3://some-bucket/some_folder"
print("Reading parquet from s3:{}".format(file_path))
spark_df = spark.read.parquet(file_path)
print("Converting to parquet")
file_path_re = "s3://other_bucket/re-partition"
partition_columns = ["source", "org_id", "device_id", "channel_id "]
spark_df.repartition(1).write.partitionBy(partition_columns).mode('append').parquet(file_path_re)
I am getting error and parquet file is not generated.
spark_df.repartition(1).write.partitionBy(partition_columns).mode('append').parquet(file_path_re)
[Stage 1:> (0 + 8) / 224]20/04/29 13:29:44 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, ip-172-31-43-0.ap-south-1.compute.internal, executor 3): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainFloatDictionary
at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:41)
at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToBinary(ParquetDictionary.java:51)
at org.apache.spark.sql.execution.vectorized.WritableColumnVector.getUTF8String(WritableColumnVector.java:380)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:148)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Then i tried
spark_df.write.partitionBy(partition_columns).mode('append').parquet(file_path_re)
spark_df.write.partitionBy(partition_columns).mode('append').parquet(file_path_re)
[Stage 3:> (0 + 8) / 224]20/04/29 13:32:11 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 23, ip-172-31-42-4.ap-south-1.compute.internal, executor 5): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainFloatDictionary
at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:41)
at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToBinary(ParquetDictionary.java:51)
at org.apache.spark.sql.execution.vectorized.WritableColumnVector.getUTF8String(WritableColumnVector.java:380)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:216)
at org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec.scala:108)
at org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec.scala:101)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
[Stage 3:==> (8 + 8) / 224]20/04/29 13:32:22 WARN TaskSetManager: Lost task 0.2 in stage 3.0 (TID 40, ip-172-31-42-4.ap-south-1.compute.internal, executor 5): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainFloatDictionary
at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:41)
at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToBinary(ParquetDictionary.java:51)
In 2nd case it is giving failure but it is creating parquet also.Now i am not sure it is correctly creating all the data to new partition .
Let me know how is correct way of re partitioning the parquet.
UPDATE 1:
from pyspark.sql.types import StringType
for col1 in partition_columns:
spark_df=spark_df.withColumn(col1, col(col1).cast(dataType=StringType()))
Tried both
spark_df.repartition(1).write.partitionBy(partition_columns).mode('append').parquet(file_path_re)
spark_df.write.partitionBy(partition_columns).mode('append').parquet(file_path_re)
I get following error
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 20, ip-172-31-42-4.ap-south-1.compute.internal, executor 4): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainFloatDictionary
at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:41)
at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToBinary(ParquetDictionary.java:51)
at org.apache.spark.sql.execution.vectorized.WritableColumnVector.getUTF8String(WritableColumnVector.java:380)
UPDATE 2:
Now i found that there is schema mismatch in one of the column one is string other is float.I have depicted the scenario below.
Here you can see col1 column is string in one row and float for other row
org_id, device_id, channel_id, source, col1, col2
"100" "device1" "channel" "source1" 10 0.1
"100" "device1" "channel" "source2" "10" 0.1
I tried casting col1 column to float.it dodn;t worked
Any suggestion.
Try force type casting all partition_columns to StringType
Root cause of the issue is mentioned in UPDATE2. In my case we have 4 apps(part of different pipeline based on source) that write to parquet store. 2 app APP1 and APP2 don't use col1 and APP 3 used to write it as float.
Recently APP4 started getting col1 in their data and stored it as string in the parquet.parquet don't complain while writing.
While reading such parquet made
I tried casting it didn't worked
merge schema failed with mismatch in data type
I tried filter data based on source type. it worked partially in the sense if filter out APP4 data it worked.but if filter out APP3 data it didn't worked.
This may not be good solution, but i had to content with this for now.
Solutions:
1. filter out app4 source data and create data frame and convert it parquet and then filter only app4 source parquet in data frame and remove col1 and convert it into parquet.
Or Remove col from whole data frame and write to parquet.
df1 =df.select([c for c in df.columns if c!= 'col1'])

Spark job which writes data frame to hdfs is aborted FileFormatWriter.scala:196

I am trying to store a data frame to HDFS using the following Spark Scala code.
All the columns in the data frame are nullable = true
Intermediate_data_final.coalesce(100).write
.option("header", value = true)
.option("compression", "bzip2")
.mode(SaveMode.Append)
.csv(path)
But I am getting this error :
2019-08-08T17:22:21.108+0000: [GC (Allocation Failure) [PSYoungGen: 979968K->34277K(1014272K)] 1027111K->169140K(1473536K), 0.0759544 secs] [
Times: user=0.61 sys=0.18, real=0.07 secs]
2019-08-08T17:22:32.032+0000: [GC (Allocation Failure) [PSYoungGen: 1014245K->34301K(840192K)] 1149108K->263054K(1299456K), 0.0540687 secs] [
Times: user=0.49 sys=0.13, real=0.05 secs]
Job aborted.
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
Can anybody please help me with this?
Not the fix for your problem I'm afraid, but if anyone is getting this issue using pyspark then I managed to solve this by separating the query execution and the writing execution into separate commands
df.select(foo).filter(bar).map(baz).write.parquet(out_path)
Would fail with this error message (for a 3.5GB dataframe) but the below worked fine
x = df.select(foo).filter(bar).map(baz)
x.write.parquet(out_path)

How to partition a table in scala with the proper name

I have a large Dataframe in scala 2.4.0, that looks like this
+--------------------+--------------------+--------------------+-------------------+--------------+------+
| cookie| updated_score| probability| date_last_score|partition_date|target|
+--------------------+--------------------+--------------------+-------------------+--------------+------+
|00000000000001074780| 0.1110987111481027| 0.27492987342938174|2019-03-29 16:00:00| 2019-04-07_10| 0|
|00000000000001673799| 0.02621894072693878| 0.2029688362968775|2019-03-19 08:00:00| 2019-04-07_10| 0|
|00000000000002147908| 0.18922034021212567| 0.3520678649755828|2019-03-31 19:00:00| 2019-04-09_12| 1|
|00000000000004028302| 0.06803669083452231| 0.23089047208736854|2019-03-25 17:00:00| 2019-04-07_10| 0|
and this schema:
root
|-- cookie: string (nullable = true)
|-- updated_score: double (nullable = true)
|-- probability: double (nullable = true)
|-- date_last_score: string (nullable = true)
|-- partition_date: string (nullable = true)
|-- target: integer (nullable = false)
then I create a partition table and insert the data into database.table_name. But when I look up at hive database and type: show partitions database.table_name I only got partition_date=0 and partition_date=1, and 0 and 1 are not values from partition_date column.
I don't know if I wrote something wrong, there are some scala concepts that I don't understand or the dataframe is too large.
I've tried differents ways to do this looking up similar questions as:
result_df.write.mode(SaveMode.Overwrite).insertInto("table_name")
or
result_df.write.mode(SaveMode.Overwrite).saveAsTable("table_name")
In case it helps I provide some INFO message from scala:
Looking at this message, I think I got my result_df partitions properly.
19/07/31 07:53:57 INFO TaskSetManager: Starting task 11.0 in stage 2822.0 (TID 123456, ip-xx-xx-xx.aws.local.somewhere, executor 45, partition 11, PROCESS_LOCAL, 7767 bytes)
19/07/31 07:53:57 INFO TaskSetManager: Starting task 61.0 in stage 2815.0 (TID 123457, ip-xx-xx-xx-xyz.aws.local.somewhere, executor 33, partition 61, NODE_LOCAL, 8095 bytes)
Then, I am starting to saving the partitions as a Vector(0, 1, 2...), but I may only save 0 and 1? I don't really know.
19/07/31 07:56:02 INFO DAGScheduler: Submitting 35 missing tasks from ShuffleMapStage 2967 (MapPartitionsRDD[130590] at insertInto at evaluate_decay_factor.scala:165) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
19/07/31 07:56:02 INFO YarnScheduler: Adding task set 2967.0 with 35 tasks
19/07/31 07:56:02 INFO DAGScheduler: Submitting ShuffleMapStage 2965 (MapPartitionsRDD[130578] at insertInto at evaluate_decay_factor.scala:165), which has no missing parents
My code looks like this:
val createTableSQL = s"""
CREATE TABLE IF NOT EXISTS table_name (
cookie string,
updated_score float,
probability float,
date_last_score string,
target int
)
PARTITIONED BY (partition_date string)
STORED AS PARQUET
TBLPROPERTIES ('PARQUET.COMPRESSION'='SNAPPY')
"""
spark.sql(createTableSQL)
result_df.write.mode(SaveMode.Overwrite).insertInto("table_name")
Given a dataframe like this:
val result = Seq(
(8, "123", 1.2, 0.5, "bat", "2019-04-04_9"),
(64, "451", 3.2, -0.5, "mouse", "2019-04-04_12"),
(-27, "613", 8.2, 1.5, "horse", "2019-04-04_10"),
(-37, "513", 4.33, 2.5, "horse", "2019-04-04_11"),
(45, "516", -3.3, 3.4, "bat", "2019-04-04_10"),
(12, "781", 1.2, 5.5, "horse", "2019-04-04_11")
I want to run: show partitions "table_name" on hive command line and get:
partition_date=2019-04-04_9
partition_date=2019-04-04_10
partition_date=2019-04-04_11
partition_date=2019-04-04_12
Instead in my output is:
partition_date=0
partition_date=1
In this simple example case it works perfectly, but with my large dataframe I get the previous output.
To change the number of partitions, use repartition(numOfPartitions)
To change the column you partition by when writing, use partitionBy("col")
example used together: final_df.repartition(40).write.partitionBy("txnDate").mode("append").parquet(destination)
Two helpful hints:
Make your repartition size equal to the number of worker cores for quickest write / repartition. In this example, I have 10 executors, each with 4 cores (40 cores total). Thus, I set it to 40.
When you are writing to a destination, don't specify anything more than the sub bucket -- let spark handle the indexing.
good destination: "s3a://prod/subbucket/"
bad destination: s"s3a://prod/subbucket/txndate=$txndate"

Parquet error when joining two dataframes. UnsupportedOperationException: ...PlainValuesDictionary$PlainLongDictionary

I have 2 dataframes with the following schema:
root
|-- pid: long (nullable = true)
|-- lv: timestamp (nullable = true)
root
|-- m_pid: long (nullable = true)
|-- vp: double (nullable = true)
|-- created: timestamp (nullable = true)
If I try to show() any of these 2 dataframes everything is ok, top 20 rows are displayed.
If I try to join these 2 dataframes and show the result (error does not appear at the join transformation, only when calling "show" action)
var joined = df1.join(df2, df2("pid") === df1("m_pid")).drop("m_pid")
joined.show()
I get an error which I do not understand. It is related to parquet. One of the dataframes is read from a parquet (the other one from text) but if it would be a problem related to reading the data then why does the problem appear only when joining the dataframes, not when showing the individually.
The error is:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 51 in stage 403.0 failed 4 times, most recent failure: Lost task 51.3 in stage 403.0 : java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
...
Caused by: java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
at org.apache.parquet.column.Dictionary.decodeToDouble(Dictionary.java:60)
at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getDouble(OnHeapColumnVector.java:354)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
... 1 more
Does anyone have an ideea what causes the error and how to go around it?

ArrayIndexOutOfBoundsException with Spark, Spark-Avro and Google Analytics Data

I'm attempting to use spark-avro with Google Analytics avro data files, from one of our clients. Also I'm new to spark/scala, so my apologies if I've got anything wrong or done anything stupid. I'm using Spark 1.3.1.
I'm experimenting with the data in the spark-shell which I'm kicking off like this:
spark-shell --packages com.databricks:spark-avro_2.10:1.0.0
Then I'm running the following commands:
import com.databricks.spark.avro._
import scala.collection.mutable._
val gadata = sqlContext.avroFile("[client]/data")
gadata: org.apache.spark.sql.DataFrame = [visitorId: bigint, visitNumber: bigint, visitId: bigint, visitStartTime: bigint, date: string, totals: struct<visits:bigint,hits:bigint,pageviews:bigint,timeOnSite:bigint,bounces:bigint,tr ansactions:bigint,transactionRevenue:bigint,newVisits:bigint,screenviews:bigint,uniqueScreenviews:bigint,timeOnScre en:bigint,totalTransactionRevenue:bigint>, trafficSource: struct<referralPath:string,campaign:string,source:string, medium:string,keyword:string,adContent:string>, device: struct<browser:string,browserVersion:string,operatingSystem :string,operatingSystemVersion:string,isMobile:boolean,mobileDeviceBranding:string,flashVersion:string,javaEnabled: boolean,language:string,screenColors:string,screenResolution:string,deviceCategory:string>, geoNetwork: str...
val gaIds = gadata.map(ga => ga.getString(11)).collect()
I get the following error:
[Stage 2:=> (8 + 4) / 430]15/05/14 11:14:04 ERROR Executor: Exception in task 12.0 in stage 2.0 (TID 27)
java.lang.ArrayIndexOutOfBoundsException
15/05/14 11:14:04 WARN TaskSetManager: Lost task 12.0 in stage 2.0 (TID 27, localhost): java.lang.ArrayIndexOutOfBoundsException
15/05/14 11:14:04 ERROR TaskSetManager: Task 12 in stage 2.0 failed 1 times; aborting job
15/05/14 11:14:04 WARN TaskSetManager: Lost task 11.0 in stage 2.0 (TID 26, localhost): TaskKilled (killed intentionally)
15/05/14 11:14:04 WARN TaskSetManager: Lost task 10.0 in stage 2.0 (TID 25, localhost): TaskKilled (killed intentionally)
15/05/14 11:14:04 WARN TaskSetManager: Lost task 9.0 in stage 2.0 (TID 24, localhost): TaskKilled (killed intentionally)
15/05/14 11:14:04 WARN TaskSetManager: Lost task 13.0 in stage 2.0 (TID 28, localhost): TaskKilled (killed intentionally)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 12 in stage 2.0 failed 1 times, most recent failure: Lost task 12.0 in stage 2.0 (TID 27, localhost): java.lang.ArrayIndexOutOfBoundsException
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
I though this might be too do with the index I was using, but the following statement works OK.
scala> gadata.first().getString(11)
res12: String = 29456309767885
So I though that maybe some of the records might be empty or have different amount of columns... so I attempted to run the following statement to get a list of all the record lengths:
scala> gadata.map(ga => ga.length).collect()
But I get a similar error:
[Stage 4:=> (8 + 4) / 430]15/05/14 11:20:04 ERROR Executor: Exception in task 12.0 in stage 4.0 (TID 42)
java.lang.ArrayIndexOutOfBoundsException
15/05/14 11:20:04 WARN TaskSetManager: Lost task 12.0 in stage 4.0 (TID 42, localhost): java.lang.ArrayIndexOutOfBoundsException
15/05/14 11:20:04 ERROR TaskSetManager: Task 12 in stage 4.0 failed 1 times; aborting job
15/05/14 11:20:04 WARN TaskSetManager: Lost task 11.0 in stage 4.0 (TID 41, localhost): TaskKilled (killed intentionally)
15/05/14 11:20:04 ERROR Executor: Exception in task 13.0 in stage 4.0 (TID 43)
org.apache.spark.TaskKilledException
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/05/14 11:20:04 WARN TaskSetManager: Lost task 9.0 in stage 4.0 (TID 39, localhost): TaskKilled (killed intentionally)
15/05/14 11:20:04 WARN TaskSetManager: Lost task 10.0 in stage 4.0 (TID 40, localhost): TaskKilled (killed intentionally)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 12 in stage 4.0 failed 1 times, most recent failure: Lost task 12.0 in stage 4.0 (TID 42, localhost): java.lang.ArrayIndexOutOfBoundsException
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Is this an Issue with Spark-Avro or Spark?
Not sure what the underlying issue was, but I've managed to fix the error by breaking up my data into monthly sets. I had 4 months worth of GA data in a single folder and was operating on all the data. The data ranged from 70MB to 150MB per day.
Creating 4 folders for January, February, March & April and loading them up individually the map succeeds without any issues. Once loaded I can join the data set together (only tried two so far) and work on them, without issue.
I'm using Spark on a Pseudo Hadoop distribution, not sure if this makes a difference to the volume of data Spark can handle.
UPDATE:
Found the root issue with the error. I loaded up each months data and printout the schema. Both January and February are identical but after this a field goes walk about in March and Aprils schemas:
root
|-- visitorId: long (nullable = true)
|-- visitNumber: long (nullable = true)
|-- visitId: long (nullable = true)
|-- visitStartTime: long (nullable = true)
|-- date: string (nullable = true)
|-- totals: struct (nullable = true)
| |-- visits: long (nullable = true)
| |-- hits: long (nullable = true)
| |-- pageviews: long (nullable = true)
| |-- timeOnSite: long (nullable = true)
| |-- bounces: long (nullable = true)
| |-- transactions: long (nullable = true)
| |-- transactionRevenue: long (nullable = true)
| |-- newVisits: long (nullable = true)
| |-- screenviews: long (nullable = true)
| |-- uniqueScreenviews: long (nullable = true)
| |-- timeOnScreen: long (nullable = true)
| |-- totalTransactionRevenue: long (nullable = true)
(snipped)
After February the totalTransactionRevenuse at the bottom is not present anymore. So I assume this is causing the error and is related to this issue