How to partition a table in scala with the proper name - scala

I have a large Dataframe in scala 2.4.0, that looks like this
+--------------------+--------------------+--------------------+-------------------+--------------+------+
| cookie| updated_score| probability| date_last_score|partition_date|target|
+--------------------+--------------------+--------------------+-------------------+--------------+------+
|00000000000001074780| 0.1110987111481027| 0.27492987342938174|2019-03-29 16:00:00| 2019-04-07_10| 0|
|00000000000001673799| 0.02621894072693878| 0.2029688362968775|2019-03-19 08:00:00| 2019-04-07_10| 0|
|00000000000002147908| 0.18922034021212567| 0.3520678649755828|2019-03-31 19:00:00| 2019-04-09_12| 1|
|00000000000004028302| 0.06803669083452231| 0.23089047208736854|2019-03-25 17:00:00| 2019-04-07_10| 0|
and this schema:
root
|-- cookie: string (nullable = true)
|-- updated_score: double (nullable = true)
|-- probability: double (nullable = true)
|-- date_last_score: string (nullable = true)
|-- partition_date: string (nullable = true)
|-- target: integer (nullable = false)
then I create a partition table and insert the data into database.table_name. But when I look up at hive database and type: show partitions database.table_name I only got partition_date=0 and partition_date=1, and 0 and 1 are not values from partition_date column.
I don't know if I wrote something wrong, there are some scala concepts that I don't understand or the dataframe is too large.
I've tried differents ways to do this looking up similar questions as:
result_df.write.mode(SaveMode.Overwrite).insertInto("table_name")
or
result_df.write.mode(SaveMode.Overwrite).saveAsTable("table_name")
In case it helps I provide some INFO message from scala:
Looking at this message, I think I got my result_df partitions properly.
19/07/31 07:53:57 INFO TaskSetManager: Starting task 11.0 in stage 2822.0 (TID 123456, ip-xx-xx-xx.aws.local.somewhere, executor 45, partition 11, PROCESS_LOCAL, 7767 bytes)
19/07/31 07:53:57 INFO TaskSetManager: Starting task 61.0 in stage 2815.0 (TID 123457, ip-xx-xx-xx-xyz.aws.local.somewhere, executor 33, partition 61, NODE_LOCAL, 8095 bytes)
Then, I am starting to saving the partitions as a Vector(0, 1, 2...), but I may only save 0 and 1? I don't really know.
19/07/31 07:56:02 INFO DAGScheduler: Submitting 35 missing tasks from ShuffleMapStage 2967 (MapPartitionsRDD[130590] at insertInto at evaluate_decay_factor.scala:165) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
19/07/31 07:56:02 INFO YarnScheduler: Adding task set 2967.0 with 35 tasks
19/07/31 07:56:02 INFO DAGScheduler: Submitting ShuffleMapStage 2965 (MapPartitionsRDD[130578] at insertInto at evaluate_decay_factor.scala:165), which has no missing parents
My code looks like this:
val createTableSQL = s"""
CREATE TABLE IF NOT EXISTS table_name (
cookie string,
updated_score float,
probability float,
date_last_score string,
target int
)
PARTITIONED BY (partition_date string)
STORED AS PARQUET
TBLPROPERTIES ('PARQUET.COMPRESSION'='SNAPPY')
"""
spark.sql(createTableSQL)
result_df.write.mode(SaveMode.Overwrite).insertInto("table_name")
Given a dataframe like this:
val result = Seq(
(8, "123", 1.2, 0.5, "bat", "2019-04-04_9"),
(64, "451", 3.2, -0.5, "mouse", "2019-04-04_12"),
(-27, "613", 8.2, 1.5, "horse", "2019-04-04_10"),
(-37, "513", 4.33, 2.5, "horse", "2019-04-04_11"),
(45, "516", -3.3, 3.4, "bat", "2019-04-04_10"),
(12, "781", 1.2, 5.5, "horse", "2019-04-04_11")
I want to run: show partitions "table_name" on hive command line and get:
partition_date=2019-04-04_9
partition_date=2019-04-04_10
partition_date=2019-04-04_11
partition_date=2019-04-04_12
Instead in my output is:
partition_date=0
partition_date=1
In this simple example case it works perfectly, but with my large dataframe I get the previous output.

To change the number of partitions, use repartition(numOfPartitions)
To change the column you partition by when writing, use partitionBy("col")
example used together: final_df.repartition(40).write.partitionBy("txnDate").mode("append").parquet(destination)
Two helpful hints:
Make your repartition size equal to the number of worker cores for quickest write / repartition. In this example, I have 10 executors, each with 4 cores (40 cores total). Thus, I set it to 40.
When you are writing to a destination, don't specify anything more than the sub bucket -- let spark handle the indexing.
good destination: "s3a://prod/subbucket/"
bad destination: s"s3a://prod/subbucket/txndate=$txndate"

Related

Rdd with tuples of different size to dataframe

Using pyspark map-reduce methos i created an rdd. I now want to create a dataframe from this rdd.
The rdd looking like this:
(491023, ((9,), (0.07971896408231094,), 'Debt collection'))
(491023, ((2, 14, 77, 22, 6, 3, 39, 7, 0, 1, 35, 84, 10, 8, 32, 13), (0.017180308460902963, 0.02751921818456658, 0.011887861159888378, 0.00859908577494079, 0.007521091815230704, 0.006522044953782423, 0.01032297079810829, 0.018976833302472455, 0.007634289723749076, 0.003033975857850723, 0.018805184361326378, 0.011217892399539534, 0.05106916198426676, 0.007901136066759178, 0.008895262042995653, 0.006665649645210911), 'Debt collection'))
(491023, ((36, 12, 50, 40, 5, 23, 58, 76, 11, 7, 65, 0, 1, 66, 16, 99, 98, 45, 13), (0.007528732561416072, 0.017248902490279026, 0.008083896178333739, 0.008274896865005982, 0.0210032206108319, 0.02048387345320946, 0.010225319903418824, 0.017842961406992965, 0.012026753813481164, 0.005154201637708568, 0.008274127579967948, 0.0168843021403551, 0.007416385430301767, 0.009257236955148311, 0.00590385362565239, 0.011031745337733267, 0.011076277004617665, 0.01575522984526745, 0.005431270081282964), 'Vehicle loan or lease'))
As you can see in my dataframe i will must have 4 different columns. The first one should be the Int 491023, the second a tuple (i think dataframes don't have tuple type, so array also works), third another tuple and fourth a string. As you can see my tuples have different sizes.
The simplest command rdd.toDF() don't work for me. Any ideas how can i achieve that?
You can create your dataframe like below , eventually you can pass an array(ArrayType())/list
from pyspark.sql import functions as F
df_a = spark.createDataFrame([('N110WA',['12','34'],1590038340000)],[ "reg","val1","val2"])
Output
+------+--------+-------------+
| reg| val1| val2|
+------+--------+-------------+
|N110WA|[12, 34]|1590038340000|
+------+--------+-------------+
Schema
df_a.printSchema()
root
|-- reg: string (nullable = true)
|-- val1: struct (nullable = true)
| |-- _1: string (nullable = true)
| |-- _2: string (nullable = true)
|-- val2: long (nullable = true)

NumberFormatException when trying to perform sort() or orderBy() on a dataframe in spark using scala?

I have a data frame df that has 3 columns (as included in the image).
data frame
when i execute
import sqlContext.implicits._
df.sort($"count".desc)
or
import org.apache.spark.sql.functions._
df.orderBy(desc("count"))
it appears to be done successfully but when i try to show() or collect(), I get the following error-
18/07/06 05:06:56, 594 INFO SparkContext: Starting job: show at :52
18/07/06 05:06:56, 596 INFO DAGScheduler: Got job 6 (show at :52) with 2 output partitions
18/07/06 05:06:56, 596 INFO DAGScheduler: Final stage: ResultStage 6 (show at :52)
18/07/06 05:06:56, 596 INFO DAGScheduler: Parents of final stage: List()
18/07/06 05:06:56, 596 INFO DAGScheduler: Missing parents: List()
18/07/06 05:06:56, 596 INFO DAGScheduler: Submitting ResultStage 6 (MapPartitionsRDD[20] at show at :52), which has no missing parents
.
.
.
Lost task 1.0 in stage 6.0 (TID 11, localhost): java.lang.NumberFormatException: For input string: "Sint Eustatius"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.parseInt(Integer.java:615)
at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
.
.
.
.
and so on.
only included some lines because its too big. is there any other way to sort this df based on the column -count???
edit 1
this is the result on displaying the dataframe.
df.show()
edit 2
when i try to execute using sqlContext, in the following way -
val df1=sqlContext.sql("SELECT * from df order by count desc").collect()
I get table not found error. how should I convert df into a table?
Understand that Spark won't compute any of the code until you apply an action on a dataFrame/RDD; in another words RDDs/DFs are lazily evaluated. Read Spark Documentation
Now in your case orderBy and sort are transformations and spark won't execute any code until you have a transformation, and the show and collect are actions which tells the spark to orderBy or sort and to get the result.
The error that you have now is due to a string Sint Eustatius in the column count which is a string type and string type cannot be casted into Integer.
Validate your data once and make sure you have only Integer values in the column count, this should solve your issue.
Here is my dataframe and queries that I wrote and working fine.
val sparkSession=SparkSession.builder().master("local").appName("LearnScala").getOrCreate()
val data = sparkSession.sparkContext.parallelize(Seq(Row(1, "A", "B"), Row(2, "A", "B")))
val schema = StructType(Array( StructField("col1", IntegerType, false),StructField("col2", StringType, false), StructField("col3", StringType, false)))
val df = sparkSession.createDataFrame(data, schema)
//df.show
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| A| B|
| 2| A| B|
+----+----+----+
df.orderBy(desc("col1")).show
df.sort(df.col("col1").desc).show
//Above both expressions produces the same output like below
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 2| A| B|
| 1| A| B|
+----+----+----+

Spark: 'Requested array size exceeds VM limit' when writing dataframe

I am running into a "OutOfMemoryError: Requested array size exceeds VM limit" error when running my Scala Spark job.
I'm running this job on an AWS EMR cluster with the following makeup:
Master: 1 m4.4xlarge 32 vCore, 64 GiB memory
Core: 1 r3.4xlarge 32 vCore, 122 GiB memory
The version of Spark I'm using is 2.2.1 on EMR release label 5.11.0.
I'm running my job in a spark shell with the following configurations:
spark-shell --conf spark.driver.memory=40G
--conf spark.driver.maxResultSize=25G
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.kryoserializer.buffer.max=2000
--conf spark.rpc.message.maxSize=2000
--conf spark.dynamicAllocation.enabled=true
What I'm attempting to do with this job is to convert a one column dataframe of objects into a one row dataframe that contains a list of those objects.
The objects are as follows:
case class Properties (id: String)
case class Geometry (`type`: String, coordinates: Seq[Seq[Seq[String]]])
case class Features (`type`: String, properties: Properties, geometry: Geometry)
And my dataframe schema is as follows:
root
|-- geometry: struct (nullable = true)
| |-- type: string (nullable = true)
| |-- coordinates: array (nullable = true)
| | |-- element: array (containsNull = true)
| | | |-- element: array (containsNull = true)
| | | | |-- element: string (containsNull = true)
|-- type: string (nullable = false)
|-- properties: struct (nullable = false)
| |-- id: string (nullable = true)
I'm converting it to a list and adding it to a one row dataframe like so:
val x = Seq(df.collect.toList)
final_df.withColumn("features", typedLit(x))
I don't run into any issues when creating this list and it's pretty quick. However, there seems to be a limit to the size of this list when I try to write it out by doing either of the following:
final_df.first
final_df.write.json(s"s3a://<PATH>/")
I've tried to also convert the list to a dataframe by doing the following, but it seems to never end.
val x = Seq(df.collect.toList)
val y = x.toDF
The largest list I've been capable of getting this dataframe to work with had 813318 Features objects, each of which contains a Geometry object that contains a list of 33 elements, for a total of 29491869 elements.
Attempting to write pretty much any list larger than that gives me the following stacktrace when running my job.
# java.lang.OutOfMemoryError: Requested array size exceeds VM limit
# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing /bin/sh -c "kill -9 33028"...
os::fork_and_exec failed: Cannot allocate memory (12)
18/03/29 21:41:35 ERROR FileFormatWriter: Aborting job null.
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:73)
at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter.write(UnsafeArrayWriter.java:217)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_1$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply1_1$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.execution.LocalTableScanExec$$anonfun$unsafeRows$1.apply(LocalTableScanExec.scala:41)
at org.apache.spark.sql.execution.LocalTableScanExec$$anonfun$unsafeRows$1.apply(LocalTableScanExec.scala:41)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at org.apache.spark.sql.execution.LocalTableScanExec.unsafeRows$lzycompute(LocalTableScanExec.scala:41)
at org.apache.spark.sql.execution.LocalTableScanExec.unsafeRows(LocalTableScanExec.scala:36)
at org.apache.spark.sql.execution.LocalTableScanExec.rdd$lzycompute(LocalTableScanExec.scala:48)
at org.apache.spark.sql.execution.LocalTableScanExec.rdd(LocalTableScanExec.scala:48)
at org.apache.spark.sql.execution.LocalTableScanExec.doExecute(LocalTableScanExec.scala:52)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:173)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:166)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:166)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:166)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:145)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
I've tried making a million configuration changes, including throwing both more driver and executor memory at this job, but to no avail. Is there any way around this? Any ideas?
The problem is here
val x = Seq(df.collect.toList)
When you do collect on a dataframe it will send all the data of the dataframe to the driver. So if your dataframe is big this will cause driver to get out of memory.
It is to be noted that out of all the memory you assign to the executor, the heap memory which driver can you is generally 30% (if not changed). So what is happening the driver is choking with the data volume due to the collect operation.
Now the thing is you might think the dataframe is smaller in size on disk but that is because the data is serialized and saved there. When you do collect it materialize the dataframe and uses JVM to store the data. This will cause huge memory explode ( generally 5-7X).
I would recommend you to remove the collect part and use df dataframe directly. Because I recon
val x = Seq(df.collect.toList) and df are essentially same
Well, there is a dataframe aggregation function that does what you want without doing a collect on the driver. For example if you wanted to collect all "feature" columns by key: df.groupBy($"key").agg(collect_list("feature")), or if you really wanted to do that for the whole dataframe without grouping: df.agg(collect_list("feature")).
However I wonder why you'd want to do that, when it seems easier to work with a dataframe with one row per object than a single row containing the entire result. Even using the collect_list aggregation function I wouldn't be surprised if you still run out of memory.

spark error in column type

I have a dataframe column,called 'SupplierId' ,typed as a string, with a lot of digits, but also some characters chain.
(ex: ['123','456','789',......,'abc']).
I formatted this column as a string using
from pyspark.sql.types import StringType
df=df.withColumn('SupplierId',df['SupplierId'].cast(StringType())
So I check it is treated as a string using:
df.printSchema()
and I get:
root
|-- SupplierId: string (nullable = true)
But when I try to convert to Pandas, or just to use df.collect(),
I obtain the following error:
An error occurred while calling o516.collectToPython. : org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 11, servername.ops.somecompany.local, executor 3):
ava.lang.RuntimeException: Error while encoding: java.lang.RuntimeException:
Exception parsing 'CPD160001' into a IntegerType$ for column "SupplierId":
Unable to deserialize value using com.somecompany.spark.parsers.text.converters.IntegerConverter.
The value being deserialized was: CPD160001
So it seems Spark treats the value of this column as integers.
I have tried using UDF to force convert to string with python, but it still doesn't work.
Do you have any idea what could cause this error?
Please do share a sample of your actual data, as your issue cannot be reproduced with toy ones:
spark.version
# u'2.2.0'
from pyspark.sql import Row
df = spark.createDataFrame([Row(1, 2, '3'),
Row(4, 5, 'a'),
Row(7, 8, '9')],
['x1', 'x2', 'id'])
df.printSchema()
# root
# |-- x1: long (nullable = true)
# |-- x2: long (nullable = true)
# |-- id: string (nullable = true)
df.collect()
# [Row(x1=1, x2=2, id=u'3'), Row(x1=4, x2=5, id=u'a'), Row(x1=7, x2=8, id=u'9')]
import pandas as pd
df_pandas = df.toPandas()
df_pandas
# x1 x2 id
# 0 1 2 3
# 1 4 5 a
# 2 7 8 9

ArrayIndexOutOfBoundsException with Spark, Spark-Avro and Google Analytics Data

I'm attempting to use spark-avro with Google Analytics avro data files, from one of our clients. Also I'm new to spark/scala, so my apologies if I've got anything wrong or done anything stupid. I'm using Spark 1.3.1.
I'm experimenting with the data in the spark-shell which I'm kicking off like this:
spark-shell --packages com.databricks:spark-avro_2.10:1.0.0
Then I'm running the following commands:
import com.databricks.spark.avro._
import scala.collection.mutable._
val gadata = sqlContext.avroFile("[client]/data")
gadata: org.apache.spark.sql.DataFrame = [visitorId: bigint, visitNumber: bigint, visitId: bigint, visitStartTime: bigint, date: string, totals: struct<visits:bigint,hits:bigint,pageviews:bigint,timeOnSite:bigint,bounces:bigint,tr ansactions:bigint,transactionRevenue:bigint,newVisits:bigint,screenviews:bigint,uniqueScreenviews:bigint,timeOnScre en:bigint,totalTransactionRevenue:bigint>, trafficSource: struct<referralPath:string,campaign:string,source:string, medium:string,keyword:string,adContent:string>, device: struct<browser:string,browserVersion:string,operatingSystem :string,operatingSystemVersion:string,isMobile:boolean,mobileDeviceBranding:string,flashVersion:string,javaEnabled: boolean,language:string,screenColors:string,screenResolution:string,deviceCategory:string>, geoNetwork: str...
val gaIds = gadata.map(ga => ga.getString(11)).collect()
I get the following error:
[Stage 2:=> (8 + 4) / 430]15/05/14 11:14:04 ERROR Executor: Exception in task 12.0 in stage 2.0 (TID 27)
java.lang.ArrayIndexOutOfBoundsException
15/05/14 11:14:04 WARN TaskSetManager: Lost task 12.0 in stage 2.0 (TID 27, localhost): java.lang.ArrayIndexOutOfBoundsException
15/05/14 11:14:04 ERROR TaskSetManager: Task 12 in stage 2.0 failed 1 times; aborting job
15/05/14 11:14:04 WARN TaskSetManager: Lost task 11.0 in stage 2.0 (TID 26, localhost): TaskKilled (killed intentionally)
15/05/14 11:14:04 WARN TaskSetManager: Lost task 10.0 in stage 2.0 (TID 25, localhost): TaskKilled (killed intentionally)
15/05/14 11:14:04 WARN TaskSetManager: Lost task 9.0 in stage 2.0 (TID 24, localhost): TaskKilled (killed intentionally)
15/05/14 11:14:04 WARN TaskSetManager: Lost task 13.0 in stage 2.0 (TID 28, localhost): TaskKilled (killed intentionally)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 12 in stage 2.0 failed 1 times, most recent failure: Lost task 12.0 in stage 2.0 (TID 27, localhost): java.lang.ArrayIndexOutOfBoundsException
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
I though this might be too do with the index I was using, but the following statement works OK.
scala> gadata.first().getString(11)
res12: String = 29456309767885
So I though that maybe some of the records might be empty or have different amount of columns... so I attempted to run the following statement to get a list of all the record lengths:
scala> gadata.map(ga => ga.length).collect()
But I get a similar error:
[Stage 4:=> (8 + 4) / 430]15/05/14 11:20:04 ERROR Executor: Exception in task 12.0 in stage 4.0 (TID 42)
java.lang.ArrayIndexOutOfBoundsException
15/05/14 11:20:04 WARN TaskSetManager: Lost task 12.0 in stage 4.0 (TID 42, localhost): java.lang.ArrayIndexOutOfBoundsException
15/05/14 11:20:04 ERROR TaskSetManager: Task 12 in stage 4.0 failed 1 times; aborting job
15/05/14 11:20:04 WARN TaskSetManager: Lost task 11.0 in stage 4.0 (TID 41, localhost): TaskKilled (killed intentionally)
15/05/14 11:20:04 ERROR Executor: Exception in task 13.0 in stage 4.0 (TID 43)
org.apache.spark.TaskKilledException
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/05/14 11:20:04 WARN TaskSetManager: Lost task 9.0 in stage 4.0 (TID 39, localhost): TaskKilled (killed intentionally)
15/05/14 11:20:04 WARN TaskSetManager: Lost task 10.0 in stage 4.0 (TID 40, localhost): TaskKilled (killed intentionally)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 12 in stage 4.0 failed 1 times, most recent failure: Lost task 12.0 in stage 4.0 (TID 42, localhost): java.lang.ArrayIndexOutOfBoundsException
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Is this an Issue with Spark-Avro or Spark?
Not sure what the underlying issue was, but I've managed to fix the error by breaking up my data into monthly sets. I had 4 months worth of GA data in a single folder and was operating on all the data. The data ranged from 70MB to 150MB per day.
Creating 4 folders for January, February, March & April and loading them up individually the map succeeds without any issues. Once loaded I can join the data set together (only tried two so far) and work on them, without issue.
I'm using Spark on a Pseudo Hadoop distribution, not sure if this makes a difference to the volume of data Spark can handle.
UPDATE:
Found the root issue with the error. I loaded up each months data and printout the schema. Both January and February are identical but after this a field goes walk about in March and Aprils schemas:
root
|-- visitorId: long (nullable = true)
|-- visitNumber: long (nullable = true)
|-- visitId: long (nullable = true)
|-- visitStartTime: long (nullable = true)
|-- date: string (nullable = true)
|-- totals: struct (nullable = true)
| |-- visits: long (nullable = true)
| |-- hits: long (nullable = true)
| |-- pageviews: long (nullable = true)
| |-- timeOnSite: long (nullable = true)
| |-- bounces: long (nullable = true)
| |-- transactions: long (nullable = true)
| |-- transactionRevenue: long (nullable = true)
| |-- newVisits: long (nullable = true)
| |-- screenviews: long (nullable = true)
| |-- uniqueScreenviews: long (nullable = true)
| |-- timeOnScreen: long (nullable = true)
| |-- totalTransactionRevenue: long (nullable = true)
(snipped)
After February the totalTransactionRevenuse at the bottom is not present anymore. So I assume this is causing the error and is related to this issue