ArrayIndexOutOfBoundsException with Spark, Spark-Avro and Google Analytics Data

ArrayIndexOutOfBoundsException with Spark, Spark-Avro and Google Analytics Data - scala

I'm attempting to use spark-avro with Google Analytics avro data files, from one of our clients. Also I'm new to spark/scala, so my apologies if I've got anything wrong or done anything stupid. I'm using Spark 1.3.1.
I'm experimenting with the data in the spark-shell which I'm kicking off like this:
spark-shell --packages com.databricks:spark-avro_2.10:1.0.0
Then I'm running the following commands:
import com.databricks.spark.avro._
import scala.collection.mutable._
val gadata = sqlContext.avroFile("[client]/data")
gadata: org.apache.spark.sql.DataFrame = [visitorId: bigint, visitNumber: bigint, visitId: bigint, visitStartTime: bigint, date: string, totals: struct<visits:bigint,hits:bigint,pageviews:bigint,timeOnSite:bigint,bounces:bigint,tr ansactions:bigint,transactionRevenue:bigint,newVisits:bigint,screenviews:bigint,uniqueScreenviews:bigint,timeOnScre en:bigint,totalTransactionRevenue:bigint>, trafficSource: struct<referralPath:string,campaign:string,source:string, medium:string,keyword:string,adContent:string>, device: struct<browser:string,browserVersion:string,operatingSystem :string,operatingSystemVersion:string,isMobile:boolean,mobileDeviceBranding:string,flashVersion:string,javaEnabled: boolean,language:string,screenColors:string,screenResolution:string,deviceCategory:string>, geoNetwork: str...
val gaIds = gadata.map(ga => ga.getString(11)).collect()
I get the following error:
[Stage 2:=> (8 + 4) / 430]15/05/14 11:14:04 ERROR Executor: Exception in task 12.0 in stage 2.0 (TID 27)
java.lang.ArrayIndexOutOfBoundsException
15/05/14 11:14:04 WARN TaskSetManager: Lost task 12.0 in stage 2.0 (TID 27, localhost): java.lang.ArrayIndexOutOfBoundsException
15/05/14 11:14:04 ERROR TaskSetManager: Task 12 in stage 2.0 failed 1 times; aborting job
15/05/14 11:14:04 WARN TaskSetManager: Lost task 11.0 in stage 2.0 (TID 26, localhost): TaskKilled (killed intentionally)
15/05/14 11:14:04 WARN TaskSetManager: Lost task 10.0 in stage 2.0 (TID 25, localhost): TaskKilled (killed intentionally)
15/05/14 11:14:04 WARN TaskSetManager: Lost task 9.0 in stage 2.0 (TID 24, localhost): TaskKilled (killed intentionally)
15/05/14 11:14:04 WARN TaskSetManager: Lost task 13.0 in stage 2.0 (TID 28, localhost): TaskKilled (killed intentionally)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 12 in stage 2.0 failed 1 times, most recent failure: Lost task 12.0 in stage 2.0 (TID 27, localhost): java.lang.ArrayIndexOutOfBoundsException
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
I though this might be too do with the index I was using, but the following statement works OK.
scala> gadata.first().getString(11)
res12: String = 29456309767885
So I though that maybe some of the records might be empty or have different amount of columns... so I attempted to run the following statement to get a list of all the record lengths:
scala> gadata.map(ga => ga.length).collect()
But I get a similar error:
[Stage 4:=> (8 + 4) / 430]15/05/14 11:20:04 ERROR Executor: Exception in task 12.0 in stage 4.0 (TID 42)
java.lang.ArrayIndexOutOfBoundsException
15/05/14 11:20:04 WARN TaskSetManager: Lost task 12.0 in stage 4.0 (TID 42, localhost): java.lang.ArrayIndexOutOfBoundsException
15/05/14 11:20:04 ERROR TaskSetManager: Task 12 in stage 4.0 failed 1 times; aborting job
15/05/14 11:20:04 WARN TaskSetManager: Lost task 11.0 in stage 4.0 (TID 41, localhost): TaskKilled (killed intentionally)
15/05/14 11:20:04 ERROR Executor: Exception in task 13.0 in stage 4.0 (TID 43)
org.apache.spark.TaskKilledException
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/05/14 11:20:04 WARN TaskSetManager: Lost task 9.0 in stage 4.0 (TID 39, localhost): TaskKilled (killed intentionally)
15/05/14 11:20:04 WARN TaskSetManager: Lost task 10.0 in stage 4.0 (TID 40, localhost): TaskKilled (killed intentionally)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 12 in stage 4.0 failed 1 times, most recent failure: Lost task 12.0 in stage 4.0 (TID 42, localhost): java.lang.ArrayIndexOutOfBoundsException
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Is this an Issue with Spark-Avro or Spark?

Not sure what the underlying issue was, but I've managed to fix the error by breaking up my data into monthly sets. I had 4 months worth of GA data in a single folder and was operating on all the data. The data ranged from 70MB to 150MB per day.
Creating 4 folders for January, February, March & April and loading them up individually the map succeeds without any issues. Once loaded I can join the data set together (only tried two so far) and work on them, without issue.
I'm using Spark on a Pseudo Hadoop distribution, not sure if this makes a difference to the volume of data Spark can handle.
UPDATE:
Found the root issue with the error. I loaded up each months data and printout the schema. Both January and February are identical but after this a field goes walk about in March and Aprils schemas:
root
|-- visitorId: long (nullable = true)
|-- visitNumber: long (nullable = true)
|-- visitId: long (nullable = true)
|-- visitStartTime: long (nullable = true)
|-- date: string (nullable = true)
|-- totals: struct (nullable = true)
| |-- visits: long (nullable = true)
| |-- hits: long (nullable = true)
| |-- pageviews: long (nullable = true)
| |-- timeOnSite: long (nullable = true)
| |-- bounces: long (nullable = true)
| |-- transactions: long (nullable = true)
| |-- transactionRevenue: long (nullable = true)
| |-- newVisits: long (nullable = true)
| |-- screenviews: long (nullable = true)
| |-- uniqueScreenviews: long (nullable = true)
| |-- timeOnScreen: long (nullable = true)
| |-- totalTransactionRevenue: long (nullable = true)
(snipped)
After February the totalTransactionRevenuse at the bottom is not present anymore. So I assume this is causing the error and is related to this issue

Related

How do I handle a null value in the a scala UDF?

I understand that there are many SO answers related to what I am asking, but since I am very new to scala, I am not able to understand those answer. Would really appreciate if someone please help me correct my UDF.
I have this UDF which is meant to do the timezone conversion from GMT to MST:
val Gmt2Mst = (dtm_str: String, inFmt: String, outFmt: String) => {
if ("".equals(dtm_str) || dtm_str == null || dtm_str.length() < inFmt.length()) {
null
}
else {
val gmtZoneId = ZoneId.of("GMT", ZoneId.SHORT_IDS);
val mstZoneId = ZoneId.of("MST", ZoneId.SHORT_IDS);
val inFormatter = DateTimeFormatter.ofPattern(inFmt);
val outFormatter = DateTimeFormatter.ofPattern(outFmt);
val dateTime = LocalDateTime.parse(dtm_str, inFormatter);
val gmt = ZonedDateTime.of(dateTime, gmtZoneId)
val mst = gmt.withZoneSameInstant(mstZoneId)
mst.format(outFormatter)
}
}
spark.udf.register("Gmt2Mst", Gmt2Mst)
But whenever there is NULL encountered it fails to handle that. I am trying to handle it using dtm_str == null but it still fails. Can some please help me with what correction do I have to make instead of dtm_str == null which can help me achieve my goal?
To give an example, if I run the below spark-sql:
spark.sql("select null as col1, Gmt2Mst(null,'yyyyMMddHHmm', 'yyyyMMddHHmm') as col2").show()
I am getting this error:
22/09/20 14:10:31 INFO TaskSetManager: Starting task 101.1 in stage 27.0 (TID 1809) (10.243.37.204, executor 18, partition 101, PROCESS_LOCAL, 5648 bytes) taskResourceAssignments Map()
22/09/20 14:10:31 INFO TaskSetManager: Lost task 100.0 in stage 27.0 (TID 1801) on 10.243.37.204, executor 18: org.apache.spark.SparkException (Failed to execute user defined function (anonfun$4: (string, string) => string)) [duplicate 1]
22/09/20 14:10:31 INFO TaskSetManager: Starting task 100.1 in stage 27.0 (TID 1810) (10.243.37.241, executor 1, partition 100, PROCESS_LOCAL, 5648 bytes) taskResourceAssignments Map()
22/09/20 14:10:31 INFO TaskSetManager: Lost task 102.0 in stage 27.0 (TID 1803) on 10.243.37.241, executor 1: org.apache.spark.SparkException (Failed to execute user defined function (anonfun$4: (string, string) => string)) [duplicate 2]
22/09/20 14:10:31 INFO TaskSetManager: Starting task 102.1 in stage 27.0 (TID 1811) (10.243.36.183, executor 22, partition 102, PROCESS_LOCAL, 5648 bytes) taskResourceAssignments Map()
22/09/20 14:10:31 INFO TaskSetManager: Finished task 80.0 in stage 27.0 (TID 1781) in 2301 ms on 10.243.36.183 (executor 22) (81/355)
22/09/20 14:10:31 INFO TaskSetManager: Starting task 108.0 in stage 27.0 (TID 1812) (10.243.36.156, executor 4, partition 108, PROCESS_LOCAL, 5648 bytes) taskResourceAssignments Map()
22/09/20 14:10:31 INFO TaskSetManager: Lost task 103.0 in stage 27.0 (TID 1804) on 10.243.36.156, executor 4: org.apache.spark.SparkException (Failed to execute user defined function (anonfun$4: (string, string) => string)) [duplicate 3]
22/09/20 14:10:31 INFO TaskSetManager: Starting task 103.1 in stage 27.0 (TID 1813) (10.243.36.180, executor 9, partition 103, PROCESS_LOCAL, 5648 bytes) taskResourceAssignments Map()
22/09/20 14:10:31 WARN TaskSetManager: Lost task 105.0 in stage 27.0 (TID 1806) (10.243.36.180 executor 9): org.apache.spark.SparkException: Failed to execute user defined function (anonfun$3: (string, string) => string)
at org.apache.spark.sql.errors.QueryExecutionErrors$.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala:136)
at org.apache.spark.sql.errors.QueryExecutionErrors.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.NullPointerException
at org.commonspirit.sepsis_bio.recovery.SepsisRecoveryBundle$$anonfun$3.apply(SepsisRecoveryBundle.scala:123)
at org.commonspirit.sepsis_bio.recovery.SepsisRecoveryBundle$$anonfun$3.apply(SepsisRecoveryBundle.scala:122)
... 15 more

I did the following test and it seems that it works:
Create a dataframe with a null type. The Schema would be:
root
|-- v0: string (nullable = true)
|-- v1: string (nullable = true)
|-- null: null (nullable = true)
for example:
+----+-----+----+
| v0| v1|null|
+----+-----+----+
|hola|adios|null|
+----+-----+----+
Create the udf:
val udf1 = udf{ v1: Any => { if(v1 != null) s"${v1}_transformed" else null } }
Note that working with Any in Scala is a bad practice, but this is Spark Sql and to handle a value that could be of two different types you would need to work with this supertype.
Register the udf:
spark.udf.register("udf1", udf1)
Create the view:
df2.createTempView("df2")
Apply the udf to the view:
spark.sql("select udf1(null) from df").show()
it shows:
+---------+
|UDF(null)|
+---------+
| null|
+---------+
Apply to a column with not null value:
spark.sql("select udf1(v0) from df2").show()
it shows:
+----------------+
| UDF(v0)|
+----------------+
|hola_transformed|
+----------------+

Re partitioning using pyspark failing with error

I have parquet in s3 folder with below column.Size of the parquet is around 40 mb.
org_id, device_id, channel_id, source, col1, col2
right now partition is on 3 column org_id device_id channel_id
I want change the partition to source, org_id, device_id, channel_id.
I am using pyspark to read file from s3 and write to s3 bucket.
sc = SparkContext(appName="parquet_ingestion1").getOrCreate()
spark = SparkSession(sc)
file_path = "s3://some-bucket/some_folder"
print("Reading parquet from s3:{}".format(file_path))
spark_df = spark.read.parquet(file_path)
print("Converting to parquet")
file_path_re = "s3://other_bucket/re-partition"
partition_columns = ["source", "org_id", "device_id", "channel_id "]
spark_df.repartition(1).write.partitionBy(partition_columns).mode('append').parquet(file_path_re)
I am getting error and parquet file is not generated.
spark_df.repartition(1).write.partitionBy(partition_columns).mode('append').parquet(file_path_re)
[Stage 1:> (0 + 8) / 224]20/04/29 13:29:44 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, ip-172-31-43-0.ap-south-1.compute.internal, executor 3): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainFloatDictionary
at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:41)
at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToBinary(ParquetDictionary.java:51)
at org.apache.spark.sql.execution.vectorized.WritableColumnVector.getUTF8String(WritableColumnVector.java:380)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:148)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Then i tried
spark_df.write.partitionBy(partition_columns).mode('append').parquet(file_path_re)
spark_df.write.partitionBy(partition_columns).mode('append').parquet(file_path_re)
[Stage 3:> (0 + 8) / 224]20/04/29 13:32:11 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 23, ip-172-31-42-4.ap-south-1.compute.internal, executor 5): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainFloatDictionary
at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:41)
at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToBinary(ParquetDictionary.java:51)
at org.apache.spark.sql.execution.vectorized.WritableColumnVector.getUTF8String(WritableColumnVector.java:380)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:216)
at org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec.scala:108)
at org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec.scala:101)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
[Stage 3:==> (8 + 8) / 224]20/04/29 13:32:22 WARN TaskSetManager: Lost task 0.2 in stage 3.0 (TID 40, ip-172-31-42-4.ap-south-1.compute.internal, executor 5): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainFloatDictionary
at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:41)
at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToBinary(ParquetDictionary.java:51)
In 2nd case it is giving failure but it is creating parquet also.Now i am not sure it is correctly creating all the data to new partition .
Let me know how is correct way of re partitioning the parquet.
UPDATE 1:
from pyspark.sql.types import StringType
for col1 in partition_columns:
spark_df=spark_df.withColumn(col1, col(col1).cast(dataType=StringType()))
Tried both
spark_df.repartition(1).write.partitionBy(partition_columns).mode('append').parquet(file_path_re)
spark_df.write.partitionBy(partition_columns).mode('append').parquet(file_path_re)
I get following error
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 20, ip-172-31-42-4.ap-south-1.compute.internal, executor 4): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainFloatDictionary
at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:41)
at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToBinary(ParquetDictionary.java:51)
at org.apache.spark.sql.execution.vectorized.WritableColumnVector.getUTF8String(WritableColumnVector.java:380)
UPDATE 2:
Now i found that there is schema mismatch in one of the column one is string other is float.I have depicted the scenario below.
Here you can see col1 column is string in one row and float for other row
org_id, device_id, channel_id, source, col1, col2
"100" "device1" "channel" "source1" 10 0.1
"100" "device1" "channel" "source2" "10" 0.1
I tried casting col1 column to float.it dodn;t worked
Any suggestion.

Try force type casting all partition_columns to StringType

Root cause of the issue is mentioned in UPDATE2. In my case we have 4 apps(part of different pipeline based on source) that write to parquet store. 2 app APP1 and APP2 don't use col1 and APP 3 used to write it as float.
Recently APP4 started getting col1 in their data and stored it as string in the parquet.parquet don't complain while writing.
While reading such parquet made
I tried casting it didn't worked
merge schema failed with mismatch in data type
I tried filter data based on source type. it worked partially in the sense if filter out APP4 data it worked.but if filter out APP3 data it didn't worked.
This may not be good solution, but i had to content with this for now.
Solutions:
1. filter out app4 source data and create data frame and convert it parquet and then filter only app4 source parquet in data frame and remove col1 and convert it into parquet.
Or Remove col from whole data frame and write to parquet.
df1 =df.select([c for c in df.columns if c!= 'col1'])

How to partition a table in scala with the proper name

I have a large Dataframe in scala 2.4.0, that looks like this
+--------------------+--------------------+--------------------+-------------------+--------------+------+
| cookie| updated_score| probability| date_last_score|partition_date|target|
+--------------------+--------------------+--------------------+-------------------+--------------+------+
|00000000000001074780| 0.1110987111481027| 0.27492987342938174|2019-03-29 16:00:00| 2019-04-07_10| 0|
|00000000000001673799| 0.02621894072693878| 0.2029688362968775|2019-03-19 08:00:00| 2019-04-07_10| 0|
|00000000000002147908| 0.18922034021212567| 0.3520678649755828|2019-03-31 19:00:00| 2019-04-09_12| 1|
|00000000000004028302| 0.06803669083452231| 0.23089047208736854|2019-03-25 17:00:00| 2019-04-07_10| 0|
and this schema:
root
|-- cookie: string (nullable = true)
|-- updated_score: double (nullable = true)
|-- probability: double (nullable = true)
|-- date_last_score: string (nullable = true)
|-- partition_date: string (nullable = true)
|-- target: integer (nullable = false)
then I create a partition table and insert the data into database.table_name. But when I look up at hive database and type: show partitions database.table_name I only got partition_date=0 and partition_date=1, and 0 and 1 are not values from partition_date column.
I don't know if I wrote something wrong, there are some scala concepts that I don't understand or the dataframe is too large.
I've tried differents ways to do this looking up similar questions as:
result_df.write.mode(SaveMode.Overwrite).insertInto("table_name")
or
result_df.write.mode(SaveMode.Overwrite).saveAsTable("table_name")
In case it helps I provide some INFO message from scala:
Looking at this message, I think I got my result_df partitions properly.
19/07/31 07:53:57 INFO TaskSetManager: Starting task 11.0 in stage 2822.0 (TID 123456, ip-xx-xx-xx.aws.local.somewhere, executor 45, partition 11, PROCESS_LOCAL, 7767 bytes)
19/07/31 07:53:57 INFO TaskSetManager: Starting task 61.0 in stage 2815.0 (TID 123457, ip-xx-xx-xx-xyz.aws.local.somewhere, executor 33, partition 61, NODE_LOCAL, 8095 bytes)
Then, I am starting to saving the partitions as a Vector(0, 1, 2...), but I may only save 0 and 1? I don't really know.
19/07/31 07:56:02 INFO DAGScheduler: Submitting 35 missing tasks from ShuffleMapStage 2967 (MapPartitionsRDD[130590] at insertInto at evaluate_decay_factor.scala:165) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
19/07/31 07:56:02 INFO YarnScheduler: Adding task set 2967.0 with 35 tasks
19/07/31 07:56:02 INFO DAGScheduler: Submitting ShuffleMapStage 2965 (MapPartitionsRDD[130578] at insertInto at evaluate_decay_factor.scala:165), which has no missing parents
My code looks like this:
val createTableSQL = s"""
CREATE TABLE IF NOT EXISTS table_name (
cookie string,
updated_score float,
probability float,
date_last_score string,
target int
)
PARTITIONED BY (partition_date string)
STORED AS PARQUET
TBLPROPERTIES ('PARQUET.COMPRESSION'='SNAPPY')
"""
spark.sql(createTableSQL)
result_df.write.mode(SaveMode.Overwrite).insertInto("table_name")
Given a dataframe like this:
val result = Seq(
(8, "123", 1.2, 0.5, "bat", "2019-04-04_9"),
(64, "451", 3.2, -0.5, "mouse", "2019-04-04_12"),
(-27, "613", 8.2, 1.5, "horse", "2019-04-04_10"),
(-37, "513", 4.33, 2.5, "horse", "2019-04-04_11"),
(45, "516", -3.3, 3.4, "bat", "2019-04-04_10"),
(12, "781", 1.2, 5.5, "horse", "2019-04-04_11")
I want to run: show partitions "table_name" on hive command line and get:
partition_date=2019-04-04_9
partition_date=2019-04-04_10
partition_date=2019-04-04_11
partition_date=2019-04-04_12
Instead in my output is:
partition_date=0
partition_date=1
In this simple example case it works perfectly, but with my large dataframe I get the previous output.

To change the number of partitions, use repartition(numOfPartitions)
To change the column you partition by when writing, use partitionBy("col")
example used together: final_df.repartition(40).write.partitionBy("txnDate").mode("append").parquet(destination)
Two helpful hints:
Make your repartition size equal to the number of worker cores for quickest write / repartition. In this example, I have 10 executors, each with 4 cores (40 cores total). Thus, I set it to 40.
When you are writing to a destination, don't specify anything more than the sub bucket -- let spark handle the indexing.
good destination: "s3a://prod/subbucket/"
bad destination: s"s3a://prod/subbucket/txndate=$txndate"

To print output of SparkSQL to dataframe

I'm currently running Analyze command for particular table and could see the statistics being printed in the Spark-Console
However when I try to write the output to a DF I could not see the statistics.
Spark Version : 1.6.3
val a : DataFrame = sqlContext.sql("ANALYZE TABLE sample PARTITION (company='aaa', market='aab', edate='2019-01-03', pdate='2019-01-10') COMPUTE STATISTICS").collect()
Output in spark Shell
Partition sample{company=aaa, market=aab, etdate=2019-01-03, p=2019-01-10} stats: [numFiles=1, numRows=215, totalSize=7551, rawDataSize=461390]
19/03/22 02:49:33 INFO Task: Partition sample{company=aaa, market=aab, edate=2019-01-03, pdate=2019-01-10} stats: [numFiles=1, numRows=215, totalSize=7551, rawDataSize=461390]
Output of dataframe
19/03/22 02:49:33 INFO PerfLogger: </PERFLOG method=runTasks start=1553237373445 end=1553237373606 duration=161 from=org.apache.hadoop.hive.ql.Driver>
19/03/22 02:49:33 INFO PerfLogger: </PERFLOG method=Driver.execute start=1553237373445 end=1553237373606 duration=161 from=org.apache.hadoop.hive.ql.Driver>
19/03/22 02:49:33 INFO Driver: OK
19/03/22 02:49:40 INFO Executor: Running task 0.0 in stage 2.0 (TID 2)
19/03/22 02:49:40 INFO Executor: Finished task 0.0 in stage 2.0 (TID 2). 940 bytes result sent to driver
19/03/22 02:49:40 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 4 ms on localhost (1/1)
19/03/22 02:49:40 INFO DAGScheduler: ResultStage 2 (show at <console>:47) finished in 0.004 s
19/03/22 02:49:40 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
19/03/22 02:49:40 INFO DAGScheduler: Job 2 finished: show at <console>:47, took 0.007774 s
+------+
|result|
+------+
+------+
Could you please let me know how to get the same statistics output into the Dataframe.
Thanks.!

If you want to print from a Dataframe the way you are using, you can use,
val a : DataFrame = sqlContext.sql("ANALYZE TABLE sample PARTITION (company='aaa', market='aab', edate='2019-01-03', pdate='2019-01-10') COMPUTE STATISTICS")
a.select("*").show()

Parquet error when joining two dataframes. UnsupportedOperationException: ...PlainValuesDictionary$PlainLongDictionary

I have 2 dataframes with the following schema:
root
|-- pid: long (nullable = true)
|-- lv: timestamp (nullable = true)
root
|-- m_pid: long (nullable = true)
|-- vp: double (nullable = true)
|-- created: timestamp (nullable = true)
If I try to show() any of these 2 dataframes everything is ok, top 20 rows are displayed.
If I try to join these 2 dataframes and show the result (error does not appear at the join transformation, only when calling "show" action)
var joined = df1.join(df2, df2("pid") === df1("m_pid")).drop("m_pid")
joined.show()
I get an error which I do not understand. It is related to parquet. One of the dataframes is read from a parquet (the other one from text) but if it would be a problem related to reading the data then why does the problem appear only when joining the dataframes, not when showing the individually.
The error is:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 51 in stage 403.0 failed 4 times, most recent failure: Lost task 51.3 in stage 403.0 : java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
...
Caused by: java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
at org.apache.parquet.column.Dictionary.decodeToDouble(Dictionary.java:60)
at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getDouble(OnHeapColumnVector.java:354)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
... 1 more
Does anyone have an ideea what causes the error and how to go around it?