Parquet error when joining two dataframes. UnsupportedOperationException: ...PlainValuesDictionary$PlainLongDictionary - scala

I have 2 dataframes with the following schema:
root
|-- pid: long (nullable = true)
|-- lv: timestamp (nullable = true)
root
|-- m_pid: long (nullable = true)
|-- vp: double (nullable = true)
|-- created: timestamp (nullable = true)
If I try to show() any of these 2 dataframes everything is ok, top 20 rows are displayed.
If I try to join these 2 dataframes and show the result (error does not appear at the join transformation, only when calling "show" action)
var joined = df1.join(df2, df2("pid") === df1("m_pid")).drop("m_pid")
joined.show()
I get an error which I do not understand. It is related to parquet. One of the dataframes is read from a parquet (the other one from text) but if it would be a problem related to reading the data then why does the problem appear only when joining the dataframes, not when showing the individually.
The error is:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 51 in stage 403.0 failed 4 times, most recent failure: Lost task 51.3 in stage 403.0 : java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
...
Caused by: java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
at org.apache.parquet.column.Dictionary.decodeToDouble(Dictionary.java:60)
at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getDouble(OnHeapColumnVector.java:354)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
... 1 more
Does anyone have an ideea what causes the error and how to go around it?

Related

pyspark, aggregations results in an error

I can print dataframe fine before aggregation
(Pdb) df_interesting.printSchema()
root
|-- userId: long (nullable = true)
|-- screen_index: integer (nullable = true)
|-- type: string (nullable = true)
|-- time_delta: float (nullable = true)
|-- app_open_index: integer (nullable = true)
|-- timestamp: timestamp (nullable = true)
(pdb) df_interesting.show(n=2)
+------+------------+------+----------+--------------+--------------------+
|userId|screen_index| type|time_delta|app_open_index| timestamp|
+------+------------+------+----------+--------------+--------------------+
|214431| 7|screen| 60.0| 13|2020-07-31 07:52:...|
|398910| 3|screen| 60.0| 2|2020-07-29 11:43:...|
+------+------------+------+----------+--------------+--------------------+
However, after aggregation, show() results in an error..
(Pdb) df_interesting.groupBy('app_open_index').agg(F.max("screen_index").alias("max_screen_index")).show(n=2)
[Stage 1:> (0 + 2) / 2]20/08/13 18:07:26 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.IllegalArgumentException: The value (Buffer()) of the type (scala.collection.convert.Wrappers.JListWrapper) cannot be converted to the string type
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:290)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:285)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:248)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:238)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
Edit
I tried to single out the column, and here's some progress
(Pdb) df_interesting = df_interesting.select(col('data.userId').alias('userId'))
(Pdb) df_interesting.count()
[Stage 0:> (0 + 2) / 2]20/08/13 18:59:12 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
org.elasticsearch.hadoop.rest.EsHadoopParsingException: org.elasticsearch.hadoop.EsHadoopIllegalStateException: Field 'data.properties.priceObj' not found; typically this occurs \
with arrays which are not mapped as single value

Spark: 'Requested array size exceeds VM limit' when writing dataframe

I am running into a "OutOfMemoryError: Requested array size exceeds VM limit" error when running my Scala Spark job.
I'm running this job on an AWS EMR cluster with the following makeup:
Master: 1 m4.4xlarge 32 vCore, 64 GiB memory
Core: 1 r3.4xlarge 32 vCore, 122 GiB memory
The version of Spark I'm using is 2.2.1 on EMR release label 5.11.0.
I'm running my job in a spark shell with the following configurations:
spark-shell --conf spark.driver.memory=40G
--conf spark.driver.maxResultSize=25G
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.kryoserializer.buffer.max=2000
--conf spark.rpc.message.maxSize=2000
--conf spark.dynamicAllocation.enabled=true
What I'm attempting to do with this job is to convert a one column dataframe of objects into a one row dataframe that contains a list of those objects.
The objects are as follows:
case class Properties (id: String)
case class Geometry (`type`: String, coordinates: Seq[Seq[Seq[String]]])
case class Features (`type`: String, properties: Properties, geometry: Geometry)
And my dataframe schema is as follows:
root
|-- geometry: struct (nullable = true)
| |-- type: string (nullable = true)
| |-- coordinates: array (nullable = true)
| | |-- element: array (containsNull = true)
| | | |-- element: array (containsNull = true)
| | | | |-- element: string (containsNull = true)
|-- type: string (nullable = false)
|-- properties: struct (nullable = false)
| |-- id: string (nullable = true)
I'm converting it to a list and adding it to a one row dataframe like so:
val x = Seq(df.collect.toList)
final_df.withColumn("features", typedLit(x))
I don't run into any issues when creating this list and it's pretty quick. However, there seems to be a limit to the size of this list when I try to write it out by doing either of the following:
final_df.first
final_df.write.json(s"s3a://<PATH>/")
I've tried to also convert the list to a dataframe by doing the following, but it seems to never end.
val x = Seq(df.collect.toList)
val y = x.toDF
The largest list I've been capable of getting this dataframe to work with had 813318 Features objects, each of which contains a Geometry object that contains a list of 33 elements, for a total of 29491869 elements.
Attempting to write pretty much any list larger than that gives me the following stacktrace when running my job.
# java.lang.OutOfMemoryError: Requested array size exceeds VM limit
# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing /bin/sh -c "kill -9 33028"...
os::fork_and_exec failed: Cannot allocate memory (12)
18/03/29 21:41:35 ERROR FileFormatWriter: Aborting job null.
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:73)
at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter.write(UnsafeArrayWriter.java:217)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_1$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply1_1$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.execution.LocalTableScanExec$$anonfun$unsafeRows$1.apply(LocalTableScanExec.scala:41)
at org.apache.spark.sql.execution.LocalTableScanExec$$anonfun$unsafeRows$1.apply(LocalTableScanExec.scala:41)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at org.apache.spark.sql.execution.LocalTableScanExec.unsafeRows$lzycompute(LocalTableScanExec.scala:41)
at org.apache.spark.sql.execution.LocalTableScanExec.unsafeRows(LocalTableScanExec.scala:36)
at org.apache.spark.sql.execution.LocalTableScanExec.rdd$lzycompute(LocalTableScanExec.scala:48)
at org.apache.spark.sql.execution.LocalTableScanExec.rdd(LocalTableScanExec.scala:48)
at org.apache.spark.sql.execution.LocalTableScanExec.doExecute(LocalTableScanExec.scala:52)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:173)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:166)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:166)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:166)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:145)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
I've tried making a million configuration changes, including throwing both more driver and executor memory at this job, but to no avail. Is there any way around this? Any ideas?
The problem is here
val x = Seq(df.collect.toList)
When you do collect on a dataframe it will send all the data of the dataframe to the driver. So if your dataframe is big this will cause driver to get out of memory.
It is to be noted that out of all the memory you assign to the executor, the heap memory which driver can you is generally 30% (if not changed). So what is happening the driver is choking with the data volume due to the collect operation.
Now the thing is you might think the dataframe is smaller in size on disk but that is because the data is serialized and saved there. When you do collect it materialize the dataframe and uses JVM to store the data. This will cause huge memory explode ( generally 5-7X).
I would recommend you to remove the collect part and use df dataframe directly. Because I recon
val x = Seq(df.collect.toList) and df are essentially same
Well, there is a dataframe aggregation function that does what you want without doing a collect on the driver. For example if you wanted to collect all "feature" columns by key: df.groupBy($"key").agg(collect_list("feature")), or if you really wanted to do that for the whole dataframe without grouping: df.agg(collect_list("feature")).
However I wonder why you'd want to do that, when it seems easier to work with a dataframe with one row per object than a single row containing the entire result. Even using the collect_list aggregation function I wouldn't be surprised if you still run out of memory.

spark error in column type

I have a dataframe column,called 'SupplierId' ,typed as a string, with a lot of digits, but also some characters chain.
(ex: ['123','456','789',......,'abc']).
I formatted this column as a string using
from pyspark.sql.types import StringType
df=df.withColumn('SupplierId',df['SupplierId'].cast(StringType())
So I check it is treated as a string using:
df.printSchema()
and I get:
root
|-- SupplierId: string (nullable = true)
But when I try to convert to Pandas, or just to use df.collect(),
I obtain the following error:
An error occurred while calling o516.collectToPython. : org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 11, servername.ops.somecompany.local, executor 3):
ava.lang.RuntimeException: Error while encoding: java.lang.RuntimeException:
Exception parsing 'CPD160001' into a IntegerType$ for column "SupplierId":
Unable to deserialize value using com.somecompany.spark.parsers.text.converters.IntegerConverter.
The value being deserialized was: CPD160001
So it seems Spark treats the value of this column as integers.
I have tried using UDF to force convert to string with python, but it still doesn't work.
Do you have any idea what could cause this error?
Please do share a sample of your actual data, as your issue cannot be reproduced with toy ones:
spark.version
# u'2.2.0'
from pyspark.sql import Row
df = spark.createDataFrame([Row(1, 2, '3'),
Row(4, 5, 'a'),
Row(7, 8, '9')],
['x1', 'x2', 'id'])
df.printSchema()
# root
# |-- x1: long (nullable = true)
# |-- x2: long (nullable = true)
# |-- id: string (nullable = true)
df.collect()
# [Row(x1=1, x2=2, id=u'3'), Row(x1=4, x2=5, id=u'a'), Row(x1=7, x2=8, id=u'9')]
import pandas as pd
df_pandas = df.toPandas()
df_pandas
# x1 x2 id
# 0 1 2 3
# 1 4 5 a
# 2 7 8 9

Writing to Parquet/Kafka: Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError

I am trying to fix an outofmemory issue I am seeing in my spark setup and at this point, I am unable to conclude on a concrete analysis as to why I am seeing this. I am always seeing this issue when writing a dataframe to parquet or kafka. My dataframe has 5000 rows. It's schema is
root
|-- A: string (nullable = true)
|-- B: string (nullable = true)
|-- C: string (nullable = true)
|-- D: array (nullable = true)
| |-- element: string (containsNull = true)
|-- E: array (nullable = true)
| |-- element: string (containsNull = true)
|-- F: double (nullable = true)
|-- G: array (nullable = true)
| |-- element: double (containsNull = true)
|-- H: integer (nullable = true)
|-- I: double (nullable = true)
|-- J: double (nullable = true)
|-- K: array (nullable = true)
| |-- element: double (containsNull = false)
Of this the column G can have a cell size of upto 16MB. My dataframe total size is about 10GB partitioned into 12 partitions. Before writing, I am attempting to create 48 partitions out of this using repartition(), but the issue is seen even if I write without repartitioning. At the time of this exception, I have only one Dataframe cached with size of about 10GB. My driver has 19GB of free memory and the 2 executors have 8 GB of free memory each. The spark version is 2.1.0.cloudera1 and scala version is 2.11.8.
I have the below settings:
spark.driver.memory 35G
spark.executor.memory 25G
spark.executor.instances 2
spark.executor.cores 3
spark.driver.maxResultSize 30g
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.kryoserializer.buffer.max 1g
spark.rdd.compress true
spark.rpc.message.maxSize 2046
spark.yarn.executor.memoryOverhead 4096
The exception traceback is
Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:991)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:918)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:765)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:764)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at org.apache.spark.scheduler.DAGScheduler.submitWaitingChildStages(DAGScheduler.scala:764)
at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1228)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Any insights?
We finally found the issue. We were running kfold logistic regression in scala on 5000 rows of dataframe with k size as 4. After the classification is done, we basically got 4 test output dataframes of size 1250, each of them partitioned by at least 200 partitions. So in all we had greater than 800 partitions on 5000 rows of data. The code would then proceed to repartitioning this data to 48 partitions. Our system couldn't handle this repartition probably due to shuffling. To fix this we repartitioned each fold output dataframe to a smaller number (instead of doing it on the combined dataframe) and this has fixed the issue.

Matching two dataframes in scala

I have two RDDs in SCALA and converted those to dataframes.
Now I have two dataframes.One prodUniqueDF where I have two columns named prodid and uid, it is having master data of product
scala> prodUniqueDF.printSchema
root
|-- prodid: string (nullable = true)
|-- uid: long (nullable = false)
Second, ratingsDF where I have columns named prodid,custid,ratings
scala> ratingsDF.printSchema
root
|-- prodid: string (nullable = true)
|-- custid: string (nullable = true)
|-- ratings: integer (nullable = false)
I want to join the above two and replace the ratingsDF.prodid with prodUniqueDF.uid in the ratingsDF
To do this, I first registered them as 'tempTables'
prodUniqueDF.registerTempTable("prodUniqueDF")
ratingsDF.registerTempTable("ratingsDF")
And I run the code
val testSql = sql("SELECT prodUniqueDF.uid, ratingsDF.custid, ratingsDF.ratings FROM prodUniqueDF, ratingsDF WHERE prodUniqueDF.prodid = ratingsDF.prodid")
But the error comes as :
org.apache.spark.sql.AnalysisException: Table not found: prodUniqueDF; line 1 pos 66
Please help! How can I achieve the join? Is there another method to map RDDs instead?
The Joining of the DataFrames can easily be achieved,
Format is
DataFrameA.join(DataFrameB)
By default it takes an inner join, but you can also specify the type of join that you want to do and they have APi's for that
You can look here for more information.
http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.sql.DataFrame
For replacing the values in an existing column you can take help of withColumn method from the API
It would be something like this:
val newDF = dfA.withColumn("newColumnName", dfB("columnName"))).drop("columnName").withColumnRenamed("newColumnName", "columnName")
I think this might do the trick !