Unsupported UCS-4 endianness (3412) detected - Scala/Spark - scala

I was running a code on spark/scala and I've got an error, which I don't know what means...
Error : Caused by: java.io.CharConversionException: Unsupported UCS-4 endianness (3412) detected
I imported some files and I was using the variable i stored it as parameter in a function, some files work, but there are some that don't.

Related

Spark 3.0 scala.None$ is not a valid external type for schema of string

While using elasticsearch-hadoop library for reading elasticsearch index with empty attribute, getting the exception
Caused by: java.lang.RuntimeException: scala.None$ is not a valid external type for schema of string
There is open defect in github for the same with steps to reproduce it: https://github.com/elastic/elasticsearch-hadoop/issues/1635
Spark: 3.1.1
Elasticsearch-Hadoop : elasticsearch-spark-30_2.12-7.12.0
Elasticsearch : 2.3.4
It worked by setting elasticsearch-hadoop property es.field.read.empty.as.null = no
.option("es.field.read.empty.as.null", "no")
From Elasticsearch Link:
es.field.read.empty.as.null (default yes)
Whether elasticsearch-hadoop will treat empty fields as null.

Apache Flink Kryo serializer - ClassNotFoundException

I have a project in Apache Flink 1.8.1, with Scala 2.11 and Java 8. I used to use Maven for compiling and all the dependency management, but switched to Gradle... which leads me to this problem below:
j.l.ClassNotFoundException: om.tinker.my.project.ProjectPayload
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
... 3 frames excluded
at c.e.k.u.DefaultClassResolver.readName(DefaultClassResolver.java:172)
... 15 common frames omitted
Wrapped by: c.e.kryo.KryoException: Unable to find class: om.tinker.my.project.ProjectPayload
Serialization trace:
eventOutputTag (com.my.project.contexts.ProjectContext)
at c.e.k.u.DefaultClassResolver.readName(DefaultClassResolver.java:178)
at c.e.k.u.DefaultClassResolver.readClass(DefaultClassResolver.java:147)
at c.e.kryo.Kryo.readClass(Kryo.java:674)
at c.e.k.s.ReflectField.read(ReflectField.java:107)
at c.e.k.s.FieldSerializer.read(FieldSerializer.java:122)
at c.e.kryo.Kryo.readClassAndObject(Kryo.java:793)
at o.a.f.a.j.t.r.k.KryoSerializer.deserialize(KryoSerializer.java:346)
at o.a.f.s.r.s.StreamElementSerializer.deserialize(StreamElementSerializer.java:202)
at o.a.f.s.r.s.StreamElementSerializer.deserialize(StreamElementSerializer.java:46)
at o.a.f.r.p.NonReusingDeserializationDelegate.read(NonReusin...
First, the error message has a missing 'c'. The class path should be 'com.tinker.my.project.ProjectPayload'... I checked the files using that code and there's no missing 'c' in my import statements...
I also edit the Flink conf file to use a parent-first strategy...
Further background info:
I have another file called ProjectContext which has an ArrayList<ProjectPayload>. It also has the eventOutputTag (as mentioned in the serialization trace)... When i comment out ArrayList<ProjectPayload> and its getters/setters, EVERYTHING WORKS!
When i put back the instance variable and its getters/setters in ProjectContext, then ClassNotFoundException occurs...
Furthermore, i sprinkled tons of print statements, and i was able to create an instance of ProjectPayload, and log it out fine.
### Edit (June, 30, 2020) ###
In light of this serialization issue, i added this code:
env.getConfig.registerTypeWithKryoSerializer(classOf[ProjectPayload], classOf[JavaSerializer[ProjectPayload]])
and now i have this awkward (but similar) error:
"j.l.ClassNotFoundException: \u0005sr\u00008com.tinker.my.project.ProjectPayload+\"v
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
... 3 frames excluded
at c.e.k.u.DefaultClassResolver.readName(DefaultClassResolver.java:172)
... 15 common frames omitted
Wrapped by: c.e.kryo.KryoException: Unable to find class: \u0005sr\u00008com.tinker.my.project.ProjectPayload+\"v
Serialization trace:
allMyPayloads (com.tinker.my.project.ProjectContext)
at c.e.k.u.DefaultClassResolver.readName(DefaultClassResolver.java:178)
at c.e.k.u.DefaultClassResolver.readClass(DefaultClassResolver.java:147)
at c.e.kryo.Kryo.readClass(Kryo.java:674)
at c.e.k.s.ReflectField.read(ReflectField.java:107)
at c.e.k.s.FieldSerializer.read(FieldSerializer.java:122)
at c.e.kryo.Kryo.readClassAndObject(Kryo.java:793)
at o.a.f.a.j.t.r.k.KryoSerializer.deserialize(KryoSerializer.java:346)
at o.a.f.s.r.s.StreamElementSerializer.deserialize(StreamElementSerializer.java:202)
at o.a.f.s.r.s.StreamElementSerializer.deserialize(StreamElementSerializer.java:46)
at o.a.f.r.p.NonReusingDeserializationDelegate....
Turns out \u0005 is the unicode character 'ENQUIRY'. and \u00008 leads to gibberish on Google search results... will report back later
### Edit (July 1, 2020) ###
Some progress: I was initializing the ArrayList<ProjectPayload> inside the ProjectContext. When i removed that initialization, moved it outside, and then set the ArrayList value, my code got much further along. Then it complained about a HashMap<String, String> instance variable as well -- i ended up deleting it since it wasn't used.
Which now brings me to an IndexOutOfBoundsException:
j.l.IndexOutOfBoundsException: Index: 93, Size: 9
at java.util.ArrayList.rangeCheck(ArrayList.java:657)
at java.util.ArrayList.get(ArrayList.java:433)
at c.e.k.u.MapReferenceResolver.getReadObject(MapReferenceResolver.java:62)
at c.e.kryo.Kryo.readReferenceOrNull(Kryo.java:838)
at c.e.kryo.Kryo.readObjectOrNull(Kryo.java:761)
at c.e.k.s.ReflectField.read(ReflectField.java:120)
... 12 common frames omitted
Wrapped by: c.e.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index: 93, Size: 9
Serialization trace:
fooBarStr (com.tinker.my.project.contexts.ProjectContext)
at c.e.k.s.ReflectField.read(ReflectField.java:133)
at c.e.k.s.FieldSerializer.read(FieldSerializer.java:122)
at c.e.kryo.Kryo.readClassAndObject(Kryo.java:793)
at o.a.f.a.j.t.r.k.KryoSerializer.deserialize(KryoSerializer.java:346)
at o.a.f.s.r.s.StreamElementSerializer.deserialize(StreamElementSerializer.java:202)
at o.a.f.s.r.s.StreamElementSerializer.deserialize(StreamElementSerializer.java:46)
at o.a.f.r.p.NonReusingDeserializationDelegate.read(NonReusingDeserializationDelegate.java:55)
at o.a.f.r.i.n.a.s.SpillingAdaptiveSpanningRecordDeserializer.getNextRecord(SpillingAdaptiveSpanningRec...
and this Github issue on Kryo: https://github.com/EsotericSoftware/kryo/issues/456
Try this:
env.getConfig.registerTypeWithKryoSerializer(classOf[ProjectPayload], classOf[JavaSerializer[ProjectPayload]])
env.getConfig.registerTypeWithKryoSerializer(classOf[ProjectContext], classOf[JavaSerializer[ProjectContext]])
and make sure you are importing org.apache.flink.api.java.typeutils.runtime.kryo.JavaSerializer
https://ci.apache.org/projects/flink/flink-docs-stable/dev/custom_serializers.html#issue-with-using-kryos-javaserializer

databricks-connect, py4j.protocol.Py4JJavaError: An error occurred while calling o342.cache

Connection to databricks works fine, working with DataFrames goes smoothly (operations like join, filter, etc).
The problem appears when I call cache on a dataframe.
py4j.protocol.Py4JJavaError: An error occurred while calling o342.cache.
: java.io.InvalidClassException: failed to read class descriptor
...
Caused by: java.lang.ClassNotFoundException: org.apache.spark.rdd.RDD$client53442a94a3$$anonfun$mapPartitions$1$$anonfun$apply$23
at java.lang.ClassLoader.findClass(ClassLoader.java:523)
at org.apache.spark.util.ParentClassLoader.findClass(ParentClassLoader.java:35)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.java:40)
at org.apache.spark.util.ChildFirstURLClassLoader.loadClass(ChildFirstURLClassLoader.java:48)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:257)
at org.apache.spark.sql.util.ProtoSerializer.org$apache$spark$sql$util$ProtoSerializer$$readResolveClassDescriptor(ProtoSerializer.scala:4316)
at org.apache.spark.sql.util.ProtoSerializer$$anon$4.readClassDescriptor(ProtoSerializer.scala:4304)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1857)
... 71 more
I work with java8 as required, clearing pycache doesn't help.
The same code submitted as a job to databricks works fine.
It looks like a local problem on a bridge python-jvm level but java version (8) and python (3.7) is as required. Switching to java13 produces quite the same message.
Versions databricks-connect==6.2.0, openjdk version "1.8.0_242", Python 3.7.6
EDIT:
Behavior depends on how DF is created, if the source of DF is external then it works fine, if DF is created locally then such error appears.
# works fine
df = spark.read.csv("dbfs:/some.csv")
df.cache()
# ERROR in 'cache' line
df = spark.createDataFrame([("a",), ("b",)])
df.cache()
This is a known issue and I think a recent patch fixed it. This was seen for Azure, I am not sure whether you are using which Azure or AWS but it's solved. Please check the issue - https://github.com/MicrosoftDocs/azure-docs/issues/52431

How to save data-frame in MySQL using PySpark

I am new to Apache Spark. I have a use case where I have to save data frame data in MySQL. I got the below code to do the same:
data_frame.write.format('jdbc').options(
url='URI',
driver='com.mysql.jdbc.Driver',
dbtable=table_name,
user=user_name,
password='your_password').mode('append').save()
But when I ran the code, I got the below error:
File "/usr/local/Cellar/apache-spark/2.3.1/libexec/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o207.save.
: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
I might be missing out on very minute detail. How can I fix this?
The error description is clearly indicating that it's not able to locate the JDBC driver class. You will have to include the JAR file for com.mysql.jdbc.Driver using
pyspark --jars <jar-file-location>
See this question - How to add third-party Java JAR files for use in PySpark.

Quartz jobPersistenceException with hsqldb

I'm trying to run a scheduler using quartz. I use HSQLDB (version 2.2.7 ). But I get following exception;
Caused by: org.quartz.JobPersistenceException: Couldn't store job: data exception: string data, right truncation [See nested exception: java.sql.SQLDataException: data exception: string data, right truncation]
at org.quartz.impl.jdbcjobstore.JobStoreSupport.storeJob(JobStoreSupport.java:1132)[475:org.apache.servicemix.bundles.quartz:1.8.6.1]
at org.quartz.impl.jdbcjobstore.JobStoreSupport$3.execute(JobStoreSupport.java:1071)[475:org.apache.servicemix.bundles.quartz:1.8.6.1]
at org.quartz.impl.jdbcjobstore.JobStoreSupport$40.execute(JobStoreSupport.java:3716)[475:org.apache.servicemix.bundles.quartz:1.8.6.1]
at org.quartz.impl.jdbcjobstore.JobStoreSupport.executeInNonManagedTXLock(JobStoreSupport.java:3788)[475:org.apache.servicemix.bundles.quartz:1.8.6.1]
at org.quartz.impl.jdbcjobstore.JobStoreTX.executeInLock(JobStoreTX.java:90)[475:org.apache.servicemix.bundles.quartz:1.8.6.1]
at org.quartz.impl.jdbcjobstore.JobStoreSupport.executeInLock(JobStoreSupport.java:3712)[475:org.apache.servicemix.bundles.quartz:1.8.6.1]
at org.quartz.impl.jdbcjobstore.JobStoreSupport.storeJobAndTrigger(JobStoreSupport.java:1059)[475:org.apache.servicemix.bundles.quartz:1.8.6.1]
at org.quartz.core.QuartzScheduler.scheduleJob(QuartzScheduler.java:822)[475:org.apache.servicemix.bundles.quartz:1.8.6.1]
at org.quartz.impl.StdScheduler.scheduleJob(StdScheduler.java:243)[475:org.apache.servicemix.bundles.quartz:1.8.6.1]
at com.aepona.ase.services.terminalstatus.utils.SimpleTriggerExample.afterPropertiesSet(SimpleTriggerExample.java:32)[320:com.aepona.ase.services.terminalstatus:3.0.7.VFB-SNAPSHOT]
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.invokeInitMethods(AbstractAutowireCapableBeanFactory.java:1514)[439:org.springframework.beans:3.1.4.RELEASE]
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.initializeBean(AbstractAutowireCapableBeanFactory.java:1452)[439:org.springframework.beans:3.1.4.RELEASE]
... 14 more
Caused by: java.sql.SQLDataException: data exception: string data, right truncation
at org.hsqldb.jdbc.Util.sqlException(Unknown Source)[315:org.hsqldb.hsqldb:2.2.7]
at org.hsqldb.jdbc.Util.sqlException(Unknown Source)[315:org.hsqldb.hsqldb:2.2.7]
at org.hsqldb.jdbc.JDBCPreparedStatement.fetchResult(Unknown Source)[315:org.hsqldb.hsqldb:2.2.7]
at org.hsqldb.jdbc.JDBCPreparedStatement.executeUpdate(Unknown Source)[315:org.hsqldb.hsqldb:2.2.7]
at org.apache.commons.dbcp.DelegatingPreparedStatement.executeUpdate(DelegatingPreparedStatement.java:102)[301:org.apache.servicemix.bundles.commons-dbcp:1.2.2.7]
at org.quartz.impl.jdbcjobstore.StdJDBCDelegate.insertJobDetail(StdJDBCDelegate.java:530)[475:org.apache.servicemix.bundles.quartz:1.8.6.1]
at org.quartz.impl.jdbcjobstore.JobStoreSupport.storeJob(JobStoreSupport.java:1126)[475:org.apache.servicemix.bundles.quartz:1.8.6.1]
anyone familiar with this issue?
I am using quartz-scheduler 1.8.6. The DB script comes with it for HSQLDB uses BINARY data type in qrtz_job_details, qrtz_triggers, qrtz_calendars and qrtz_blob_triggers tables. I just change BINARY to BLOB and problem solved.