Parsing JSON with a Hive UDF in Scala - scala

I'm trying to write a Scala UDF for Hive which acts on a JSON array -- extending org.apache.hadoop.hive.ql.exec.UDF and and relying on play-json's play.api.libs.json.parse.
When attempting to call this from within Hive, I see java.lang.NoSuchMethodError: com.fasterxml.jackson.core.JsonToken.id()I.
I'm not sure what the cause is here--some incompatibility with the jackson versions, and if so, how can I work around this?
The only component/version that I'm tied to is Hive 1.2.

Take a look at the JSON UDF's in Brickhouse ( http://github.com/klout/brickhouse ). Brickhouse has UDF's to_json and from_json , as well as a convenience functions json_map and json_split to deal directly with maps and arrays.
In regards to your versioning problem, Brickhouse uses jackson under the covers, using version 1.8.8 (among others), and I haven't come across this particular versioning problem.

The guess that this is a Jackson incompatibility makes sense.
Hive 1.2 uses Jackson 1.9.2 but later versions are used by recent versions (i.e. last couple of years) of Play-JSON.
If reverting to an old enough version of Play-JSON doesn't make sense, then perhaps the simplest workaround would be to use a Scala JSON parsing library that doesn't depend on Jackson; Rapture JSON can be used with multiple backends and so might be a good choice.

Related

Random Avro data generator in Java/Scala

Is this possible to generate random Avro data by the specified schema using org.apache.avro library?
I need to produce this data with Kafka.
I tried to find some kind of random data generator for test, however, I have stumbled upon tools for such data generator or GenericRecord usage. Tools are not very suitable for me as there is a specific file dependency (like reading the file and so on) and GenericRecord should be generated one-by-one as I've understood.
Are there any other solutions for Java/Scala?
UPDATE: I have found this class but it does not seem to beaccessible from org.apache.avro version version 1.8.2
The reason you need to read a file, is that it matches a Schema, which defines the fields that need to be created, and of which types.
That is not a hard requirement, and there would be nothing preventing creation of random Generic or Specific Records that are built in code via Avro's SchemaBuilder class
See this repo for example, that uses a POJO generated from an AVSC schema (which again, could be done with SchemaBuilder instead) into a Java class.
Even the class you linked to uses a schema file
So I personally would probably use Avro4s (https://github.com/sksamuel/avro4s) in conjunction with scalachecks (https://www.scalacheck.org) Gen to model such tests.
You could use scalacheck to generate random instances of case classes and avro4s to convert them to generic records, extract their schema etc etc.
There's also avro-mocker https://github.com/speedment/avro-mocker though I don't know how easy it is to hook into the code.
I'd just use Podam http://mtedone.github.io/podam/ to generate POJOs and then just output them to Avro using Java Avro library https://avro.apache.org/docs/1.8.1/gettingstartedjava.html#Serializing

Does Apache spark 2.2 supports user-defined type (UDT)?

From this JIRA ticket Hide UserDefinedType in Spark 2.0
, seems that spark hide the UDT API from version 2.0.
Is there exists an alternative function or API we can use in version 2.2, so that we could define UserDefinedType? I wish to use a custom type in dataframe or structured streaming.
There is no alternative API and UDT remains private (https://issues.apache.org/jira/browse/SPARK-7768).
Generic Encoders (org.apache.spark.sql.Encoders.kryo and org.apache.spark.sql.Encoders.javaSerialization) serve similar purpose in Dataset, but there are not direct replacement:
How to store custom objects in Dataset?
Questions about the future of UDTs and Encoders

Can I customize data serialization mechanism in spark

In my project, I need to ship plenty of existing Java objects to Spark workers, most of them are not extended from java.io.Serializable. I also want the ability to control the variables/attributes included in the objects. I only want to serialize the useful attributes, not everything in a object.
Spark documentation indicates that there are two ways to serialize an object in Spark, using java.io.Serializable or Kryo. I think both ways need to rewrite a bunch of wrappers or extra code for every business objects. However, my current code base has already implemented protocol-buffers serialization. I am wondering if there is any way to embed this serialization mechanism into Spark.

Does Apache thrift work with Scala

I am new to apache thrift and I am familiar with Scala. But I have not seen any example or any reference on internet, saying it supports Scala.
Can some one tell me is there a way to work with scala on thrif. thank you.
Yes, thrift can work with scala smoothly. It's not surprising since scala essentially works on JVM. One open-source example is Twitter's Scalding, a scala DSL for Cascading. In scalding, one is able to handle various cascading flows whose tuples are in the type of thrift-based classes.
See LongThriftTransformer for example.
I don't mean to be rude, but the very first link for the google query apache thrift scala shows up scrooge, which
notes, that due to interoperability between scala and java ordinary thrift-java will work fine
itself -- a way to work with thrift in scala-native way
So yes, there are ways to work with Thrift in Scala.
Twitters 'Scrooge' is meant as a replacement for the standard Thrift generator and generates Java and Scala code. It works with SBT and seems relatively mature.

Scala, Morphia and Enumeration

I need to store Scala class in Morphia. With annotations it works well unless I try to store collection of _ <: Enumeration
Morphia complains that it does not have serializers for that type, and I am wondering, how to provide one. For now I changed type of collection to Seq[String], and fill it with invoking toString on every item in collection.
That works well, however I'm not sure if that is right way.
This problem is common to several available layers of abstraction on the top of MongoDB. It all come back to a base reason: there is no enum equivalent in json/bson. Salat for example has the same problem.
In fact, MongoDB Java driver does not support enums as you can read in the discussion going on here: https://jira.mongodb.org/browse/JAVA-268 where you can see the problem is still open. Most of the frameworks I have seen to use MongoDB with Java do not implement low-level functionalities such as this one. I think this choice makes a lot of sense because they leave you the choice on how to deal with data structures not handled by the low-level driver, instead of imposing you how to do it.
In general I feel that the absence of support comes not from technical limitation but rather from design choice. For enums, there are multiple way to map them with their pros and their cons, while for other data types is probably simpler. I don't know the MongoDB Java driver in detail, but I guess supporting multiple "modes" would have required some refactoring (maybe that's why they are talking about a new version of serialization?)
These are two strategies I am thinking about:
If you want to index on an enum and minimize space occupation, you will map the enum to an integer ( Not using the ordinal , please can set enum start value in java).
If your concern is queryability on the mongoshell, because your data will be accessed by data scientist, you would rather store the enum using its string value
To conclude, there is nothing wrong in adding an intermediate data structure between your native object and MongoDB. Salat support it through CustomTransformers, on Morphia maybe you would need to do the conversion explicitely. Go for it.