Pig's AvroStorage LOAD removes unicode chars from input - unicode

I am using pig to read avro files and normalize/transform the data before writing back out. The avro files have records of the form:
{
"type" : "record",
"name" : "KeyValuePair",
"namespace" : "org.apache.avro.mapreduce",
"doc" : "A key/value pair",
"fields" : [ {
"name" : "key",
"type" : "string",
"doc" : "The key"
}, {
"name" : "value",
"type" : {
"type" : "map",
"values" : "bytes"
},
"doc" : "The value"
} ]
}
I have used the AvroTools command-line utility in conjunction with jq to dump the first record to JSON:
$ java -jar avro-tools-1.8.1.jar tojson part-m-00000.avro | ./jq --compact-output 'select(.value.pf_v != null)' | head -n 1 | ./jq .
{
"key": "some-record-uuid",
"value": {
"pf_v": "v1\u0003Basic\u0001slcvdr1rw\u001a\u0004v2\u0003DayWatch\u0001slcva2omi\u001a\u0004v3\u0003Performance\u0001slc1vs1v1w1p1g1i\u0004v4\u0003Fundamentals\u0001snlj1erwi\u001a\u0004v5\u0003My Portfolio\u0001svr1dews1b2b3k1k2\u001a\u0004v0\u00035"
}
}
I run the following pig commands:
REGISTER avro-1.8.1.jar
REGISTER json-simple-1.1.1.jar
REGISTER piggybank-0.15.0.jar
REGISTER jackson-core-2.8.6.jar
REGISTER jackson-databind-2.8.6.jar
DEFINE AvroLoader org.apache.pig.piggybank.storage.avro.AvroStorage();
AllRecords = LOAD 'part-m-00000.avro'
USING AvroLoader()
AS (key: chararray, value: map[]);
Records = FILTER AllRecords BY value#'pf_v' is not null;
SmallRecords = LIMIT Records 10;
DUMP SmallRecords;
The corresponding record for the last command above is as follows:
...
(some-record-uuid,[pf_v#v03v1Basicslcviv2DayWatchslcva2omiv3Performanceslc1vs1v1w1p1g1i])
...
As you can see the unicode chars have been removed from the pf_v value. The unicode characters are actually being used as delimiters in these values so I will need them in order to fully parse the records into their desired normalized state. The unicode characters are clearly present in the encoded .avro file (as demonstrated by dumping the file to JSON). Is anybody aware of a way to get AvroStorage to not remove the unicode chars when loading records?
Thank you!
Update:
I have also performed the same operation using Avro's python DataFileReader:
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter
reader = DataFileReader(open("part-m-00000.avro", "rb"), DatumReader())
for rec in reader:
if 'some-record-uuid' in rec['key']:
print rec
print '--------------------------------------------'
break
reader.close()
This prints a dict with what looks like hex chars substituted for the unicode chars (which is preferable to removing them entirely):
{u'value': {u'pf_v': 'v0\x033\x04v1\x03Basic\x01slcvi\x1a\x04v2\x03DayWatch\x01slcva2omi\x1a\x04v3\x03Performance\x01slc1vs1v1w1p1g1i\x1a'}, u'key': u'some-record-uuid'}

Related

Issue while parsing mongo collection which has few schemas in spark

I'm moving data from one collection to another in other cluster using Spark. the data's schema is not consistent(I mean that has few schema's in a single collection with different data types with little variations). When I try to read data from spark, the sampling is unable to get all the schema's of the data and throwing the below error.(I have a complex schema which I can't explicitly mention instead of spark gets by sampling.)
com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast ARRAY into a NullType (value: BsonArray{values=[{ "type" : "GUEST_FEE", "appliesPer" : "GUEST_PER_NIGHT", "description" : null, "minAmount" : 33, "maxAmount" : 33 }]})
I tried reading the collection as an RDD and write as an RDD still the issue persists.
Any help on this.!
Thanks.
All these com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast SOME_TYPE into a NullType come from incorrect schema inference. For schema-less data sources such as JSON file or mongodb, Spark does a scan of small fraction of the data to determine the types. If some particular field has lots of NULL's you can get unlucky and type will be set as NullType.
One thing you can do is increase the number of entries scanned for schema inference.
Another - get the inferred schema first, fix it, and reload dataframe with fixed schema:
def fix_spark_schema(schema):
if schema.__class__ == pyspark.sql.types.StructType:
return pyspark.sql.types.StructType([fix_spark_schema(f) for f in schema.fields])
if schema.__class__ == pyspark.sql.types.StructField:
return pyspark.sql.types.StructField(schema.name, fix_spark_schema(schema.dataType), schema.nullable)
if schema.__class__ == pyspark.sql.types.NullType:
return pyspark.sql.types.StringType()
return schema
collection_schema = sqlContext.read \
.format("com.mongodb.spark.sql") \
.options(...) \
.load() \
.schema
collection = sqlContext.read \
.format("com.mongodb.spark.sql") \
.options(...) \
.load(schema=fix_spark_schema(collection_schema))
In my case all problematic fields could be represented with StringType, you might make the logic more complex if needed.
As far as I understood your problem:
* either Spark incorrectly detected your schema and considered some fields as required (nullable = false) - in such case, you can still define it explicitly and set nullable to true. It would work if your schema was evolving and in some time in the past you added or removed a field but still keeping column type (e.g. String will be always a String and not a Struct or other completely different type)
* or your schemas are completely inconsistent, i.e. your String field transformed at some time to a Struct or other completely different type. In such case I don't see other solution than use RDD abstraction and work with very permissive types as Any in Scala (Object in Java) and using isInstanceOf tests to normalize all fields into a 1 common format
Actually I see also another possible solution, but only if you know what data has which schema. For instance, if you know that for data between 2018-01-01 and 2018-02-01 you use schema#1 and for the others schema#2, you can write a pipeline that will transform schema#1 to schema#2. Later you could simply union both datasets and apply your transformations on consistently structured values.
Edit:
I've just tried similar code you give and it worked correctly on my local MongoDB instance:
val sc = getSparkContext(Array("mongodb://localhost:27017/test.init_data"))
// Load sample data
import com.mongodb.spark._
val docFees =
"""
| {"fees": null}
| {"fees": { "main" : [ { "type" : "misc", "appliesPer" : "trip", "description" : null, "minAmount" : 175, "maxAmount" : 175 } ]} }
""".stripMargin.trim.stripMargin.split("[\\r\\n]+").toSeq
MongoSpark.save(sc.parallelize(docFees.map(Document.parse)))
val rdd = MongoSpark.load(sc)
rdd.saveToMongoDB(WriteConfig(Map("uri"->"mongodb://localhost:27017/test.new_coll_data", "replaceDocument"->"true")))
And when I checked the result in MongoDB shell I got:
> coll = db.init_data;
test.init_data
> coll.find();
{ "_id" : ObjectId("5b33d415ea78632ff8452c60"), "fees" : { "main" : [ { "type" : "misc", "appliesPer" : "trip", "description" : null, "minAmount" : 175, "maxAmount" : 175 } ] } }
{ "_id" : ObjectId("5b33d415ea78632ff8452c61"), "fees" : null }
> coll = db.new_coll_data;
test.new_coll_data
> coll.find();
{ "_id" : ObjectId("5b33d415ea78632ff8452c60"), "fees" : { "main" : [ { "type" : "misc", "appliesPer" : "trip", "description" : null, "minAmount" : 175, "maxAmount" : 175 } ] } }
{ "_id" : ObjectId("5b33d415ea78632ff8452c61"), "fees" : null }

Handling double quotes in parameter with #Query annotation

I have a query implemented in my Elastic Search repository using the #Query annotation. One of the fields in the query is a String and I found that when I attempt to execute the query using a parameter that has double quotes in it, the query fails to parse.
It appears that the ElasticsearchStringQuery:88 class simply does a replacement of the parameter value into the string query without escaping possible quotes in the parameter. Am I doing something wrong or is this just a known limitation? If it is known, it seems odd that the caller must escape the quotes in the parameter before executing the finder method on the repository. This causes my code to be littered with double quote stripping or escaping and it leaks the query format externally (i.e. that double quotes are used to define the query and therefore can't be used in parameters).
An example query:
#Query("{ \"filtered\" : { \"query\" : { \"bool\" : { \"must\" : [ { \"match\" : { \"programTitle\" : { \"query\" : \"?0\", \"type\" : \"boolean\", \"operator\" : \"AND\", \"fuzziness\" : \"0\" } } }, { \"match\" : { \"title\" : { \"query\" : \"?2\", \"type\" : \"boolean\", \"operator\" : \"AND\",\"fuzziness\" : \"0\" } } } ] } }, \"filter\" : { \"term\" : { \"programAirDate\" : \"?1\" } } }}")
The query is being executed with the test case:
repository.save(piece);
List<Piece> actual = repository.findByProgramTitleAndProgramAirDateAndTitle(
piece.getProgramTitle(),
airDate.format(DateTimeFormatter.ISO_INSTANT),
piece.getTitle(), new PageRequest(0, 1));
If piece.getTitle() contains double quotes such as with Franz Joseph Haydn: String Quartet No. 53 "The Lark", the query fails with the error:
at [Source: [B#1696066a; line: 1, column: 410]]; }{[7x1Q86yNTdy0a0zHarqYVA][metapub][4]: SearchParseException[[metapub][4]: from[0],size[1]: Parse Failure [Failed to parse source [{"from":0,"size":1,"query_binary":"eyAgImZpbHRlcmVkIiA6IHsgICAgInF1ZXJ5IiA6IHsgICAgICAiYm9vbCIgOiB7ICAgICAgICAibXVzdCIgOiBbIHsgICAgICAgICAgIm1hdGNoIiA6IHsgICAgICAgICAgICAicHJvZ3JhbVRpdGxlIiA6IHsgICAgICAgICAgICAgICJxdWVyeSIgOiAiQmVzdCBQcm9ncmFtIEV2ZXIiLCAgICAgICAgICAgICAgInR5cGUiIDogImJvb2xlYW4iLCAgICAgICAgICAgICAgIm9wZXJhdG9yIiA6ICJBTkQiLCAgICAgICAgICAgICAgImZ1enppbmVzcyIgOiAiMCIgICAgICAgICAgICB9ICAgICAgICAgIH0gICAgICAgIH0sIHsgICAgICAgICAgIm1hdGNoIiA6IHsgICAgICAgICAgICAidGl0bGUiIDogeyAgICAgICAgICAgICAgInF1ZXJ5IiA6ICJGcmFueiBKb3NlcGggSGF5ZG46IFN0cmluZyBRdWFydGV0IE5vLiA1MyAiVGhlIExhcmsiIiwgICAgICAgICAgICAgICJ0eXBlIiA6ICJib29sZWFuIiwgICAgICAgICAgICAgICJvcGVyYXRvciIgOiAiQU5EIiwgICAgICAgICAgICAgICJmdXp6aW5lc3MiIDogIjAiICAgICAgICAgICAgfSAgICAgICAgICB9ICAgICAgICB9IF0gICAgICB9ICAgIH0sICAgICJmaWx0ZXIiIDogeyAgICAgICJ0ZXJtIiA6IHsgICAgICAgICJwcm9ncmFtQWlyRGF0ZSIgOiAiMjAxNi0wNC0wMVQxMzowMDowMFoiICAgICAgfSAgICB9ICB9fQ=="}]]];
nested: QueryParsingException[[metapub] Failed to parse]; nested: JsonParseException[Unexpected character ('T' (code 84)): was expecting comma to separate OBJECT entries
Decoding the base64 query I can see that the double quotes are causing the issue:
... "title" : { "query" : "Franz Joseph Haydn: String Quartet No. 53 "The Lark"", "type" : "boolean", "operator" : "AND", ...
I can make things work by removing all the double quotes (for example, piece.getTitle().replaceAll("\"", "")) but that seems like something the framework should handle. When using a JDBC repository, the user doesn't have to remove/replace special characters because the JDBC driver does it automatically.
So what is the best way to handle this?

Specifying Numbers in VTL on AMS API Gateway

Doc Ref: http://docs.aws.amazon.com/apigateway/latest/developerguide/models-mappings.html
In AMS VTL one specifies dictionary fields in a model schema thus:
"field1" : {"type":"string"},
"field2" : {"type":"number"},
and so a mapping template can populate such fields thus:
#set($inputRoot = $input.path('$'))
"questions" :
[
#foreach($elem in $inputRoot)
{
"field1" : "$elem.field1",
"field2" : $elem.field2
}#if($foreach.hasNext),#end
#end
]
However... my iOS app complains the received data isn't in JSON format. If I add quotes around $elem.field2 then iOS accepts the JSON and converts all fields to strings.
My Lambda function is returning is returning a standard JSON list of dictionaries with field2 defined as an integer.
But APIG returns strings for all my fields, delimited with {} and a prefix:
{S=some text}
{N=10000000500}
So I can see that field2 isn't a number but a string {N=10000000500}.
How do I handle numbers in this system?
Undocumented but you can simply specify the type after the field name in a mapping template:
#set($inputRoot = $input.path('$'))
"questions" :
[
#foreach($elem in $inputRoot)
{
"field1" : "$elem.field1.S",
"field2" : $elem.field2.N
}#if($foreach.hasNext),#end
#end
]
Note that string fields need to be delimited in quotes.

mongodb find element within a hash within a hash

I am attempting to build a query to run from Mongo client that will allow access to the following element of a hash within a hash within a hash.
Here is the structure of the data:
"_id" : ObjectId("BSONID"),
"e1" : "value",
"e2" : "value",
"e3" : "value"),
"updated_at" : ISODate("2015-08-31T21:04:37.669Z"),
"created_at" : ISODate("2015-01-05T07:20:17.833Z"),
"e4" : 62,
"e5" : {
"sube1" : {
"26444745" : {
"subsube1" : "value",
"subsube2" : "value",
"subsube3" : "value I am looking for",
"subsube4" : "value",
"subsube5" : "value"
},
"40937803" : {
"subsube1" : "value",
"subsube2" : "value",
"subsube3" : "value I am looking for",
"subsube4" : "value",
"subsube5" : "value"
},
"YCPGF5SRTJV2TVVF" : {
"subsube1" : "value",
"subsube2" : "value",
"subsube3" : "value I am looking for",
"subsube4" : "value",
"subsube5" : "value"
}
}
}
So I have tried dotted notation based on a suggestion for "diving" into an wildcard named hash using db.my_collection.find({"e5.sube1.subsube4": "value I am looking for"}) which keeps coming back with an empty result set. I have also tried the find using a match instead of an exact value using /value I am lo/ and still an empty result set. I know there is at least 1 document which has the "value I am looking for".
Any ideas - note I am restricted to using the Mongo shell client.
Thanks.
So since this is not capable of being made into a javascript/mongo shell array I will go to plan B which is write some code be it Perl or Ruby and pull the result set into an array of hashes and walk each document/sub-document.
Thanks Mario for the help.
You have two issues:
You're missing one level.
You are checking subsube4 instead of subsube3
Depending on what subdocument of sube1 you want to check, you should do
db.my_collection.find({"e5.sube1.26444745.subsube4": "value I am looking for"})
or
db.my_collection.find({"e5.sube1.40937803.subsube4": "value I am looking for"})
or
db.my_collection.find({"e5.sube1.YCPGF5SRTJV2TVVF.subsube4": "value I am looking for"})
You could use the $or operator if you want to look in any one of the three.
If you don't know the keys of your documents, that's an issue with your schema design: you should use arrays instead of objects. Similar case: How to query a dynamic key - mongodb schema design
EDIT
Since you explain that you have a special request to know the count of "value I am looking for" only one time, we can run a map reduce. You can run those commands in the shell.
Define map function
var iurMapFunction = function() {
for (var key in this.e5.sube1) {
if (this.e5.sube1[key].subsube3 == "value I am looking for") {
var value = {
count: 1,
subkey: key
}
emit(key, value);
}
}
};
Define reduce function
var iurReduceFunction = function(keys, countObjVals) {
reducedVal = {
count: 0
};
for (var idx = 0; idx < countObjVals.length; idx++) {
reducedVal.count += countObjVals[idx].count;
}
return reducedVal;
};
Run mapreduce command
db.my_collection.mapReduce(iurMapFunction,
iurReduceFunction, {
out: {
replace: "map_reduce_result"
},
}
);
Find your counts
db.map_reduce_result.find()
This should give you, for each dynamic key in your object, the number of times it had an embedded field subsube3 with value value I am looking for.

How do you import binary data with mongoimport?

I've tried every combination to import binary data for Mongo and I CANNOT get it to work. I've tried using new BinData(0, <bindata>) and I've tried using
{
"$binary" : "<bindata>",
"$type" : "0"
}
The first one gives me a parsing error. The second gives me an error reading "Invalid use of a reserved field name."
I can import other objects fine. For reference, I'm trying to import a BASE64-encoded image string. Here is my current version of the JSON I'm using:
{"_id" : "72984ce4-de03-407f-8911-e7b03f0fec26","OriginalWidth" : 73, "OriginalHeight" : 150, { "$binary" : "", "$type" : "0" }, "ContentType" : "image/jpeg", "Name" : "test.jpg", "Type" : "5ade8812-e64a-4c64-9e23-b3aa7722cfaa"}
I actually figured out this problem and thought I'd come back to SO to help anyone out who might be struggling.
Essentially, what I was doing was using C# to generate a JSON file. That file was used on an import script that ran and brought in all kinds of data. One of the fields in a collection required storing binary image data as a Base64-encoded string. The Mongo docs (Import Export Tools and Importing Interesting Types) were helpful, but only to a certain point.
To format the JSON properly for this, I had to use the following C# snippet to get an image file as a byte array and dump it into a string. There is a more efficient way of doing this for larger strings (StringBuilder for starters), but I'm simplifying for the purpose of illustrating the example:
byte[] bytes = File.ReadAllBytes(imageFile);
output = "{\"Data\" : {\"$binary\" : \"" + Convert.ToBase64String(bytes) + "\", \"$type\" : \"00\"}, \"ContentType\" : \"" + GetMimeType(fileInfo.Name) + "\", \"Name\" : \"" + fileInfo.Name + "\"}";
I kept on failing on the type part, by the way. It translates to generic binary data is specified in the BSON spec here: http://bsonspec.org/#/specification.
If you want to skip straight to the JSON, the above code output a string very similar to this:
{"Data": {"$binary": "[Byte array as Base64 string]", "$type": "00"}, "ContentType": "image/jpeg", "Name": "test.jpg"}
Then, I just used the mongoimport tool to process the resulting JSON.
Note: since I'm already in C#, I could've just used the Mongo DLL and done processing there, but for this particular case, I had to create the JSON files raw in the code. Fun times.