Issue while parsing mongo collection which has few schemas in spark - mongodb

I'm moving data from one collection to another in other cluster using Spark. the data's schema is not consistent(I mean that has few schema's in a single collection with different data types with little variations). When I try to read data from spark, the sampling is unable to get all the schema's of the data and throwing the below error.(I have a complex schema which I can't explicitly mention instead of spark gets by sampling.)
com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast ARRAY into a NullType (value: BsonArray{values=[{ "type" : "GUEST_FEE", "appliesPer" : "GUEST_PER_NIGHT", "description" : null, "minAmount" : 33, "maxAmount" : 33 }]})
I tried reading the collection as an RDD and write as an RDD still the issue persists.
Any help on this.!
Thanks.

All these com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast SOME_TYPE into a NullType come from incorrect schema inference. For schema-less data sources such as JSON file or mongodb, Spark does a scan of small fraction of the data to determine the types. If some particular field has lots of NULL's you can get unlucky and type will be set as NullType.
One thing you can do is increase the number of entries scanned for schema inference.
Another - get the inferred schema first, fix it, and reload dataframe with fixed schema:
def fix_spark_schema(schema):
if schema.__class__ == pyspark.sql.types.StructType:
return pyspark.sql.types.StructType([fix_spark_schema(f) for f in schema.fields])
if schema.__class__ == pyspark.sql.types.StructField:
return pyspark.sql.types.StructField(schema.name, fix_spark_schema(schema.dataType), schema.nullable)
if schema.__class__ == pyspark.sql.types.NullType:
return pyspark.sql.types.StringType()
return schema
collection_schema = sqlContext.read \
.format("com.mongodb.spark.sql") \
.options(...) \
.load() \
.schema
collection = sqlContext.read \
.format("com.mongodb.spark.sql") \
.options(...) \
.load(schema=fix_spark_schema(collection_schema))
In my case all problematic fields could be represented with StringType, you might make the logic more complex if needed.

As far as I understood your problem:
* either Spark incorrectly detected your schema and considered some fields as required (nullable = false) - in such case, you can still define it explicitly and set nullable to true. It would work if your schema was evolving and in some time in the past you added or removed a field but still keeping column type (e.g. String will be always a String and not a Struct or other completely different type)
* or your schemas are completely inconsistent, i.e. your String field transformed at some time to a Struct or other completely different type. In such case I don't see other solution than use RDD abstraction and work with very permissive types as Any in Scala (Object in Java) and using isInstanceOf tests to normalize all fields into a 1 common format
Actually I see also another possible solution, but only if you know what data has which schema. For instance, if you know that for data between 2018-01-01 and 2018-02-01 you use schema#1 and for the others schema#2, you can write a pipeline that will transform schema#1 to schema#2. Later you could simply union both datasets and apply your transformations on consistently structured values.
Edit:
I've just tried similar code you give and it worked correctly on my local MongoDB instance:
val sc = getSparkContext(Array("mongodb://localhost:27017/test.init_data"))
// Load sample data
import com.mongodb.spark._
val docFees =
"""
| {"fees": null}
| {"fees": { "main" : [ { "type" : "misc", "appliesPer" : "trip", "description" : null, "minAmount" : 175, "maxAmount" : 175 } ]} }
""".stripMargin.trim.stripMargin.split("[\\r\\n]+").toSeq
MongoSpark.save(sc.parallelize(docFees.map(Document.parse)))
val rdd = MongoSpark.load(sc)
rdd.saveToMongoDB(WriteConfig(Map("uri"->"mongodb://localhost:27017/test.new_coll_data", "replaceDocument"->"true")))
And when I checked the result in MongoDB shell I got:
> coll = db.init_data;
test.init_data
> coll.find();
{ "_id" : ObjectId("5b33d415ea78632ff8452c60"), "fees" : { "main" : [ { "type" : "misc", "appliesPer" : "trip", "description" : null, "minAmount" : 175, "maxAmount" : 175 } ] } }
{ "_id" : ObjectId("5b33d415ea78632ff8452c61"), "fees" : null }
> coll = db.new_coll_data;
test.new_coll_data
> coll.find();
{ "_id" : ObjectId("5b33d415ea78632ff8452c60"), "fees" : { "main" : [ { "type" : "misc", "appliesPer" : "trip", "description" : null, "minAmount" : 175, "maxAmount" : 175 } ] } }
{ "_id" : ObjectId("5b33d415ea78632ff8452c61"), "fees" : null }

Related

Transforming ReactiveMongo JSON with Play JSON

ReactiveMongo's JSON functionality generates objects (JsObject in play-json parlance) rather than scalars for certain MongoDB datatypes like BSONObjectID and BSONDateTime. For example, you get JSON like this:
{
"_id" : {
"$oid" : "5de32e618f02001d8d521757" //BSONObjectID
},
"createdAt" : {
"$date" : 15751396221447 //BSONDateTime
}
}
Aside from being cumbersome to deal with, I would prefer not to expose JSON that leaks MongoDB concerns to REST clients.
The tricky thing is that these values occur throughout the tree, so I need to write a Play JSON transformer smart enough to recursively transform the above at every level to look like this:
{
"id" : "5de32e618f02001d8d521757",
"createdAt" : 15751396221447
}
One failed attempt to do this for just BSONObjectID is this:
(JsPath \ "_id").json.update(
JsPath.read[JsObject].map{ o => o ++ Json.obj( "id" -> (o \ f"$$oid").as[String]) }
)
How can I do this?

Apache Spark export PostgreSQL data in Parquet format

I have PostgreSQL database with ~1000 different tables. I'd like to export all of these tables and data inside them into Parquet files.
In order to do it, I'm going to read each table into DataFrame and then store this df into Parquet file. Many of the PostgreSQL tables contains user-defined Types.
The biggest issue is - that I can't manually specify the schema for DataFrame. Will Apache Spark, in this case, be able to automatically infer the PostgreSQL table schemas and store them appropriately into Parquet format or this is impossible with Apache Spark and some other technology must be used for this purpose?
UPDATED
I have created the following PostgreSQL user-defined type, table and records:
create type dimensions as (
width integer,
height integer,
depth integer
);
create table moving_boxes (
id serial primary key,
dims dimensions not null
);
insert into moving_boxes (dims) values (row(3,4,5)::dimensions);
insert into moving_boxes (dims) values (row(1,4,2)::dimensions);
insert into moving_boxes (dims) values (row(10,12,777)::dimensions);
Implemented the following Spark application:
// that gives an one-partition Dataset
val opts = Map(
"url" -> "jdbc:postgresql:sparktest",
"dbtable" -> "moving_boxes",
"user" -> "user",
"password" -> "password")
val df = spark.
read.
format("jdbc").
options(opts).
load
println(df.printSchema())
df.write.mode(SaveMode.Overwrite).format("parquet").save("moving_boxes.parquet")
This is df.printSchema output:
root
|-- id: integer (nullable = true)
|-- dims: string (nullable = true)
As you may see, Spark DataFrame infer dims schema as string and not as a complex nested type.
This is the log information from ParquetWriteSupport:
18/11/06 10:08:52 INFO ParquetWriteSupport: Initialized Parquet WriteSupport with Catalyst schema:
{
"type" : "struct",
"fields" : [ {
"name" : "id",
"type" : "integer",
"nullable" : true,
"metadata" : { }
}, {
"name" : "dims",
"type" : "string",
"nullable" : true,
"metadata" : { }
} ]
}
and corresponding Parquet message type:
message spark_schema {
optional int32 id;
optional binary dims (UTF8);
}
Could you please explain, will the original complex dims type(defined in PostgreSQL) be lost in the saved Parquet file or no?

Pig's AvroStorage LOAD removes unicode chars from input

I am using pig to read avro files and normalize/transform the data before writing back out. The avro files have records of the form:
{
"type" : "record",
"name" : "KeyValuePair",
"namespace" : "org.apache.avro.mapreduce",
"doc" : "A key/value pair",
"fields" : [ {
"name" : "key",
"type" : "string",
"doc" : "The key"
}, {
"name" : "value",
"type" : {
"type" : "map",
"values" : "bytes"
},
"doc" : "The value"
} ]
}
I have used the AvroTools command-line utility in conjunction with jq to dump the first record to JSON:
$ java -jar avro-tools-1.8.1.jar tojson part-m-00000.avro | ./jq --compact-output 'select(.value.pf_v != null)' | head -n 1 | ./jq .
{
"key": "some-record-uuid",
"value": {
"pf_v": "v1\u0003Basic\u0001slcvdr1rw\u001a\u0004v2\u0003DayWatch\u0001slcva2omi\u001a\u0004v3\u0003Performance\u0001slc1vs1v1w1p1g1i\u0004v4\u0003Fundamentals\u0001snlj1erwi\u001a\u0004v5\u0003My Portfolio\u0001svr1dews1b2b3k1k2\u001a\u0004v0\u00035"
}
}
I run the following pig commands:
REGISTER avro-1.8.1.jar
REGISTER json-simple-1.1.1.jar
REGISTER piggybank-0.15.0.jar
REGISTER jackson-core-2.8.6.jar
REGISTER jackson-databind-2.8.6.jar
DEFINE AvroLoader org.apache.pig.piggybank.storage.avro.AvroStorage();
AllRecords = LOAD 'part-m-00000.avro'
USING AvroLoader()
AS (key: chararray, value: map[]);
Records = FILTER AllRecords BY value#'pf_v' is not null;
SmallRecords = LIMIT Records 10;
DUMP SmallRecords;
The corresponding record for the last command above is as follows:
...
(some-record-uuid,[pf_v#v03v1Basicslcviv2DayWatchslcva2omiv3Performanceslc1vs1v1w1p1g1i])
...
As you can see the unicode chars have been removed from the pf_v value. The unicode characters are actually being used as delimiters in these values so I will need them in order to fully parse the records into their desired normalized state. The unicode characters are clearly present in the encoded .avro file (as demonstrated by dumping the file to JSON). Is anybody aware of a way to get AvroStorage to not remove the unicode chars when loading records?
Thank you!
Update:
I have also performed the same operation using Avro's python DataFileReader:
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter
reader = DataFileReader(open("part-m-00000.avro", "rb"), DatumReader())
for rec in reader:
if 'some-record-uuid' in rec['key']:
print rec
print '--------------------------------------------'
break
reader.close()
This prints a dict with what looks like hex chars substituted for the unicode chars (which is preferable to removing them entirely):
{u'value': {u'pf_v': 'v0\x033\x04v1\x03Basic\x01slcvi\x1a\x04v2\x03DayWatch\x01slcva2omi\x1a\x04v3\x03Performance\x01slc1vs1v1w1p1g1i\x1a'}, u'key': u'some-record-uuid'}

Mongo db : query on nested json

Sample json object :
{ "_id" : ObjectId( "55887982498e2bef5a5f96db" ),
"a" : "x",
"q" : "null",
"p" : "",
"s" : "{\"f\":{\"b\":[\"I\"]},\"time\":\"fs\"}" }
need all documents where time = fs
My query :
{"s":{"time" : "fs"}}
above returns zero products but that is not true.
There are two problems here. First of all s is clearly a string so your query cannot work. You can use $regex as below but it won't be very efficient:
{s: {$regex: '"time"\:"fs"'}}
I would suggest converting s fields to proper documents. You can use JSON.parse to do it. Documents can be updated based on a current value using db.foo.find().snapshot().forEach. See this answer for details.
Second problem is your query is simply wrong. To match time field you should use dot notation:
{"s.time" : "fs"})

MapReduce MongoDB User Agent

I have a 5 million entries in a Mongo DB that look like this:
{
"_id" : ObjectId("525facace4b0c1f5e78753ea"),
"productId" : null,
"name" : "example name",
"time" : ISODate("2013-10-17T09:23:56.131Z"),
"type" : "hover",
"url" : "www.example.com",
"userAgent" : "curl/7.24.0 (x86_64-apple-darwin12.0) libcurl/7.24.0 openssl/0.9.8r zlib/1.2.5"
}
I need to add to every entry a new field called device which will have either the value desktop or mobile. That means, the goal would be to have the following kind of entries:
{
"_id" : ObjectId("525facace4b0c1f5e78753ea"),
"productId" : null,
"device" : "desktop",
"name" : "example name",
"time" : ISODate("2013-10-17T09:23:56.131Z"),
"type" : "hover",
"url" : "www.example.com",
"userAgent" : "curl/7.24.0 (x86_64-apple-darwin12.0) libcurl/7.24.0 openssl/0.9.8r zlib/1.2.5"
}
I am working with the MongoDB Java driver and so far I am doing the following:
DBObject query = new BasicDBObject();
query.put("device", new BasicDBObject("$exists", false)); //some entries already have such field
DBCursor cursor = resource.find(query);
cursor.addOption(Bytes.QUERYOPTION_NOTIMEOUT);
Iterator<DBObject> iterator = cursor.iterator();
int size = cursor.count();
And then I am iterating with a while(iterator.hasNext()), doing an if-else with a huge regular expression I found out there, and depending of the result of such if-else I execute something like:
BasicDBObject newDocument = new BasicDBObject("$set", new BasicDBObject().append("device", "desktop")); //of "mobile", depending on the if-else
BasicDBObject searchQuery = new BasicDBObject("_id", id);
resource.getCollection(DatabaseConfiguration.WEBSITE_STATISTICS).update(searchQuery, newDocument);
However, due to the big amount of data (more than 5 million entries) this takes forever.
Is there a way of doing this with map reduce? So far I've only used MapReduce for counting, so I am not sure if it can be used for other matters.
I found a way which was kind of tricky due to the whole configuration.
After installing Hadoop following this link, I did the following:
Created one class called MongoUpdate, with a method run where I set up all the configuration (like input and output URI) and create a job and configure all the settings. Among those, there is job.setMapperClass(MongoMapper.class)
Created MongoMapper where I have the method map which gets a BSONObject. Here I perform the if-else condition and at the very end I do:
Text id = new Text(pValue.get("_id").toString());
pContext.write(id, new BSONWritable(pValue));
Class Main whose main method simply instantiates a MongoUpdate class and runs it run method
Export the jar with all the libraries and type on the terminal: hadoop java NameOfTheJar.jar