The problem is quite simple - I have an SQL query obtained from an external source (thus, neither the query nor the schema (data types) is not known at compile time), and I want to create a Stream of "raw" rows (e.g. Array[AnyRef] or similar, thus deferring the actual type-checking to the stream processing).
However, creating a Query0, e.g. via
val query: String = ...
Query0[Array[AnyRef]](query)
.stream
does not work (quite expectedly), since Array[AnyRef] has no Read instance.
The question is: Should I try to construct my own Read instance for the raw row or use more low-level methods (manually dealing with statement/result set etc. APIs)?
Related
I have read this friendly article about encoding and decoding custom object using official go mongo driver.
There is nice example how to marshaling them into extended json format (bson.MarshalExtJSONWithRegistry). But I would like to know how to put this document into collection with InserOne() (and later get it from). Look at this pseudo code:
// myReg - variable created according to linked article in question.
// WithRegistry do **not** exist in mongo-driver lib is part of pseudocode
mongoCollection := client.Database("db").Collection("coll").WithRegistry(myReg)
// Now InserOne() honor myReg (type *bsoncodec.Registry) when serialize `val` and puting it into mongodb
mongoCollection.InsertOne(context.TODO(), val)
I have go through API doc and I have found there are Marshaler and Unmarshaler interfaces, but with registry way I would be able to (de)serialize the same type in different way on different collection (for example when migrate from old format to new one).
So the question is how to use *bsoncodec.Registry with collection functions (like InserOne, UpdateOne, FindOne etc.) if not what is the most idiomatic way to achieve my goal (custom (de)serialize).
The Database.Collection() method has "optional" options.CollectionOptions parameter which does have option to set the bsoncodec.Registry. If you acquire your collection using an options that's configured with a registry, that registry will be used with all operations performed with that collection.
Use it like this:
opts := options.Collection().SetRegistry(myReg)
c := client.Database("db").Collection("coll", opts)
Quoting from my related answer: How to ignore nulls while unmarshalling a MongoDB document?
Registries can be set / applied at multiple levels, even to a whole mongo.Client, or to a mongo.Database or just to a mongo.Collection, when acquiring them, as part of their options, e.g. options.ClientOptions.SetRegistry().
So when you're not doing migration from old to new format, you may set the registry at the "client" level and "be done with it". Your registry and custom coders / decoders will be applied whenever the driver deals with a value of your registered custom type.
Kafka 0.8V
I want to publish /consume byte[] objects, java bean objects, serializable objects and much more..
What is the best way to define a publisher and consumer for this type scenario?
When I consume a message from the consumer iterator, I do not know what type of the message it is.
Can anybody point me a guide on how to design such scenarios?
I enforce a single schema or object type per Kafka Topic. That way when you receive messages you know exactly what you are getting.
At a minimum, you should decide whether a given topic is going to hold binary or string data, and depending on that, how it will be further encoded.
For example, you could have a topic named Schema that contains JSON-encoded objects stored as strings.
If you use JSON and a loosely-typed language like JavaScript, it could be tempting to store different objects with different schemas in the same topic. With JavaScript, you can just call JSON.parse(...), take a peek at the resulting object, and figure out what you want to do with it.
But you can't do that in a strictly-typed language like Scala. The Scala JSON parsers generally want you to parse the JSON into an already defined Scala type, usually a case class. They do not work with this model.
One solution is to keep the one schema / one topic rule, but cheat a little: wrap an object in an object. A typical example would be an Action object where you have a header that describes the action, and a payload object with a schema dependent on the action type listed in the header. Imagine this pseudo-schema:
{name: "Action", fields: [
{name: "actionType", type: "string"},
{name: "actionObject", type: "string"}
]}
This way, in even a strongly-typed language, you can do something like the following (again this is pseudo-code) :
action = JSONParser[Action].parse(msg)
switch(action.actionType) {
case "foo" => var foo = JSONParser[Foo].parse(action.actionObject)
case "bar" => var bar = JSONParser[Bar].parse(action.actionObject)
}
One of the neat things about this approach is that if you have a consumer that's waiting for only a specific action.actionType, and is just going to ignore all the others, it's pretty lightweight for it to decode just the header and put off decoding action.actionObject until when and if it is needed.
So far this has all been about string-encoded data. If you want to work with binary data, of course you can wrap it in JSON as well, or any of a number of string-based encodings like XML. But there are a number of binary-encoding systems out there, too, like Thrift and Avro. In fact, the pseudo-schema above is based on Avro. You can even do cool things in Avro like schema evolution, which amongst other things provides a very slick way to handle the above Action use case -- instead of wrapping an object in an object, you can define a schema that is a subset of other schemas and decode just the fields you want, in this case just the action.actionType field. Here is a really excellent description of schema evolution.
In a nutshell, what I recommend is:
Settle on a schema-based encoding system (be it JSON, XML, Avro,
whatever)
Enforce a one schema per topic rule
Can anyone please explain me difference between map and mapAsync w.r.t AKKA stream? In the documentation it is said that
Stream transformations and side effects involving external non-stream
based services can be performed with mapAsync or mapAsyncUnordered
Why cant we simply us map here? I assume that Flow, Source, Sink all would be Monadic in nature and thus map should work fine w.r.t the Delay in the nature of these ?
Signature
The difference is best highlighted in the signatures: Flow.map takes in a function that returns a type T while Flow.mapAsync takes in a function that returns a type Future[T].
Practical Example
As an example, suppose that we have a function which queries a database for a user's full name based on a user id:
type UserID = String
type FullName = String
val databaseLookup : UserID => FullName = ??? //implementation unimportant
Given an akka stream Source of UserID values we could use Flow.map within a stream to query the database and print the full names to the console:
val userIDSource : Source[UserID, _] = ???
val stream =
userIDSource.via(Flow[UserID].map(databaseLookup))
.to(Sink.foreach[FullName](println))
.run()
One limitation of this approach is that this stream will only make 1 db query at a time. This serial querying will be a "bottleneck" and likely prevent maximum throughput in our stream.
We could try to improve performance through concurrent queries using a Future:
def concurrentDBLookup(userID : UserID) : Future[FullName] =
Future { databaseLookup(userID) }
val concurrentStream =
userIDSource.via(Flow[UserID].map(concurrentDBLookup))
.to(Sink.foreach[Future[FullName]](_ foreach println))
.run()
The problem with this simplistic addendum is that we have effectively eliminated backpressure.
The Sink is just pulling in the Future and adding a foreach println, which is relatively fast compared to database queries. The stream will continuously propagate demand to the Source and spawn off more Futures inside of the Flow.map. Therefore, there is no limit to the number of databaseLookup running concurrently. Unfettered parallel querying could eventually overload the database.
Flow.mapAsync to the rescue; we can have concurrent db access while at the same time capping the number of simultaneous lookups:
val maxLookupCount = 10
val maxLookupConcurrentStream =
userIDSource.via(Flow[UserID].mapAsync(maxLookupCount)(concurrentDBLookup))
.to(Sink.foreach[FullName](println))
.run()
Also notice that the Sink.foreach got simpler, it no longer takes in a Future[FullName] but just a FullName instead.
Unordered Async Map
If maintaining a sequential ordering of the UserIDs to FullNames is unnecessary then you can use Flow.mapAsyncUnordered. For example: you just need to print all of the names to the console but didn't care about order they were printed.
I'm working on an application that uses a generic Slick class to make queries based on information (such as url, user, pass, column count, etc) provided in metadata files or property files. As a result, I am unable to hardcode any information about the tables I will be accessing. Thus, I will be using a lot of raw SQL queries within Slick, and then proceeding to filter and paginate through the data using Slick tools.
My question is this:
In the example provided in Slick's documentation:
import slick.driver.H2Driver.api._
val db = Database.forConfig("h2mem1")
val action = sql"select ID, NAME, AGE from PERSON".as[(Int,String,Int)]
db.run(action)
You see that action has .as[(Int, String, Int)] at the end of it, I'm guessing to tell the compiler what to expect. That makes sense. However, what I'd like to do would require me to know that information based on non-source-code. Is there any way to have the rows returned from the query be some sort of List or Array that I could access with dynamic information (such as index numbers). I'd be willing to accept a List[String] for example to make this less of a type headache.
I'll keep working at it, but as a Slick newbie, I was wondering if anyone more experienced than me would have a solution off the top of their head.
I want to perform geoip lookups of my data in spark. To do that I'm using MaxMind's geoIP database.
What I want to do is to initialize a geoip database object once on each partition, and later use that to lookup the city related to an IP address.
Does spark have an initialization phase for each node, or should I instead check whether an instance variable is undefined, and if so, initialize it before continuing? E.g. something like (this is python but I want a scala solution):
class IPLookup(object):
database = None
def getCity(self, ip):
if not database:
self.database = self.initialise(geoipPath)
...
Of course, doing this requires spark will serialise the whole object, something which the docs caution against.
In Spark, per partition operations can be do using :
def mapPartitions[U](f: (Iterator[T]) ⇒ Iterator[U], preservesPartitioning: Boolean = false)
This mapper will execute the function f once per partition over an iterator of elements. The idea is that the cost of setting up resources (like DB connections) will be offset with the usage of such resources over a number of elements in the iterator.
Example:
val logsRDD = ???
logsRDD.mapPartitions{iter =>
val geoIp = new GeoIPLookupDB(...)
// this is local map over the iterator - do not confuse with rdd.map
iter.map(elem => (geoIp.resolve(elem.ip),elem))
}
This seems like a good usage of a broadcast variable. Have you looked at the documentation for that functionality and if you have does it fail to meet your requirements in someway?
As #bearrito mentioned - you can use load your GeoDB and then broadcast it from your Driver.
Another option to consider is to provide an external service that you can use to do a lookup. It could be an in memory cache such as Redis/Memcached/Tacheyon or a regular datastore.