Performance Optimization: SQL query and looping over the grouped object (hash) - postgresql

My Association is like this
field has_many field_values
field_value belongs_to field
I am querying field_values table to get field_values grouped by field_id like this:
def field_values
FieldValue.where(field_id: #field_ids)
.pluck(:id, :value, :field_id, :active, :old_value)
.map { |obj| { id: obj[0], value: obj[1], field_id: obj[2],
active: obj[3], old_value: obj[4] } }
.group_by { |a| a[:field_id] }
end
Above method executes this SQL which takes 20ms to fetch 30k records
SELECT id, value, field_id, active, old_value FROM field_values WHERE field_id IN (85, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 98, 99, 100, 101, 102, 103, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 117, 118, 119);
And then constructing my #shared_fields obj to insert field_values in it:
#shared_fields
{
id: 1
field:
{
id: 11,
field_values:
[
{
id: 111,
value: 'abc',
field_id: 11,
active: true,
old_value: nil
}
{
id: 112,
value: 'pqr',
field_id: 11,
active: true,
old_value: nil
}
]
}
}
def construct_obj
field_values.each do |id, values|
sh_field = #shared_fields.detect { |shf| shf[:field][:id] == id }
next unless sh_field
sh_field[:field][:field_values] = values
end
end
field_values method is taking 20 ms to fetch 30k records
construct_obj method is taking around 170 ms to complete the
processing.
Any thoughts on how we can optimize the sql query and also looping over the grouped object which is taking 170 ms.

Related

CAST jsonb column into INT[]

I'm in a situation where I get a jsonb value (from the scrape field which is jsonb) that looks like this:
SELECT COALESCE(scrape->'amenity_ids', '[]'::jsonb) AS ids
FROM my_table
ids |
-------------------------------------------------------------------------------------------------------------+
[] |
[33, 34, 35, 4, 5, 37, 8, 40, 9, 41, 42, 11, 44, 45, 46, 47, 16, 21, 56] |
[129, 35, 4, 36, 37, 103, 40, 41, 45, 77, 17, 23, 30] |
[1, 33, 34, 35, 4, 36, 8, 40, 41, 44, 45, 77, 46, 47, 85, 56, 90, 91, 92, 93, 30, 95] |
[1, 129, 2, 4, 8, 9, 77, 85, 89, 90, 91, 92, 93, 30, 94, 95, 96, 33, 34, 100, 37, 38, 40, 41, 44, 45, 46, 57]|
Note that there are NULL values in the jsonb object. So at this point ids is going to be of type jsonb and what I need is to have an array of integers as I'm trying to query for:
SELECT int_array_ids #> '{33,34,35}' FROM my_table;
Once I'm able to have a converted ids to INT[] I can create indexes to speed my array contains queries.
I tried a subquery using array_agg but it's terrible slow:
SELECT array_agg(arrayed.am_id) FROM (
SELECT
id,
jsonb_array_elements_text(scrape->'amenity_ids') AS am_id
FROM my_table
) AS arrayed
GROUP BY arrayed.id

Intersection of Two Map rdd's in Scala

I have two RDD's, for example:
firstmapRDD - (0-14,List(0, 4, 19, 19079, 42697, 444, 42748))
secondmapRdd-(0-14,List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94))
I want to find the intersection.
I tried, var interResult = firstmapRDD.intersection(secondmapRdd), which shows no result in output file.
I also tried , cogrouping based on keys, mapRDD.cogroup(secondMapRDD).filter(x=>), but I don't know how to find the intersection between both the values, is it x=>x._1.intersect(x._2), Can someone help me with the syntax?
Even this throws a compile time error, mapRDD.cogroup(secondMapRDD).filter(x=>x._1.intersect(x._2))
var mapRDD = sc.parallelize(map.toList)
var secondMapRDD = sc.parallelize(secondMap.toList)
var interResult = mapRDD.intersection(secondMapRDD)
It may be because of ArrayBuffer[List[]] values, because of which the intersection is not working. Is there any hack to remove it?
I tried doing this
var interResult = mapRDD.cogroup(secondMapRDD).filter{case (_, (l,r)) => l.nonEmpty && r.nonEmpty }. map{case (k,(l,r)) => (k, l.toList.intersect(r.toList))}
Still getting an empty list!
Since you are looking intersect on values, you need to join both RDDs, get all the matched values, then do the intersect on values.
sample code:
val firstMap = Map(1 -> List(1,2,3,4,5))
val secondMap = Map(1 -> List(1,2,5))
val firstKeyRDD = sparkContext.parallelize(firstMap.toList, 2)
val secondKeyRDD = sparkContext.parallelize(secondMap.toList, 2)
val joinedRDD = firstKeyRDD.join(secondKeyRDD)
val finalResult = joinedRDD.map(tuple => {
val matchedLists = tuple._2
val intersectValues = matchedLists._1.intersect(matchedLists._2)
(tuple._1, intersectValues)
})
finalResult.foreach(println)
The output will be
(1,List(1, 2, 5))

infer type parameter from function argument

In the context of another stackoverflow question, I have this snippet:
def orderedGroupBy[T, P](seq: Traversable[T], f: T => P): Traversable[Tuple2[P, Traversable[T]]] = {
#tailrec
def accumulator(seq: Traversable[T], f: T => P, res: List[Tuple2[P, Traversable[T]]]): Traversable[Tuple2[P, Traversable[T]]] = seq.headOption match {
case None => res.reverse
case Some(h) => {
val key = f(h)
val subseq = seq.takeWhile(f(_) == key)
accumulator(seq.drop(subseq.size), f, (key -> subseq) :: res)
}
}
accumulator(seq, f, Nil)
}
I'd like to use it just like one can use .groupBy, e.g.:
orderedGroupBy(1 to 100, (_ / 10))
But the compiler yields an error about not having enough type info
<console>:10: error: missing parameter type for expanded function ((x$1) => x$1.$div(10))
orderedGroupBy(1 to 100, (_ / 10))
What is the idiomatic way to do this?
You can curry the parameters, so that T is inferred solely from seq: Traversable[T].
def orderedGroupBy[T, P](seq: Traversable[T])(f: T => P): Traversable[Tuple2[P, Traversable[T]]] = ???
scala> orderedGroupBy(1 to 100)(_ / 10)
res110: Traversable[(Int, Traversable[Int])] = List((0,Range(1, 2, 3, 4, 5, 6, 7, 8, 9)), (1,Range(10, 11, 12, 13, 14, 15, 16, 17, 18, 19)), (2,Range(20, 21, 22, 23, 24, 25, 26, 27, 28, 29)), (3,Range(30, 31, 32, 33, 34, 35, 36, 37, 38, 39)), (4,Range(40, 41, 42, 43, 44, 45, 46, 47, 48, 49)), (5,Range(50, 51, 52, 53, 54, 55, 56, 57, 58, 59)), (6,Range(60, 61, 62, 63, 64, 65, 66, 67, 68, 69)), (7,Range(70, 71, 72, 73, 74, 75, 76, 77, 78, 79)), (8,Range(80, 81, 82, 83, 84, 85, 86, 87, 88, 89)), (9,Range(90, 91, 92, 93, 94, 95, 96, 97, 98, 99)), (10,Range(100)))

Elixir streaming mondodb fail

I am using elixir-mongo and trying to stream the results of a query. Here's the code...
def archive_stream(z) do
Stream.resource(
fn ->
{jobs, datetime} = z
lt = datetime_to_bson_utc datetime
c = jobs |> Mongo.Collection.find( %{updated_at: %{"$lt": lt}}) |> Mongo.Find.exec
{:cont, c.response.buffer, c}
end,
fn(z) ->
{j, {cont, therest, c}} = next(z)
case cont do
:cont -> {j, {cont, therest, c}}
:halt -> {:halt, {cont, therest, c}}
end
end,
fn(:halt, resp) -> resp end
)
end
All of the sub-bits seem to work (like the query), but when I try to get at the stream, I fail...
Mdb.archive_stream({jobs, {{2013,11,1},{0,0,0}}})|>Enum.take(2)
I get...
(BadArityError) #Function<2.49475906/2 in Mdb.archive_stream/1> with arity 2 called with 1 argument ({:cont, <<90, 44, 0, 0, 7, 95, 105, 100, 0, 82, 110, 129, 221, 102, 160, 249, 201, 109, 0, 137, 233, 4, 95, 115, 108, 117, 103, 115, 0, 51, 0, 0, 0, 2, 48, 0, 39, 0, 0, 0, 109, 97, 110, 97, 103, 101, 114, 45, ...>>, %Mongo.Cursor{batchSize: 0, collection: %Mongo.Collection{db: %Mongo.Db{auth: {"admin", "edd5404c4f906060b125688e26ffb281"}, mongo: %Mongo.Server{host: 'db-stage.member0.mongolayer.com', id_prefix: 57602, mode: :passive, opts: %{}, port: 27017, socket: #Port<0.27099>, timeout: 6000}, name: "db-stage", opts: %{mode: :passive, timeout: 6000}}, name: "jobs", opts: %{}}, exhausted: false, response: %Mongo.Response{buffer: <<188, 14, 0, 0, 7, 95, 105, 100, 0, 82, 110, 129, 221, 102, 160, 249, 201, 109, 0, 137, 242, 4, 95, 115, 108, 117, 103, 115, 0, 45, 0, 0, 0, 2, 48, 0, 33, 0, 0, 0, 114, 101, 116, 97, ...>>, cursorID: 3958284337526732701, decoder: &Mongo.Response.bson_decode/1, nbdoc: 101, requestID: 67280413, startingFrom: 0}}})
(elixir) lib/stream.ex:1020: Stream.do_resource/5
(elixir) lib/enum.ex:1731: Enum.take/2
I'm stumped. Any ideas?
thanks for the help
Dang! Rookie Error.
:halt -> {:halt, {cont, therest, c}} should be _ -> {:halt, z} and fn(:halt, resp) -> resp end should be fn(resp) -> resp end
I've been d**king around with everything but the after function for a day and a half.
A little more explanation for fellow rookies...
the last option in the next_fun() should probably _ inorder to catch other "bad behavior" and not just {:halt}
the after_fn() is only expecting 1 arg and in the above code that would be the z tuple in the last option of the next_fun(). It is not expecting to see :halt and z, just z.
Would like to have REAL experts input.

ALL query operator with range?

Is it possible to specify a range when we use all operator in query ?
Example:
Where favorites_ordered contains [174, 225, 25, 165, 65, 87, 158], standart way will be:
Select where 174 and 158 are in favorites_ordered field.
db.inventory.find( { favorites_ordered: { $all: [ 174, 158 ] } } )
[174, 225, 25, 165, 65, 87, 158] : found
I would like make
Select where 174 and 158 are in favorites_ordered field and are in 3 first values.
[174, 225, 25, 165, 65, 87, 158] : not found
You can do this in mongo:
> db.foo.insert({favorites_ordered: [174, 225, 25, 165, 65, 87, 158]})
> db.foo.insert({favorites_ordered: [158, 225, 25, 165, 65, 87, 174]})
> db.foo.insert({favorites_ordered: [100, 158, 225, 25, 165, 65, 87, 174]})
> db.foo.find({ 'favorites_ordered.0': { $in : [174, 158] } })
{ "_id" : ObjectId("51fa394f59a0a6afeec5b138"), "favorites_ordered" : [ 174, 225, 25, 165, 65, 87, 158 ] }
{ "_id" : ObjectId("51fa3a2d59a0a6afeec5b139"), "favorites_ordered" : [ 158, 225, 25, 165, 65, 87, 174 ] }