How to use spark and mongo with play to calculate prediction? - mongodb

I am using play, scala and mongodb (salat).
I have following database structure-
[{
"id":mongoId,
"name":"abc",
"utilization":20,
"timestamp":1416668402352
},
{
"id":mongoId,
"name":"abc",
"utilization":30,
"timestamp":1415684102290
},
{
"id":mongoId,
"name":"abc",
"utilization":90,
"timestamp":1415684402210
},
{
"id":mongoId,
"name":"abc",
"utilization":40,
"timestamp":1415684702188
},
{
"id":mongoId,
"name":"abc",
"utilization":35,
"timestamp":1415684702780
}]
By using above data, I want to calculate utilization for current timestamp (By applying statistical algorithm).
To calculate it I am using spark. I have added following dependencies to build.sbt of play framework.
I have following questions.
1) How to calculate current utilization?? (using MLlib of spark)
2) Is it possible to query mongo collection to get some of fields using spark??

There is a project named Deep-Spark that takes care about integrating spark with mongodb (and others datastores like cassandra, aerospike, etc).
https://github.com/Stratio/deep-spark
You can check how to use it here:
https://github.com/Stratio/deep-spark/blob/master/deep-examples/src/main/java/com/stratio/deep/examples/java/ReadingCellFromMongoDB.java
It is a very simple way to start working with mongodb and spark.
Sorry I can not help you with MLlib, but sure somebody will add something useful.
Disclaimer: I am currently working on Stratio.

Related

What is the proper index format that should I use in MongoDB for this particular scenario explained below?

I have the following query to be executed on my MongoDB collection order_error. It has over 60 million documents. The main concern is I am having a $in operator within my query. I tried several possibilities of indices but none of them gave a high-performance improvement. The query is as follows
db.getCollection("order_error").find({
"$and":[
{
"type":"order"
},
{
"Origin.SN":{
"$in":[
"4095",
"4100",
"4509",
"4599",
"4510"
]
}
}
]
}).sort({"timestamp.milliseconds" : 1}).skip(1).limit(100).explain("executionStats")
One issue that needs to be noted is I am allowing sort on timestamp.milliseconds in both directions(ASC + DESC). I have limited the entries within the $in. Usually, it is more. SO what kind of index gives the performance improvement. I tried creating the following indices already
type_1_Origin.SN_1_timestamp.milliseconds_-1
type_1_timestamp.milliseconds_-1_Origin.SN
Is there any better way for index creation?

Performing a date comparison with a MongoDB Find using only JSON

As far as I can tell there seems to be no way to perform a Find lookup in MongoDB that performs a Date lte without using either the Date or ISODate JavaScript constructor methods.
I am using a technology stack that uses LUA scripts that are invoked by a secondary Perl system. The issue is that I can only pass pure JSON between these systems and subsequently on to the MongoDB instance itself.
This presents a number off issues, the major one being that any JavaScript functions supported by Mongo are not an option. (Assuming there's no way to get Mongo to interpret functions stored within string values?)
Currently I can perform a lte lookup using the Aggregate method with something like:
{
"$match": {
"$expr": {
"$and": {
"$lte": {
"start_date": {
"$dateFromString": {
"dateString": "2020-06-07T04:44:39.993Z"
}
}
}
}
}
}
}
While this works, this is just for a single date comparison, it would be nice to be able to reduce the amount of code since I expect to be creating some fairly complex queries.
Ideally I would be able to use Find and do the following:
{
"start_date": {
"$lte": "2020-06-07T04:44:39.993Z"
}
}
As far as I know Find doesn't seem to support lte date lookups using a string based Date value.
Things I've tried:
Using various ISO and JavaScript formatted date strings.
Using Epoch seconds and millisecond string values.
Attempting to escape the JavaScript function calls within the string value.
Is what I am attempting to do at all possible? Is using pure JSON an option? Or do I just have to concede defeat and use the Aggregate function to perform the String to Date conversion and just deal with the additional boiler plate code or is there a fundamental thing I'm missing with Mongo that would let me do this?

Simple migration of data from SQL to MongoDB using Pentaho

I am a newbie to Mongodb and Pentaho, I am facing trouble to convert my existing RDBMS table to mongodb.
The structure of RDBMS is:
user_id,question_id,option_id
12,23,4
12,24,7
12,24,8
12,25,9
I want this to be converted into:
{
user_id:12,
questions:[
{question_id:23,Options:[4]},
{question_id:24}, Options:[7,8]},
{question_id:25},Optioins:[9]
]
}
I am using pentaho mongodbOutput, I tried various combinations but none worked.
One of the combination I have used is below
user_id,,Y,N,Y,N/A,Insert&Update
question_id,Questions[0],Y,N,N,$push,Insert&‌​Update
options_id,Questions[1],Y,N,N,$push,Insert&Update
with this above settings I have got
{
user_id:12,
questions:[
{question_id:23,Options:4},
{question_id:24}, Options:7},
{question_id:24}, Options:8},
{question_id:25},Options:9
]
}
But I need Options 7,8 to be array.

Restructuring data inside of Mongo or out?

I have some simple transaction-style data in a flat format like the following:
{
'giver': 'Alexandra',
'receiver': 'Julie',
'amount': 20,
'year_given': 2015
}
There can be multiple entries for any 'giver' or 'receiver'.
I am mostly querying this data based on the giver field, and then split up by year. So, I would like to speed up these queries.
I'm fairly new to Mongo so I'm not sure which of the following methods would be the best course of action:
1. Restructure the data into the format:
{
'giver': 'Alexandra',
'transactions': {
'2015': [
{
'receiver': 'Julie',
'amount': 20
},
...
],
'2014': ...,
...
}
}
This makes me the most sense to me. We place all transactions into subdocuments of a person rather than having transactions all over the collection. It provides the data in the form I query it by the most, so it should be fast to query by 'giver' and then 'transactions.year'
I'm unsure if restructuring data like this is possible inside of mongo or if I should export it and modify it outside of mongo via some programming language.
2. Simply index by 'giver'
This doesn't quite match the way I'm querying this data (by 'giver' then 'year'), but it could be fast enough to do what I'm looking for. It's simple within mongo to do, and doesn't require restructuring of the data.
How should I go about adjusting my database to make my queries faster? And which way is the 'Mongo way'?

MongoDB selection with existing value

I'm using PyMongo to fetch data from MongoDB. All documents in the collection look like the structure below:
{
"_id" : ObjectId("50755d055a953d6e7b1699b6"),
"actor":
{
"languages": ["nl"]
},
"language":
{
"value": "nl"
}
}
I'm trying to fetch all the conversations where the property language.value is inside the property actor.languages.
At the moment I know how to look for all conversations with a constant value inside actor.languages (eg. all conversations with en inside actor.languages).
But I'm stuck on how to do the same comparison with a variable value (language.value) inside the current document.
Any help is welcome, thanks in advance!
db.testcoll.find({$where:"this.actor.languages.indexOf(this.language.value) >= 0"})
You could use a $where provided your query set is small, but any real size and you could start seeing problems, especially since this query seems like one that needs to be run in realtime on a page and the JS engine is single threaded among other problems.
I would actually consider a better way in this case is through the client side, it is quite straight forward, pull out records based on one of the values, iterate and test their conditional double value (i.e. pull out based on language.value being nl and test actor.languages value for that previous value).
I would imagine you might be able to do this with the aggregation framework however, at the min you cannot use computed fields within $match. I would imagine it would look like this:
{$project:
{languages_value: "$language.value", languages: "$actor.languages"}
}, {$match: {"$languages": {$in:"$languages_values"}}
If you could. But there might be a way.