Read from MongoDB into sklearn (scikit-learn) - mongodb

Apologies if this has been asked, but I have looked here and elsewhere on the web for an answer and nothing has been forthcoming.
The basic examples of using sklearn are either reading from a prepared sklearn dataset into memory, or reading from a file such as a .csv into memory and then processing. Can someone kindly provide an equivalent example for reading database data into sklearn for processing, preferably MongoDB, but I will take what I can get at this point. I have been struggling with this for a little while now.
I can post what I have done so far, but I don't think it will help. Thanks for any help/advice/pointers.

Related

accessing p-values in PySpark UnivariateFeatureSelector module

I'm currently in the process of performing feature selection on a fairly large dataset and decided to try out PySpark's UnivariateFeatureSelector module.
I've been able to get everything sorted out except one thing -- how on earth do you access the actual p-values that have been calculated for a given set of features? I've looked through the documentation and searched online and I'm wondering if you can't... but that seems like such a gross oversight for such this package.
thanks in advance!

Pyspark Dataframe metadata

Is there a way to get metadata about ALL the dataframes? I'd be interested in something like a list of dataframes with information about each (memory distribution information, etc.) I don't see anything in the docs that shows how to do this.
I'm think of this as a troubleshooting tool when I'm having memory issues. If I'm working on a big script that creates, caches and unpersists lots of dataframes, it would be really nice to be able to display a list of dataframes so that I could see if I've missed anything or my caching settings are wrong or something like that.
Thanks

Using PrintNode to Transfer data to Excel

I am trying to take data for a USB-HID scale (dymo S100) and print it in Excel. I came across a program called PrintNode: https://www.printnode.com/docs/reading-usb-scales-over-the-internet/
It seems like the solution however I don't have enough knowledge to make use of the API to transfer data to Excel. Any advice would be a tremendous help. Thank you.

How to make Weka API work with MongoDB?

I'm looking to use WEKA to train and predict from data in MongoDB. Specifically, I intend to use Weka API to analyse data (e.g. build a recommendation engine). But I have no idea how to proceed, because the data in MongoDB is stored in the BSON format, while WEKA uses the ARFF format. I would like to use the WEKA API to read data from MongoDB, analyse it, and provide recommendations to the user in real-time. I can not find a bridge beween WEKA and MongoDB.
Is this even possible or should I try another approach?
Before I begin, I should say that WEKA isn't the best tool for working with Big Data. If you really have Big Data, you will likely want to use Spark and the Hadoop family as they are more suited to analysis.
To answer your question as written, I would advise doing the training manually (i.e. creating a training file using any programmatic tools available to you) and pretraining a model. These models can then be saved and integrated into a program accordingly.
For testing, you can follow the official instructions, but I usually take a bit of a shortcut: I usually preprocess my data into a CSV-like format (as if it was going into an ARFF file) and just prepend a valid ARFF header (the same one as your training file uses). From there, it is very easy to test the instances. In my experience, this greatly simplifies the process of writing code that actually makes novel predictions.

PPDB paraphases searching

There is a well known lexical resources of paraphrases PPDB.
It comes with several forms from the biggest precision to the biggest recall. The biggest set XXXL for paraphrases contains ~5Gb of data.
I want PPDB for my research and I wounder what is the best engine to perform searching in such a big resources. I didn't try but I think to use it as is in file is not a good idea.
I was thinking about to export all the data to mongo, but I am not sure if this the best solution.
Please if you have some ideas share them with us.
Thank you.
You need to consider the following aspects:
1. For your use-case you will need a schemaless database
2. Transactions not required
3. Fast queries/searching
4. Easy to setup and deploy
5. Ability to handle large volumes of data
All the above aspects indicate to adopt MongoDB.
But you will have teething troubles to export data to MongoDB, but it is definitely worth the effort. Your data model can be as follows {key:[value1,value2,.....]} for each document.