Training data for unsupervised learning api - fasttext

I am trying to use the "crawl-300d-2M.vec" pre-trained model to cluster the documents for my projects. I am not sure what format the training data(train.txt) should be when i use
ft_model = fasttext.train_unsupervised(input='train.txt',pretrainedVectors=path, dim=300)
My corpus contains 10k documents. What I did was to put them all in a text file and feed it to train_unsupervised method. I did not get good results. Can someone explain what i am missing ? Thank you.

Related

Design of real time sentiment analysis

We're trying to design a real time sentiment analysis system (on paper) for a school project. We got some (very vague) negative feedback on how we store our data, but it isn't fully clear why this would be a bad idea or how we'd best improve this.
The setup of the system is as follows:
data from real time news RSS feeds is gathered in a kafka messaging queue, which connects to our preprocessing platform. This preprocessing step would transform all the news articles into semi-structured data, which we can do a sentiment analysis on.
We then want to store both the sentiment analysis and the preprocessed, semi-structured news article for reference.
We were thinking of using MongoDB as a database to do so since you have a lot of freedom in defining different fields in the value (in the key:value pair you store) instead of Cassandra (which would be faster).
The basic use case is for people to look up an institution and get the sentiment analysis of a bunch of news articles in a certain timeframe.
As a possible improvement: do we need to use a NoSQL database or would it make sense to use a SQL database? I think our system could benefit from being denormalized (as is the case by default in NoSQL) and we wouldn't be needing any operations such as join operations that are significantly faster in SQL systems.
Does anyone know of existing systems that do similar things, for comparison?
Any input would be highly appreciated.

GeoTools filters for shapefiles

I am looking at using GeoTools to read shapefiles. The tutorial for using it is straightforward showing how to set a filter to "Filter filter = Filter.INCLUDE;" to specify everything.
I want to split up the reading for performance purposes on very large shape files. In essence I want to split the reading of the info in the DBF file from the reading of the "THE_GEOM" data. We have a lot of our own filtering already built and it is easier to just use it and then retrieve the actual geometry as required.
How do I specify a filter to retrieve all the DBF info without the geometry?
How do I specify a filter to retrieve the geometry without the DBF info? This isn't as important since it probably won't impact performance so much but I would like to know.
Thanks.
By design the GeoTools' Shapefile Datastore goes to great lengths to keep the geometry and the attributes (the DBF stuff) together. So you are going to have to poke around in the internals to be able to do this. So you could use a DBFFileReader and a ShapefileReader to split the reading up.
I would consider porting your filters to GeoTools as it gives you the flexibility to switch data sources later when Shapefiles prove too small or too slow. It might be worth looking at the CQL and ECQL classes to help out in constructing them.
If you are really dealing with large Shapefiles (>2Gb) then going with a proper spatial database like PostGIS is almost certainly going to give better performance and GeoTools will access the data in exactly the same way with exactly the same filters.

How to make Weka API work with MongoDB?

I'm looking to use WEKA to train and predict from data in MongoDB. Specifically, I intend to use Weka API to analyse data (e.g. build a recommendation engine). But I have no idea how to proceed, because the data in MongoDB is stored in the BSON format, while WEKA uses the ARFF format. I would like to use the WEKA API to read data from MongoDB, analyse it, and provide recommendations to the user in real-time. I can not find a bridge beween WEKA and MongoDB.
Is this even possible or should I try another approach?
Before I begin, I should say that WEKA isn't the best tool for working with Big Data. If you really have Big Data, you will likely want to use Spark and the Hadoop family as they are more suited to analysis.
To answer your question as written, I would advise doing the training manually (i.e. creating a training file using any programmatic tools available to you) and pretraining a model. These models can then be saved and integrated into a program accordingly.
For testing, you can follow the official instructions, but I usually take a bit of a shortcut: I usually preprocess my data into a CSV-like format (as if it was going into an ARFF file) and just prepend a valid ARFF header (the same one as your training file uses). From there, it is very easy to test the instances. In my experience, this greatly simplifies the process of writing code that actually makes novel predictions.

Data mining on unstructured text

I am working right now at a academic project and I want to use data mining tehniques for a market segmentetion.
I want to store text information (which is supossed to be large amount of text), like tweets, news feed etc - so they are different source of data (they have different structure).
There are 2 questions:
What is the best way to get all this news articles, posts etc, so I can finally get enough text data to have the posibility to process it and to draw good conclusions from it? Or what other kind of unstructured data cand I use?
Where to store all the unstructured text, in order to access it later and apply all this text mining tehniques? What about MongoDB?
Thank you so much!
Take a look at the following:
Apache Lucene
Apache Solr
Elasticsearch

Read from MongoDB into sklearn (scikit-learn)

Apologies if this has been asked, but I have looked here and elsewhere on the web for an answer and nothing has been forthcoming.
The basic examples of using sklearn are either reading from a prepared sklearn dataset into memory, or reading from a file such as a .csv into memory and then processing. Can someone kindly provide an equivalent example for reading database data into sklearn for processing, preferably MongoDB, but I will take what I can get at this point. I have been struggling with this for a little while now.
I can post what I have done so far, but I don't think it will help. Thanks for any help/advice/pointers.