Data mining on unstructured text - mongodb

I am working right now at a academic project and I want to use data mining tehniques for a market segmentetion.
I want to store text information (which is supossed to be large amount of text), like tweets, news feed etc - so they are different source of data (they have different structure).
There are 2 questions:
What is the best way to get all this news articles, posts etc, so I can finally get enough text data to have the posibility to process it and to draw good conclusions from it? Or what other kind of unstructured data cand I use?
Where to store all the unstructured text, in order to access it later and apply all this text mining tehniques? What about MongoDB?
Thank you so much!

Take a look at the following:
Apache Lucene
Apache Solr
Elasticsearch

Related

Postgres or MongoDB

I have to make a website. I have the choice between Postgres and MongoDB.
Case 1 : Postgres
Each page has one table and each table has only one row, for one page (each page is not structured like another)
I have a timelined page with medias (albums and videos)
So I have multiple medias (pictures, videos), and I display it as well by an album of pictures page and a videos page.
Therefore I have a medias table, linked with an album table (many-to-many), and a type column for determining if it's picture or video.
Case 2 : MongoDB
I'm completely new to NoSQL and I don't know how to store the data.
Problems that I see
Only one row for a table, that disturb me
In the medias table, I can have an album with videos, I'd like to avoid it. But if I cut this table in pictures table and videos table, How can I do a single call to have all the medias for the timelined page.
That's why I think it's better to me to make the website with MongoDB.
What is the best solution, Postgres or MongoDB? What do I need to know if it's MongoDB? Or maybe something escape me for Postgres.
It will depend on time, if you don't have time to learn another technology, the answer is to going straight forward with the one you know and solve the issues with it.
If scalability is more important, then you'll have to take a deeper look to your architecture and know very well how to scale postgresql.
Postgresql can handle json columns for unstructured data, I use it and it's great. I will have a single table with the unstructured data in a column name page_structure, so you'll have one single big indexed table instead of a lot of one row tables.
It's relative easy to query just what you want so no need no separate tables for images and videos, in order to be more specific, you'll need to provide some scheme.
I think you are coming to the right conclusion of using a NoSql database because you are not sure about the columns in a table for a page and thats the reason you are creating different tables for different pages. I will still say to make columns a bit consistent over the records. Anyways, by using MongoDB, you can have different records (called documents in MongoDB) with different columns based on attributes of your page in a single Collection (Tables in SQL). You can have pictures and videos collections separately if you want and wire them with your page collection using some foreign key like page_id. Or you can call page collection to get all the attributes including an array containing the IDs of all videos or pictures by which you can retrieve corresponding videos and pictures of a particular page like illustrated below,
Collections
Pages [{id, name, ...., [video1, video2,..], [pic1, pic2, pic78,...]}, id, name, ...., [video1_id, video2_id,..], [pic1_id, pic2_id, pic78_id,...]},...]
Videos [{video1_id, content,... }, {video2_id, content,...}]
Pictures [{pic1_id, content,... }, {pic2_id, content,...}]
I suggest you use the Clean Code architecture. personally, I believe that you MUST departure your application logic and data access functions aside so they can both work separately. your code must not rely on your database. I rather code the way that I can migrate my data to every database I'd like it would still work.
think about when your project gets big and you want to try cashing to solve a problem. if your data access functions are not separated from your business logic code you can not easily achieve that.
I agree with #espino316 about using the project you are already familiar with.
and also with #Actung about you should consider learning a database like MongoDB but in some training projects first, because there are many projects that the best way to go is to use NoSQL.
just consider that might find out about this 2 years AFTER you deployed your website. or the opposite way, you go for MongoDB and you realize the best way to go was to use Postgres or IDK MySQL, etc.
I think the best way to go is to make the migration easy for yourself.
all the best <3

How to data model a live web app from SQL Server to ElasticSearch?

In our web application we use a denormalized data mart in SQL Server for geo-based user project content.
Users have 1..*projects, 1..*geo areas. Content is stored (in the data mart) with UserID, ProjectID, text values for geo areas, title and description (both free text search indexed):
UserID, ProjectID, Geo, Title, Description, Timestamp
Now wanting to move this over to ElasticSearch, what would be a good data modeling approach?
Simply for the data mart, I was thinking of just serializing the data object (currently using .Net and EntityFramework) to give me the JSON representation and stuffing that into ES. Is this a good approach (also requires least re-work)?
With regards to modeling the entire application, I have seen examples where an ES type would be organized by, say Users, so the model may look something like this:
User
User ID, Name, etc...
ProfileSettings
Setting1, Setting2, etc...
Geographies
GeoID, GeoName
Projects
ProjectID, ProjectName
ProjectContent
Key (UserID:ProjectID:ProjectContentID), GeoName, Title, Description, Timestamp
So this looks like the whole web application could run off of one index/type. A bit scary, no?
I would like to use Kibana and other analysis tools in the future, and have read about data modeling limitations like not using parent/child types.
What is would a good ElasticSearch data model look like for something like this?
Another way of asking would be, how would one model a live web application using ElasticSearch, and/or would it be better to store user configs and profiles in a separate RDBMS?
Thank you.
These questions are always difficult to answer without understanding the business and the reporting requirements. But here are a couple guidelines I learned from my admittedly brief experience with ES:
1) You don't have to put it all in one index, so separate indexes for "user" and "project" may work best. Since ES indexes all fields by default, searching a project index by user will be fast. Kibana can search multiple indexes.
2) The prevailing wisdom at the time was to keep the indexes as flat as possible, so same thing applies to having a separate index for user profile settings.
3) It may be advantageous to create a mapping, in addition to serializing and stuffing.
Regarding user configs and profiles, I don't see any compelling reason to use a RDBMS. They'll be keyed by user id with no join requirements, will not require the ACID consistency and concurrency model. A NOSQL solution will give you the schema flexibility those use cases demand.

Clustering structured (numeric) and text data simultaneously

Folks,
I have a bunch of documents (approx 200k) that have a title and abstract. There is other meta data available for each document for example category - (only one of cooking, health, exercise etc), genre - (only one of humour, action, anger) etc. The meta data is well structured and all this is available in a MySql DB.
I need to show to our user related documents while she is reading one of these document on our site. I need to provide the product managers weight-ages for title, abstract and meta data to experiment with this service.
I am planning to run clustering on top of this data, but am hampered by the fact that all Mahout Clustering example use either DenseVectors formulated on top of numbers, or Lucene based text vectorization.
The examples are either numeric data only or text data only. Has any one solved this kind of a problem before. I have been reading Mahout in Action book and the Mahout Wiki, without much success.
I can do this from the first principles - extract all titles and abstracts in to a DB, calculate TFIDF & LLR, treat each word as a dimension and go about this experiment with a lot of code writing. That seems like a longish way to the solution.
That in a nutshell is where I am trapped - am I doomed to the first principles or there exist a tool / methodology that I somehow missed. I would love to hear from folks out there who have solved similar problem.
Thanks in advance
You have a text similarity problem here and I think you're thinking about it correctly. Just follow any example concerning text. Is it really a lot of code? Once you count the words in the docs you're mostly done. Then feed it into whatever clusterer you want. The term extractions is not something you do with Mahout, though there are certainly libraries and tools that are good at it.
I'm actually working on something similar, but without the need of distinciton between numeric and text fields.
I have decided to go with the semanticvectors package which does all the part about tfidf, the semantic space vectors building, and the similarity search. It uses a lucene index.
Please note that you can also use the s-space package if semanticvectors doesn't suit you (if you go down that road of course).
The only caveat I'm facing with this approach is that the indexing part can't be iterative. I have to index everything every time a new document is added, or an old document is modified. People using semanticvectors say they have very good indexing times. But I don't know how large their corpora are. I'm going to test these issues with the wikipedia dump to see how fast it can be.

Log viewing utility database choice

I will be implementing log viewing utility soon. But I stuck with DB choice. My requirements are like below:
Store 5 GB data daily
Total size of 5 TB data
Search in this log data in less than 10 sec
I know that PostgreSQL will work if I fragment tables. But will I able to get this performance written above. As I understood NoSQL is better choice for log storing, since logs are not very structured. I saw an example like below and it seems promising using hadoop-hbase-lucene:
http://blog.mgm-tp.com/2010/03/hadoop-log-management-part1/
But before deciding I wanted to ask if anybody did a choice like this before and could give me an idea. Which DBMS will fit this task best?
My logs are very structured :)
I would say you don't need database you need search engine:
Solr based on Lucene and it packages everything what you need together
ElasticSearch another Lucene based search engine
Sphinx nice thing is that you can use multiple sources per search index -- enrich your raw logs with other events
Scribe Facebook way to search and collect logs
Update for #JustBob:
Most of the mentioned solutions can work with flat file w/o affecting performance. All of then need inverted index which is the hardest part to build or maintain. You can update index in batch mode or on-line. Index can be stored in RDBMS, NoSQL, or custom "flat file" storage format (custom - maintained by search engine application)
You can find a lot of information here:
http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
See which fits your needs.
Anyway for such a task NoSQL is the right choice.
You should also consider the learning curve, MongoDB / CouchDB, even though they don't perform such as Cassandra or Hadoop, they are easier to learn.
MongoDB being used by Craigslist to store old archives: http://www.10gen.com/presentations/mongodb-craigslist-one-year-later

where can noSql be successfully implemented?

I took the time and see the entire Hadi Hariri presentation of CouchDB for .NET Developers that took place in OreDev conference last year.
And I keep asking myself, where should I use such way to store data?
What small, medium and large examples can be took using a noSQL model?
In what application context I would save the data in JSON, and that do not follow a pattern? In what application context the retrieving of such data would be better and faster (along the application time) comparing to the process of getting from a SQL server? Licencing price? Is that the only one?
Let me share our case: we use a NoSQL system of document type to store and search our documents in full text. This requires a full-text indexing. We also do a facet search on the entire data. That is, we produce only "hit" count for a specific search broke down to some categories that we need. You can imagine an electronic shop selling photo cameras, so the facet search here can take place in price ranges. Thus you would be able to say, to which price range what types of cameras belong.
If you think about using a NoSQL system for document search, then small dataset would be order of GB's (let's say up to 10), a medium up to 100GB and large dataset of size up to 1TB. This is based on what I have seen people use Apache SOLR (from their mail-list) and what data volume we have in our company.
There are other types of NoSQL systems and associated use / business cases, where you can utilize them in conjuction with SQL systems or alone. You can have a look on this short PP presentation I made for an introductory talk on NoSQL systems: http://www.slideshare.net/dmitrykan/nosql-apache-solr-and-apache-hadoop