How to make Weka API work with MongoDB? - mongodb

I'm looking to use WEKA to train and predict from data in MongoDB. Specifically, I intend to use Weka API to analyse data (e.g. build a recommendation engine). But I have no idea how to proceed, because the data in MongoDB is stored in the BSON format, while WEKA uses the ARFF format. I would like to use the WEKA API to read data from MongoDB, analyse it, and provide recommendations to the user in real-time. I can not find a bridge beween WEKA and MongoDB.
Is this even possible or should I try another approach?

Before I begin, I should say that WEKA isn't the best tool for working with Big Data. If you really have Big Data, you will likely want to use Spark and the Hadoop family as they are more suited to analysis.
To answer your question as written, I would advise doing the training manually (i.e. creating a training file using any programmatic tools available to you) and pretraining a model. These models can then be saved and integrated into a program accordingly.
For testing, you can follow the official instructions, but I usually take a bit of a shortcut: I usually preprocess my data into a CSV-like format (as if it was going into an ARFF file) and just prepend a valid ARFF header (the same one as your training file uses). From there, it is very easy to test the instances. In my experience, this greatly simplifies the process of writing code that actually makes novel predictions.

Related

Design of real time sentiment analysis

We're trying to design a real time sentiment analysis system (on paper) for a school project. We got some (very vague) negative feedback on how we store our data, but it isn't fully clear why this would be a bad idea or how we'd best improve this.
The setup of the system is as follows:
data from real time news RSS feeds is gathered in a kafka messaging queue, which connects to our preprocessing platform. This preprocessing step would transform all the news articles into semi-structured data, which we can do a sentiment analysis on.
We then want to store both the sentiment analysis and the preprocessed, semi-structured news article for reference.
We were thinking of using MongoDB as a database to do so since you have a lot of freedom in defining different fields in the value (in the key:value pair you store) instead of Cassandra (which would be faster).
The basic use case is for people to look up an institution and get the sentiment analysis of a bunch of news articles in a certain timeframe.
As a possible improvement: do we need to use a NoSQL database or would it make sense to use a SQL database? I think our system could benefit from being denormalized (as is the case by default in NoSQL) and we wouldn't be needing any operations such as join operations that are significantly faster in SQL systems.
Does anyone know of existing systems that do similar things, for comparison?
Any input would be highly appreciated.

How to show data from the RDF archive in scala flink

I am looking for an approach to load and print data from .n3 files of .tar.gz archive in scala. Or should I extract it?
If you want to download the file, it is located on
http://wiki.knoesis.org/index.php/LinkedSensorData
Could anyone describe how can I print the data on the screen from this archive using scala?
The files that you are dealing with are large. I therefore suggest you import it into an RDF store of some sort rather than try and parse it yourself. You can use GraphDB, Blazegraph, Virtuso and the list goes on. A search for RDF stores should give many other options. Then you can use SPARQL to query the RDF store (which is like SQL for relational databases).
For finding a Scala library that can access RDF data you can see this related SO question, though it does not look promising. I would suggest you look at Apache Jena, a Java library.
You may also want to look at the DBPedia Extraction Framework where they extract data from Wikipedia and store it as RDF data using Scala. It is certainly not exactly what you are trying to do, but it could give you insight into the tools they used for generating RDF from Scala and related issues.

How to build Scala report projects

Is there a common standard to follow for building a SCALA based report engine from scratch. Data will be sourced from HDFS, Filtered, formatted and emailed. Please share any experience or hurdles to expect.
I used to do such reports as PDF, HTML and XSLX.
We used ElasticSearch but here was the general workflow:
get filtered data from storage to scala (no real trouble, just make sure your filters are well tested)
fill the holes to have a consistent data: think about missing points, crazy timezones...
format (we used an xslt processor to produce email HTML, it is really specific and size for emails is limited, aim ~15 Mo as a very maximum)
if file is too big, store it somewhere and send the link instead

DATASTAGE capabilities

I'm a Linux programmer.
I used to write code in order to get things done: java perl php c.
I need to start working with DATA STAGE.
All I see is that DATA STAGE is working on table/csv style data and doing it line by line.
I want to know if DATA STAGE can work on file that are not table/csv like. can it load
data into data structures and run function on them, or is it limited to working
only on one line at a time.
thank you for any information that you can give on the capabilities of DATA SATGE
IBM (formerly Ascential) DataStage is an ETL platform that, indeed, works on data sets by applying various transformations.
This does not necessarily mean that you are constrained on applying only single line transformations (you can also aggregate, join, split etc). Also, DataStage has it's own programming language - BASIC - that allows you to modify the design of your jobs as needed.
Lastly, you are still free to call external scripts from within DataStage (either using the DSExecute function, Before Job property, After Job property or the Command stage).
Please check the IBM Information Center for a comprehensive documentation on BASIC Programming.
You could also check the DSXchange forums for DataStage specific topics.
Yes it can, as Razvan said you can join, aggregate, split. It can uses loops and external scripts, it can also handles XML.
My advice for you is that if you have large quantities of data you're gonna have to work on then datastage is your friend, else if the data that you're going to have to load is not very big then it's going to be easier to use JAVA, c, or any programming language that you know.
You can all times of functions , conversions , manipulate the data. mainly Datastage is used for ease of use when you handling humongous data from datamart /datawarehouse.
The main process of datastage would be ETL - Extraction Transformation Loading.
If a programmer uses 100 lines of code to connect to some database here we can do it with one click.
Anything can be done here even c , c++ coding in a rountine activity.
If you are talking about hierarchical files, like XML or JSON, the answer is yes.
If you are talking about complex files, such as are produced by COBOL, the answer is yes.
All using in-built functionality (e.g. Hierarchical Data stage, Complex Flat File stage). Review the DataStage palette to find other examples.

What are the real-time compute solutions that can take raw semistructured data as input?

Are there any technologies that can take raw semi-structured, schema-less big data input (say from HDFS or S3), perform near-real-time computation on it, and generate output that can be queried or plugged in to BI tools?
If not, is anyone at least working on it for release in the next year or two?
There are some solutions with big semistructured input and queried output, but they are usually
unique
expensive
secret enough
If you are able to avoid direct computations using neural networks or expert systems, you will be close enough to low latency system. All you need is a team of brilliant mathematicians to make a model of your problem, a team of programmers to realize it in code and some cash to buy servers and get needed input/output channels for them.
Have you taken a look at Splunk? We use it to analyze Windows Event Logs and Splunk does an excellent job indexing this information to allow for fast querying of any string that appears in the data.