What is the best data structure to store road network data - matlab

I am working on a map-matching/trajectory matching project. What I am unsure after reading a number of research papers is what is the most efficient data structure to store the road network (described by a weighted directed graph) so as to facilitate real time searching (Fast). I am getting things like grids, MTrees, Quadtrees...do I need a database in the backend for these. I am working on MATLAB at the moment but can shift languages. What are the programming languages used in actual satellite navigators.
Help will be much appreciated

There are many such index structures like K-D Trees, R-Trees etc. that can be built on the data that you store on the database. You can store your data on SQL Server and use these indexes. You might also want to take a look at SQL Server Spatial tools that helps to perform Spatial queries on your data.

Related

Design of real time sentiment analysis

We're trying to design a real time sentiment analysis system (on paper) for a school project. We got some (very vague) negative feedback on how we store our data, but it isn't fully clear why this would be a bad idea or how we'd best improve this.
The setup of the system is as follows:
data from real time news RSS feeds is gathered in a kafka messaging queue, which connects to our preprocessing platform. This preprocessing step would transform all the news articles into semi-structured data, which we can do a sentiment analysis on.
We then want to store both the sentiment analysis and the preprocessed, semi-structured news article for reference.
We were thinking of using MongoDB as a database to do so since you have a lot of freedom in defining different fields in the value (in the key:value pair you store) instead of Cassandra (which would be faster).
The basic use case is for people to look up an institution and get the sentiment analysis of a bunch of news articles in a certain timeframe.
As a possible improvement: do we need to use a NoSQL database or would it make sense to use a SQL database? I think our system could benefit from being denormalized (as is the case by default in NoSQL) and we wouldn't be needing any operations such as join operations that are significantly faster in SQL systems.
Does anyone know of existing systems that do similar things, for comparison?
Any input would be highly appreciated.

NoSQL for time series/logged instrument reading data that is also versioned

My Data
It's primarily monitoring data, passed in the form of Timestamp: Value, for each monitored value, on each monitored appliance. It's regularly collected over many appliances and many monitored values.
Additionally, it has the quirky feature of many of these data values being derived at the source, with the calculation changing from time to time. This means that my data is effectively versioned, and I need to be able to simply call up only data from the most recent version of the calculation. Note: This is not versioning where the old values are overwritten. I simply have timestamp cutoffs, beyond which the data changes its meaning.
My Usage
Downstream, I'm going to have various undefined data mining/machine learning uses for the data. It's not really clear yet what those uses are, but it is clear that I will be writing all of the downstream code in Python. Also, we are a very small shop, so I can really only deal with so much complexity in setup, maintenance, and interfacing to downstream applications. We just don't have that many people.
The Choice
I am not allowed to use a SQL RDBMS to store this data, so I have to find the right NoSQL solution. Here's what I've found so far:
Cassandra
Looks totally fine to me, but it seems like some of the major users have moved on. It makes me wonder if it's just not going to be that much of a vibrant ecosystem. This SE post seems to have good things to say: Cassandra time series data
Accumulo
Again, this seems fine, but I'm concerned that this is not a major, actively developed platform. It seems like this would leave me a bit starved for tools and documentation.
MongoDB
I have a, perhaps irrational, intense dislike for the Mongo crowd, and I'm looking for any reason to discard this as a solution. It seems to me like the data model of Mongo is all wrong for things with such a static, regular structure. My data even comes in (and has to stay in) order. That said, everybody and their mother seems to love this thing, so I'm really trying to evaluate its applicability. See this and many other SE posts: What NoSQL DB to use for sparse Time Series like data?
HBase
This is where I'm currently leaning. It seems like the successor to Cassandra with a totally usable approach for my problem. That said, it is a big piece of technology, and I'm concerned about really knowing what it is I'm signing up for, if I choose it.
OpenTSDB
This is basically a time-series specific database, built on top of HBase. Perfect, right? I don't know. I'm trying to figure out what another layer of abstraction buys me.
My Criteria
Open source
Works well with Python
Appropriate for a small team
Very well documented
Has specific features to take advantage of ordered time series data
Helps me solve some of my versioned data problems
So, which NoSQL database actually can help me address my needs? It can be anything, from my list or not. I'm just trying to understand what platform actually has code, not just usage patterns, that support my super specific, well understood needs. I'm not asking which one is best or which one is cooler. I'm trying to understand which technology can most natively store and manipulate this type of data.
Any thoughts?
It sounds like you are describing one of the most common use cases for Cassandra. Time series data in general is often a very good fit for the cassandra data model. More specifically many people store metric/sensor data like you are describing. See:
http://rubyscale.com/blog/2011/03/06/basic-time-series-with-cassandra/
http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra
http://engineering.rockmelt.com/post/17229017779/modeling-time-series-data-on-top-of-cassandra
As far as your concerns with the community I'm not sure what is giving you that impression, but there is quite a large community (see irc, mailing lists) as well as a growing number of cassandra users.
http://www.datastax.com/cassandrausers
Regarding your criteria:
Open source
Yes
Works well with Python
http://pycassa.github.com/pycassa/
Appropriate for a small team
Yes
Very well documented
http://www.datastax.com/docs/1.1/index
Has specific features to take advantage of ordered time series data
See above links
Helps me solve some of my versioned data problems
If I understand your description correctly you could solve this multiple ways. You could start writing a new row when the version changes. Alternatively you could use composite columns to store the version along with the timestamp/value pair.
I'll also note that Accumulo, HBase, and Cassandra all have essentially the same data model. You will still find small differences around the data model in regards to specific features that each database offers, but the basics will be the same.
The bigger difference between the three will be the architecture of the system. Cassandra takes its architecture from Amazon's Dynamo. Every server in the cluster is the same and it is quite simple to setup. HBase and Accumulo or more direct clones of BigTable. These have more moving parts and will require more setup/types of servers. For example, setting up HDFS, Zookeeper, and HBase/Accumulo specific server types.
Disclaimer: I work for DataStax (we work with Cassandra)
I only have experience in Cassandra and MongoDB but my experience might add something.
So your basically doing time based metrics?
Ok if I understand right you use the timestamp as a versioning mechanism so that you query per a certain timestamp, say to get the latest calculation used you go based on the metric ID or whatever and get ts DESC and take off the first row?
It sounds like a versioned key value store at times.
With this in mind I probably would not recommend either of the two I have used.
Cassandra is too rigid and it's too heirachal, too based around how you query to the point where you can only make one pivot of graph data from (I presume you would wanna graph these metrics) the columfamily which is crazy, hence why I dropped it. As for searching (which Facebook use it for, and only that) it's not that impressive either.
MongoDB, well I love MongoDB and I am an elite of the user group and it could work here if you didn't use a key value storage policy but at the end of the day if your mind is not set and you don't like the tech then let me be the very first to say: don't use it! You will be no good at a tech that you don't like so stay away from it.
Though I would picture this happening in Mongo much like:
{
_id: ObjectID(),
metricId: 'AvailableMessagesInQueue',
formula: '4+5/10.01',
result: NaN
ts: ISODate()
}
And you query for the latest version of your calculation by:
var results = db.metrics.find({ 'metricId': 'AvailableMessagesInQueue' }).sort({ ts: -1 });
var latest = results.getNext();
Which would output the doc structure you see above. Without knowing more of exactly how you wish to query and the general servera and app scenario etc thats the best I can come up with.
I fond this thread on HBase though: http://mail-archives.apache.org/mod_mbox/hbase-user/201011.mbox/%3C5A76F6CE309AD049AAF9A039A39242820F0C20E5#sc-mbx04.TheFacebook.com%3E
Which might be of interest, it seems to support the argument that HBase is a good time based key value store.
I have not personally used HBase so do not take anything I say about it seriously....
I hope I have added something, if not you could try narrowing your criteria so we can answer more dedicated questions.
Hope it helps a little,
Not a plug for any particular technology but this article on Time Series storage using MongoDB might provide another way of thinking about the storage of large amounts of "sensor" data.
http://www.10gen.com/presentations/mongodc-2011/time-series-data-storage-mongodb
Axibase Time-Series Database
Open source
There is a free Community Edition
Works well with Python
https://github.com/axibase/atsd-api-python. There are also other language wrappers, for example ATSD R client.
Appropriate for a small team
Built-in graphics and rule engine make it productive for building an in-house reporting, dashboarding, or monitoring solution with less coding.
Very well documented
It's hard to beat IBM redbooks, but we're trying. API, configuration, and administration is documented in detail and with examples.
Has specific features to take advantage of ordered time series data
It's a time-series database from the ground-up so aggregation, filtering and non-parametric ARIMA and HW forecasts are available.
Helps me solve some of my versioned data problems
ATSD supports versioned time-series data natively in SE and EE editions. Versions keep track of status, change-time and source changes for the same timestamp for audit trails and reconciliations. It's a useful feature to have if you need clean, verified data with tracing. Think energy metering, PHMR records. ATSD schema also supports series tags, which you could use to store versioning columns manually if you're on CE edition or you need to extend default versioning columns: status, source, change-time.
Disclosure - I work for the company that develops ATSD.

MongoDB for Forex

I was wondering, can MongoDB be used for storing Forex data which would be later presented on client applications as real time data with analisys in form of graphs? I will have different sources with different feeds which can not be found from mainstream data providers.
Look at these papers coming from the MongoSF convention. Particularly about the analytics. Be aware that the data storage is only one aspect of - in this case - a very complex system design.
MongoDB can be used to store Forex data, the same all databases (that I can think of) will be able to. I think the big question is what do you want to get out of your data storage?
If you are after high performance, then NoSQL is certainly a good direction to go in, as they typically provide better speeds on large datasets when the table relationships get complex.
To be honest though, regardless of feeds - Forex data can be typically stored with a DateTime/High/Low/Open/Close/Currency/Interval right? I use SQL Server to do a very similar thing than what you described, and accessing the stored data is NOT the performance bottleneck. When you start trying to translate the data into the graphs and add indicators etc.. that's when the good design decisions pay off.
MongoDB can be used to store Forex data just like the rest of the data base systems. However, if one is after high performance NOSQL is a better option because it provides better speeds on large sets of data.
A little information on MongoDB and use with financial markets:
https://www.mongodb.com/blog/post/mongodb-single-platform-all-financial-data-ahl
Arctic is a great open source datastore solution which uses MongoDB and Python:
http://www.slideshare.net/JamesBlackburn1/2015-pydata-highperformance-iot-and-financial-data-storage-with-python-and-mongodb

Should I use Core Data or SQLite to represent a graph structure with weighted edges?

I am building an iPhone application that will be using a weighted graph (probably around 1000 nodes and 500-700 relationships, ball park estimate, could be quite wrong).
Core data would be wonderful since it's easy to work with and is optimized for the iPhone. At the same time, the graph nodes need to be weighted. I could add a level of indirection (edit/clarification: I would represent the weighted edges as relationships.) in core data but feel that might defeat the point of the increase in performance.
SQLite would enable me to do it without any extra indirection the problem becomes performance.
Another consideration is that I use core data to store the stuff and then store only the weights in sqlite in tables and build custom core data objects that when I fetch or create creates the necessary links in sqlite and then just use the sqlite for calculation (since the calculation would only be traversing and looking at weights, so I don't need anything but the final result/id of the object). I then would look up the actually object in core data.
Also, there will be a server that is taking in consideration all of the users graphs into a much larger graph. So it might be easier to use sqlite straight up since replicating the information on the beastly server would be easier :-)
Thanks for the help.
At first glance, Core Data is wonderful for your use case because its an object graph management framework (with persistence). It's lazy loading and caching features allow you to load a partial graph in memory, which is usually desired if you have a large graph.
Core Data doesn't seem to support weighted relationship out of box (I'm not an Core Data expert at all). If "a level of indirection" means Core Data objects representing relationships, I agree it will be a performance hit.
For performance and portability considerations, straight SQLite may be a better choice because you have complete control over the schemas and indices.
I vote for your idea of "store only the weights in sqlite". In my understanding, this allows you to take advantage of Core Data features while keeping performance critical tasks efficient.
The author of NetNewsWire has written a blog article explaining why he has switched away from Core Data (mostly because performance and flexibility reasons) which may be helpful for you. http://inessential.com/2010/02/26/on_switching_away_from_core_data

Are there any data warehouse frameworks?

I've got a lot of mysql data that I need to generate reports from. It's mostly historic data so it won't be changing much, but it weighs in at 20-30 gigabytes easily and is expected to grow. I currently have a collection of php scripts that will do some complex queries and output csv and excel files. I also use phpMyAdmin with bookmarked queries. I manually edit them to change the parameters. The amount of data is growing and the number of people who need access to it is also growing, so I'm making the time to improve this situation.
I started reading about data warehousing the other day and it seems that this an area that relates to what I need to do. I've read some good articles and am even waiting on a book. I think I'm getting a handle on what these sorts of systems do and what's possible.
Creating a reporting system for my data has always been on a todo list, but until recently I figured it would be a highly niche programing venture. Since I now know data warehousing is a common thing, I figure there must be some sort of reporting/warehousing frames available to ease in the development. I'd gladly skip writing interfaces and scripts to schedule and email reports and the like and stick to writing queries and setting up relations.
I've mostly been a lamp guy, but I'm not above switching languages or platforms. I just need a more robust solution as my one off scripts don't scale well.
So where's a good place to get started?
I'll discuss a few points on the {budget, business utility function, time frame} spectrum out there. For convenience, let's follow the architecture conceptualization you linked to at
WikipediaDataWarehouseArticle
Operational database layer
The source data for the data warehouse - Normalized for In One Place Only data maintenance
Data access layer
The transformation of your source data into your informational access layer. ETL tools to extract, transform, load data into the warehouse fall into this layer.
Informational access layer
• Report-facilitating Data Structure
Data is not maintained here. It is merely a reflection of your source data
Hence, denormalized structures (containing duplicate, but systematically derived data)
are usually most effective here
• Reporting tools
How do you actually allow your users access to the data
• pre-canned reports (simple)
• more dynamic slice-and-dice access methods
The data accessed for reporting and analyzing and the tools for reporting and analyzing data
fall into this layer. And the Inmon-Kimball differences about design methodology,
discussed later in the Wikipedia article, have to do with this layer.
Metadata layer (facilitates automation, organization, etc)
Roll your own (low-end)
For very little out-of-pocket cost, just recognizing the need for the denormalized structures can buy those that are not using it some efficiencies
Get in the ballgame (some outlays required)
You don't need to use all the functionality of a platform right off the bat.
IMO, however, you want to be on a platform that you know will grow, and in the highly competitive and consolidating BI environment, that seems to be one of the four enterprise mega-vendors (my opinion)
Microsoft (the platform of our 110 employee firm)
SAP
Oracle
IBM
BiMarketStateArticle
My firm is at this stage, using some of the ETL capability offered by SQL Server Integration Services (SSIS) and some alternate usage of the open source, but in practice license requiring Talend product in the "Data Access Layer", a denormalized reporting structure (implemented completely in the basic SQL Server database), and SQL Server Reporting Services (SSRS) to largely automate (based on your skill) the production of pre-specified reports. Note that an SSRS "report" is merely a (scalable) XML configuration/specification that gets rendered at runtime via the SSRS engine. Choices such as export to an excel file are simple options.
Serious Commitment (some significant human commitment required)
Notice above that we have yet to utilize the data mining/dynamic slicing/dicing
capabilities of SQL Server Analysis Services. We are working toward that,
but now focused on improving the quality of our data cleansing in the "Data Access Layer".
I hope this helps you to get a sense of where to start looking.
Pentaho has put together a pretty comprehensive suite of products. The products are "free", but be prepared for the usual heavy sell once you fork over your identifying information.
I haven't had a chance to really stretch them as we're a Microsoft shop from one sad end to the other.
I think you should first check out Kimball and Inmon and see if you want to approach your data warehouse in a particular way. Kimball, in particular, lays out a very good framework for the modelling and construction of the warehouse.
There are a number of tools which try to make the process of designing, implementing and managing/operating a Data Warehouse and they each have their strengths and weaknesses and often vastly differing price points. Under the covers you are always going to be best off if you have a good knowledge of warsehousing principles from the Kimball and/or Inmon camps.
As well as tools like Kalido and Wherescape RED (which do similar thing in very different ways), many of the ETL platforms now have good in-built support for the donkey work of implementation - SCD components etc and lineage tracking.
Best though to view all these as tools to be used in the hands of you, the craftsman, they make certain easy things even easier (or even trivial), some hard things easier but some things they just get in they way of IMHO ;) Learn the methodology and principles first and get a good understanding of them and then you will know which tools to apply from your kitbag and when...
It hasn't been updated in a while but there's a nice Data Warehousing/ETL Ruby package called ActiveWarehouse.
But I would check out the Pentaho products like Nick mentioned in another answer. It should easily handle the volume of data you have and may provide you with more ways to slice and dice your data than you could have ever imagined.
The best framework you can currently get is Anchor Modeling.
It might look quite complex because of it's generic structure and built-in capability to historize data.
Also modeling technique is quite different than ERD.
But you end-up with sql code to generate all db objects including 3NF views and:
insert/update handled by triggers
query any point/range in history
you application developers will not see underlying 6NF anchor model.
The technology is open sourced and at the moment is unbeatable.
If you would have AM question you may want to ask on that tag anchor-modeling.
Kimball is the simpler method for data warehousing.
We use Informatica for moving data around, but it doesn't do DW things like indexing by default.
I like the idea of Wherescape RED, as a DW tool and using MS SQL's Linked Servers to obviate the need for an ETL tool.