How to store multiple graphs in a database? - mongodb

I need to store many independent graphs:
Each graph with 100 to 2000 nodes
Each node with 1 to 7 edges plus 2 to 3 extra fields.
No directional
Right now I'm storing them as mongoDB documents with one collection for nodes and one collection for edges.
I have some questions:
Does MongoDB have a "best practices" to store graphs?
Would it be better to store them in another database like Neo4j? (It seems to be really powerful when you have really large graphs)
I would like to be able to version each graph

Related

Star Schema horizontal scaling

AFAIK, in case of Relational Database on MPP hardware, the key to performance is a correct data distribution. While Dimensional Modeling is about query flexibility, you don't even know how the data will be queried (shuffled) in future.
For example, you have MPP Data Warehouse (Greenplum, Redshift, Synapse Analytics). For example, in 1-2 years, you expect your fact table will grow up to 10 billion of rows and you'll have 15-30 dimension tables of 10s millions of rows. How the data should be distributed accross DW nodes? Is there any common techniques? Like shard fact table and replicate dimension tables. Or should I minimize node amount in MPP DW?
I can bring specific use case, but I believe that the question arise from my misunderstanding of how Dimensional Modeling could be paired with scaling out.
One technique I’ve seen applied with success in the past is: segment the fact table (e.g., by mod’ing the date key), and distribute all dimensions across all nodes. That way all joins can be done locally.
Note that even with large dimensions, their total size on disk should be a small fraction of the total needed for the fact table.

Microstrategy Data Model

I am new to MSTR.
We are working on migrating from Essbase to Microstrategy 10.2.
After migration, we expect business users to be able to create report on top of MSTR cube and play around with the data similar to the way they have been doing using Essbase and Excel.
I need help to design data model for given scenario:
FactTb:
Subcategory Revenue
1 100
2 200
3 300
DimensionTb:
Category Subcategory
A 1
A 2
B 1
B 2
B 3
C 2
C 3
User wants to see revenue by category or subcategory.
FactTb has 3 rows. Assuming size of each row as 10 bytes, size of FactTb is 30 bytes.
If it is joined with DimensionTb there will 7 rows and size will grow (approximately) to 70 bytes.
Is there any way to restrict size of Cube?
Mapping of Category and Subcategory is static and there is no need to maintain a table for it.
Can I create/define DimensionTb out of Cube (Store it in report, create derived element using Subcategory)?
We want to restrict size of cube to maintain it in memory and ensure that report will always hit cube over database.
A cube is just the result of a SQL query, copied in memory for faster access. As you cannot imagine the result of a query split in two, the same is for a cube.
In memory cubes are compressed by MicroStrategy using multiple algorithms (to use the best compression depending on column data types and value distributions), but cubes contains also internal indexes (to speed up data access) created automatically depending on the queries used for the cube.
A VLDB setting can help reducing the size of the cube.
If you check the technote TN32540: Intelligent Cube Population methods in MicroStrategy 9.x, you will see different options, for my experience the last setting (Direct loading of dimensional data and filtered fact data.) is quete helpful in speed up cube loading and reduce the size, but you can also try the other ones (Normalize Intelligent Cube data in the Database).
With this approach the values from Dimension tables will be stored in memory, but separated from the fact, saving space.
Finally to be sure that your users alway use the cube, allow/teach them to create reports and dashboards clicking directly on the cube (or selecting it).
This is the safe way, MicroStrategy offers also a dynamic way to map reports to cubes (when conditions are satisfied), but users can surprise even the most thorough designer.

MongoDB and using DBRef with Spatial Data

I have a collection with 100 million documents of geometry.
I have a second collection with time data associated to each of the other geometries. This will be 365 * 96 * 100 million or 3.5 trillion documents.
Rather than store the 100 million entries (365*96) times more than needed, I want to keep them in separate collections and do a type of JOIN/DBRef/Whatever I can in MongoDB.
First and foremost, I want to get a list of GUIDs from the geometry collection by using a geoIntersection. This will filter it down to 100 million to 5000. Then using those 5000 geometries guids I want to filter the 3.5 trillion documents based on the 5000 goemetries and additional date criteria I specify and aggregate the data and find the average. You are left with 5000 geometries and 5000 averages for the date criteria you specified.
This is basically a JOIN as I know it in SQL, is this possible in MongoDB and can it be done optimally in say less than 10 seconds.
Clarify: as I understand, this is what DBrefs is used for, but I read that it is not efficient at all, and with dealing with this much data that it wouldn't be a good fit.
If you're going to be dealing with a geometry and its time series data together, it makes sense to store them in the same doc. A years worth of data in 15 minute increments isn't killer - and you definitely don't want a document for every time-series entry! Since you can retrieve everything you want to operate on as a single geometry document, it's a big win. Note that this also let's you sparse things up for missing data. You can encode the data differently if it's sparse rather than indexing into a 35040 slot array.
A $geoIntersects on a big pile of geometry data will be a performance issue though. Make sure you have some indexing on (like 2dsphere) to speed things up.
If there is any way you can build additional qualifiers into the query that could cheaply eliminate members from the more expensive search, you may make things zippier. Like, say the search will hit states in the US. You could first intersect the search with state boundaries to find the states containing the geodata and use something like a postal code to qualify the documents. That would be a really quick pre-search against 50 documents. If a search boundary was first determined to hit 2 states, and the geo-data records included a state field, you just winnowed away 96 million records (all things being equal) before the more expensive geo part of the query. If you intersect against smallish grid coordinates, you may be able to winnow it further before the geo data is considered.
Of course, going too far adds overhead. If you can correctly tune the system to the density of the 100 million geometries, you may be able to get the times down pretty low. But without actually working with the specifics of the problem, it's hard to know. That much data probably requires some specific experimentation rather than relying on a general solution.

What is a better approach of storing and querying a big dataset of meteorological data

I am looking for a convenient way to store and to query huge amount of meteorological data (few TB). More information about the type of data in the middle of the question.
Previously I was looking in the direction of MongoDB (I was using it for many of my own previous projects and feel comfortable dealing with it), but recently I found out about HDF5 data format. Reading about it, I found some similarities with Mongo:
HDF5 simplifies the file structure to include only two major types of
object: Datasets, which are multidimensional arrays of a homogenous
type Groups, which are container structures which can hold datasets
and other groups This results in a truly hierarchical, filesystem-like
data format. Metadata is stored in the form of user-defined, named
attributes attached to groups and datasets.
Which looks like arrays and embedded objects in Mongo and also it supports indices for querying the data.
Because it uses B-trees to index table objects, HDF5 works well for
time series data such as stock price series, network monitoring data,
and 3D meteorological data.
The data:
Specific region is divided into smaller squares. On the intersection of each one of the the sensor is located (a dot).
This sensor collects the following information every X minutes:
solar luminosity
wind location and speed
humidity
and so on (this information is mostly the same, sometimes a sensor does not collect all the information)
It also collects this for different height (0m, 10m, 25m). Not always the height will be the same. Also each sensor has some sort of metainformation:
name
lat, lng
is it in water, and many others
Giving this, I do not expect the size of one element to be bigger than 1Mb.
Also I have enough storage at one place to save all the data (so as far as I understood no sharding is required)
Operations with the data.
There are several ways I am going to interact with a data:
convert as store big amount of it: Few TB of data will be given to me as some point of time in netcdf format and I will need to store them (and it is relatively easy to convert it HDF5). Then, periodically smaller parts of data (1 Gb per week) will be provided and I have to add them to the storage. Just to highlight: I have enough storage to save all this data on one machine.
query the data. Often there is a need to query the data in a real-time. The most of often queries are: tell me the temperature of sensors from the specific region for a specific time, show me the data from a specific sensor for specific time, show me the wind for some region for a given time-range. Aggregated queries (what is the average temperature over the last two months) are highly unlikely. Here I think that Mongo is nicely suitable, but hdf5+pytables is an alternative.
perform some statistical analysis. Currently I do not know what exactly it would be, but I know that this should not be in a real time. So I was thinking that using hadoop with mongo might be a nice idea but hdf5 with R is a reasonable alternative.
I know that the questions about better approach are not encouraged, but I am looking for an advice of experienced users. If you have any questions, I would be glad to answer them and will appreciate your help.
P.S I reviewed some interesting discussions, similar to mine: hdf-forum, searching in hdf5, storing meteorological data
It's a difficult question and I am not sure if I can give a definite answer but I have experience with both HDF5/pyTables and some NoSQL databases.
Here are some thoughts.
HDF5 per se has no notion of index. It's only a hierarchical storage format that is well suited for multidimensional numeric data. It's possible to extend on top of HDF5 to implement an index (i.e. PyTables, HDF5 FastQuery) for the data.
HDF5 (unless you are using the MPI version) does not support concurrent write access (read access is possible).
HDF5 supports compression filters which can - unlike popular belief - make data access actually faster (however you have to think about proper chunk size which depends on the way you access the data).
HDF5 is no database. MongoDB has ACID properties, HDF5 doesn't (might be important).
There is a package (SciHadoop) that combines Hadoop and HDF5.
HDF5 makes it relatively easy to do out core computation (i.e. if the data is too big to fit into memory).
PyTables supports some fast "in kernel" computations directly in HDF5 using numexpr
I think your data generally is a good fit for storing in HDF5. You can also do statistical analysis either in R or via Numpy/Scipy.
But you can also think about a hybdrid aproach. Store the raw bulk data in HDF5 and use MongoDB for the meta-data or for caching specific values that are often used.
You can try SciDB if loading NetCDF/HDF5 into this array database is not a problem for you. Note that if your dataset is extremely large, the data loading phase will be very time consuming. I'm afraid this is a problem for all the databases. Anyway, SciDB also provides an R package, which should be able to support the analysis you need.
Alternatively, if you want to perform queries without transforming HDF5 into something else, you can use the product here: http://www.cse.ohio-state.edu/~wayi/papers/HDF5_SQL.pdf
Moreover, if you want to perform a selection query efficiently, you should use index; if you want to perform aggregation query in real time (in seconds), you can consider approximate aggregation. Our group has developed some products to support those functions.
In terms of statistical analysis, I think the answer depends on the complexity of your analysis. If all you need is to compute something like entropy or correlation coefficient, we have products to do it in real time. If the analysis is very complex and ad-hoc, you may consider SciHadoop or SciMATE, which can process scientific data in the MapReduce framework. However, I am not sure if SciHadoop currently can support HDF5 directly.

Reducing Large Datasets with MongoDB and D3

I'm working on a D3 visualization and luckily have made some progress. However, I've run into an issue... and to be honest, I'm not sure if its a MongoDB issue, or a D3 issue. You see, I'm trying to make a series of graphs from a set of sensor points (my JSON object contains timestamps, light, temperature, humidity, and motion detection levels or each datapoint). However, my sensors are uploading data to my MongoDB database every 8 seconds. So, if I query the MongoDB database for just one days worth of data... I get 10,800 datapoints. Worse, if I were to ask for one month of data, I'd be swamped with 324,000 datapoints. My issue is that my d3 visualization slows to a crawl when dealing with more than about 1000 points (I'm visualizing the data on four different graphs each which use a single brush to select a certain domain on the graph. Is there a way to limit the amount of data I'm trying to visualize? Is this better done using MongoDB (so basically filter the data I'm querying and only getting every nth data point based on how big of a time value I'm trying to query). Or is there a better way? Should I try to filter the data using D3 once I've retrieved the entire dataset? What is the best way to go about reducing the amount of points I need to deal with? Thanks in advance.
Mongodb is great at filtering. If you really only need a subset of the data, specify that in a find query -- this could limit to a subset of time, or if you're clever only get data for the first minute of every hour or similar.
Or you can literally reduce the amount of data coming out of mongodb using the aggregation framework. This could be used to get partial sums or averages or similar.