Big Data Characteristics and Challenges to DBMSs: precise correspondences - nosql

If we consider the seven features of Big Data individually as challenges to (NoSQL) DBMSs, what would be the features of NoSQL DBMSs that meet each of the challenges?
Volume : (horizontal) Scalability
Velocity : Availability (over Consistency)
Variety : Schemaless - Schema-Flexibility
Veracity : ?
Variability : ?
Value : ?
Complexity : ?

Related

range of meaningful values for ranger importance using impurity_corrected

Using ranger R package, while obtaining importance using 'impurity_corrected' option, I get some importance values to be -ve.
My understanding was that the importance values would be from 0 to 1, with 0 meaning not important features (i.e. not adding any value)
Since I get -ve importance values, what do the -ve values mean ?
Negatively correlated ? Or worse than 0 (I don't know what that means)
Any reference papers / docs would be helpful
I did check : https://github.com/imbs-hl/ranger/blob/master/R/ranger.R and https://github.com/cran/ranger/blob/master/src/rangerCpp.cpp but could not really arrive at a conclusion.
The reference paper for this method : https://academic.oup.com/bioinformatics/article/34/21/3711/4994791 does not specify what negative values mean

What's a good way to "weight" the importance of item/user features in LightFM?

I have been using LightFM in an e-commerce application with a decent amount success. Thanks for this really cool package!
I'm in the process of optimizing the recommendation outputs. Because of specific business domain considerations, certain features are "more important" and have less leeway in terms of variability than others, like "price" (i.e. certain recommended items need to be in a soft price range).
Is there also a way for LightFM to weight this feature MORE than the others?.
Thanks in advance!

2dsphere vs 2d index performance

I need to do fast queries to find all documents within a certain GPS radius of a point. The radius will be small and accuracy is not that critical, so I don't need to account for the spherical geometry. There will be lots of writes. Will I get better performance with a 2d index than a 2dsphere?
If you definitely don't need spherical geometry or more than one field in a compound geo index (see the notes on Geospatial Indexes in the MongoDB manual), a 2d index would be more appropriate. There will also be a slight storage advantage in saving coordinates as legacy pairs (longitude, latitude) rather than GeoJSON points. This probably isn't enough to significantly impact your write performance, but it depends what you mean by "lots of writes" and whether these will be pushing your I/O limits.
I'm not sure on the relative performance of queries for different geo index types, but you can easily set up a representative test case in your own dev/staging environment to compare. Make sure you average the measurements over a number of iterations so documents are loaded into memory and there is a fair comparison.
You may also want to consider a haystack index, which is designed to return results for 2d queries within a small area in combination with an additional field criteria (for example, "find restaurants near longitude, latitude"). If you are not fussed on accuracy or sorting by distance (and have an additional search field), this index type may work well for your use case.
2dsphere is now version 3 after MongoDB 3.2
2dsphere is better
data from https://jira.mongodb.org/browse/SERVER-18056
more details : https://www.mongodb.com/blog/post/geospatial-performance-improvements-in-mongodb-3-2
3.1.6 - 2dsphere V2
"executionTimeMillis" : 1875,
"totalKeysExamined" : 24335,
"totalDocsExamined" : 41848,
After reindex
3.1.6 - 2dsphere V3
"executionTimeMillis" : 94,
"totalKeysExamined" : 21676,
"totalDocsExamined" : 38176,
Compared to 2d
3.1.6 - 2d
"executionTimeMillis" : 359,
"totalKeysExamined" : 95671,
"totalDocsExamined" : 112968,

What is a better approach of storing and querying a big dataset of meteorological data

I am looking for a convenient way to store and to query huge amount of meteorological data (few TB). More information about the type of data in the middle of the question.
Previously I was looking in the direction of MongoDB (I was using it for many of my own previous projects and feel comfortable dealing with it), but recently I found out about HDF5 data format. Reading about it, I found some similarities with Mongo:
HDF5 simplifies the file structure to include only two major types of
object: Datasets, which are multidimensional arrays of a homogenous
type Groups, which are container structures which can hold datasets
and other groups This results in a truly hierarchical, filesystem-like
data format. Metadata is stored in the form of user-defined, named
attributes attached to groups and datasets.
Which looks like arrays and embedded objects in Mongo and also it supports indices for querying the data.
Because it uses B-trees to index table objects, HDF5 works well for
time series data such as stock price series, network monitoring data,
and 3D meteorological data.
The data:
Specific region is divided into smaller squares. On the intersection of each one of the the sensor is located (a dot).
This sensor collects the following information every X minutes:
solar luminosity
wind location and speed
humidity
and so on (this information is mostly the same, sometimes a sensor does not collect all the information)
It also collects this for different height (0m, 10m, 25m). Not always the height will be the same. Also each sensor has some sort of metainformation:
name
lat, lng
is it in water, and many others
Giving this, I do not expect the size of one element to be bigger than 1Mb.
Also I have enough storage at one place to save all the data (so as far as I understood no sharding is required)
Operations with the data.
There are several ways I am going to interact with a data:
convert as store big amount of it: Few TB of data will be given to me as some point of time in netcdf format and I will need to store them (and it is relatively easy to convert it HDF5). Then, periodically smaller parts of data (1 Gb per week) will be provided and I have to add them to the storage. Just to highlight: I have enough storage to save all this data on one machine.
query the data. Often there is a need to query the data in a real-time. The most of often queries are: tell me the temperature of sensors from the specific region for a specific time, show me the data from a specific sensor for specific time, show me the wind for some region for a given time-range. Aggregated queries (what is the average temperature over the last two months) are highly unlikely. Here I think that Mongo is nicely suitable, but hdf5+pytables is an alternative.
perform some statistical analysis. Currently I do not know what exactly it would be, but I know that this should not be in a real time. So I was thinking that using hadoop with mongo might be a nice idea but hdf5 with R is a reasonable alternative.
I know that the questions about better approach are not encouraged, but I am looking for an advice of experienced users. If you have any questions, I would be glad to answer them and will appreciate your help.
P.S I reviewed some interesting discussions, similar to mine: hdf-forum, searching in hdf5, storing meteorological data
It's a difficult question and I am not sure if I can give a definite answer but I have experience with both HDF5/pyTables and some NoSQL databases.
Here are some thoughts.
HDF5 per se has no notion of index. It's only a hierarchical storage format that is well suited for multidimensional numeric data. It's possible to extend on top of HDF5 to implement an index (i.e. PyTables, HDF5 FastQuery) for the data.
HDF5 (unless you are using the MPI version) does not support concurrent write access (read access is possible).
HDF5 supports compression filters which can - unlike popular belief - make data access actually faster (however you have to think about proper chunk size which depends on the way you access the data).
HDF5 is no database. MongoDB has ACID properties, HDF5 doesn't (might be important).
There is a package (SciHadoop) that combines Hadoop and HDF5.
HDF5 makes it relatively easy to do out core computation (i.e. if the data is too big to fit into memory).
PyTables supports some fast "in kernel" computations directly in HDF5 using numexpr
I think your data generally is a good fit for storing in HDF5. You can also do statistical analysis either in R or via Numpy/Scipy.
But you can also think about a hybdrid aproach. Store the raw bulk data in HDF5 and use MongoDB for the meta-data or for caching specific values that are often used.
You can try SciDB if loading NetCDF/HDF5 into this array database is not a problem for you. Note that if your dataset is extremely large, the data loading phase will be very time consuming. I'm afraid this is a problem for all the databases. Anyway, SciDB also provides an R package, which should be able to support the analysis you need.
Alternatively, if you want to perform queries without transforming HDF5 into something else, you can use the product here: http://www.cse.ohio-state.edu/~wayi/papers/HDF5_SQL.pdf
Moreover, if you want to perform a selection query efficiently, you should use index; if you want to perform aggregation query in real time (in seconds), you can consider approximate aggregation. Our group has developed some products to support those functions.
In terms of statistical analysis, I think the answer depends on the complexity of your analysis. If all you need is to compute something like entropy or correlation coefficient, we have products to do it in real time. If the analysis is very complex and ad-hoc, you may consider SciHadoop or SciMATE, which can process scientific data in the MapReduce framework. However, I am not sure if SciHadoop currently can support HDF5 directly.

Mixed variables (categorical and numerical) distance function

I want to fuzzy cluster a set of jobs.
Jobs Attributes are:
Categorical: position,diploma, skills
Numerical : salary , years of experience
My question is: how to calculate the distance between different jobs?
e.g job1(programmer,bs computer science,(java ,.net,responsibility),1500, 3)
and job2(tester,bs computer science,(black and white box testing),1200,1)
PS: I'm beginner in data mining clustering, I highly appreciate your help.
You may take this as your starting point:
http://www.econ.upf.edu/~michael/stanford/maeb4.pdf. Distance between categorical data is nicely explained at the end.
Here is a good walk-through of several different clustering methods and how to use them in R: http://biocluster.ucr.edu/~tgirke/HTML_Presentations/Manuals/Clustering/clustering.pdf
In general, clustering for discrete data is related to either the use of counts (e.g. overlaps in vectors) or related to some statistic derived from counts. As much as I'd like to address the statistical side, I suppose you're interested in the algorithm, so I'll leave it at that.