As we know, the referential constraints are not enforced by Redshift. Should we still opt for dimensional modeling ?
If so, how do we get around the limitations and maintain data integrity of our datawarehouse.
Yes, dimensional modelling is feasible and strongly encouraged on Redshift. Redshift is even optimized for star schema queries
Optimizing for Star Schemas and Interleaved Sorting on Amazon Redshift
Refer this Is dimensional modeling feasible in Amazon RedShift?
Related
I am trying to find the number of clusters in DBLP V11 dataset using field of study.
I've tried using doc2vec pretrained and average on word2vec pretrained and clustering the results using DBSCAN, hierarchical clustering and get the number of clusters using elbow method, silhouette method and gap statistics.
I get one or two clusters from this because all the articles are computer science related, but I need to find out the number of subfields from computer science.
There is not "the" number of clusters in such data.
Instead, many answers are correct. Or none.
Is machine learning part of artificial intelligence? Is deep learning a separate topic? And data science? how is data science different from statistics? Doesn't statistics have lots of subtopics? What about big data, and how does it relate to data science? Isn't data mining the same as data science? Humans won't all agree on all of these topics either.
Please tell me what is difference between hierarchical, network and relational data models?
Hierarchical model
1.One to many or one to one relationships.
2.Based on parent child relationship.
3.Retrieve algorithms are complex and asymmetric
4.Data Redundancy more
Network model
1.Many to many relationships.
2.Many parents as well as many children.
3.Retrieve algorithms are complex and symmetric
4.Data Redundancy more
Relational model
1.One to One,One to many, Many to many relationships.
2.Based on relational data structures.
3.Retrieve algorithms are simple and symmetric
4.Data Redundancy less
Performing exploratory data analysis is the first step in any machine learning project, I mostly use pandas to perform data exploration using datasets that fit in memory... but I would like to know how to perform data cleaning, handle missing data and data outlier, single variable plots, density plot of how a feature impacts label, correlation, etc, etc
Pandas is easy and intuitive for doing data analysis in Python. But I find difficulty in handling multiple bigger dataframes in Pandas due to limited system memory.
For datasets that are greater than size of RAM... 100s of gigabytes
I have seen tutorials where they use spark to filter out based on rules and generate a dataframe that fits in memory... eventually there is always data that resides entirely in memory but i want to know how to work with big data set and perform exploratory data analysis
Another challenge would be to visualize big data for exploratory data analysis... its easy to do using packages like seaborn or matplotlib if it fits in memory but how to perform it for big data
To put up something concrete:
normally you will want to reduce your data, by aggregation, sampling, etc., to something small enough that a direct visualisation makes sense
some tools exist for directly dealing with bigger-than-memory (Dask) data to create visuals. One good link was this: http://pyviz.org/tutorial/10_Working_with_Large_Datasets.html
I can't find the partitioning algorithm which is supported by OrientDB.
I need a graph database which is supports clever algorithm of partitioning or rebalancing to decrease the number of cutted edges (edge which points on another server). Because I have a lot of reads but few writes.
Also, does Titan database support some clever algorithm?
Topic modeling identifies distribution of topics in a document collection, which effectively identifies the clusters in the collection. So is it right to say that topic modeling is a technique to do document clustering?
A topic is quite different from a cluster of docs, after all, a topic is not composed of docs.
However, these two techniques are indeed related. I believe Topic Modeling is a viable way of deciding how similar documents are, hence a viable way for document clustering.
In representing each document as a topic distribution (actually a vector), topic modeling techniques reduce the feature dimensionality from number of distinct words appeared (in a corpus) to the number of topics. Similarity between docs' Topic distributions can be calculated using Cosine metrics and many other metrics, which reflect the similarity of the docs themselves in terms of the topics/themes they cover. Based on this quantified similarity measure, many clustering algorithms can be applied to group the documents.
And in this sense, I think it is right to say that topic modeling is a technique to do document clustering.
The relation between clustering and classification is very similar to the relation between topic modeling and multi-label classification.
In single-label multi-class classification we assign just one label per each document. And in clustering we put each document in just one group. The fact is that we can't define the clusters in advance as we define labels. If we ignore this fact, grouping and labeling are essentially the same thing.
However, in real world problems flat classification is not sufficient. Often documents are related to multiple categories/classes. Thus we leverage the multi-label classification. Now, we can see the topic modeling as the unsupervised version of multi-label classification as we can put each document under multiple groups/topics. Here again, I'm ignoring the fact that we can't decide what topics to use as labels in advance.