Perform EDA and visualize it if my data can not fit in memory? my dataset size is 200gigs - pyspark

Performing exploratory data analysis is the first step in any machine learning project, I mostly use pandas to perform data exploration using datasets that fit in memory... but I would like to know how to perform data cleaning, handle missing data and data outlier, single variable plots, density plot of how a feature impacts label, correlation, etc, etc
Pandas is easy and intuitive for doing data analysis in Python. But I find difficulty in handling multiple bigger dataframes in Pandas due to limited system memory.
For datasets that are greater than size of RAM... 100s of gigabytes
I have seen tutorials where they use spark to filter out based on rules and generate a dataframe that fits in memory... eventually there is always data that resides entirely in memory but i want to know how to work with big data set and perform exploratory data analysis
Another challenge would be to visualize big data for exploratory data analysis... its easy to do using packages like seaborn or matplotlib if it fits in memory but how to perform it for big data

To put up something concrete:
normally you will want to reduce your data, by aggregation, sampling, etc., to something small enough that a direct visualisation makes sense
some tools exist for directly dealing with bigger-than-memory (Dask) data to create visuals. One good link was this: http://pyviz.org/tutorial/10_Working_with_Large_Datasets.html

Related

How much data is actually required to train a doc2Vec model?

I have been using gensim's libraries to train a doc2Vec model. After experimenting with different datasets for training, I am fairly confused about what should be an ideal training data size for doc2Vec model?
I will be sharing my understanding here. Please feel free to correct me/suggest changes-
Training on a general purpose dataset- If I want to use a model trained on a general purpose dataset, in a specific use case, I need to train on a lot of data.
Training on the context related dataset- If I want to train it on the data having the same context as my use case, usually the training data size can have a smaller size.
But what are the number of words used for training, in both these cases?
On a general note, we stop training a ML model, when the error graph reaches an "elbow point", where further training won't help significantly in decreasing error. Has any study being done in this direction- where doc2Vec model's training is stopped after reaching an elbow ?
There are no absolute guidelines - it depends a lot on your dataset and specific application goals. There's some discussion of the sizes of datasets used in published Doc2Vec work at:
what is the minimum dataset size needed for good performance with doc2vec?
If your general-purpose corpus doesn't match your domain's vocabulary – including the same words, or using words in the same senses – that's a problem that can't be fixed with just "a lot of data". More data could just 'pull' word contexts and representations more towards generic, rather than domain-specific, values.
You really need to have your own quantitative, automated evaluation/scoring method, so you can measure whether results with your specific data and goals are sufficient, or improving with more data or other training tweaks.
Sometimes parameter tweaks can help get the most out of thin data – in particular, more training iterations or a smaller model (fewer vector-dimensions) can slightly offset some issues with small corpuses, sometimes. But the Word2Vec/Doc2Vec really benefit from lots of subtly-varied, domain-specific data - it's the constant, incremental tug-of-war between all the text-examples during training that helps the final representations settle into a useful constellation-of-arrangements, with the desired relative-distance/relative-direction properties.

How to deal with data when making a decision tree

I am trying to make a decision tree for dataset I got from Kaggle.
Since I don't have any experience for dealing with real-life datasets, I have no idea how to deal with cleaning, integrating, and scaling the data (mainly scaling).
For example, let's say I have a feature that has real numbers. So I want to make that feature to something like categorical data by scaling them into the specific number of groups (for making decision tree).
In this case, I have no idea how many groups of data is a reasonable for decision tree purpose.
I am sure it depends on the distribution of the data for the feature and the number of unique values in target dataset but I don't know how I find the good guess by looking at the distribution and target dataset.
My best guess is divide the data of the feature into similar number with the number of unique values of target dataset. (I don't even know if this makes sense..)
When I learned from school, I was already given with 2-5 categorical data for every features so that I didn't have to worry about, but real-life is totally different from school.
Please help me out.
For DT you need numerical data to be numerical, categorical - to be in dummies-style. No scaling is needed for numerical columns.
To process categorical data use one-hot encoding. Please be sure that before one-hot encoding you have rather big amounts of each feature (>= 5%), otherwise group small variables.
And consider other model. DT are good but it's old school and they are easy to be overfitted.
You can use decision tree regressors which eliminate the need for stratifying real numbers in to categories: http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
When you do this, it will help to scale input data to zero mean, and unit variance; this helps prevent any large-category inputs from dominating the model
That being said, a decision tree may not be the best option. Try SVM, or ANN. Or (most likely) some ensemble of many models (or even just a random forest).

Does tensorflow convnet only duplicate model across multiple GPUs?

I am currently running a Tensorflow convnet for image recognition and I am considering of buying new GPUs to enable more complex graphs, batch size, and input dimensions. I have read posts like this that do not recommend using AWS GPU instances to train convnets, but more opinions are always welcomed.
I've read Tensorflow's guide 'Training a Model Using Multiple GPU Cards', and it seems that the graph is duplicated across the GPUs. I would like to know is this the only way to use parallel GPUs in Tensorflow convnet?
The reason I am asking this is because if Tensorflow can only duplicate graphs across multiple GPUs, it would mean each GPU must have at least the memory size that my model requires for one batch. (Example if the minimum memory size required is 5GB, two card of 4GB each would not do the job)
Thank you in advance!
No, it is definitely possible to use different variables on different GPUs.
For every variable and every layer that you declare, you have the choice of where do you declare the variable.
And in the specific case, you would want to use multiple GPUs for duplicating your model only to increase its batch_size training parameter to train faster, you would still need to explicitly build your model using the concept of shared parameters and manage how do those parameters communicate.

K means Analysis on KDD Cup Dataset 99

What kind of knowledge/ inference can be made from k means clustering analysis of KDDcup99 dataset?
We ploted some graphs using matlab they looks like this:::
Experiment 1: Plot of dst_host_count vs serror_rate
Experiment 2: Plot of srv_count vs srv_serror_rate
Experiment 3: Plot of count vs serror_rate
I just extracted saome features from kddcup data set and ploted them.....
The main problem am facing is due to lack of domain knowledge I cant determine what inference can be drawn form this graphs another one is if I have chosen wrong axis then what should be the correct chosen feature?
I got very less time to complete this thing so I don't understand the backgrounds very well
Any help telling the interpretation of these graphs would be helpful
What kind of unsupervised learning can be made using this data and plots?
Just to give you some domain knowledge: the KDD cup data set contains information about different aspects of network connections. Each sample contains 'connection duration', 'protocol used', 'source/destination byte size' and many other features that describes one connection connection. Now, some of these connections are malicious. The malicious samples have their unique 'fingerprint' (unique combination of different feature values) that separates them from good ones.
What kind of knowledge/ inference can be made from k means clustering analysis of KDDcup99 dataset?
You can try k-means clustering to initially cluster the normal and bad connections. Also, the bad connections falls into 4 main categories themselves. So, you can try k = 5, where one cluster will capture the good ones and other 4 the 4 malicious ones. Look at the first section of the tasks page for details.
You can also check if some dimensions in your data set have high correlation. If so, then you can use something like PCA to reduce some dimensions. Look at the full list of features. After PCA, your data will have a simpler representation (with less number of dimensions) and might give better performance.
What should be the correct chosen feature?
This is hard to tell. Currently data is very high dimensional, so I don't think trying to visualize 2/3 of the dimensions in a graph will give you a good heuristics on what dimensions to choose. I would suggest
Use all the dimensions for for training and testing the model. This will give you a measure of the best performance.
Then try removing one dimension at a time to see how much the performance is affected. For example, you remove the dimension 'srv_serror_rate' from your data and the model performance comes out to be almost the same. Then you know this dimension is not giving you any important info about the problem at hand.
Repeat step two until you can't find any dimension that can be removed without hurting performance.

Why we need training and test datasets in research?

I'm newbie in research area of data mining (text clustering) and i have couple question regarding to training and test datasets.
Is that clustering need training and testing datasets?
why we need to separate into training and test datasets?
Sorry for the rookie question hope expert in this group can help me.
As your question is on clustering:
In cluster analysis, there usually is no training or test data split.
Because you do cluster analysis when you do not have labels, so you cannot "train".
Training is a concept from machine learning, and train-test splitting is used to avoid overfitting.
But if you are not learning labels, you cannot overfit.
Properly used cluster analysis is a knowledge discovery method. You want to discover some new structure in your data, not rediscover something that is already labeled.
To train your data you need a sets of relevant data similar but not identical to your testing data. For example, you could split up your data where 0.7 of your data is training and the rest testing. This will allow your algorithm to get a feel for what it should be looking for. The rest of the data 0.3 can be used for testing as it is a distinct set of information (hopefully) which should allow the algorithm to test itself.
Why split it up?
Well if you train your data on data A and then test your algorithm on data A your algorithm will be able to identify all the information correctly because that is what it was trained on.
For example, if when learning addition you were given the sums 3+4, 4+5, 6+9, which you correctly solved it would be redundant to test your knowledge of addition using the same sums.
further information:
http://en.wikipedia.org/wiki/Natural_language_processing
http://www.nltk.org/book
Hope this helps.