Deciding parameters of DBSCAN Algo for Tweet clustering - cluster-analysis

I am trying to cluster tweets to detect breaking news. I am using DBSCAN as the clustering technique. I am unable to arrive at good values of epsilon and min_sample_points. To cluster the tweets i am making batches of 2000 tweets and applying the clustering algo on it. For feauture extraction i am using the tf-idf vectorizer from scikit.learn package. Using max_df= 0.6 and min_df= 5 and bi-grams as parameters for the vectorizer. The result at large is showing most tweets as outliers or many random tweets into a single cluster. Example of values i have used - eps =0.2 and min_samples = 8. Also i am avoiding K-means algorithm since the no of clusters(k) cannot be foresighted for this problem and the shape of the cluster may not necessarily be spherical.

For breaking news, there are much better approaches than clustering.
Text data and in particular Twitter is incredibly noisy. Many Tweets are just complete nonsense. But the main problem is that they are too short. If you only have a few words, there is too little data to measure distance. "The car hit a wall." and "A car on Wall street" have very similar words (based on TF-IDF) yet they have very different meaning.
So I'm not surprised this does not work well. It's actually not the clustering which "fails" but your distance function.

Related

The performance of k-means evaluated by different metrics

I am trying to evaluate the clusters generated by K-means with different metrics, but I am not sure about whether the results are good or not.
I have 40 documents to cluster in 6 categories.
I first converted them into tf-idf vectors, then I clustered them by K-means (k = 6). Finally, I tried to evaluate the results by different metrics.
Because I have the real labels of the documents, I tried to calculate the F1 score and accuracy. But I also want to know the performance for the metrics that do not need real labels such as silhouette score.
For F1 score and accuracy, the results are about 0.65 and 0.88 respectively, while for the silhouette score, it is only about 0.05, which means I may have overlapping clusters.
In this case, can I say that the results are acceptable? Or should I handle the overlapping issue by trying other methods instead of tf-idf to represent the documents or other algorithms to cluster?
With such tiny data sets, you really need to use a measure that is adjusted for chance.
Do the following: label each document randomly with an integer 1..6
What F1 score to you get? Now repeat this 100x times, what is the best result you get? A completely random result can score pretty well on such tiny data!
Because of this problem, the standard measure used in clustering it's the adjusted Rand index (ARI). A similar adjustment also exists for NMI: Adjusted Mutual Information or AMI. But AMI is much less common.

Clustering Algorithm for average energy measurements

I have a data set which consists of data points having attributes like:
average daily consumption of energy
average daily generation of energy
type of energy source
average daily energy fed in to grid
daily energy tariff
I am new to clustering techniques.
So my question is which clustering algorithm will be best for such kind of data to form clusters ?
I think hierarchical clustering is a good choice. Have a look here Clustering Algorithms
The more simple way to do clustering is by kmeans algorithm. If all of your attributes are numerical, then this is the easiest way of doing the clustering. Even if they are not, you would have to find a distance measure for caterogical or nominal attributes, but still kmeans is a good choice. Kmeans is a partitional clustering algorithm... i wouldn't use hierarchical clustering for this case. But that also depends on what you want to do. you need to evaluate if you want to find clusters within clusters or they all have to be totally apart from each other and not included on each other.
Take care.
1) First, try with k-means. If that fulfills your demand that's it. Play with different number of clusters (controlled by parameter k). There are a number of implementations of k-means and you can implement your own version if you have good programming skills.
K-means generally works well if data looks like a circular/spherical shape. This means that there is some Gaussianity in the data (data comes from a Gaussian distribution).
2) if k-means doesn't fulfill your expectations, it is time to read and think more. Then I suggest reading a good survey paper. the most common techniques are implemented in several programming languages and data mining frameworks, many of them are free to download and use.
3) if applying state-of-the-art clustering techniques is not enough, it is time to design a new technique. Then you can think by yourself or associate with a machine learning expert.
Since most of your data is continuous, and it reasonable to assume that energy consumption and generation are normally distributed, I would use statistical methods for clustering.
Such as:
Gaussian Mixture Models
Bayesian Hierarchical Clustering
The advantage of these methods over metric-based clustering algorithms (e.g. k-means) is that we can take advantage of the fact that we are dealing with averages, and we can make assumptions on the distributions from which those average were calculated.

Can tfidf be weighed to improve classification of sparse data in a corpus?

I am currently using tfidf prior to performing classification on a number of websites based on their content. Unfortunately, my training data is not uniform: about 70% of the pre-labeled websites are news sites, while the rest (tech, arts, entertainment, etc.) are each a vast minority.
My questions are the following:
Is it possible to adjust tfidf so that it weighs different labels differently and make it behave as if the data were uniform? Should I perhaps be using a different approach in this case? I am currently using the Gaussian Naive Bayes classifier after the tfidf analysis, would something else be better suited in this specific case?
Is it possible to have tfidf give me a list of possible labels when the probability that it is exactly a given label is below a certain threshold? For example, if the vector entries are close enough that it is only slightly (< 1-2%) more probable that it is one class rather than another, can it print both?

K means Analysis on KDD Cup Dataset 99

What kind of knowledge/ inference can be made from k means clustering analysis of KDDcup99 dataset?
We ploted some graphs using matlab they looks like this:::
Experiment 1: Plot of dst_host_count vs serror_rate
Experiment 2: Plot of srv_count vs srv_serror_rate
Experiment 3: Plot of count vs serror_rate
I just extracted saome features from kddcup data set and ploted them.....
The main problem am facing is due to lack of domain knowledge I cant determine what inference can be drawn form this graphs another one is if I have chosen wrong axis then what should be the correct chosen feature?
I got very less time to complete this thing so I don't understand the backgrounds very well
Any help telling the interpretation of these graphs would be helpful
What kind of unsupervised learning can be made using this data and plots?
Just to give you some domain knowledge: the KDD cup data set contains information about different aspects of network connections. Each sample contains 'connection duration', 'protocol used', 'source/destination byte size' and many other features that describes one connection connection. Now, some of these connections are malicious. The malicious samples have their unique 'fingerprint' (unique combination of different feature values) that separates them from good ones.
What kind of knowledge/ inference can be made from k means clustering analysis of KDDcup99 dataset?
You can try k-means clustering to initially cluster the normal and bad connections. Also, the bad connections falls into 4 main categories themselves. So, you can try k = 5, where one cluster will capture the good ones and other 4 the 4 malicious ones. Look at the first section of the tasks page for details.
You can also check if some dimensions in your data set have high correlation. If so, then you can use something like PCA to reduce some dimensions. Look at the full list of features. After PCA, your data will have a simpler representation (with less number of dimensions) and might give better performance.
What should be the correct chosen feature?
This is hard to tell. Currently data is very high dimensional, so I don't think trying to visualize 2/3 of the dimensions in a graph will give you a good heuristics on what dimensions to choose. I would suggest
Use all the dimensions for for training and testing the model. This will give you a measure of the best performance.
Then try removing one dimension at a time to see how much the performance is affected. For example, you remove the dimension 'srv_serror_rate' from your data and the model performance comes out to be almost the same. Then you know this dimension is not giving you any important info about the problem at hand.
Repeat step two until you can't find any dimension that can be removed without hurting performance.

Clustering of data - Pre- processing of data

These days I am using some clustering algorithm and I just wanted to ask a question related to this field. Maybe those who are working in this field already have this answer.
During clustering I need to have some training data which I am going to cluster. The number of iterations (e.x. K-Means algorithm) is depended on the number of training data(number of vectors). Is there any method to find the most important data from training data. What I mean is: Instead of training the K-Means with all the data maybe there is a method to find just the important vectors (those vectors who affect most the clusters) and use these "important" vectors(from training data) to traing the algorithm.
I hope you understood me.
Thank You for reading and trying to answer.
"Training" and "Test" data is a concept from classification, not from cluster analysis.
K-means is a statistical method. If you want to speed it up, running it on a large enough random sample should give you nearly the same result.