How is `sample` different from `TABLESAMPLE` in Spark? - scala

I'd like to get create a random sub-sample of my data.
Spark's sample function (link) is the API I'd like to use. Particularly, because it allows me to toggle if the sampling is done with or without replacement. However, executing this function takes a long time. Based on the answers from this question Spark sample is too slow, it seems like sample requires a full table scan.
TABLESAMPLE seems like a faster alternative, albeit, the ability to toggle with and without replacement is lost.
I'd like to understand how sample and TABLESAMPLE are different, and why does TABLESAMPLE execute faster than sample. Could it be that TABLESAMPLE does not require a full table scan?

TABLESAMPLE has three ways of sampling:
percentage (or fraction): under the hood, does the same thing as sample. It is used to create uniform sampling.
num_rows: under the hood, does the same thing as LIMIT, which is why this API is very fast.
bucket OUT OF: specifies the portion out of the total to sample.
This is stated on the documentation:
Always use TABLESAMPLE (percent PERCENT) if randomness is important. TABLESAMPLE (num_rows ROWS) is not a simple random sample but instead is implemented using LIMIT.
So the answer whether sample and TABLESAMPLE are the same thing is no, but TABLESAMPLE used with percentage (fraction) and sample are the same thing.
If you want to read more about this, Databricks has some good information about this.

Related

How to pass a vector from tableau to R

I have a need to pass a vector of arguments to Rserve from tableau. Specifically, I am using IRR calculations in R (on Rserve), and i want to pass vector of cash-flows that are as columns in my table (instead of rows/measure). So, i want to collect all those CF in a vector and pass it on to Rserve. Passing them one at a time slows down IO.
SCRIPT_REAL("r_func(c(.arg1, .arg2, .arg3))",sum(cf1), sum(cf2), sum(cf3))
cf1..cfn are cashflows corresponding to various periods. Above code works well when cf are few but takes a long time when i have few hundereds. Further, time spent is not in calculation but IO when communicating with remote Rserve. If i have a local Rserve, this calculation happens under few seconds while on remote, it takes well over a minute.
Also, want to point out that tableau / Rserve, set one argument after another and that takes time. My expectation is that once i have a vector, it would be just 1 transfer and setting of arguments, and therefore this should speed up
The first step in understanding how Tableau interacts with R or Python, is understanding how Tableau's table calcs work.
Tableau Script_XXX() functions are table calculations which means that you invoke them on a vector of aggregate query results and the corresponding R or Python code needs to return a vector usually of the same size. (I think you may be able to return a scalar or smaller vector which gets replicated to appear like a vector of the same size as the argument -- but not certain)
You can control how your data is partitioned into vectors, and also the ordering of data in the vectors, by editing the table calc to specify the partitioning and addressing for that calc.
Partitioning determines how your aggregate query results are broken up into vectors for calculation purposes. Addressing determines how the elements of each vector are ordered. You can either do that based on the physical layout of the table structure, or (better) based on the specific dimensions.
See the Tableau on-line help for table calcs for more info, and look online training videos from Tableau or blog entries (especially from anyone named Bora)
One way to test your understanding of these concepts is create a Tableau table (i.e., a viz with a mark type of text) with several dimensions on row and column shelves. Then create calculated fields for INDEX() and SIZE() and display them on text. Finally, change the partitioning and addressing in different ways by editing those table calcs. Try several different permutations. When you can confidently predict what those functions will produce for different settings, then you're ready to do more complex tasks - such as talking to R.
It is also instructive to experiment with FIRST(), LAST(), LOOKUP(), WINDOW_SUM() etc -- and finally dig into PREVIOUS_VALUE(). Warning, PREVIOUS_VALUE() is a bit odd, and does not behave the way you probably assume it does. Still, it is a useful technique that can implement a recursive calculation, and is about as close to a for loop as Tableau gets.

Results of test statistics evaluation of Recommender Systems given data

I was wondering if there is a source, where given some train data and test data, the test statistics of evaluation of Recommender Systems are also provided. For example, given two files train.dat and test.dat, where the data have already been split into a training and a test set which contain user_id, item_id and ratings (just like in grouplens dataset) and in the end some answer for precision or recall or map#k test is provided for the performance of a nonpersonalized recommender system (like the top rated items-most viewed) or any other recommender system.
Thank you in advance,
Regards,
Marios
The fastest way to find out how known recommenders performance on the new dataset is to grab existing implementation and run it. If You have prepared split - it's OK, but make sure that splitting strategy is adequate what You are trying to measure.
Helpful project for evaluation:
- librec
- project Rival
- MediaLens
And make sure, the right measure is used for the right purpose. In the question the rating prediction example is given and then the ranking evaluation measures are listed.

how can I set a goal on recommendation system ?(mean average precision, baselineRmse)

I starting to develop offline recommendation system using ALS algorithm.
and I need to set a goal about system.
so I wanna know what criteria used to evaluate recommendation system.
I already know MAP (mean average precision) and improvement to baselineRmse and I would like to know: how is the performance on these criterions in modern recommendation systems to set my goal.
Back in the early days of recommenders people thought predicting ratings was a good idea. This has since proven to be nearly useless of itself. If you have enough space in a UI to show a few recommendations are you going to pick the one you think the user will pick with the highest ratings? That will always result in bad performance. Rating prediction is what RMSE was designed to measure.
MAP#k on the other hand is meant to find the predictiveness in a recommender. It measures how well the training data predicts what is in the test data. It also accounts for the ordering of recommendations. Ranking/ordering of recommendations has more recently been discovered to have a much greater effect on the effectiveness of recommendations because if you can only show a limited number they had better be the most likely to cause a user to take action.
MAP#k also takes account of ranking in the sense that if you measure MAP#1 and MAP#10, you will see decreasing MAP scores if your first recommendation was more likely to be in the test data than the 10th. This means you are ordering recommendations roughly correct.
For these reason we use MAP#k. Split the "gold standard" dataset you will use in later rests and keep the split static—something like 80%-20% will work split by random choice or by time, the most recent 20% used as the test split. Build you model on the 80%, then for each interaction in the 20% get recommendations and see if the recommendations contain the item actually interacted with in the test set. The aggregate of all these will go into the MAP#k calculation, k is based on how many recommendation you ask for.
See these references and some tools we have to do this:
Kaggle blog references python code they and we ActionML use. https://www.kaggle.com/wiki/MeanAveragePrecision
ActionML analysis python code to split data sets and run MAP#k, where we use the Kaggle function. https://github.com/actionml/analysis-tools

Doubts about clustering methods for tweets

I'm fairly new to clustering and related topics so please forgive my questions.
I'm trying to get introduced into this area by doing some tests, and as a first experiment I'd like to create clusters on tweets based on content similarity. The basic idea for the experiment would be storing tweets on a database and periodically calculate the clustering (ie. using a cron job). Please note that the database would obtain new tweets from time to time.
Being ignorant in this field, my idea (probably naive) would be to do something like this:
1. For each new tweet in the db, extract N-grams (N=3 for example) into a set
2. Perform Jaccard similarity and compare with each of the existing clusters. If result > threshold then it would be assigned to that cluster
3. Once finished I'd get M clusters containing similar tweets
Now I see some problems with this basic approach. Let's put aside computational cost, how would the comparison between a tweet and a cluster be done? Assuming I have a tweet Tn and a cluster C1 containing T1, T4, T10 which one should I compare it to? Given that we're talking about similarity, it could well happen that sim(Tn,T1) > threshold but sim(Tn,T4) < threshold. My gut feeling tells me that something like an average should be used for the cluster, in order to avoid this problem.
Also, it could happen that sim(Tn, C1) and sim(Tn, C2) are both > threshold but similarity with C1 would be higher. In that case Tn should go to C1. This could be done brute force as well to assign the tweet to the cluster with maximum similarity.
And last of all, it's the computational issue. I've been reading a bit about minhash and it seems to be the answer to this problem, although I need to do some more research on it.
Anyway, my main question would be: could someone with experience in the area recommend me which approach should I aim to? I read some mentions about LSA and other methods, but trying to cope with everything is getting a bit overwhelming, so I'd appreciate some guiding.
From what I'm reading a tool for this would be hierarchical clustering, as it would allow regrouping of clusters whenever new data enters. Is this correct?
Please note that I'm not looking for any complicated case. My use case idea would be being able to cluster similar tweets into groups without any previous information. For example, tweets from Foursquare ("I'm checking in ..." which are similar to each other would be one case, or "My klout score is ..."). Also note that I'd like this to be language independent, so I'm not interested in having to deal with specific language issues.
It looks like to me that you are trying to address two different problems in one, i.e. "syntactic" and "semantic" clustering. They are quite different problems, expecially if you are in the realm of short-text analysis (and Twitter is the king of short-text analysis, of course).
"Syntactic" clustering means aggregating tweets that come, most likely, from the same source. Your example of Foursquare fits perfectly, but it is also common for retweets, people sharing online newspaper articles or blog posts, and many other cases. For this type of problem, using a N-gram model is almost mandatory, as you said (my experience suggests that N=2 is good for tweets, since you can find significant tweets that have as low as 3-4 features). Normalization is also an important factor here, removing RT tag, mentions, hashtags might help.
"Semantic" clustering means aggregating tweets that share the same topic. This is a much more difficult problem, and it won't likely work if you try to aggregate random sample of tweets, due to the fact that they, usually, carry too little information. These techniques might work, though, if you restrict your domain to a specific subset of tweets (i.e. the one matching a keyword, or an hashtag). LSA could be useful here, while it is useless for syntactic clusters.
Based on your observation, I think what you want is syntactic clustering. Your biggest issue, though, is the fact that you need online clustering, and not static clustering. The classical clustering algorithms that would work well in the static case (like hierarchical clustering, or union find) aren't really suited for online clustering , unless you redo the clustering from scratch every time a new tweet gets added to your database. "Averaging" the clusters to add new elements isn't a great solution according to my experience, because you need to retain all the information of every cluster member to update the "average" every time new data gets in. Also, algorithms like hierarchical clustering and union find work well because they can join pre-existant clusters if a link of similarity is found between them, and they don't simply assign a new element to the "closest" cluster, which is what you suggested to do in your post.
Algorithms like MinHash (or SimHash) are indeed more suited to online clustering, because they support the idea of "querying" for similar documents. MinHash is essentially a way to obtain pairs of documents that exceed a certain threshold of similarity (in particular, MinHash can be considered an estimator of Jaccard similarity) without having to rely on a quadratic algorithm like pairwise comparison (it is, in fact, O(nlog(n)) in time). It is, though, quadratic in space, therefore a memory-only implementation of MinHash is useful for small collections only (say 10000 tweets). In your case, though, it can be useful to save "sketches" (i.e., the set of hashes you obtain by min-hashing a tweet) of your tweets in a database to form an "index", and query the new ones against that index. You can then form a similarity graph, by adding edges between vertices (tweets) that matched the similarity query. The connected components of your graph will be your clusters.
This sounds a lot like canopy pre-clustering to me.
Essentially, each cluster is represented by the first object that started the cluster.
Objects within the outer radius join the cluster. Objects that are not within the inner radius of at least one cluster start a new cluster. This way, you get an overlapping (non-disjoint!) quantization of your dataset. Since this can drastically reduce the data size, it can be used to speed up various algorithms.
However don't expect useful results from clustering tweets. Tweet data is just to much noise. Most tweets have just a few words, too little to define a good similarity. On the other hand, you have the various retweets that are near duplicates - but trivial to detect.
So what would be a good cluster of tweets? Can this n-gram similarity actually capture this?

Non-linear regression models in PostgreSQL using R

Background
I have climate data (temperature, precipitation, snow depth) for all of Canada between 1900 and 2009. I have written a basic website and the simplest page allows users to choose category and city. They then get back a very simple report (without the parameters and calculations section):
The primary purpose of the web application is to provide a simple user interface so that the general public can explore the data in meaningful ways. (A list of numbers is not meaningful to the general public, nor is a website that provides too many inputs.) The secondary purpose of the application is to provide climatologists and other scientists with deeper ways to view the data. (Using too many inputs, of course.)
Tool Set
The database is PostgreSQL with R (mostly) installed. The reports are written using iReport and generated using JasperReports.
Poor Model Choice
Currently, a linear regression model is applied against annual averages of daily data. The linear regression model is calculated within a PostgreSQL function as follows:
SELECT
regr_slope( amount, year_taken ),
regr_intercept( amount, year_taken ),
corr( amount, year_taken )
FROM
temp_regression
INTO STRICT slope, intercept, correlation;
The results are returned to JasperReports using:
SELECT
year_taken,
amount,
year_taken * slope + intercept,
slope,
intercept,
correlation,
total_measurements
INTO result;
JasperReports calls into PostgreSQL using the following parameterized analysis function:
SELECT
year_taken,
amount,
measurements,
regression_line,
slope,
intercept,
correlation,
total_measurements,
execute_time
FROM
climate.analysis(
$P{CityId},
$P{Elevation1},
$P{Elevation2},
$P{Radius},
$P{CategoryId},
$P{Year1},
$P{Year2}
)
ORDER BY year_taken
This is not an optimal solution because it gives the false impression that the climate is changing at a slow, but steady rate.
Questions
Using functions that take two parameters (e.g., year [X] and amount [Y]), such as PostgreSQL's regr_slope:
What is a better regression model to apply?
What CPAN-R packages provide such models? (Installable, ideally, using apt-get.)
How can the R functions be called within a PostgreSQL function?
If no such functions exist:
What parameters should I try to obtain for functions that will produce the desired fit?
How would you recommend showing the best fit curve?
Keep in mind that this is a web app for use by the general public. If the only way to analyse the data is from an R shell, then the purpose has been defeated. (I know this is not the case for most R functions I have looked at so far.)
Thank you!
The awesome pl/r package allows you to run R inside PostgreSQL as a procedural language. There are some gotchas because R likes to think about data in terms of vectors which is not what a RDBMS does. It is still a very useful package as it gives you R inside of PostgreSQL saving you some of the roundtrips of your architecture.
And pl/r is apt-get-able for you as it has been part of Debian / Ubuntu for a while. Start with apt-cache show postgresql-8.4-plr (that is on testing, other versions/flavours have it too).
As for the appropriate modeling: that is a whole different ballgame. loess is a fair suggestion for something non-parametric, and you probably also want some sort of dynamic model, either ARMA/ARIMA or lagged regression. The choice of modeling is pretty critical given how politicized the topic is.
I don't think autoregression is what you want. Non-linear isn't what you want either because the implies discontinuous data. You have continuous data, it just may not be a straight line. If you're just visualizing, and especially if you don't know what the shape is supposed to be then loess is what you want.
It's easy to also get a confidence interval band around the line if you just plot the data with ggplot2.
qplot(x, y, data = df, geom = 'point') + stat_smooth()
That will make a nice plot.
If you want to a simpler graph in straight R.
plot(x, y)
lines(loess.smooth(x,y))
May I propose a different solution? Just use PostgreSQL to pull the data, feed it into some R script and finally show the results. The R script may be as complicated as you want as long as the user doesn't have to deal with it.
You may want to have a look at rapache, an Apache module that allows running R scripts in a webpage.
A couple of videos illustrating its use:
Hello world application
Jeffrey Horner's presentation of RApache + links to working apps
In particular check how the San Francisco Estuary Institue Web Query Tool allows the user to interact with the parameters.
As for the regression, I'm not an expert, so I may be saying something extremely stupid... but wouldn't something like a LOESS regression be OK for this?