Bipartitle clustering - cluster-analysis

I have a bipartite graph of queries and associated clicked urls. Based on these queries, I want to cluster queries. I am planning to use simple set based formula for determining whether two queries are same or not. My initial input data is in the following format:
Url Queries
Url1 Q1, Q2, Q3
Url2 Q1, Q5, Q6
I am having a tough time determining the appropriate algorithm to solve this problem.
Will appreciate any help.

Related

Unsupervised Anomaly Detection with Mixed Numeric and Categorical Data

I am working on a data analysis project over the summer. The main goal is to use some access logging data in the hospital about user accessing patient information and try to detect abnormal accessing behaviors. Several attributes have been chosen to characterize a user (e.g. employee role, department, zip-code) and a patient (e.g. age, sex, zip-code). There are about 13 - 15 variables under consideration.
I was using R before and now I am using Python. I am able to use either depending on any suitable tools/libraries you guys suggest.
Before I ask any question, I do want to mention that a lot of the data fields have undergone an anonymization process when handed to me, as required in the healthcare industry for the protection of personal information. Specifically, a lot of VARCHAR values are turned into random integer values, only maintaining referential integrity across the dataset.
Questions:
An exact definition of an outlier was not given (it's defined based on the behavior of most of the data, if there's a general behavior) and there's no labeled training set telling me which rows of the dataset are considered abnormal. I believe the project belongs to the area of unsupervised learning so I was looking into clustering.
Since the data is mixed (numeric and categorical), I am not sure how would clustering work with this type of data.
I've read that one could expand the categorical data and let each category in a variable to be either 0 or 1 in order to do the clustering, but then how would R/Python handle such high dimensional data for me? (simply expanding employer role would bring in ~100 more variables)
How would the result of clustering be interpreted?
Using clustering algorithm, wouldn't the potential "outliers" be grouped into clusters as well? And how am I suppose to detect them?
Also, with categorical data involved, I am not sure how "distance between points" is defined any more and does the proximity of data points indicate similar behaviors? Does expanding each category into a dummy column with true/false values help? What's the distance then?
Faced with the challenges of cluster analysis, I also started to try slicing the data up and just look at two variables at a time. For example, I would look at the age range of patients accessed by a certain employee role, and I use the quartiles and inter-quartile range to define outliers. For categorical variables, for instance, employee role and types of events being triggered, I would just look at the frequency of each event being triggered.
Can someone explain to me the problem of using quartiles with data that's not normally distributed? And what would be the remedy of this?
And in the end, which of the two approaches (or some other approaches) would you suggest? And what's the best way to use such an approach?
Thanks a lot.
You can decide upon a similarity measure for mixed data (e.g. Gower distance).
Then you can use any of the distance-based outlier detection methods.
You can use k-prototypes algorithm for mixed numeric and categorical attributes.
Here you can find a python implementation.

Different Aggregation calculations of a measure using two dimensions in Tableau

It is a Tableau 8.3 Desktop Edition question.
I am trying to aggregate data using two different dimensions. So, I want to aggregate twice: first I want to sum over all the rows and then multiply the results in a cummulative manner (so I can build a graph). How do I do that? Ok, too vague, here follow some more details:
I have a set of historical data. The columns are the date, the rows are the categories.
Easy part: I would like to sum all the rows.
Hard part: Given this those summations I want to build a graph that for each date it shows the product of all the summations from the earlier date till this date.
In another words:
Take the sum of all rows, call it x_i, where i is the date.
For each date i find y_i such that y_i = x_0 * x_1 * ... * x_i (if there is missing data, consider it to be one)
Then show a line graph for the y values versus the date.
I have searched for a solution for this and tried to figure it out by myself, but failed.
Thank you very much for your time and help :)
You need n calculated fields (number of columns you have), and manually do the calculation you need:
y_i = sum(field0)*sum(field1)
Basically because you cannot iterate on columns. For tableau, each column represent a different dimension or measure. So it won't consider that there is a logic order among them, meaning, it won't assume that column A comes before column B. It will assume A and B are different things.
Tableau works better with tables organized as databases. So if you have year columns, you should reorganize your data, eliminate all those columns and create a single field called 'Date', which will identify the value of your measure for that date. Yes, you will have less columns but far more rows. But Tableau works better this way (for very good reasons).
Tableau 9.0 allows you to do that directly. I only watched a demo (it was launched yesterday), but I understand that now there is an option to selected those columns (in the Data Connection tab) and convert them to a database format.
With that done, you can use a PREVIOUS_VALUE function to help you. I'm not with Tableau right now. As soon as I get to it I'll update this with the final answer . Unless you take the lead and discover yourself before that ;)

Database solution to store and aggregate vectors?

I'm looking for a way to solve a data storage problem for a project.
The Data:
We have a batch process that generates 6000 vectors of size 3000 each daily. Each element in the vectors is a DOUBLE. For each of the vectors, we also generate tags like "Country", "Sector", "Asset Type" and so on (It's financial data).
The Queries:
What we want to be able to do is see aggregates by tag of each of these vectors. So for example if we want to see the vectors by sector, we want to get back a response that gives us all the unique sectors and a 3000x1 vector that is the sum of all the vectors of each element tagged by that sector.
What we've tried:
It's easy enough to implement a normalized star schema with 2 tables, one with the tagging information and an ID and a second table that has "VectorDate, ID, ElementNumber, Value" which will have a row to represent each element for each vector. Unfortunately, given the size of the data, it means we add 18 million records to this second table daily. And since our queries need to read (and add up) all 18 million of these records, it's not the most efficient of operations when it comes to disk reads.
Sample query:
SELECT T1.country, T2.ElementNumber, SUM(T2.Value)
FROM T1 INNER JOIN T2 ON T1.ID=T2.ID
WHERE VectorDate = 20140101
GROUP BY T1.country, T2.ElementNumber
I've looked into NoSQL solutions (which I don't have experience with) but seen that some, like MongoDB allow for storing entire vectors as part of a single document - but I'm unsure if they would allow aggregations like we're trying efficiently (adding each element of the vector in a document to the corresponding element of other documents' vectors). I read the $unwind operation required isn't that efficient either?
It would be great if someone could point me in the direction of a database solution that can help us solve our problem efficiently.
Thanks!

Reducing Large Datasets with MongoDB and D3

I'm working on a D3 visualization and luckily have made some progress. However, I've run into an issue... and to be honest, I'm not sure if its a MongoDB issue, or a D3 issue. You see, I'm trying to make a series of graphs from a set of sensor points (my JSON object contains timestamps, light, temperature, humidity, and motion detection levels or each datapoint). However, my sensors are uploading data to my MongoDB database every 8 seconds. So, if I query the MongoDB database for just one days worth of data... I get 10,800 datapoints. Worse, if I were to ask for one month of data, I'd be swamped with 324,000 datapoints. My issue is that my d3 visualization slows to a crawl when dealing with more than about 1000 points (I'm visualizing the data on four different graphs each which use a single brush to select a certain domain on the graph. Is there a way to limit the amount of data I'm trying to visualize? Is this better done using MongoDB (so basically filter the data I'm querying and only getting every nth data point based on how big of a time value I'm trying to query). Or is there a better way? Should I try to filter the data using D3 once I've retrieved the entire dataset? What is the best way to go about reducing the amount of points I need to deal with? Thanks in advance.
Mongodb is great at filtering. If you really only need a subset of the data, specify that in a find query -- this could limit to a subset of time, or if you're clever only get data for the first minute of every hour or similar.
Or you can literally reduce the amount of data coming out of mongodb using the aggregation framework. This could be used to get partial sums or averages or similar.

Non-linear regression models in PostgreSQL using R

Background
I have climate data (temperature, precipitation, snow depth) for all of Canada between 1900 and 2009. I have written a basic website and the simplest page allows users to choose category and city. They then get back a very simple report (without the parameters and calculations section):
The primary purpose of the web application is to provide a simple user interface so that the general public can explore the data in meaningful ways. (A list of numbers is not meaningful to the general public, nor is a website that provides too many inputs.) The secondary purpose of the application is to provide climatologists and other scientists with deeper ways to view the data. (Using too many inputs, of course.)
Tool Set
The database is PostgreSQL with R (mostly) installed. The reports are written using iReport and generated using JasperReports.
Poor Model Choice
Currently, a linear regression model is applied against annual averages of daily data. The linear regression model is calculated within a PostgreSQL function as follows:
SELECT
regr_slope( amount, year_taken ),
regr_intercept( amount, year_taken ),
corr( amount, year_taken )
FROM
temp_regression
INTO STRICT slope, intercept, correlation;
The results are returned to JasperReports using:
SELECT
year_taken,
amount,
year_taken * slope + intercept,
slope,
intercept,
correlation,
total_measurements
INTO result;
JasperReports calls into PostgreSQL using the following parameterized analysis function:
SELECT
year_taken,
amount,
measurements,
regression_line,
slope,
intercept,
correlation,
total_measurements,
execute_time
FROM
climate.analysis(
$P{CityId},
$P{Elevation1},
$P{Elevation2},
$P{Radius},
$P{CategoryId},
$P{Year1},
$P{Year2}
)
ORDER BY year_taken
This is not an optimal solution because it gives the false impression that the climate is changing at a slow, but steady rate.
Questions
Using functions that take two parameters (e.g., year [X] and amount [Y]), such as PostgreSQL's regr_slope:
What is a better regression model to apply?
What CPAN-R packages provide such models? (Installable, ideally, using apt-get.)
How can the R functions be called within a PostgreSQL function?
If no such functions exist:
What parameters should I try to obtain for functions that will produce the desired fit?
How would you recommend showing the best fit curve?
Keep in mind that this is a web app for use by the general public. If the only way to analyse the data is from an R shell, then the purpose has been defeated. (I know this is not the case for most R functions I have looked at so far.)
Thank you!
The awesome pl/r package allows you to run R inside PostgreSQL as a procedural language. There are some gotchas because R likes to think about data in terms of vectors which is not what a RDBMS does. It is still a very useful package as it gives you R inside of PostgreSQL saving you some of the roundtrips of your architecture.
And pl/r is apt-get-able for you as it has been part of Debian / Ubuntu for a while. Start with apt-cache show postgresql-8.4-plr (that is on testing, other versions/flavours have it too).
As for the appropriate modeling: that is a whole different ballgame. loess is a fair suggestion for something non-parametric, and you probably also want some sort of dynamic model, either ARMA/ARIMA or lagged regression. The choice of modeling is pretty critical given how politicized the topic is.
I don't think autoregression is what you want. Non-linear isn't what you want either because the implies discontinuous data. You have continuous data, it just may not be a straight line. If you're just visualizing, and especially if you don't know what the shape is supposed to be then loess is what you want.
It's easy to also get a confidence interval band around the line if you just plot the data with ggplot2.
qplot(x, y, data = df, geom = 'point') + stat_smooth()
That will make a nice plot.
If you want to a simpler graph in straight R.
plot(x, y)
lines(loess.smooth(x,y))
May I propose a different solution? Just use PostgreSQL to pull the data, feed it into some R script and finally show the results. The R script may be as complicated as you want as long as the user doesn't have to deal with it.
You may want to have a look at rapache, an Apache module that allows running R scripts in a webpage.
A couple of videos illustrating its use:
Hello world application
Jeffrey Horner's presentation of RApache + links to working apps
In particular check how the San Francisco Estuary Institue Web Query Tool allows the user to interact with the parameters.
As for the regression, I'm not an expert, so I may be saying something extremely stupid... but wouldn't something like a LOESS regression be OK for this?