How to find all lines in a set that coincide in a single point? - matlab

Suppose I am given a set of lines, how can I partition this set into a number of clusters, such that all the lines in each cluster coincide in a single point?

If number of lines N is reasonable, then you can use O(N^3) algorithm:
for ever pair of lines (in form A*x+B*y+C=0) check whether they intersect - exclude pairs of parallel and not-coincident lines with determinant
|A1 B1|
= 0
|A2 B2|
For every intersecting pair if another line shares the same intersecting point, with determinant:
|A1 B1 C1|
|A2 B2 C2| = 0
|A3 B3 C3|
If N is too large for cubic algorithm using, then calculate all intersection points (upto O(N^2) of them) and add these points to any map structure (hashtable, for example). Check for matching points. Don't forget about numerical errors issue.

Related

Clustering matrix distance between 3 time series

I have a question about the application of clustering techniques more concretely the K-means.
I have a data frame with 3 sensors (A,B,C):
time A | B | C |
8:00:00 6 10 11
8:30:00 11 17 20
9:00:00 22 22 15
9:30:00 20 22 21
10:00:00 17 26 26
10:30:00 16 45 29
11:00:00 19 43 22
11:30:00 20 32 22
... ... ... ...
And I want to group sensors that have the same behavior.
My question is: Looking at the dataframe above, I must calculate the correlation of each object of the data frame and then apply the Euclidean distance on this correlation matrix, thus obtaining a 3 * 3 matrix with the value of distances?
Or do I transpose my data frame and then compute the dist () matrix with Euclidean metric only and then I will have a 3 * 3 matrix with the distances value.
You have just three sensors. That means, you'll need three values, d(A B), d(B,C) and d(A B). Any "clustering" here does not seem to make sense to me? Certainly not k-means. K-means is for points (!) In R^d for small d.
Choose any form of time series similarity that you like. Could be simply correlation, but also DTW and the like.
Q1: No. Why: The correlation is not needed here.
Q2: No. Why: I'd calculate the distances differently
For the first row, R' built-in s dist() function (which uses Euclidean distance by default)
dist(c(6, 10, 11))
gives you the intervals between each value
1 2
------
2| 4
3| 5 1
item 2 and 3 are closest to each other. That's simple.
But there is no single way to calculate the distance between a point and a group of points. There you need a linkage function (min/max/average/...)
What I would do using R's built-in kmeans() function:
Ignore the date column,
(assuming there are no NA values in any A,B,C columns)
scale the data if necessary (here they all seem to have same order of magnitude)
perform KMeans analysis on the A,B,C columns, with k = 1...n ; evaluate results
perform a final KMeans with your suitable choice of k
get the cluster assignments for each row
put them in a new column to the right of C

How does perl resolve possible hash collisions in hashes?

As we know, perl implements its 'hash' type as a table with calculated indexes, where these indexes are truncated hashes.
As we also know, a hashing function can (and will, by probability) collide, giving the same hash to 2 or more different inputs.
Then: How does the perl interpreter handle when it finds that a key generated the same hash than another key? Does it handle it at all?
Note: This is not about the algorithm for hashing but about collision resolution in a hash table implementation.
A Perl hash is an array of linked lists.
+--------+ +--------+
| -------->| |
+--------+ +--------+
| | | key1 |
+--------+ +--------+
| ------+ | val1 |
+--------+ | +--------+
| | |
+--------+ | +--------+ +--------+
+-->| ------>| |
+--------+ +--------+
| key2 | | key3 |
+--------+ +--------+
| val2 | | val3 |
+--------+ +--------+
The hashing function produces a value which is used as the array index, then a linear search of the associated linked list is performed.
This means the worse case to lookup is O(N). So why do people say it's O(1)? You could claim that if you kept the list from exceeding some constant length, and that's what Perl does. It uses two mechanisms to achieve this:
Increasing the number of buckets.
Hashing algorithm perturbing.
Doubling the number of buckets should divide the number of entries in a given by half, on average. For example,
305419896 % 4 = 0 and 943086900 % 4 = 0
305419896 % 8 = 0 and 943086900 % 8 = 4
However, a malicious actor could choose values where this doesn't happen. This is where the hash perturbation comes into play. Each hash has its own random number that perturbs (causes variances in) the output of the hashing algorithm. Since the attacker can't predict the random number, they can't choose values that will cause collisions. When needed, Perl can rebuild the hash using a new random number, causing keys to map to different buckets than before, and thus breaking down long chains.
Sets of key-value pairs where the keys produce the same hash value are stored together in a linked list. The gory details are available in hv.c.

Normalize Count Measure in Tableau

I am trying to create a plot similar to those created by Google's ngram viewer. I have the ngrams that correspond to year, but some years have much more data than others; as a result, plotting from absolute counts doesn't get me the information I want. I'd like to normalize it so that I get the counts as a percentage of the total samples for that year.
I've found ways to normalize data to ranges in Tableau, but nothing about normalizing by count. I also see that there is a count distinct function, but that doesn't appear to do what I want.
How can I do this in Tableau?
Thanks in advance for your help!
Edit:
Here is some toy data and the desired output.
Toy Data:
+---------+------+
| Pattern | Year |
+---------+------+
| a | 1 |
| a | 1 |
| a | 1 |
| b | 1 |
| b | 1 |
| b | 1 |
| a | 2 |
| b | 2 |
| a | 3 |
| b | 4 |
+---------+------+
Desired Output:
Put [Year] on the Columns shelf, and if it is really a Date field instead of a number - choose any truncation level you'd like or choose exact date. Make sure to treat it as a discrete dimension field (the pill should be blue)
Put [Number of Records] on the Rows shelf. Should be a continuous measure, i.e. SUM([Number of Records])
Put Pattern on the Color shelf.
At this point, you should be looking at a graph raw counts. To convert them to percentages, right click on the [Number of Records] field on the Rows shelf, and choose Quick Table Calc->Percent of Total. Finally, right click on [Number of Records] a second time, and choose Compute Using->Pattern.
You might want to sort the patterns. One easy way is to just drag them in the color legend.

Stata: Keep only observations with minimum, maximum and median value of a given variable

In Stata, I have a dataset with two variables: id and var, and say 1000 observations. The variable var is of type float and takes distinct values for all observations. I would like to keep only the three observations where var is either the minimum of var, the maximum of var, or the median of var.
The way I currently do this:
summarize var, detail
local varmax = r(max)
local varmin = r(min)
local varmedian= r(p50)
keep if inlist(float(var),float(`varmax') , float(`varmedian'), float(`varmin'))
The problem that I face is that sometimes the inlist condition will not match one of the value. E.g. I end up with two observations instead of three, for instance the one with min and the one with max, but not the one with median. I suspect this has to do with a precision problem. As you see, I tried to convert all numbers to float, but this is apparently not sufficient.
Any fix to my solution, or alternative solution would be greatly appreciated (if possible without installing additional packages), thanks!
This is not in the first instance a precision problem.
It is an inevitable problem when (1) the number of values is even and (2) the median is the mean of two central values that are different. Then the median itself is not a value in the dataset and will not be found by keep.
Consider a data set 1, 2, 3, 4. The median 2.5 is not in the data. This is very common; indeed it is what is expected with all values distinct and the number of observations even.
Other problems can arise because two or even three of the minimum, median and maximum could be equal to each other. This is not your present problem, but it can bite with other variables (e.g. indicator variables).
Precision problems are possible.
Here is a general solution purported to avoid all these difficulties.
If you collapse to min, median. max and then reshape you can avoid the problem. You will always get three results, even if they are numerically equal and/or not present in the data.
In the trivial example below, the identifier is needed only to appease reshape. In other problems, you might want to collapse using by() and then your identifier comes ready-made. However, you will be less likely to want to reshape in that case.
. clear
. set obs 4
number of observations (_N) was 0, now 4
. gen y = _n
. collapse (min)ymin=y (max)ymax=y (median)ymedian=y
. gen id = _n
. reshape long y, i(id) j(statistic) string
(note: j = max median min)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 1 -> 3
Number of variables 4 -> 3
j variable (3 values) -> statistic
xij variables:
ymax ymedian ymin -> y
-----------------------------------------------------------------------------
. list
+---------------------+
| id statis~c y |
|---------------------|
1. | 1 max 4 |
2. | 1 median 2.5 |
3. | 1 min 1 |
+---------------------+
All that said, having (lots of?) datasets with just three observations sounds poor data management strategy. Perhaps this is extracted from some larger question.
UPDATE
Here is another way to keep precisely 3 observations. Apart from the minimum and maximum, we use the rule that we keep the "low median", i.e. the lower of two values averaged for the median, when the number of observations is even, and a single value that is the median otherwise. (In Stephen Stigler's agreeable terminology, we can talk of "comedians" in the first case.)
. sysuse auto, clear
(1978 Automobile Data)
. sort mpg
. drop if missing(mpg)
(0 observations deleted)
. keep if inlist(_n, 1, cond(mod(_N, 2), ceil(_N/2), floor(_N/2)), _N)
(71 observations deleted)
. l mpg
+-----+
| mpg |
|-----|
1. | 12 |
2. | 20 |
3. | 41 |
+-----+
mod(_N, 2) is 1 if _N is odd and 0 if _N is even. The expression in cond() selects ceil(_N/2) if the number of observations is odd and floor(_N/2) if it is even.

Vowpal Wabbit ignore linear terms, only keep interaction terms

Hi have a Vowpal Wabbit file with two namespaces, for example:
1.0 |A snow |B ski:10
0.0 |A snow |B walk:10
1.0 |A clear |B walk:10
0.0 |A clear |B walk:5
1.0 |A clear |B walk:100
1.0 |A clear |B walk:15
Using -q AB, I can get the interaction terms. Is there any way for me to keep only the interaction terms and ignore the linear terms?
In other words, the result of vw sample.vw -q AB --invert_hash sample.model right now is this:
....
A^clear:24861:0.153737
A^clear^B^walk:140680:0.015292
A^snow:117127:0.126087
A^snow^B^ski:21312:0.015803
A^snow^B^walk:28234:-0.010592
B^ski:107733:0.015803
B^walk:114655:0.007655
Constant:116060:0.234153
I would like it to be something like this:
....
A^clear^B^walk:140680:0.015292
A^snow^B^ski:21312:0.015803
A^snow^B^walk:28234:-0.010592
Constant:116060:0.234153
The --keep and --ignore options do not produce the desired effect because they are appear to be considered before the quadratic terms are generated. Is it possible to do this with vw or do I need a custom preprocessing step that creates all of the combinations?
John Langford (the main author of VW) wrote:
There is not a good way to do this at present. The easiest approach
would be to make --ignore apply to the foreach_feature<> template in the
source code.
You can use a trick with transforming each original example into four new examples:
1 |first:1 foo bar gah |second:1 loo too rah
-1 |first:1 foo bar gah |second:-1 loo too rah
1 |first:-1 foo bar gah |second:-1 loo too rah
-1 |first:-1 foo bar gah |second:1 loo too rah
This makes the quadratic features all be perfectly correlated with the
label, but the linear features have zero correlation with the label.
Hence a mild l1 regularization should kill off the linear features.
I'm skeptical that this will improve performance enough to care (hence
the design), but if you do find that it's useful, please tell us about it.
See the original posts:
https://groups.yahoo.com/neo/groups/vowpal_wabbit/conversations/topics/2964
https://groups.yahoo.com/neo/groups/vowpal_wabbit/conversations/topics/4346