Difference between Matlab JOIN vs. INNERJOIN - matlab

In SQL, JOIN and INNER JOIN mean the same thing. In Matlab, they are different commands. Just from perusing the documentation thus far, they appear on the surface to fufill the same general function, with possible differences in the details, as controlled by parameters. I am slogging through the individual examples and may (or may not) find the fundamental difference. However, I feel that the difference should not be a subtlety that users have to ferrut out of the examples. These are two separate commands, and the documentation should make it clear up front why they are both needed. Would anyone be able to chime in about the key difference? Perhaps it could become a request to place it front and centre in the documentation.

I've empirically characterized the difference between JOIN and INNERJOIN (some would refer to this as reverse engineering). I'll summarize from the perspective of one who is comfortable with SQL. As I am new to SQL-like operations in Matlab, I've only been able to test drive it to a limited degree, but the INNERJOIN appears to join records in the same manner as SQL. Since SQL is a pretty open language, the behavioural specification of INNERJOIN is readily available, and I won't dwell on that. It's Matlab's JOIN that I need to suss out.
In short, from my testing, Matlab's JOIN seems to "join" the rows in the two operand table in a manner more like Excel's VLOOKUP rather than any of the JOINS in SQL. In general, the main differences with SQL joins seem to be (i) that the right hand table cannot have repeating values in the columns used matching up rows between the two tables and (ii) all combinations of values in the key columns of the left hand table must show up in the right hand table.
Here is the empirical testing. First, prepare the test tables:
a=array2table([
1 2
3 4
5 4
],'VariableNames',{'col1','col2'})
b=array2table([
4 7
4 8
6 9
],'VariableNames',{'col2','col3'})
c=array2table([
2 10
4 8
6 9
],'VariableNames',{'col2','col3'})
d=array2table([
2 10
4 8
6 9
6 11
],'VariableNames',{'col2','col3'})
a2=array2table([
1 2
3 4
5 4
20 99
],'VariableNames',{'col1','col2'})
Here are the tests:
>> join(a,b)
Error using table/join (line 111)
The key variable for B must have unique values.
>> join(a,c)
ans = col1 col2 col3
____ ____ ____
1 2 10
3 4 8
5 4 8
>> join(a,d)
Error using table/join (line 111)
The key variable for B must have unique values.
>> join(a2,c)
Error using table/join (line 130)
The key variable for B must contain all values in the key
variable for A.
The first thing to notice is that JOIN is not a symmetric operation with respect to the two tables.
It seems that the 2nd table argument is used as a lookup table. Unlike SQL joins, Matlab throws an error if it can't find a match in the 2nd table [See join(a2,d)]. This is somewhat hinted at in the documentation, though not entirely clearly. For example, it says that the key values must be common to both tables, but join(a,c) clearly shows that the tables do not have to have common key values. On the contrary, just as one would expect of a lookup table, 2nd table contains entries that aren't matched do not throw errors.
Another difference with SQL joins is that records that cause the key values to replicate in 2nd table are not allowed in Matlab's join. [See join(a,b) & join(a,d)]. In contrast, the fields used for matching records between tables aren't even referred to as keys in SQL, and hence can have non-unique values in either of the two tables. The disallowance of repeated key values in the 2nd table is consistent with the view of the 2nd table as a lookup table. On the other hand, repetition on of key values are permitted in the 1st table.

Related

Attributes internal working in aj for performance benefits in kdb

Considering the trade table 't' and quotes table 'q' in memory:
q)t:([] sym:`GOOG`AMZN`GOOG`AMZN; time:10:01 10:02 10:02 10:03; px:10 20 11 19)
q)q:([] sym:`GOOG`AMZN`AMZN`GOOG`AMZN; time:10:01 10:01 10:02 10:02 10:03; vol:100 200 210 110 220)
In order to get performance benefits applying grouped attribute on 'sym' column of q table and making 'time' column sorted within sym.
Using this, I can clearly see the performance benefits from it:
q)\t:1000000 aj[`sym`time;t;q]
9573
q)\t:1000000 aj[`sym`time;t;q1]
8761
q)\t:100000 aj[`sym`time;t;q]
968
q)\t:100000 aj[`sym`time;t;q1]
893
And in large tables the performance is far better.
Now, I'm trying to understand how it works internally when we are applying grouped attribute to sym column and sort time within sym.
My understanding is internally the aj should happen in below way, can someone please let me know the correct internal working?
* Since, grouped attribute is applied on sym; so it creates a hashtable for table q1, then since we are sorting on time so the internal q1 table might look like.
GOOG|(10:01;10:02)|(100;110)
AMZN|(10:01;10:02:10:03)|(200;210;220)
So in this case of q1, if the interpreter has to join (AMZN;10:02) of t table; it will directly find it in q1's hasttable in less time, but for joining same value(AMZN;10:02) of table 't' in table 'q' the interpreter will have to search linearly through table 'q' hence taking more time.
I believe you're on the right track, though we can't know for sure as we don't have access to the kdb source code to see precisely what it does.
If you look at the definition of aj you'll see that it's based on bin:
q)aj
k){.Q.ft[{d:x_z;$[&/j:-1<i:(x#z)bin x#y;y,'d i;+.[+.Q.ff[y]d;(!+d;j);:;.+d i j:&j]]}[x,();;0!z]]y}
specifically,
(`sym`time#q)bin `sym`time#t
and the bin documentation provides some more details on how bin behaves: https://code.kx.com/q/ref/bin/
I believe in the two-column case it will first match on the sym column and then use bin on the second column. Like you said, the grouped attribute on sym speeds up the matching of syms part and the sorting on time ensures the bin returns the correct results. Note that for on-disk queries it's optimal to put `p# on sym rather than `g# as the parted attribute is optimal for matching/retrieving by sym from disk.

Matlab `unstack`: Safe to assume ordering of new columns?

According to the documentation, Matlab's unstack can take this table:
S=12×3 table
Storm Town Snowfall
_____ ____ ________
3 'T1' 0
3 'T3' 3
1 'T1' 5
3 'T2' 5
1 'T2' 9
1 'T3' 10
4 'T2' 12
2 'T1' 13
4 'T3' 15
2 'T3' 16
4 'T1' 17
2 'T2' 21
...and convert it into:
U = unstack(S,'Snowfall','Town')
U=4×4 table
Storm T1 T2 T3
_____ __ __ __
3 0 5 3
1 5 9 10
4 17 12 15
2 13 21 16
It seems reasonable to assume that the new columns are generated in alphabetic order. To assume this would be fine if one is manually manipulating data, but is a deal breaker for automated data processing if one cannot be 100% assured of the ordering of the columns. For example, if the Town column was actually a numerical index, then the new column names would be automatically generated so as to be legitimate variable names, and the ordering would be the key piece of information linking the new columns back to the values in the Town field. If one extracts U{:,2:end} for manipulation, the data could be all wrong unless one could be 100% sure of whatever the scheme is for ordering the new columns.
I actually create a new column in place of Town containing a valid string, suffixed with the numerical index value. These become the new column headings. But the reality is, having to write extra code to assure that the columns appear in the right order is too much trouble. It cancels out the benefit of unstack, and I ended up just creating loops to build up the new columns one by one. Not efficient or elegant in terms of time and code. I am trying to find a way to reliably exploit unstack in the future.
I have already submitted feedback describing the criticality of this bit of information, but I don't expect a response back any time soon. Meanwhile, unstacking is such a useful function that I wonder whether anyone can weigh in about the advisability of assuming alphabetic ordering of the new columns?
Yes, from what I understood in the code source of unstack.m (you can read it by typing edit unstack), the columns will be in alphabetical order following Unicode alphabetical order by using a function that converts the identifier to a unique index, before checking if the identifier is valid.
The Unicode order will mean in particular:
that T10 will be before T9.
t10will be after T10.
According to unstack, the function that converts the identifier to a unique index subs2inds relies on a class tabularDimension which is said to be
(at R2018b) temporal:
%tabularDimension Internal abstract class to represent a tabular's dimension.
% This class is for internal use only and will change in a
% future release. Do not use this class.
After sorting the identifiers, comes the validity checking with the function matlab.lang.makeValidName (using the default option 'Prefix','x') that will modify the identifier if is not valid (replacing illegal character by underscore by default).
A valid MATLAB identifier is a character vector of alphanumerics (A–Z, a–z, 0–9) and underscores, such that the first character is a letter and the length of the character vector is less than or equal to namelengthmax.
makeValidName deletes any whitespace characters before replacing any characters that are not alphanumerics or underscores. If a whitespace character is followed by a lowercase letter, makeValidName converts the letter to the corresponding uppercase character.
For example:
2A will be change to x2A.
ça will be change to x_A.
Particular case will be dealt with the help of the matlab.lang.makeuniquestrings function.
For example, if you ask identifiers: ç1, à1, Matlab will still be able to distinguish them and rename them respectively x_1_1, x_1.
In your case, I will suggest to automatically generate columns with a constant starting letter, then the index with leading zeros resulting in a constant number of characters: T0001, T0002, ..., T0100, ..., T9999.

Spark window functions: how to implement complex logic with good performance and without looping

I have a data set that lends itself to window functions, 3M+ rows that once ranked can be partitioned into groups of ~20 or less rows. Here is a simplified example:
id date1 date2 type rank
171 20090601 20090601 attempt 1
171 20090701 20100331 trial_fail 2
171 20090901 20091101 attempt 3
171 20091101 20100201 attempt 4
171 20091201 20100401 attempt 5
171 20090601 20090601 fail 6
188 20100701 20100715 trial_fail 1
188 20100716 20100730 trial_success 2
188 20100731 20100814 trial_fail 3
188 20100901 20100901 attempt 4
188 20101001 20101001 success 5
The data is ranked by id and date1, and the window created with:
Window.partitionBy("id").orderBy("rank")
In this example the data has already been ranked by (id, date1). I could also work on the unranked data and rank it within Spark.
I need to implement some logic on these rows, for example, within a window:
1) Identify all rows that end during a failed trial (i.e. a row's date2 is between date1 and date2 of any previous row within the same window of type "trial_fail").
2) Identify all trials after a failed trial (i.e. any row with type "trial_fail" or "trial success" after a row within the same window of type "trial_fail").
3) Identify all attempts before a successful attempt (i.e. any row with type "attempt" with date1 earlier than date1 of another later row of type "success").
The exact logic of these conditions is not important to my question (and there will be other different conditions), what's important is that the logic depends on values in many rows in the window at once. This can't be handled by the simple Spark SQL functions like first, last, lag, lead, etc. and isn't as simple as the typical example of finding the largest/smallest 1 or n rows in the window.
What's also important is that the partitions don't depend on one another so this seems like this a great candidate for Spark to do in parallel, 3 million rows with 150,000 partitions of 20 rows each, in fact I wonder if this is too many partitions.
I can implement this with a loop something like (in pseudocode):
for i in 1..20:
for j in 1..20:
// compare window[j]'s type and dates to window[i]'s etc
// add a Y/N flag to the DF to identify target rows
This would require 400+ iterations (the choice of 20 for the max i and j is an educated guess based on the data set and could actually be larger), which seems needlessly brute force.
However I am at a loss for a better way to implement it. I think this will essentially collect() in the driver, which I suppose might be ok if it is not much data. I thought of trying to implement the logic as sub-queries, or by creating a series of sub-DF's each with a subset or reduction of data.
If anyone is aware of any API's or techniques that I am missing any info would be appreciated.
Edit: This is somewhat related:
Spark SQL window function with complex condition

SAS PROC SQL - Concatenate variable values into a single value by group

I have a data set which contains 'factor' values and corresponding 'response' values:
data inTable;
input fact $ val $;
datalines;
a 1
a 2
a 3
b 4
b 5
b 6
c 7
d 8
e 9
e 10
f 11
;
run;
I want to aggregate response options by factor, i.e. I need to get
I know perfectly well how to implement this in a data step running a loop through values and applying CATX (posted here). But can I do the same with PROC SQL, using a combination of GROUP BY and some character analog of SUM() or CATX()?
Thanks for help,
Dmitry
The data step is the appropriate tool to use in SAS if you want to apply any sort of logic that carries lots of values forward from previous rows.
Any SQL solution would be extremely unwieldy - you would need to join the input table to itself n times, where n is the maximum number of distinct values for any of your factors, and you would also need to define a sequential key preserving the row order to use for the join.
A list of aggregation functions you can use in proc sql is available here:
http://support.sas.com/kb/25/279.html
Although a few of these do work with character variables, there is no aggregation function for string concatenation.

MySQL Normalization - 1 table with 3 columns or 2 tables with 2 columns?

I am building a database with a couple of million records, and I've got a question regarding one of the relational tables which will be used to store two searchable reference numbers. I am new to this, so I apologize f this has been asked before.
id digit1 digit2
varchar(9) varchar(9) varchar(9)
Is it better to a) keep 2 separate optional columns in one table or b) two separate tables for digit1 and digit2?
What kind of a mysql character type should I use if digit1 always consists of 6 - 9 numbers and digit2 always consists of same 3 letters and 6 numbers? How do I limit the input by a set of such rules?
Thanks!
actually, if you're going to store numbers and if you don't want to query by digit1 and digit2 at the same time, it's better to keep them apart in different tables. Otherwise, it's better to keep them in the same table, or you'll have a painful join. It also depends how sparse is your matrix (I mean, if there are too many items in column 2 and just a few in column 3, probably it's better to keep them apart too)
now, what will make a bigger difference here, if you want to store numbers, is to use a numeric field to store the values (instead of varchar), which will be smaller and faster to search and index (and so, to retrieve)