Matlab `unstack`: Safe to assume ordering of new columns? - matlab

According to the documentation, Matlab's unstack can take this table:
S=12×3 table
Storm Town Snowfall
_____ ____ ________
3 'T1' 0
3 'T3' 3
1 'T1' 5
3 'T2' 5
1 'T2' 9
1 'T3' 10
4 'T2' 12
2 'T1' 13
4 'T3' 15
2 'T3' 16
4 'T1' 17
2 'T2' 21
...and convert it into:
U = unstack(S,'Snowfall','Town')
U=4×4 table
Storm T1 T2 T3
_____ __ __ __
3 0 5 3
1 5 9 10
4 17 12 15
2 13 21 16
It seems reasonable to assume that the new columns are generated in alphabetic order. To assume this would be fine if one is manually manipulating data, but is a deal breaker for automated data processing if one cannot be 100% assured of the ordering of the columns. For example, if the Town column was actually a numerical index, then the new column names would be automatically generated so as to be legitimate variable names, and the ordering would be the key piece of information linking the new columns back to the values in the Town field. If one extracts U{:,2:end} for manipulation, the data could be all wrong unless one could be 100% sure of whatever the scheme is for ordering the new columns.
I actually create a new column in place of Town containing a valid string, suffixed with the numerical index value. These become the new column headings. But the reality is, having to write extra code to assure that the columns appear in the right order is too much trouble. It cancels out the benefit of unstack, and I ended up just creating loops to build up the new columns one by one. Not efficient or elegant in terms of time and code. I am trying to find a way to reliably exploit unstack in the future.
I have already submitted feedback describing the criticality of this bit of information, but I don't expect a response back any time soon. Meanwhile, unstacking is such a useful function that I wonder whether anyone can weigh in about the advisability of assuming alphabetic ordering of the new columns?

Yes, from what I understood in the code source of unstack.m (you can read it by typing edit unstack), the columns will be in alphabetical order following Unicode alphabetical order by using a function that converts the identifier to a unique index, before checking if the identifier is valid.
The Unicode order will mean in particular:
that T10 will be before T9.
t10will be after T10.
According to unstack, the function that converts the identifier to a unique index subs2inds relies on a class tabularDimension which is said to be
(at R2018b) temporal:
%tabularDimension Internal abstract class to represent a tabular's dimension.
% This class is for internal use only and will change in a
% future release. Do not use this class.
After sorting the identifiers, comes the validity checking with the function matlab.lang.makeValidName (using the default option 'Prefix','x') that will modify the identifier if is not valid (replacing illegal character by underscore by default).
A valid MATLAB identifier is a character vector of alphanumerics (A–Z, a–z, 0–9) and underscores, such that the first character is a letter and the length of the character vector is less than or equal to namelengthmax.
makeValidName deletes any whitespace characters before replacing any characters that are not alphanumerics or underscores. If a whitespace character is followed by a lowercase letter, makeValidName converts the letter to the corresponding uppercase character.
For example:
2A will be change to x2A.
ça will be change to x_A.
Particular case will be dealt with the help of the matlab.lang.makeuniquestrings function.
For example, if you ask identifiers: ç1, à1, Matlab will still be able to distinguish them and rename them respectively x_1_1, x_1.
In your case, I will suggest to automatically generate columns with a constant starting letter, then the index with leading zeros resulting in a constant number of characters: T0001, T0002, ..., T0100, ..., T9999.

Related

Is this table in first normal form?

I am currently studying SQL normal forms.
Lets say I have the following table the primary key is userid
userid FirstName LastName Phone
1 John Smith 555-555
1 Tim Jack 432-213
2 Sarah Mit 454-541
3 Tom jones 987-125
The book I'm reading states the following conditions must be true in order for a table to be in 1st normal form.
Rows contain data about an entity.
Columns contain data about attributes of the entities.
All entries in a column are of the same kind.
Each column has a unique name.
Cells of the table hold a single value.
The order of the columns is unimportant.
The order of the rows is unimportant.
No two rows may be identical.
A primary key Must be assigned
I'm not sure if my table violates the
8th rule No two rows may be identical.
Because the first two records in my table
1 John Smith 555-555
1 Tim Jack 432-213
share the same userid does that mean that they are considered
duplicate rows?
Or does duplicate records mean that every peace of data in the row
has to be the same for the record to be considered a duplicate row
see example below?
1 John Smith 555-555
1 John Smith 555-555
EDIT1: Sorry for the confusion
The question I was trying to ask is simple
Is this table below in 1st normal form?
userid FirstName LastName Phone
1 John Smith 555-555
1 Tim Jack 432-213
2 Sarah Mit 454-541
3 Tom jones 987-125
Based on the 9 rules given in the textbook I think it is but I wasn't sure that
if rule 8 No two rows may be identical
was being violated because of two records that use the same primary key.
The class text book and prof isn't really that clear on this subject which is why I am asking this question.
Or does duplicate records mean that every peace of data in the row has to be the same for the record to be considered a duplicate row see example below?
They mean that--the latter of your choices. Entire rows are what must be "identical". It's ok if two rows share the same values for one or more columns as long as one or more columns differ.
That's because a relation holds a set of values that are tuples/rows/records, and set is a collection of values that are all different.
But SQL & some relational algebras have different notions of "identical" in the case of NULLs compared to the relational model without NULLs. You should read what your textbook says about it if you want to know exactly what they mean by it. Two rows that have NULL in the same column are considered different. (Point 9 might be summarizing something involving NULLs. Depends on the explanation in the book.)
PS
There's no single notion of what a relation is. There is no single notion of "identical". There is no single notion of 1NF.
Points 3-8 are better described as (poor) ways of restricting how to interpret a picture of a table to get a relation. Your textbook seems to be (strangely) making "1NF" a property of such an interpretation of a picture of a table. Normally we simply define a relation to be a certain thing so if you have one then it has to have the defined properties. Then "in 1NF" applies to a relation & either means "is a relation" & isn't further used or it means certain further restrictions hold. A relation is a set of tuples/rows/records, and in the kind of relation your 3-8 describes they are sets of attribute/column/field name-value pairs & the values paired with a name have to be of the type paired with that name in some schema/heading that is a set of name-type pairs that is defined either as part of the relation or external to it.
Your textbook doesn't seem to present things clearly. It's definition of "1NF" is also idiosyncratic in that although 3-8 are mathematical, 1 & 2 are informal/heuristic (& 9 could be either or both).

kdb q - efficiently count tables in flatfiles

I have a lot of tables stored in flat files (in a directory called basepath) and I want to check their number of rows. The best I can so right now is:
c:([] filename:system "ls ",basepath;
tablesize:count each get each hsym `$basepath,/:system "ls ",basepath)
which loads each table entirely into memory and then performs the count (that's quite slow). Is saving as splayed tables the only way to make this faster (because I would only load 1 column and count that) or is there a trick in q that I can use?
Thanks for the help
If you have basepath defined as a string of the path to directory where all your flat tables are stored then you can create a dictionary of the row counts as follows:
q)cnt:{count get hsym x}
q)filename:key hsym `$basepath
q)filename!cnt each filename
t| 2
g| 3
This is where I have flat tables t and g saved in my basepath directory. This stops you from having to use system commands which are often less effiecient.
The function cnt takes the path of each flat table (as a symbol) and returns the number of rows without saving them into memory.
The best solution if you have control of the process of saving such files down is to add an extra step of saving the meta information of the row count somewhere seperate at the same time of saving the raw data. This would allow you to quickly access the table size from this file instead of reading the full tbale in each time.
However, note that to avoid pulling them into memory at all you would have to instead use read1 and look at the headers on the binary data. As you said it would be better to save as a splayed table and read in one column.
UPDATE: I would not recommend doing this and strongly suggest doing the above but for curiosity after looking into using read1 here's an example what what a hacky solution might look like:
f:{
b:read1(y;0;x);
if[not 0x62630b~b[2 4 5];'`$"not a table"];
cc:first first((),"i";(),4)1:b 7+til 4;
if[null ce:first where cc=sums 0x0=11 _ b;:.z.s[x*2;y]];
c:`$"\000" vs "c"$b[11+til ce];
n:first first((),"i";(),4)1:b[(20+ce)+til 4];
:`columns`rows!(c;n);
}[2000]
The q binary file format isn’t documented anywhere, the only way to figure it out is to save different things and see how the bytes change. It’s also subject to changes between versions - the above is written for 3.5 and is probably valid for 3.0-3.5 only, not the latest 3.6 release or anything 2.X.
The given code works in the following way:
reads a chunk from the front of the file
validates that it looks like a flat unkeyed table (flip[98] of a dict[99] with symbol[11] keys)
reads the count of symbols in the list of columns as a little endian 4 byte int
scans through the null terminated strings for that many zero bytes
if the columns are so numerous or verbose that we don’t have them
all in this chunk it will double the size of the chunk and try again
turn the strings into symbols
using the offset we get from the end of the column list, skip a bit
more of the header for the mixed list of columns
then read the count from the header of the first column
Hope this answers your question!
From experimenting with the binary files, it seems that the table count is saved as part of the binary file when you save down a flat file, taking up 4 bytes after the initial object type and column headings which will vary from table to table.
`:test set ([]a:1 2 3;b:4 5 6;c:7 8 9;aa:10 11 12;bb:13 14 15)
q)read1 `:test
0xff016200630b000500000061006200630061610062620000000500000009000300000
0 7 11 31
bytes | example | meaning
---------------------------------------------------------------------------------------
0 - 5 | 0xff016200630b0 | object is a flat table
7 - 11 | 0x05000000 | number of columns (5)
12- 22 | 0x6100620063006161006262 | one byte for the ascii values of column "a" and "b" in hex followed by the one byte separator
23 - 30 | 0x0000050000000900 | 8 bytes that can be skipped
31 - 34 | 0x0300000 | 4 bytes for row count of first column (3)
This should help you understand the function that Fiona posted.
The binary is saved down little-endian meaning the most-significant byte is the right-hand most digit - doing this in decimal for the number 100 would give 001, with the 100's (most significant) on the right and then 10s and finally 1s on the left. In the binary file, each group of 2 digits is a byte.
You can use 1: to read in the contents of a binary file, with additional arguments in the list specifying the offset - where to start reading from, and how many bytes to read. In our case we want to start at byte 31 and read in 4 bytes, specifying the output should be an integer and to cut the input into separate 4 byte chunks.
q)first first (enlist "i";enlist 4)1:(`:test;31;4)
3i
Converting the little-endian bytes into a long gives us the row count. Since this only has to read in 4 bytes instead of the whole file it is a lot quicker.
For a table with 10000 rows and 2 columns there is not much difference:
q)\t 0x0 sv reverse first (enlist "x";enlist 1)1:(`:test10000;31;4)
0
q)\t count get `:test10000
0
For a table with 100m rows and 2 columns:
q)\t 0x0 sv reverse first (enlist "x";enlist 1)1:(`:test10m;31;4)
0
q)\t count get `:test10m
2023
If you have a splayed table instead you can read in the number of elements in one of the columns from bytes 9-13 like so, assuming the column is a simple list:
q)first first (enlist "i";enlist 4)1:(`:a;8;4)
3i
You can read more about reading in from binary files here https://code.kx.com/q/ref/filenumbers/#1-binary-files
You can make what you currently have more efficient by using the following
counttables:{count each get each hsym `$basepath}
This will improve the speed of the count by not including the extra read in of the data as well as the join which you are currently doing. You are correct though that if the tables where saved splayed you would only have to read in the one column making it much more efficient.
If your tables are stored uncompressed there's probably something quite hacky you could do with a read1 on the headers within the file until you find the first column header.
But v hacky :-(
Are you responsible for saving these down? Can you keep a running state as you do?

SAS PROC SQL - Concatenate variable values into a single value by group

I have a data set which contains 'factor' values and corresponding 'response' values:
data inTable;
input fact $ val $;
datalines;
a 1
a 2
a 3
b 4
b 5
b 6
c 7
d 8
e 9
e 10
f 11
;
run;
I want to aggregate response options by factor, i.e. I need to get
I know perfectly well how to implement this in a data step running a loop through values and applying CATX (posted here). But can I do the same with PROC SQL, using a combination of GROUP BY and some character analog of SUM() or CATX()?
Thanks for help,
Dmitry
The data step is the appropriate tool to use in SAS if you want to apply any sort of logic that carries lots of values forward from previous rows.
Any SQL solution would be extremely unwieldy - you would need to join the input table to itself n times, where n is the maximum number of distinct values for any of your factors, and you would also need to define a sequential key preserving the row order to use for the join.
A list of aggregation functions you can use in proc sql is available here:
http://support.sas.com/kb/25/279.html
Although a few of these do work with character variables, there is no aggregation function for string concatenation.

Difference between Matlab JOIN vs. INNERJOIN

In SQL, JOIN and INNER JOIN mean the same thing. In Matlab, they are different commands. Just from perusing the documentation thus far, they appear on the surface to fufill the same general function, with possible differences in the details, as controlled by parameters. I am slogging through the individual examples and may (or may not) find the fundamental difference. However, I feel that the difference should not be a subtlety that users have to ferrut out of the examples. These are two separate commands, and the documentation should make it clear up front why they are both needed. Would anyone be able to chime in about the key difference? Perhaps it could become a request to place it front and centre in the documentation.
I've empirically characterized the difference between JOIN and INNERJOIN (some would refer to this as reverse engineering). I'll summarize from the perspective of one who is comfortable with SQL. As I am new to SQL-like operations in Matlab, I've only been able to test drive it to a limited degree, but the INNERJOIN appears to join records in the same manner as SQL. Since SQL is a pretty open language, the behavioural specification of INNERJOIN is readily available, and I won't dwell on that. It's Matlab's JOIN that I need to suss out.
In short, from my testing, Matlab's JOIN seems to "join" the rows in the two operand table in a manner more like Excel's VLOOKUP rather than any of the JOINS in SQL. In general, the main differences with SQL joins seem to be (i) that the right hand table cannot have repeating values in the columns used matching up rows between the two tables and (ii) all combinations of values in the key columns of the left hand table must show up in the right hand table.
Here is the empirical testing. First, prepare the test tables:
a=array2table([
1 2
3 4
5 4
],'VariableNames',{'col1','col2'})
b=array2table([
4 7
4 8
6 9
],'VariableNames',{'col2','col3'})
c=array2table([
2 10
4 8
6 9
],'VariableNames',{'col2','col3'})
d=array2table([
2 10
4 8
6 9
6 11
],'VariableNames',{'col2','col3'})
a2=array2table([
1 2
3 4
5 4
20 99
],'VariableNames',{'col1','col2'})
Here are the tests:
>> join(a,b)
Error using table/join (line 111)
The key variable for B must have unique values.
>> join(a,c)
ans = col1 col2 col3
____ ____ ____
1 2 10
3 4 8
5 4 8
>> join(a,d)
Error using table/join (line 111)
The key variable for B must have unique values.
>> join(a2,c)
Error using table/join (line 130)
The key variable for B must contain all values in the key
variable for A.
The first thing to notice is that JOIN is not a symmetric operation with respect to the two tables.
It seems that the 2nd table argument is used as a lookup table. Unlike SQL joins, Matlab throws an error if it can't find a match in the 2nd table [See join(a2,d)]. This is somewhat hinted at in the documentation, though not entirely clearly. For example, it says that the key values must be common to both tables, but join(a,c) clearly shows that the tables do not have to have common key values. On the contrary, just as one would expect of a lookup table, 2nd table contains entries that aren't matched do not throw errors.
Another difference with SQL joins is that records that cause the key values to replicate in 2nd table are not allowed in Matlab's join. [See join(a,b) & join(a,d)]. In contrast, the fields used for matching records between tables aren't even referred to as keys in SQL, and hence can have non-unique values in either of the two tables. The disallowance of repeated key values in the 2nd table is consistent with the view of the 2nd table as a lookup table. On the other hand, repetition on of key values are permitted in the 1st table.

Sql query to populate iPhone quick-access-bar (count rows by first letter)

I have a sqlite database, with potentially tens of thousands of rows, which is exposed as a UITableView on the iPhone. There's many columns in the relevant table, but for now I only care about a 'name' column. (The database is in a server app, the iPhone app is a client, talking over a socket) Naturally to make this usable, I want to support the 'quick access bar', and hence to count the number of rows that start with A, B, C .. Z (and ideally handle the rows that start with an awkward characters such as #, or a digit).
I'm having trouble formulating a SQL query (supported by SQLite) to count the rows starting with a given letter; if necessary I could make 26 separate queries, but I wonder if there's some nested query magic to compute all the values at once. (Obviously 'COUNT WHERE NAME LIKE 'A*' would work, repeated twenty-six times, but it feels very crude, and wouldn't handle digits or symbols)
Relevant facts: modifications to the database are infrequent, and I can easily cache the query, and refresh it when the DB is modified. The DB is much larger than the RAM on the server device, so avoiding paging the whole DB file would be good. There is an index on the relevant 'name' column of the DB already. Performance is more important than brevity, so if separate queries is faster than one complex query, that's fine.
Use substr(X,Y,Z) instead of LIKE.
substr(X,Y) The substr(X,Y,Z) function returns a substring of input string X that begins with the Y-th character and which is Z characters long. If Z is omitted then substr(X,Y) returns all characters through the end of the string X beginning with the Y-th. The left-most character of X is number 1. If Y is negative then the first character of the substring is found by counting from the right rather than the left. If Z is negative then the abs(Z) characters preceding the Y-th character are returned. If X is a string then characters indices refer to actual UTF-8 characters. If X is a BLOB then the indices refer to bytes.