Stata mmerge update replace gives wrong output - merge

I wanted to test what happened if I replace a variable with a different data type:
clear
input id x0
1 1
2 13
3 .
end
list
save tabA, replace
clear
input id str5 x0
1 "1"
2 "23"
3 "33"
end
list
save tabB, replace
use tabA, clear
mmerge id using tabB, type(1:1) update replace
list
The result is:
+--------------------------------------------------+
| id x0 _merge |
|--------------------------------------------------|
1. | 1 1 in both, master agrees with using data |
2. | 2 13 in both, master agrees with using data |
3. | 3 . in both, master agrees with using data |
+--------------------------------------------------+
This seems very strange to me. I expected breakdown or disagreement. Is this a bug or am I missing something?

mmerge is user-written (Jeroen Weesie, SSC, 2002).
If you use the official merge in an up-to-date Stata, you will get what you expect.
. merge 1:1 id using tabB, update replace
x0 is str5 in using data
r(106);
I have not looked inside mmerge. My own guess is that what you see is a feature from the author's point of view, namely that it's not a problem if one variable is numeric and one variable is string so long as their contents agree. But why are you not using merge directly? There was a brief period several years ago when mmerge had some advantages over merge, but that's long past. BTW, I agree in wanting my merges to be very conservative and not indulgent on variable types.

Related

deleting all records for all tables in memory in Kdb+

I would like to delete all records for all tables in memory but still keep the schemas.
for example:
a:([]a:1 2;b:2 4);
b:([]c:2 3;d:3 5);
I wrote a function:
{[t] t::select from t where i = -1} each tables[]
this didnt work, so i tried
{[t] ![`t;enlist(=;`i;-1);0b;()]} each tables[]
didnt work either.
Any idea or better ways?
If you pass a global table name as a symbol, it removes all the rows, leaving an empty table
q)delete from `a
`a
q)a
a b
---
q)meta a
c| t f a
-| -----
a| j
b| j
To do it for all global tables in root name space
{delete from x} each tables[]
Your second attempt using function was close. You can achieve it via the following (functional form of the above):
![;();0b;`symbol$()] each tables[]
The first argument should be the symbol of the table for the same reason I mentioned before
The second argument should be an empty list as we want to delete all records (we do not want to delete where i=-1, as that would delete nothing)
The final argument (list of columns to delete) should be an empty symbol list instead of an empty general list.
Mark's solution is best for doing what you want rather than functional form. Just adding to your question on t failing as putting kdb code in comments is awkward.
Your functional form fails not because of the t but because your last argument is not a symbol list `$(). Also you would want to delete where i is > -1, not =
q){[t] ![t;enlist(>;`i;-1);0b;`$()]} each `d`t`q
`d`t`q
q)d
date sym time bid1 bsize1 bid2 bsize2 bid3 bsize3 ask1 asize1 ask2 asize2 ask..
-----------------------------------------------------------------------------..
q)t
date sym time src price size
----------------------------
q)q
date sym time src bid ask bsize asize
-------------------------------------

How to import dates correctly from this .csv file into Matlab?

I have a .csv file with the first column containing dates, a snippet of which looks like the following:
date,values
03/11/2020,1
03/12/2020,2
3/14/20,3
3/15/20,4
3/16/20,5
04/01/2020,6
I would like to import this data into Matlab (I think the best way would probably be using the readtable() function, see here). My goal is to bring the dates into Matlab as a datetime array. As you can see above, the problem is that the dates in the original .csv file are not consistently formatted. Some of them are in the format mm/dd/yyyy and some of them are mm/dd/yy.
Simply calling data = readtable('myfile.csv') on the .csv file results in the following, which is not correct:
'03/11/2020' 1
'03/12/2020' 2
'03/14/0020' 3
'03/15/0020' 4
'03/16/0020' 5
'04/01/2020' 6
Does anyone know a way to automatically account for this type of data in the import?
Thank you!
My version: Matlab R2017a
EDIT ---------------------------------------
Following the suggestion of Max, I have tried specifiying some of the input options for the read command using the following:
T = readtable('example.csv',...
'Format','%{dd/MM/yyyy}D %d',...
'Delimiter', ',',...
'HeaderLines', 0,...
'ReadVariableNames', true)
which results in:
date values
__________ ______
03/11/2020 1
03/12/2020 2
NaT 3
NaT 4
NaT 5
04/01/2020 6
and you can see that this is not working either.
If you are sure all the dates involved do not go back more than 100 years, you can easily apply the pivot method which was in use in the last century (before th 2K bug warned the world of the danger of the method).
They used to code dates in 2 digits only, knowing that 87 actually meant 1987. A user (or a computer) would add the missing years automatically.
In your case, you can read the full table, parse the dates, then it is easy to detect which dates are inconsistent. Identify them, correct them, and you are good to go.
With your example:
a = readtable(tfile) ; % read the file
dates = datetime(a.date) ; % extract first column and convert to [datetime]
idx2change = dates.Year < 2000 ; % Find which dates where on short format
dates.Year(idx2change) = dates.Year(idx2change) + 2000 ; % Correct truncated years
a.date = dates % reinject corrected [datetime] array into the table
yields:
a =
date values
___________ ______
11-Mar-2020 1
12-Mar-2020 2
14-Mar-2020 3
15-Mar-2020 4
16-Mar-2020 5
01-Apr-2020 6
Instead of specifying the format explicitly (as I also suggested before), one should use the delimiterImportoptions and in the case of a csv-file, use the delimitedTextImportOptions
opts = delimitedTextImportOptions('NumVariables',2,...% how many variables per row?
'VariableNamesLine',1,... % is there a header? If yes, in which line are the variable names?
'DataLines',2,... % in which line does the actual data starts?
'VariableTypes',{'datetime','double'})% as what data types should the variables be read
readtable('myfile.csv',opts)
because the neat little feature recognizes the format of the datetime automatically, as it knows that it must be a datetime-object =)

Need to explain the kdb/q script to save partitioned table

I'm trying to understand this snippet code from:
https://code.kx.com/q/kb/loading-from-large-files/
to customize it by myself (e.x partition by hours, minutes, number of ticks,...):
$ cat fs.q
\d .Q
/ extension of .Q.dpft to separate table name & data
/ and allow append or overwrite
/ pass table data in t, table name in n, : or , in g
k)dpfgnt:{[d;p;f;g;n;t]if[~&/qm'r:+en[d]t;'`unmappable];
{[d;g;t;i;x]#[d;x;g;t[x]i]}[d:par[d;p;n];g;r;<r f]'!r;
#[;f;`p#]#[d;`.d;:;f,r#&~f=r:!r];n}
/ generalization of .Q.dpfnt to auto-partition and save a multi-partition table
/ pass table data in t, table name in n, name of column to partition on in c
k)dcfgnt:{[d;c;f;g;n;t]*p dpfgnt[d;;f;g;n]'?[t;;0b;()]',:'(=;c;)'p:?[;();();c]?[t;();1b;(,c)!,c]}
\d .
r:flip`date`open`high`low`close`volume`sym!("DFFFFIS";",")0:
w:.Q.dcfgnt[`:db;`date;`sym;,;`stats]
.Q.fs[w r#]`:file.csv
But I couldn't find any resources to give me detail explain. For example:
if[~&/qm'r:+en[d]t;'`unmappable];
what does it do with the parameter d?
(Promoting this to an answer as I believe it helps answer the question).
Following on from the comment chain: in order to translate the k code into q code (or simply to understand the k code) you have a few options, none of which are particularly well documented as it defeats the purpose of the q language - to be the wrapper which obscures the k language.
Option 1 is to inspect the built-in functions in the .q namespace
q).q
| ::
neg | -:
not | ~:
null | ^:
string | $:
reciprocal| %:
floor | _:
...
Option 2 is to inspect the q.k script which creates the above namespace (be careful not to edit/change this):
vi $QHOME/q.k
Option 3 is to lookup some of the nuggets of documentation on the code.kx website, for example https://code.kx.com/q/wp/parse-trees/#k4-q-and-qk and https://code.kx.com/q/basics/exposed-infrastructure/#unary-forms
Options 4 is to google search for reference material for other/similar versions of k, for example k2/k3. They tend to be similar-ish.
Final point to note is that in most of these example you'll see a colon (:) after the primitives....this colon is required in q/kdb to use the monadic form of the primitive (most are heavily overloaded) while in k it is not required to explicitly force the monadic form. This is why where will show as &: in the q reference but will usually just be & in actual k code

use perl to extract specific output lines

I'm endeavoring to create a system to generalize rules from input text. I'm using reVerb to create my initial set of rules. Using the following command[*], for instance:
$ echo "Bananas are an excellent source of potassium." | ./reverb -q | tr '\t' '\n' | cat -n
To generate output of the form:
1 stdin
2 1
3 Bananas
4 are an excellent source of
5 potassium
6 0
7 1
8 1
9 6
10 6
11 7
12 0.9999999997341693
13 Bananas are an excellent source of potassium .
14 NNS VBP DT JJ NN IN NN .
15 B-NP B-VP B-NP I-NP I-NP I-NP I-NP O
16 bananas
17 be source of
18 potassium
I'm currently piping the output to a file, which includes the preceding white space and numbers as depicted above.
What I'm really after is just the simple rule at the end, i.e. lines 16, 17 & 18. I've been trying to create a script to extract just that component and put it to a new file in the form of a Prolog clause, i.e. be source of(banans, potassium).
Is that feasible? Can Prolog rules contain white space like that?
I think I'm locked into getting all that output from reVerb so, what would be the best way to extract the desirable component? With a Perl script? Or maybe sed?
*Later I plan to replace this with a larger input file as opposed to just single sentences.
This seems wasteful. Why not leave the tabs as they are, and use:
$ echo "Bananas are an excellent source of potassium." \
| ./reverb -q | cut --fields=16,17,18
And yes, you can have rules like this in Prolog. See the answer by #mat. You need to know a bit of Prolog before you move on, I guess.
It is easier, however, to just make the string a a valid name for a predicate:
be_source_of with underscores instead of spaces
or 'be source of' with spaces, and enclosed in single quotes.
You can use probably awk to do what you want with the three fields. See for example the printf command in awk. Or, you can parse it again from Prolog directly. Both are beyond the scope of your current question, I feel.
sed -n 'N;N
:cycle
$!{N
D
b cycle
}
s/\(.*\)\n\(.*\)\n\(.*\)/\2 (\1,\3)/p' YourFile
if number are in output and not jsut for the reference, change last sed action by
s/\^ *[0-9]\{1,\} \{1,\}\(.*\)\n *[0-9]\{1,\} \{1,\}\(.*\)\n *[0-9]\{1,\} \{1,\}\(.*\)/\2 (\1,\3)/p
assuming the last 3 lines are the source of your "rules"
Regarding the Prolog part of the question:
Yes, Prolog facts can contain whitespace like this, with suitable operator declarations present.
For example:
:- op(700, fx, be).
:- op(650, fx, source).
:- op(600, fx, of).
Example query and its result, to let you see the shape of terms that are created with this syntax:
?- write_canonical(be source of(a, b)).
be(source(of(a,b))).
Therefore, with these operator declarations, a fact like:
be source of(a, b).
is exactly the same as stating:
be(source(of(a,b)).
Depending on use cases and other definitions, it may even be an advantage to create this kind of facts (i.e., facts of the form be/1 instead of source_of/2). If this is the only kind of facts you need, you can simply write:
source_of(a, b).
This creates no redundant wrappers and is easier to use.
Or, as Boris suggested, you can use single quotes as in 'be source of'/2.

Org (version 7.9) converts periods and hyphens in tables to 0

I am using the Org mode that comes with Emacs 24.3, and I am having an issue that when Org creates a table from the result of a code block it is replacing characters like '-' and '.' with 0 (integer zero). Then when I pass the table to another code block that's expecting a column of strings I get type errors etc.
I haven't been able to find anything useful, as it seems to be practically un-Googleable. Has anyone had the same problem? If I update to the latest version of org-mode, will that fix it?
EDIT:
I updated to Org 8.2 and this problem seems to have gone away. Now I have another (related) problem, where returning a table with a cell containing a string consisting of one double quote character ('"' in python) messes something up; Org added 2 extra columns to the table, one had something like
(quote (quote ) ())
in it. The reason my tables have things like this in them is that I'm working with part-of-speech tags from natural language data.
It's pretty obvious Org is doing some stuff to try to interpret the table contents, and not dealing well with meta characters. Technically I think these are bugs where Org should be dealing better with unexpected input.
EDIT 2:
Here is a minimal reproduction with Org 7.9.3f (system Python is 3.4):
#+TBLNAME: table
| DT | The |
| . | . |
| - | - |
#+BEGIN_SRC python :var table=table
return table
#+END_SRC
#+RESULTS:
| DT | The |
| 0 | 0 |
| 0 | 0 |
Incidentally, Org does not like the '"' character at all, in tables or in code blocks (I just get a "End of file during parsing" message when the above table has a cell with just '"' in it). It's probably just better to avoid it altogether, so I think my problem is solved. If nobody wants to add anything, I'll answer this myself in a day or so.