Stata merging changing my values? - merge

I feel like I'm missing something really basic here...
I am trying to merge two datasets in Stata, FranceSQ.dta and FranceHQ.dta. They both have a variable that I created named "uid" that uniquely identifies the observations.
use FranceSQ, clear
merge 1:1 uid using FranceHQ, gen(_merge) keep(match)
Now what's confusing me is that it tells me that uid doesn't uniquely identify my observations. What I realized it happening is that when I open FranceSQ, everything is normal, and when I look at my uid variable, I have the following values...
25010201
25010202
25010203
...
But then once I try to run the merge, it changes all of my values, so that I see...
2.50101e+10
2.50101e+10
2.50101e+10
...
Any help would be very appreciated...I'm sure there's a simple answer but it's eluding me at the moment.
*** EDIT ***
So Nick's advice helped, thanks! This is what I was doing that went wrong, so I wonder if someone could point out why it didn't work.
1) I created the uid variable in each dataset by concatenating two numeric variables, which cast the uid variable as a string.
2) I ran destring on the whole dataset (because there were a lot of incorrectly cast variables), which turned uid into a double.
3) Then I recast uid as a string. It was with this that I was unable to do the initial merge. I noticed that the value it was changing all of my observations to was the last value in the dataset.
4) Just because I was tweaking around, I recast the uid variable as double, and was getting the same results.
Now I finally got it to work by just starting over and not recasting the uid variable as a string in the first place, but I'm still at a loss as to why my previous efforts did not work, or how the merge command actually decided to change my values.

Very likely, this is a problem with precision. Long integers need to be held in long or double data types. You might need to recast one identifier before the merge.
You should check by looking at the results of describe whether uid has the same data type in both datasets.

To check whether your variable really identifies observations, type isid uid. Stata would complain if uid is not a unique identifier when performing merge, anyway, but that's a useful check on its own. If uid passes the check in both files, it should still do so in the merged file; it must be failing in at least one of the source files in order to fail in the merged file.
On top of Nick Cox' answer concerning data types, the issue may simply be formatting. Type describe uid to find out what the current format is, and may be format uid %12.0f to get rid of the scientific notation.
I think Stata promotes variables to the more accurate format when it needs to, say when you replace an integer-valued variable with non-integer values; same thing should happen with merge when you have say byte values in one data set, and you merge in float values on the same variable from the other data set.
Missing values in uid may be the reason Stata does not believe this variable works well. Check for these, too, before and after merge (see help data types that I references above concerning the valid ranges for each type).

Related

How does Snowflake calculate its HASH() output?

Take a look at this query
select
hash( col1, col2 ) as a,
col1||col2 as b, -- just taking a guess as to how hash can take multiple values
hash( b ) as c
from table_name
The result for a and c are different.
So, my question is: how does Snowflake calculate the hash when there are many fields like in a? Is it concatinating the fields first, and then signing that result of that?
Thank you
More to NickW's point that HASH is proprietary
HASH is a proprietary function that accepts a variable number of input expressions of arbitrary types and returns a signed value. It is not a cryptographic hash function and should not be used as such.
I assume the core of the problem you are trying to achieve, is to "make a value in another system, and be able to compare these "safely", of which concatenating strings together, seems very dangerous, as the number and length of each string is a property of those strings.
The usage notes section has some good hints:
Any two values of type NUMBER that compare equally will hash to the same hash value, even if the respective types have different precision and/or scale.
this implies that things are converted to this form.. but it also notes on convertion:
Note that this guarantee does not apply to other combinations of types, even if implicit conversions exist between the types.
What really would help is for you to describe, what you want to happen for you, then if "knowing how HASH works" is the best path to that end, OR not as I would suggest, would be more answerable.
Aka, this answer is a long form question, suggesting this question needs to be reworked.

Error in makeClassifTask - columns to join must specify "on="

I am getting an error here for the makeClassifTask() from MLR package.
task = makeClassifTask(data = data[,2:20441], target='Disease')
Entering this I get this error.
Provided data is not a pure data.frame but from class data.table, hence it will be converted.
Error in [.data.table(data, target) :
When i is a data.table (or character vector), the columns to join by must be specified using 'on=' argument (see ?data.table), by keying x (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing column names between x and i (i.e., a natural join). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM.
If someone could help me out it'd be great.
Given that you did not provide the data I can only do some guessing and suggest to read the documentation at https://mlr3book.mlr-org.com/tasks.html.
It looks like you left out the first column in your dataset which might be your target. Hence makeClassifTask() cannot find your target column.
As #Shreyash Gputa pointed out correctly, changing the data.table object to a data.frame object solves the issue:
task = makeClassifTask(data = as.data.frame(data[,2:20441]), target='Disease')
Given of course that data[,2:20441] contains the target variable Disease...

Set nextval sequence data type to integer only

I have an issues running around my mind regarding default for 'id' field in my postgresql database. Here is the syntax:-
nextval('unsub_keyword_id_seq'::regclass)
However I'm not really understands even after read the documentations & I would like to set the value only for integer(digit only). I try to alter the column by change regclass to other OIDs but each time it will return errors.
Really appreciate if can get this solved very soon.
Update:
It just come to my idea on the data type for the column after I try & error with the code that will produce the id for the column.
Is integer(postgresql in this case) have it's own default length or not?
If I need to to insert long id, should I set the column length?
Kindly advise.
sorry if my questions quite confusing. your comments may help me to improve it.
From the comments:
I need to insert an id with length of 50 with consist of 2 alphabets & the rest is numeric. the problems occur as the data type is in integer & the data inserting in unsuccessful. is it possible to insert my desired data by retain the data type to integer?
If I understand this correctly, you probably need to format a string, e.g.
format('%s%s', 'XX', nextval('some_sequence_name'))

Is there any way for Access 2016 to sort the numbers that are part of a "text" data type formatted field as though they are numeric values?

I am working on a database that (hopefully) will end up using a primary key with both numbers and letters in the values to track lots of agricultural product. Due to the way in which the weighing of product takes place at more than one facility, I have no other option but to maintain the same base number but use letters in addition to this base number to denote split portions of each lot of product. The problem is, after I create record number 99, the number 100 suddenly floats up and underneath 10. This makes it difficult to maintain consistency and forces me to replace this alphanumeric lot ID with a strictly numeric value in order to keep it sorted (which I use "autonumber" as the data type). Either way, I need the alphanumeric lot ID, and so having 2 ID's for the same lot can be confusing for anyone inputting values into the form. Is there a way around this that I am just not seeing?
If you're using query as a data source then you may try to sort it by string converted to number, something like
SELECT id, field1, field2, ..
ORDER BY CLng(YourAlphaNumericField)
Edit: you may also try Val function instead of CLng - it should not fail on non-numeric input
Why not properly format your key before saving ? e.g: "0000099". You will avoid a costly conversion later.
Alternatively, you could use 2 fields as the composite PK. One with the Number (as Long) and one with the Location (as String).

How to alter Postgres table data based on its contents?

This is probably a super simple question, but I'm struggling to come up with the right keywords to find it on Google.
I have a Postgres table that has among its contents a column of type text named content_type. That stores what type of entry is stored in that row.
There are only about 5 different types, and I decided I want to change one of them to display as something else in my application (I had been directly displaying these).
It struck me that it's funny that my view is being dictated by my database model, and I decided I would convert the types being stored in my database as strings into integers, and enumerate the possible types in my application with constants that convert them into their display names. That way, if I ever got the urge to change any category names again, I could just change it with one alteration of a constant. I also have the hunch that storing integers might be somewhat more efficient than storing text in the database.
First, a quick threshold question of, is this a good idea? Any feedback or anything I missed?
Second, and my main question, what's the Postgres command I could enter to make an alteration like this? I'm thinking I could start by renaming the old content_type column to old_content_type and then creating a new integer column content_type. However, what command would look at a row's old_content_type and fill in the new content_type column based off of that?
If you're finding that you need to change the display values, then yes, it's probably a good idea not to store them in a database. Integers are also more efficient to store and search, but I really wouldn't worry about it unless you've got millions of rows.
You just need to run an update to populate your new column:
update table_name set content_type = (case when old_content_type = 'a' then 1
when old_content_type = 'b' then 2 else 3 end);
If you're on Postgres 8.4 then using an enum type instead of a plain integer might be a good idea.
Ideally you'd have these fields referring to a table containing the definitions of type. This should be via a foreign key constraint. This way you know that your database is clean and has no invalid values (i.e. referential integrity).
There are many ways to handle this:
Having a table for each field that can contain a number of values (i.e. like an enum) is the most obvious - but it breaks down when you have a table that requires many attributes.
You can use the Entity-attribute-value model, but beware that this is too easy to abuse and cause problems when things grow.
You can use, or refer to my implementation solution PET (Parameter Enumeration Tables). This is a half way house between between 1 & 2.