Logic to convert string of words to number - hash

I am looking for a logic which will help me in coverting a string to number in teradata and hive.
It should be easily implementable in Tearadata as I dont have permission to deploy a UDF in TD. In hive if it is not simple I can easily write a UDF.
My requirement - Lets say I have columns sender_country, receiver country. I want to generate a number for concat('sender_country','_','receiver_country')
The number should always be same if the countries appear again.
Below is the illustration
UID sender_country receiver_country concat number
1 US UK US_UK 198760
2 FR IN FR_IN 146785
3 CH RU CH_RU 467892
4 US UK US_UK 198760
It should be in a way where all unique combinations of a country should have unique values. Like in above example US_US is repeated, it has same corresponding number.
I tried hashbucket(hashrow('concat')) in TD, but don't know its equivalent implementation in hive.
Similarly we have hash() function in hive, but don't have its equivalent function in TD.
I could not find any hash functions which returns similar values in TD and Hive too

You can simply convert each character into a number:
Ascii(Substr(sender_country,1,1))*1000000+
Ascii(Substr(sender_country,2,1))*10000+
Ascii(Substr(receiver_country,1,1))*100+
Ascii(Substr(receiver_country,2,1))
returns 85838575 for US,UK

Related

Adding Columns using google sheets

Is there a way I can add a total of a column if the numbers are imported from a different sheet with google sheets? and matching a pair of dates?
=SUMIFS($S$8:S47,$O$8:$O47,">="&T$51,$O$8:$O47,"<="&$V$51)
most likely this is the issue of QUERY. query likes to "assume" the type of dataset (numeric/plain text) and 90% of time it assumes it wrongly so you either end up with empty columns or missing values. use in S6:
=IMPORTRANGE("1nC7e8za4_SjIAbPsRv4loQ8CHinwcp0J43LcyG0qopM", "August 2022!i3:j45")

How to handle NaNs in pandas dataframe integer column to postgresql database

I have a pandas dataframe with a "year" column. However some rows have a np.NaN value due to an outer merge. The data type of the column in pandas is therefore converted to float64 instead of integer (integer cannot store NaNs?). Next, I want to store the dataframe on a postGreSQL database. For this I use:
df.to_sql()
Everything works fine but my postGreSQL column is now type "double precision" and the np.NaN values are now [null]. This all makes sense since the input column type was float64 and not integer type.
I was wondering if there is a way to store the results in an integer type column with [nans].
Example Notebook
Result of Ami's answer:
(integer cannot store NaNs?)
No, they cannot. If you look at the postgresql numeric documentation, you can see that the number of bytes, and ranges, are completely specified, and integers cannot store this.
A common solution in this case is to decide, by convention, that some number is logically a nan. In your case, if it is year, you might choose a negative value (or just -1) as that. Before writing, you could use
df.year = df.year.fillna(-1).astype(int)
Alternatively, you can define another column as year_is_none.
Alternatively, you can store them as floats.
These solutions range from most efficient, to least efficient in terms of memory.
You should use it;
df.year = df.year.fillna(-1) OR 0

Range values in Tableau

I want to visualise the below excel table in Tableau.
When adding this table to Tableau it shows Salary values as String and thus under Dimension Tab and not under Measure, thus cannot make proper graph from it.
How to convert this Salary range values to Int ?
As #Alexandru Porumb suggested, the best solution is to have a min_salary column and a max_salary column — unless you really have the actual salary available which is even better.
If you don’t want to revise the incoming data, you can get the same effect using the Split() function in a calculated field from Tableau to derive two integer fields from the original string field.
For example, you could define a calculated field called min_salary as INT(SPLIT([Salary], ‘-‘, 1)). Split() extracts part of a string based on a separator string. Int() converts the string to an integer.
You could simplify the way it sees the data and separate the salary column into Min and Max, thus you wouldn't have the hyphen that makes Tableau consider the entry as a string.
Simplistic idea, I know but it may help until a better solution will be provided.
Hope it helps

Ordered Data with Cassandra RandomPartitioner

I have about a billion pieces of data that I would like to store in Cassandra. The data items are ordered by time, and one of the main queries I'll be doing is to find the items between two time ranges, in order. I'd really prefer to use the RandomParititioner, if at all possible. Is there a way to do this in Cassandra?
At first, since I'm coming from SQL, I assumed I should create each event as a row, but then it occurred to me that I was thinking about it the wrong way and I should really use columns. Columns in Cassandra seem to be ordered, but I'm confused as to just how ordered they are. If I use a time as the column name, is there a way for me to get all of the columns from one time to another in order?
Another thing I looked at was the 0.7 feature of secondary indices, but I've had trouble finding documentation for whether I can use these to view a range of things in order.
All I want is the Cassandra equivalent of this SQL: "Select * from Stuff where date > X and date < Y order by date asc". How can I do this?
The partitioner only affects the distribution of keys around the ring, not the order of columns within a key. Columns are always ordered according to the Column Comparator defined for the column family.
You can call get_slice with a SlicePredicate that specifies a SliceRange to get all the columns of a key within a range.
To model your data, you can create 1 row for each day (or suitable time shard) and have a column for each piece of data. Something like,
"yyyy-mm-dd" : { #key, one for each day
timeStampMillis1:dataid1 : "value1" # one column for each piece of data
timeStampMillis2:dataid2 : "value2"
timeStampMillis3:dataid3 : "value3"
}
The column names should be binary, using the binary comparator. The first 8 bytes are the timestamp, while the rest of the bytes are the id of the data.
Assuming X and Y are on the same day, to find all items between X and Y, do a do a get_slice on the day key, with a SlicePredicate with a SliceRange specifying a start of X and a finish of Y+1. Both start and finish are byte arrays of 8 bytes.
To find data over multiple days, read from multiple keys.

UNION with different data types in db2 server

I have built a query which contains UNION ALL, but the two parts of it
have not the same data type. I mean, i have to display one column but the
format of the two columns, from where i get the data have differences.
So, if i get an example :
select a,b
from c
union all
select d,b
from e
a and d are numbers, but they have different format. It means that a's length is 15
and b's length is 13. There are no digits after the floating point.
Using digits, varchar, integer and decimal didn't work.
I always get the message : Data conversion or data mapping error.
How can i convert these fields in the same format?
I've no DB2 experience but can't you just cast 'a' & 'd' to the same types. That are large enough to handle both formats, obviously.
I have used the cast function to convert the columns type into the same type(varchar with a large length).So i used union without problems. When i needed their original type, back again, i used the same cast function(this time i converted the values into float), and i got the result i wanted.