Talend using Tmaps to convert 1M to 1000000 and 1K to 1000 - talend

So i am trying to map and string column having numeric values such as 10M and 10K in it, into another table column. But i need to map them as numeric and want to replace 10M with 10000000 and 10 with 10000. What the best way to do that. I am new at Talend so any help would be appreciated.

You can do in your tmap this formula :
Relational.ISNULL(row1.col1 ) || "".equals(row1.col1 ) ? null : Integer.parseInt(StringHandling.CHANGE(StringHandling.CHANGE(row1.col1,"K","000"),"M","000000"))

Related

Changing a functional qSQL query to involve multiple columns in calculation KDB+/Q

I have a ? exec query like so:
t:([]Quantity: 1 2 3;Price 4 5 6;date:2020.01.01 2020.01.02 2020.01.03);
?[t;enlist(within;`date;(2020.01.01,2020.01.02));0b;(enlist `Quantity)!enlist (sum;(`Quantity))]
to get me the sum of the Quantity in the given date range. I want to adjust this to get me the sum of the Notional in the date range; Quantity*Price. So the result should be (1x4)+(2x5)=14.
I tried things like the following
?[t;enlist(within;`date;(2020.01.01,2020.01.02));0b;(enlist `Quantity)!enlist (sum;(`Price*`Quantity))]
but couldn't get it to work. Any advice would be greatly appreciated!
I would advise in such a scenario to think about the qSql style query that you are looking for and then work from there.
So in this case you are looking, I believe, to do something like:
select sum Quantity*Price from t where date within 2020.01.01 2020.01.02
You can then run parse on this to break it into its function form i.e the ? exec query you refer to.
q)parse"select sum Quantity*Price from t where date within 2020.01.01 2020.01.02"
?
`t
,,(within;`date;2020.01.01 2020.01.02)
0b
(,`Quantity)!,(sum;(*;`Quantity;`Price))
This is your functional form that you need; table, where clause, by and aggregation.
You can see your quantity here is just the sum of the multiplication of the two columns.
q)?[t;enlist(within;`date;(2020.01.01;2020.01.02));0b;enlist[`Quantity]!enlist(sum;(*;`Quantity;`Price))]
Quantity
--------
14
You could also extend this to change the column as necessary and create a function for it too, if you so wish:
q)calcNtnl:{[sd;ed] ?[t;enlist(within;`date;(sd;ed));0b;enlist[`Quantity]!enlist(sum;(*;`Quantity;`Price))]}
q)calcNtnl[2020.01.01;2020.01.02]
Quantity
--------
14

sort data in hdb by using dbmain.q in kdb

I am trying to sort 1 or 2 columns in a hdb in kdb but failed. This is the code I have
fncol[dbdir;`trade;`sym;xasc];
and got a length error when I called it. But I don't have a length error if I use this code
fncol[dbdir;`trade;`sym;asc];.
However this only sorts the sym column itself. I want the data from other columns change according to sym column as well.
In addition, I would like to apply parted attribute to sym column. Also, I tried to sort this way
fncol[dbdir;`trade;`sym`ptime;xasc];. also failed
You should always be careful with dbmaint.q if you are unsure what it is going to do. I gather from the fact asc worked after xasc that you are using a test hdb each time.
fncol should be used with unary functions i.e. 1 argument. It's use case is for modifying individual columns. What you are trying to do is modifying the entire table as you want to sort the entire table relative to the sym column. Using .Q.dpft for each date is what you want as outlined by Cathal in your follow-up question. using .Q.dpft function to resave table
When you run this fncol[dbdir;`trade;`sym;xasc]; You are saving down a projection in place of the sym column in each date.
fncol[`:.;`trades;`sym;xasc];
select from trades where date = 2014.04.21
'length
[0] select from trades where date = 2014.04.21
q)get `:2014.04.21/trades/sym
k){$[$[#x;~`s=-2!(0!.Q.v y)x;0];.Q.ft[#[;*x;`s#]].Q.ord[<:;x]y;y]}[`p#`sym$`A..
// This is the k definition of xasc with the sym column as the first parameter.
q)xasc
k){$[$[#x;~`s=-2!(0!.Q.v y)x;0];.Q.ft[#[;*x;`s#]].Q.ord[<:;x]y;y]}
// Had you needed to fix your hdb, I managed to undo this using value and indexing to the sym col data.
fncol[`:.;`trades;`sym;{(value x)[1]}];
q)select from trades where date = 2014.04.21
date sym time src price size
------------------------------------------------------------
2014.04.21 AAPL 2014.04.21D08:00:12.155000000 N 25.31 2450
2014.04.21 AAPL 2014.04.21D08:00:42.186000000 N 25.32 289
2014.04.21 AAPL 2014.04.21D08:00:51.764000000 O 25.34 3167
asc will not break the hdb as it just takes 1 argument and saves down ONLY the sym column in ascending order not the table.
Is there any indication of what date is failing with a length error? It could be something wrong with one of the partitions.
Perhaps if you try to load one of the dates into memory and sort it manually IE
`sym xasc select from trade where date=last date
that might indicate if there's a specific partition causing issues.
FYI if you're intersted in applying the p# attribute you should try setattrcol in dbmaint.q. I think the data will need to be sorted first though.

How to populate a Spark DataFrame column based on another column's value?

I have a use-case where I need to select certain columns from a dataframe containing atleast 30 columns and millions of rows.
I'm loading this data from a cassandra table using scala and apache-spark.
I selected the required columns using: df.select("col1","col2","col3","col4")
Now I have to perform a basic groupBy operation to group the data according to src_ip,src_port,dst_ip,dst_port and I also want to have the latest value from a received_time column of the original dataframe.
I want a dataframe with distinct src_ip values with their count and latest received_time in a new column as last_seen.
I know how to use .withColumn and also, I think that .map() can be used here.
Since I'm relatively new in this field, I really don't know how to proceed further. I could really use your help to get done with this task.
Assuming you have a dataframe df with src_ip,src_port,dst_ip,dst_port and received_time, you can try:
val mydf = df.groupBy(col("src_ip"),col("src_port"),col("dst_ip"),col("dst_port")).agg(count("received_time").as("row_count"),max(col("received_time")).as("max_received_time"))
The above line calculates the count of timestamp received against the group by columns as well as the max timestamp for that group by columns.

AWS Athena: Handling big numbers

I have files on S3 where two columns contain only positive integers which can be of 10^26. Unfortunately, according to AWS Docs Athena only supports values in a range up to 2^63-1 (approx 10^19). So at the moment these column are represented as a string.
When it comes to filtering it is not that big of an issue, as I can use regex. For example, if I want to get all records between 5e^21 and 6e^21 my query would look like:
SELECT *
FROM database.table
WHERE (regexp_like(col_1, '^5[\d]{21}$'))
I have approx 300M rows (approx 12GB in parquet) and it takes about 7 seconds, so performance wise it ok.
However, sometimes I would like to perform some math operation on these two big columns, e.g subtract one big column from another. Casting these records to DOUBLE wouldn't work due to approximation error. Ideally, I would want to stay within Athena. At the moment, I have about 100M rows that are greater then 2^63-1, but this number can grow in a future.
What would be the right way to approach problem of having numerical records that exceed available range? Also what are your thoughts on using regex for filtering? Is there a better/more appropriate way to do it?
You can cast numbers of the form 5e21 to an approximate 64bit double or an exact numeric 128bit decimal. First you'll need to remove the caret ^, with the replace function. Then a simple cast will work:
SELECT CAST(replace('5e^21', '^', '') as DOUBLE);
_col0
--------
5.0E21
or
SELECT CAST(replace('5e^21', '^', '') as DECIMAL);
_col0
------------------------
5000000000000000000000
If you are going to this table often, I would rewrite it the new data type to save processing time.

Reading formula from a different collection

Here is my scenario:
I have 2 XLS files :
One with 5 columns : date, service, value1, value2, Value3
One with the formulas to apply: Service, Type_of_aggregation, columns_to_aggregate
According to the service you apply a formula (sum or avg of the values of a column.)
Questions:
- How can i tell mongodb to retrieve and apply the specified formula ?
Example:
We insert both XLS files into seperate collections.
We somehow trigger off the calculation: example: For Service1 do a Sum of all the values of colunm_value1.
We store the result in a new collection
I hope you understand what i am trying to do and hoping to get some help on the best way to do this.
Thank you,