How to split an address in TALEND based on UPPER CASE values? - talend

I want to split the below address from single column to multiple columns using talend.
Input
|ADDRESS|
|15 St. Patrick Rd NORTH WEST LONDON|
Expected Output
|ADDRESS_LINE1 | ADDRESS_LINE2 |
|15 St. Patrick Rd | NORTH WEST LONDON |

You can use these two following regex expressions allow to split your input ADDRESS as you specified :
ADDRESS_LINE1 = StringHandling.TRIM(
input.ADDRESS.replaceAll("^(.+?)(([A-Z]{2,}\\s*?)+)$", "$1")
)
;
ADDRESS_LINE2 = StringHandling.TRIM(
input.ADDRESS.replaceAll("^(.+?)(([A-Z]{2,}\\s*?)+)$", "$2")
)
;

Related

How can i compare 2 tables in postgresql?

i have a table named hotel with 2 columns named : hotel_name , hotel_price
hotel_name | hotel_price
hotel1 | 5
hotel2 | 20
hotel3 | 100
hotel4 | 50
and another table named city that contains the column : city_name , average_prices
city_name | average_prices
paris | 20
london | 30
rome | 75
madrid | 100
I want to find which hotel has a price that's more expensive than average prices in the cities.For example i want in the end to find something like this:
hotel_name | city_name
hotel3 | paris --hotel3 is more expensive than the average price in paris
hotel3 | london --hotel3 is more expensive than the average price in london etc.
hotel3 | rome
hotel4 | paris
hotel4 | london
(I found the hotels that are more expensive than the average prices of the cities)
Any help would be valuable thank you .
A simple join is all that is needed. Typically tables are joined on a defined relationship (PK/FK) but there is nothing requiring that. See fiddle.
select h.hotel_name, c.city_name
from hotels h
join cities c
on h.hotel_price > c.average_prices;
However, while you can get the desired results, it's pretty meaningless. You cannot tell whether a particular hotel is even in a given city.

Update telephone number format to include country identifier

Being a beginner in SQL, I am trying to do the telephone fields to be noted in the format: '"+" + country identifier + telephone number'.
UPDATE public.contact
SET phone_number = CASE WHEN (country_code ='FR')
AND phone_number NOT LIKE '+33%'
AND phone_number <> NULL
THEN CONCAT('+33', phone_number)
WHEN (country_code ='GB')and phone_number NOT LIKE '+44%'
AND phone_number <> NULL
THEN CONCAT('+44', phone_number)
I want to update telephone number format to include country identifier like : 0606080905-> +33606080905 if country_code='FR' . I am looking for a faster and less complex way than what I did.
You can do this with a regular expression using regexp_replace.
Imagine your data being:
+----------+--------------+
Table 'numbers': | country | phone |
+----------+--------------+
| FR | 0606080905 |
| FR | +33606080906 |
| GB | 0123456789 |
| GB | +44987654321 |
| GB | NULL |
+----------+--------------+
Then the following update would replace the leading 0 with the country code +33 for all numbers that do not start with a +xx and have FR as country.
UPDATE numbers
SET phone = REGEXP_REPLACE(trim(phone), '^(0)', '+33')
WHERE country = 'FR'
Explained:
the ^ means start of the string
the (0) is the match that gets replaced (leading zero)
the +33 is the string that is used to replace it
the trim() is just added for safety, in case there are leading spaces
NULL phone numbers won't be affected, as they do not match
You could do this now as you did before with a CASE WHEN or something similar for each of the different possibilities. But since the expression always is the same, an easier way would be to have your country codes and their numerical mapping in a separate table:
+----------+--------+
Table 'mapping': | country | prefix |
+----------+--------+
| FR | +33 |
| GB | +44 |
+----------+--------+
You could then do
UPDATE numbers n
SET phone = REGEXP_REPLACE(trim(phone), '^(0)', prefix)
FROM mapping m
WHERE m.country = n.country
and update all your numbers in one go:
+----------+--------------+
| country | phone |
+----------+--------------+
| FR | +33606080905 |
| FR | +33606080906 |
| GB | +44123456789 |
| GB | +44987654321 |
| GB | NULL |
+----------+--------------+
EDIT: Previously, I had this needlessly complicated answer. You may need something like this if your phone number patterns are more diverse...
The following update would replace the leading 0 with the country code +33 for all numbers that do not start with a +xx and have FR as country.
UPDATE numbers
SET phone = REGEXP_REPLACE(trim(phone), '^(?<![+\d{2}])(0)', '+33')
WHERE country = 'FR'
Explained:
the (?<![+]) is a negative lookbehind assertion that makes sure the regex only matches if there is no + followed by two digits before
the (0) is the match that gets replaced
the +33 is the string that is used to replace it
the trim() is just added for safety, in case there are leading spaces
NULL phone numbers won't be affected, as they do not match
That's about as simple as it gets.
The only way I can imagine to speed up processing is to add a WHERE condition that avoids updating the rows that don't have to be modified.
You could also run several such statements in parallel, where each modifies a different part of the table.
As mentioned in the comment, <> NULL is never true.

How to filter a column in a data frame by the regex value of another column in same data frame in Pyspark

I am trying to filter a column in data frame that matches the regex pattern given in another column
df = sqlContext.createDataFrame([('what is the movie that features Tom Cruise','actor_movies','(movie|film).*(feature)|(in|on).*(movie|film)'),
('what is the movie that features Tom Cruise','artist_song','(who|what).*(sing|sang|perform)'),
('who is the singer for hotel califonia?','artist_song','(who|what).*(sing|sang|perform)')],
['query','question_type','regex_patt'])
+--------------------+-------------+----------------------------------------- -+
| query |question_type |regex_patt|
+--------------------+-------------+----------------------------------------------+
|what movie features Tom Cruise | actor_movies | (movie|film).*(feature)|(in|on).*(movie|film)
|what movie features Tom Cruise | artist_song | (who|what).*(sing|sang|perform)
|who is the singer for hotel califonia | artist_song | (who|what).*(sing|sang|perform) |
+--------------------+-------------+------------------------------------------------+
I want to prune the data frame such that only to keep rows whose query matches the regex_pattern column value.
The final result should like this
+--------------------+-------------+----------------------------------------- -+
| query |question_type |regex_patt|
+--------------------+-------------+----------------------------------------------+
|what movie features Tom Cruise | actor_movies | (movie|film).*(feature)|(in|on).*(movie|film)|
|who is the singer for hotel califonia | artist_song | (who|what).*(sing|sang|perform)
+--------------------+-------------+------------------------------------------------+
I was thinking of
df.filter(column('query').rlike('regex_patt'))
But rlike only accepts regex strings.
Now the question is, how to filter the "query" column based on the regex value of "regex_patt" column?
You could try this. The expression allows you to put columns as the str and pattern.
from pyspark.sql import functions as F
df.withColumn("query1", F.expr("""regexp_extract(query, regex_patt)""")).filter(F.col("query1")!='').drop("query1").show(truncate=False)
+------------------------------------------+-------------+---------------------------------------------+
|query |question_type|regex_patt |
+------------------------------------------+-------------+---------------------------------------------+
|what is the movie that features Tom Cruise|actor_movies |(movie|film).*(feature)|(in|on).*(movie|film)|
|who is the singer for hotel califonia? |artist_song |(who|what).*(sing|sang|perform) |
+------------------------------------------+-------------+---------------------------------------------+

Dataframe search from list and all elemts found in a new column in Scala

I have a df and I need to search if there is any set of elements from the list of keywords or not .. if yes I need to put all these keywords # separated in a new column called found or not.
My df is like
utid | description
123 | my name is harry and I live in newyork
234 | my neighbour is daniel and he plays hockey
The list is quite big something like list ={harry,daniel,hockey,newyork}
the output should be like
utid | description | foundornot
123 | my name is harry and I live in newyork | harry#newyork
234 | my neighbour is daniel and he plays hockey | daniel#hockey
The list is quite big like some 20k keywords ..also in case not found print NF
You can check for the elements in the list if exists each row of description column in the udf function and make the list of the elements as a string separated by # to return it or else NF string as
val list = List("harry","daniel","hockey","newyork")
import org.apache.spark.sql.functions._
def checkUdf = udf((strCol: String) => if (list.exists(strCol.contains)) list.filter(strCol.contains(_)).mkString("#") else "NF")
df.withColumn("foundornot", checkUdf(col("description"))).show(false)
which should give you
+----+------------------------------------------+-------------+
|utid|description |foundornot |
+----+------------------------------------------+-------------+
|123 |my name is harry and i live in newyork |harry#newyork|
|234 |my neighbour is daniel and he plays hockey|daniel#hockey|
+----+------------------------------------------+-------------+

Update dataset in spark-shell by breaking one element into multiple parts and inserting a row for each part

I have a use case where I am storing my data into dataset. I have a column where I can have multiple values in a row separated by pipe(|). So, a typical row looks like this:
2016/01/01 1/XYZ PQR M|N|O
I want this row to be converted into 3 rows as follows:
2016/01/01 1/XYZ PQR M
2016/01/01 1/XYZ PQR N
2016/01/01 1/XYZ PQR O
Also, not all the contents in last column may contain pipe(|).Some rows can be as one of the above. I was trying to split the concerned column with pipe(|), but it is giving error because of rows not containing pipe(|). I couldn’t think any further solution.
What is best way to achieve this using spark-shell in scala.
For your use case you have to use split and explode(as mentioned by #Pushkr) both.
df.withColumn("new", split($"col4", "[|:]+")).drop("col4").withColumn("col4", explode($"new")).drop("new").show
Here df is the DataFrame containing 2016/01/01 1/XYZ PQR M|N|O data. Also, to split by any delimiter you have to build pattern according to your requirements. Like in the code above I am using [|:]+ pattern to split the string by either | or :.
For Example:
2016/01/01,1/XYZ,PQR,M|N|O
2016/02/02,2/ABC,DEF,P:Q:R
Will result in:
+-----------+------+----+----+
| col1| col2|col3|col4|
+-----------+------+----+----+
|2016/01/01 |1/XYZ |PQR | M |
|2016/01/01 |1/XYZ |PQR | N |
|2016/01/01 |1/XYZ |PQR | O |
|2016/02/02 |2/ABC |DEF | P |
|2016/02/02 |2/ABC |DEF | Q |
|2016/02/02 |2/ABC |DEF | R |
+-----------+------+----+----+
I hope this helps !