I am quite new in regular expression so I am having confusion in replacing numbers inside a string, in which several id numbers (from 1 to 5 digits, up to now) are separated by commas and closed between curly brackets.
I need to put before each written number a fixed code like p1_ in order to distinguish in future different types of objects' ids.
I have postgres database with a column "maintainance" in text format which can contains values like the followings (cells CAN'T BE null or empty):
+---------------+
| maintainance |
+---------------+
| {12541,2,4} |
+---------------+
| {12,131,9999} |
+---------------+
| {54} |
+---------------+
| {1} |
+---------------+
| {12500,65} |
+---------------+
and I'd need to replace values putting before each number "p1_" like this:
+------------------------+
| maintainance |
+------------------------+
| {p1_12541,p1_2,p1_4} |
+------------------------+
| {p1_12,p1_131,p1_9999} |
+------------------------+
| {p1_54} |
+------------------------+
| {p1_1} |
+------------------------+
| {p1_12500,p1_65} |
+------------------------+
can you please suggest me how to write replace command using regular expressions ?
Thanks in advance
regexp_replace(col, '[0-9]+', 'p_1\&', 'g')
Related
I am migrating the sql statement greenplum to HiveSQL supported with pyspark scripts. we are encounter the issues if the double or decimal value, the value getting changed while doing the sum of the double or decimal value.
source data output as below like.
+------------------------+
| test_amount|
+------------------------+
| 6.987713783438851E8 |
| 1.42712398248124E10 |
| 6.808514167337805E8 |
| 1.5215888908771296E10 |
actual greenplum provide the output while doing the sum of the amount value as below like,
2022-01-16 2 14970011203.1514433
2022-01-17 2 15896793850.6107666
GP sum out put
+-------------+------+------------------------+
| _c0 | _c1 | _c2 |
+-------------+------+------------------------+
| 2022-01-16 | 2 | 1.4970011203156286E10 |
| 2022-01-17 | 2 | 1.5896740325505075E10 |
+-------------+------+------------------------+
please help us.
Hello I'm fairly new to spark and I need help with this little exercise. I want to find certain values in another dataframe but if those values aren't present I want to reduce the length of each value until I find the match. I have these dataframes:
----------------
|values_to_find|
----------------
| ABCDE |
| CBDEA |
| ACDEA |
| EACBA |
----------------
------------------
| list | Id |
------------------
| EAC | 1 |
| ACDE | 2 |
| CBDEA | 3 |
| ABC | 4 |
------------------
And I expect the next output:
--------------------------------
| Id | list | values_to_find |
--------------------------------
| 4 | ABC | ABCDE |
| 3 | CBDEA | CBDEA |
| 2 | ACDE | ACDEA |
| 1 | EAC | EACBA |
--------------------------------
For example ABCDE isn't present so I reduce its length by one (ABCD), again it doesn't match any so I reduce it again and this time I get ABC, which matches so I use that value to join and form a new dataframe. There is no need to worry about duplicates values when reducing the length but I need to find the exact match. Also, I would like to avoid using a UDF if possible.
I'm using a foreach to get every value in the first dataframe and I can do a substring there (if there is no match) but I'm not sure how to lookup these values in the 2nd dataframe. What's the best way to do it? I've seen tons of UDFs that could do the trick but I want to avoid that as stated before.
df1.foreach { values_to_find =>
df1.get(0).toString.substring(0, 4)}
Edit: Those dataframes are examples, I have many more values, the solution should be dynamic... iterate over some values and find their match in another dataframe with the catch that I need to reduce their length if not present.
Thanks for the help!
You can load the dataframe as temporary view and write the SQL. Is the above scenario you are implementing for the first time in Spark or already did in the previous code ( i mean before spark have you implemented in the legacy system). With Spark you have the freedom to write udf in scala or use SQL. Sorry i don't have solution handy so just giving a pointer.
the following will help you.
val dataDF1 = Seq((4,"ABC"),(3,"CBDEA"),(2,"ACDE"),(1,"EAC")).toDF("Id","list")
val dataDF2 = Seq(("ABCDE"),("CBDEA"),("ACDEA"),("EACBA")).toDF("compare")
dataDF1.createOrReplaceTempView("table1")
dataDF2.createOrReplaceTempView("table2")
spark.sql("select * from table1 inner join table2 on table1.list like concat('%',SUBSTRING(table2.compare,1,3),'%')").show()
Output:
+---+-----+-------+
| Id| list|compare|
+---+-----+-------+
| 4| ABC| ABCDE|
| 3|CBDEA| CBDEA|
| 2| ACDE| ACDEA|
| 1| EAC| EACBA|
+---+-----+-------+
I want to do a single query that outputs an array of arrays of table rows. Think along the lines of <table><rowgroup><tr><tr><tr><rowgroup><tr><tr>. Is SQL capable of this? (specifically, as implemented in MariaDB, though migration to AWS RDS might occur one day)
The GROUP BY statement alone does not do this, it creates one row per group.
Here's an example of what I'm thinking of…
SELECT * FROM memes;
+------------+----------+
| file_name | file_ext |
+------------+----------+
| kittens | jpeg |
| puppies | gif |
| cats | jpeg |
| doggos | mp4 |
| horses | gif |
| chickens | gif |
| ducks | jpeg |
+------------+----------+
SELECT * FROM memes GROUP BY file_ext WITHOUT COLLAPSING GROUPS;
+------------+----------+
| file_name | file_ext |
+------------+----------+
| kittens | jpeg |
| cats | jpeg |
| ducks | jpeg |
+------------+----------+
| puppies | gif |
| horses | gif |
| chickens | gif |
+------------+----------+
| doggos | mp4 |
+------------+----------+
I've been using MySQL for ~20 years and have not come across this functionality before but maybe I've just been looking in the wrong place ¯\_(ツ)_/¯
I haven't seen an array rendering such as the one you want, but you can simulate it with multiple GROUP BY / GROUP_CONCAT() clauses.
For example:
select concat('[', group_concat(g), ']') as a
from (
select concat('[', group_concat(file_name), ']') as g
from memes
group by file_ext
) x
Result:
a
---------------------------------------------------------
[[puppies,horses,chickens],[kittens,cats,ducks],[doggos]]
See running example at DB Fiddle.
You can tweak the delimiters such as ,, [, and ].
SELECT ... ORDER BY file_ext will come close to your second output.
Using GROUP BY ... WITH ROLLUP would let you do subtotals under each group, which is not what you wanted either, but it would give you extra lines where you want the breaks.
I am trying to include null values in collect_list while using pyspark, however the collect_list operation excludes nulls. I have looked into the following post Pypsark - Retain null values when using collect_list . However, the answer given is not what I am looking for.
I have a dataframe df like this.
| id | family | date |
----------------------------
| 1 | Prod | null |
| 2 | Dev | 2019-02-02 |
| 3 | Prod | 2017-03-08 |
Here's my code so far:
df.groupby("family").agg(f.collect_list("date").alias("entry_date"))
This gives me an output like this:
| family | date |
-----------------------
| Prod |[2017-03-08]|
| Dev |[2019-02-02]|
What I really want is as follows:
| family | date |
-----------------------------
| Prod |[null, 2017-03-08]|
| Dev |[2019-02-02] |
Can someone please help me with this? Thank you!
A possible workaround for this could be to replace all null-values with another value. (Perhaps not the best way to do this, but it's a solution nonetheless)
df = df.na.fill("my_null") # Replace null with "my_null"
df = df.groupby("family").agg(f.collect_list("date").alias("entry_date"))
Should give you:
| family | date |
-----------------------------
| Prod |[my_null, 2017-03-08]|
| Dev |[2019-02-02] |
I have 2 column spark Scala DataFrame. The first is of one variable, the second one is an array of letters. What I am trying to do is find a way to code a tally (without using a for loop) of the variables in an array.
For example, this is what I have (I am sorry its not that neat, this is my first stack post). You have 5 computers, each person is represented by a letter. I want to find a way to find out how many computers a person (A,B,C,D,E) has used.
+-----------------+--------------+
| id | [person] |
+-----------------+--------------+
| Computer 1 | [A,B,C,D] |
| Computer 2 | [A,B] |
| Computer 3 | [A,B,E] |
| Computer 4 | [A,C,D] |
| Computer 5 | [A,B,C,D,E] |
+-----------------+--------------+
What I would like to code up or asking if anyone has a solution would be something like this:
+---------+-----------+
| Person | [Count] |
+---------+-----------+
| A | 5 |
| B | 4 |
| C | 3 |
| D | 3 |
| E | 2 |
+---------+-----------+
Somehow count the people who are in arrays within the dataframe.
There's a function called explode which will expand the arrays into one row for each item:
| id | person
+-----------------+------------------------+
| Computer 1| A |
| Computer 1| B |
| Computer 1| C |
| Computer 1| D |
....
+---+----+----+----+----+
Then you can group by the person and count. Something like:
val df2 = df.select(explode($"person").as("person"))
val result = df2.groupBy($"person").count