How to filter a column in a data frame by the regex value of another column in same data frame in Pyspark - pyspark

I am trying to filter a column in data frame that matches the regex pattern given in another column
df = sqlContext.createDataFrame([('what is the movie that features Tom Cruise','actor_movies','(movie|film).*(feature)|(in|on).*(movie|film)'),
('what is the movie that features Tom Cruise','artist_song','(who|what).*(sing|sang|perform)'),
('who is the singer for hotel califonia?','artist_song','(who|what).*(sing|sang|perform)')],
['query','question_type','regex_patt'])
+--------------------+-------------+----------------------------------------- -+
| query |question_type |regex_patt|
+--------------------+-------------+----------------------------------------------+
|what movie features Tom Cruise | actor_movies | (movie|film).*(feature)|(in|on).*(movie|film)
|what movie features Tom Cruise | artist_song | (who|what).*(sing|sang|perform)
|who is the singer for hotel califonia | artist_song | (who|what).*(sing|sang|perform) |
+--------------------+-------------+------------------------------------------------+
I want to prune the data frame such that only to keep rows whose query matches the regex_pattern column value.
The final result should like this
+--------------------+-------------+----------------------------------------- -+
| query |question_type |regex_patt|
+--------------------+-------------+----------------------------------------------+
|what movie features Tom Cruise | actor_movies | (movie|film).*(feature)|(in|on).*(movie|film)|
|who is the singer for hotel califonia | artist_song | (who|what).*(sing|sang|perform)
+--------------------+-------------+------------------------------------------------+
I was thinking of
df.filter(column('query').rlike('regex_patt'))
But rlike only accepts regex strings.
Now the question is, how to filter the "query" column based on the regex value of "regex_patt" column?

You could try this. The expression allows you to put columns as the str and pattern.
from pyspark.sql import functions as F
df.withColumn("query1", F.expr("""regexp_extract(query, regex_patt)""")).filter(F.col("query1")!='').drop("query1").show(truncate=False)
+------------------------------------------+-------------+---------------------------------------------+
|query |question_type|regex_patt |
+------------------------------------------+-------------+---------------------------------------------+
|what is the movie that features Tom Cruise|actor_movies |(movie|film).*(feature)|(in|on).*(movie|film)|
|who is the singer for hotel califonia? |artist_song |(who|what).*(sing|sang|perform) |
+------------------------------------------+-------------+---------------------------------------------+

Related

pyspark dataframe: add a new indicator column with random sampling

I have a spark dataframe containing the following schema:
StructField(email_address,StringType,true),StructField(subject_line,StringType,true)))
I want randomly sample 50% of the population into control and test. Currently I am doing it the following way:
df_segment_ctl = df_segment.sample(False, 0.5, seed=0)
df_segment_tmt = df_segment.join(df_segment_ctl, ["email_address"], "leftanti")
But I am certain there must be a better way to create a column instead like the following
| email_address| segment_id|group |
+--------------------+---------------+---------+
|xxxxxxxxxx#gmail.com| 1.1|treatment|
| xxxxxxx#gmail.com| 1.6|control |
Any help is appreciated. I am new to this world
UPDATE:
I dont want to split the dataframe into two. Just want to add an indicator column
UPDATE:
Is it possible to have multiple splits elegantly. Suppose instead of two groups I want a single control and two treatment
| email_address| segment_id|group |
+--------------------+---------------+---------+
|xxxxxxxxxx#gmail.com| 1.1|treat_1. |
| xxxxxxx#gmail.com| 1.6|control |
| xxxxx#gmail.com | 1.6|treat_2 |
You can split the spark dataframe using random split like below
df_segment_ctl, df_segment_tmt = df_segment.randomSplit(weights=[0.5,0.5], seed=0)

Need some data after by grouping on key in spark/scala

I have a problem in spark(v2.2.2)/scala(v2.11.8). Mostly into scala/spark functional language.
I have a list of person with rented_date like below.
These are csv file which I will convert into parquet and read as a dataframe.
Table: Person
+-------------------+-----------+
| ID |report_date|
+-------------------+-----------+
| 123| 2011-09-25|
| 111| 2017-08-23|
| 222| 2018-09-30|
| 333| 2020-09-30|
| 444| 2019-09-30|
+-------------------+-----------+
I want to find out the start_date of the address for the period person's rented it out by grouping on ID
Table: Address
+-------------------+----------+----------+
| ID |start_date|close_date|
+-------------------+----------+----------+
| 123|2008-09-23|2009-09-23|
| 123|2009-09-24|2010-09-23|
| 123|2010-09-24|2011-09-23|
| 123|2011-09-30|2012-09-23|
| 123|2012-09-24| null|
| 111|2013-09-23|2014-09-23|
| 111|2014-09-24|2015-09-23|
| 111|2015-09-24|2016-09-23|
| 111|2016-09-24|2017-09-23|
| 111|2017-09-24| null|
| 222|2018-09-24| null|
+-------------------+----------+----------+
ex: For 123 rented_date is 2011-09-20, which in address table falls in the period (start_date, close_date) 2010-09-24,2011-09-23 (row 3 in address). Form here I have to fetch start_date 2010-09-24.
I have to do this on entire dataset by joining the tables. Or need to fetch start_date from address table into the Person table.
Also need to handle where closed date is null.
Sometime scenario may also include where rented date will not fall in any of the period in that case we need to take it where rented_date < closed_date.
Apologies, proper format of tables are not populating.
Thanks in Advance.
First of all
I have a list of person with rented_date like below. These are csv file which I will convert into parquet and read as a dataframe.
No need to convert it you can just read it directly with spark
spark.read.csv("path")
spark.read.format("csv").load("path")
I am not sure what your expectation in null fields are so I would filter them out for now:
dfAdressNotNull.filter($"close_date".isNotNull)
Of course now you need to join them together and since the data in Address is the relevant one I would do a left join.
val joinedDf = dfAddressNotNull.join(dfPerson, Seq("ID"), "left")
No you have Addresses and Persons combined
If you filter now like that
joinedDf.filter($"report_date" >= $"start_date" && $"report_date" < $"closed_date")
You should have something like that what you want to achieve.

Dataframe search from list and all elemts found in a new column in Scala

I have a df and I need to search if there is any set of elements from the list of keywords or not .. if yes I need to put all these keywords # separated in a new column called found or not.
My df is like
utid | description
123 | my name is harry and I live in newyork
234 | my neighbour is daniel and he plays hockey
The list is quite big something like list ={harry,daniel,hockey,newyork}
the output should be like
utid | description | foundornot
123 | my name is harry and I live in newyork | harry#newyork
234 | my neighbour is daniel and he plays hockey | daniel#hockey
The list is quite big like some 20k keywords ..also in case not found print NF
You can check for the elements in the list if exists each row of description column in the udf function and make the list of the elements as a string separated by # to return it or else NF string as
val list = List("harry","daniel","hockey","newyork")
import org.apache.spark.sql.functions._
def checkUdf = udf((strCol: String) => if (list.exists(strCol.contains)) list.filter(strCol.contains(_)).mkString("#") else "NF")
df.withColumn("foundornot", checkUdf(col("description"))).show(false)
which should give you
+----+------------------------------------------+-------------+
|utid|description |foundornot |
+----+------------------------------------------+-------------+
|123 |my name is harry and i live in newyork |harry#newyork|
|234 |my neighbour is daniel and he plays hockey|daniel#hockey|
+----+------------------------------------------+-------------+

Spark explode multiple columns of row in multiple rows

I have a problem with converting one row using three 3 columns into 3 rows
For example:
<pre>
<b>ID</b> | <b>String</b> | <b>colA</b> | <b>colB</b> | <b>colC</b>
<em>1</em> | <em>sometext</em> | <em>1</em> | <em>2</em> | <em>3</em>
</pre>
I need to convert it into:
<pre>
<b>ID</b> | <b>String</b> | <b>resultColumn</b>
<em>1</em> | <em>sometext</em> | <em>1</em>
<em>1</em> | <em>sometext</em> | <em>2</em>
<em>1</em> | <em>sometext</em> | <em>3</em>
</pre>
I just have dataFrame which is connected with first schema(table).
val df: dataFrame
Note: I can do it using RDD, but do we have other way? Thanks
Assuming that df has the schema of your first snippet, I would try:
df.select($"ID", $"String", explode(array($"colA", $"colB",$"colC")).as("resultColumn"))
I you further want to keep the column names, you can use a trick that consists in creating a column of arrays that contains the array of the value and the name. First create your expression
val expr = explode(array(array($"colA", lit("colA")), array($"colB", lit("colB")), array($"colC", lit("colC"))))
then use getItem (since you can not use generator on nested expressions, you need 2 select here)
df.select($"ID, $"String", expr.as("tmp")).select($"ID", $"String", $"tmp".getItem(0).as("resultColumn"), $"tmp".getItem(1).as("columnName"))
It is a bit verbose though, there might be more elegant way to do this.

Sane way to store different data types within same column in postgres?

I'm currently attempting to modify an existing API that interacts with a postgres database. Long story short, it's essentially stores descriptors/metadata to determine where an actual 'asset' (typically this is a file of some sort) is storing on the server's hard disk.
Currently, its possible to 'tag' these 'assets' with any number of undefined key-value pairs (i.e. uploadedBy, addedOn, assetType, etc.) These tags are stored in a separate table with a structure similar to the following:
+---------------+----------------+-------------+
|assetid (text) | tagid(integer) | value(text) |
|---------------+----------------+-------------|
|someStringValue| 1234 | someValue |
|---------------+----------------+-------------|
|aDiffStringKey | 1235 | a username |
|---------------+----------------+-------------|
|aDiffStrKey | 1236 | Nov 5, 1605 |
+---------------+----------------+-------------+
assetid and tagid are foreign keys from other tables. Think of the assetid representing a file and the tagid/value pair is a map of descriptors.
Right now, the API (which is in Java) creates all these key-value pairs as a Map object. This includes things like timestamps/dates. What we'd like to do is to somehow be able to store different types of data for the value in the key-value pair. Or at least, storing it differently within the database, so that if we needed to, we could run queries checking date-ranges and the like on these tags. However, if they're stored as text items in the db, then we'd have to a.) Know that this is actually a date/time/timestamp item, and b.) convert into something that we could actually run such a query on.
There is only 1 idea I could think of thus far, without complete changing changing the layout of the db too much.
It is to expand the assettag table (shown above) to have additional columns for various types (numeric, text, timestamp), allow them to be null, and then on insert, checking the corresponding 'key' to figure out what type of data it really is. However, I can see a lot of problems with that sort of implementation.
Can any PostgreSQL-Ninjas out there offer a suggestion on how to approach this problem? I'm only recently getting thrown back into the deep-end of database interactions, so I admit I'm a bit rusty.
You've basically got two choices:
Option 1: A sparse table
Have one column for each data type, but only use the column that matches that data type you want to store. Of course this leads to most columns being null - a waste of space, but the purists like it because of the strong typing. It's a bit clunky having to check each column for null to figure out which datatype applies. Also, too bad if you actually want to store a null - then you must chose a specific value that "means null" - more clunkiness.
Option 2: Two columns - one for content, one for type
Everything can be expressed as text, so have a text column for the value, and another column (int or text) for the type, so your app code can restore the correct value in the correct type object. Good things are you don't have lots of nulls, but importantly you can easily extend the types to something beyond SQL data types to application classes by storing their value as json and their type as the class name.
I have used option 2 several times in my career and it was always very successful.
Another option, depending on what your doing, could be to just have one value column but store some json around the value...
This could look something like:
{
"type": "datetime",
"value": "2019-05-31 13:51:36"
}
That could even go a step further, using a Json or XML column.
I'm not in any way PostgreSQL ninja, but I think that instead of two columns (one for name and one for type) you could look at hstore data type:
data type for storing sets of key/value pairs within a single
PostgreSQL value. This can be useful in various scenarios, such as
rows with many attributes that are rarely examined, or semi-structured
data. Keys and values are simply text strings.
Of course, you have to check how date/timestamps converting into and from this type and see if it good for you.
You can use 2 different technics:
if you have floating type for every tagid
Define table and ID for every tagid-assetid combination and actual data tables:
maintable:
+---------------+----------------+-----------------+---------------+
|assetid (text) | tagid(integer) | tablename(text) | table_id(int) |
|---------------+----------------+-----------------+---------------|
|someStringValue| 1234 | tablebool | 123 |
|---------------+----------------+-----------------+---------------|
|aDiffStringKey | 1235 | tablefloat | 123 |
|---------------+----------------+-----------------+---------------|
|aDiffStrKey | 1236 | tablestring | 123 |
+---------------+----------------+-----------------+---------------+
tablebool
+-------------+-------------+
| id(integer) | value(bool) |
|-------------+-------------|
| 123 | False |
+-------------+-------------+
tablefloat
+-------------+--------------+
| id(integer) | value(float) |
|-------------+--------------|
| 123 | 12.345 |
+-------------+--------------+
tablestring
+-------------+---------------+
| id(integer) | value(string) |
|-------------+---------------|
| 123 | 'text' |
+-------------+---------------+
In case if every tagid has fixed type
create tagid description table
tag descriptors
+---------------+----------------+-----------------+
|assetid (text) | tagid(integer) | tablename(text) |
|---------------+----------------+-----------------|
|someStringValue| 1234 | tablebool |
|---------------+----------------+-----------------|
|aDiffStringKey | 1235 | tablefloat |
|---------------+----------------+-----------------|
|aDiffStrKey | 1236 | tablestring |
+---------------+----------------+-----------------+
and correspodnding data tables
tablebool
+-------------+----------------+-------------+
| id(integer) | tagid(integer) | value(bool) |
|-------------+----------------+-------------|
| 123 | 1234 | False |
+-------------+----------------+-------------+
tablefloat
+-------------+----------------+--------------+
| id(integer) | tagid(integer) | value(float) |
|-------------+----------------+--------------|
| 123 | 1235 | 12.345 |
+-------------+----------------+--------------+
tablestring
+-------------+----------------+---------------+
| id(integer) | tagid(integer) | value(string) |
|-------------+----------------+---------------|
| 123 | 1236 | 'text' |
+-------------+----------------+---------------+
All this is just for general idea. You should adapt it for your needs.