Generate a summarizing word based on a set of words - pyspark

I'm very new to NLP, so I have some theoretical question.
Let's say I have the following Spark dataframe:
+--+------------------------------------------+
|id| word_list|
+--+------------------------------------------+
| 1| apple, banana, lime, juice, cherry, peach|
| 2| sauce, cabbage, cucumber, tomatoes, pesto|
| 3| cocoa, coffee, bottle, tea, water, juice|
+--+------------------------------------------+
I need for each id to extract a generic word that describes the predominant set of semantically similar words in the word_list column. Desired output:
+--+------------------------------------------+----------+
|id| word_list| category|
+--+------------------------------------------+----------+
| 1| apple, banana, lime, juice, cherry, peach| fruit|
| 2| sauce, cabbage, cucumber, tomatoes, pesto|vegetables|
| 3| cocoa, coffee, bottle, tea, water, juice| beverages|
+--+------------------------------------------+----------+
Is there any unsupervised NLP algorithm that can be used to get the desired output?

Related

PySpark convert Dataframe to Dictionary

I got the following DataFrame:
>>> df.show(50)
+--------------------+-------------+----------------+----+
| User Hash ID| Word|sum(Total Count)|rank|
+--------------------+-------------+----------------+----+
|00095808cdc611fb5...| errors| 5| 1|
|00095808cdc611fb5...| text| 3| 2|
|00095808cdc611fb5...| information| 3| 3|
|00095808cdc611fb5...| department| 2| 4|
|00095808cdc611fb5...| error| 2| 5|
|00095808cdc611fb5...| data| 2| 6|
|00095808cdc611fb5...| web| 2| 7|
|00095808cdc611fb5...| list| 2| 8|
|00095808cdc611fb5...| recognition| 2| 9|
|00095808cdc611fb5...| pipeline| 2| 10|
|000ac87bf9c1623ee...|consciousness| 14| 1|
|000ac87bf9c1623ee...| book| 3| 2|
|000ac87bf9c1623ee...| place| 2| 3|
|000ac87bf9c1623ee...| mystery| 2| 4|
|000ac87bf9c1623ee...| mental| 2| 5|
|000ac87bf9c1623ee...| flanagan| 2| 6|
|000ac87bf9c1623ee...| account| 2| 7|
|000ac87bf9c1623ee...| world| 2| 8|
|000ac87bf9c1623ee...| problem| 2| 9|
|000ac87bf9c1623ee...| theory| 2| 10|
This shows some for each user the 10 most frequent words he read.
I would like to create a dictionary, which then can be saved to a file, with the following format:
User : <top 1 word>, <top 2 word> .... <top 10 word>
To achieve this, I thought it might be more efficient to cut down the df as much as possible, before converting it. Thus, I tried:
>>> df.groupBy("User Hash ID").agg(collect_list("Word")).show(20)
+--------------------+--------------------+
| User Hash ID| collect_list(Word)|
+--------------------+--------------------+
|00095808cdc611fb5...|[errors, text, in...|
|000ac87bf9c1623ee...|[consciousness, b...|
|0038ccf6e16121e7c...|[potentials, orga...|
|0042bfbafc6646f47...|[fuel, car, consu...|
|00a19396b7bb52e40...|[face, recognitio...|
|00cec95a2c007b650...|[force, energy, m...|
|00df9406cbab4575e...|[food, history, w...|
|00e6e2c361f477e1c...|[image, based, al...|
|01636d715de360576...|[functional, lang...|
|01a778c390e44a8c3...|[trna, genes, pro...|
|01ab9ade07743d66b...|[packaging, car, ...|
|01bdceea066ec01c6...|[anthropology, de...|
|020c643162f2d581b...|[laser, electron,...|
|0211604d339d0b3db...|[food, school, ve...|
|0211e8f09720c7f47...|[privacy, securit...|
|021435b2c4523dd31...|[life, rna, origi...|
|0239620aa740f1514...|[method, image, d...|
|023ad5d85a948edfc...|[web, user, servi...|
|02416836b01461574...|[parts, based, ad...|
|0290152add79ae1d8...|[data, score, de,...|
+--------------------+--------------------+
From here, it should be more straight forward to generate that dictionary However, I cannot be sure if by using this agg function I am guaranteed that the words are in the correct order! That is why I am hesitant and wanted to get some feedback on maybe better options
Based on answers provided here - collect_list by preserving order based on another variable
you can write below query to make sure you have top 5 in correct order
import pyspark.sql.functions as F
grouped_df = dft.groupby("userid") \
.agg(F.sort_array(F.collect_list(F.struct("rank", "word"))) \
.alias("collected_list")) \
.withColumn("sorted_list",F.slice(F.col("collected_list.word"),start=1,length=5)) \
.drop("collected_list")\
.show(truncate=False)
First of all, if you go from a dataframe to a dictionary, you may have to face some memory issue as you will bring all the content of the dataframe to your driver (dictionary is a python object, not a spark object).
You are not that far away from a working solution. I'd do it that way :
from pyspark.sql import functions as F
df.groupBy("User Hash ID").agg(
F.collect_list(F.struct("Word", "sum(Total Count)", "rank")).alias("data")
)
This will create a data column where you have your 3 fields, aggregated by user id.
Then, to go from a dataframe to a dict object, you can use for example toJSON or Row object method asDict

PySpark - Struggling to arrange the data by a specific format

I am working on outputting total deduplicated counts from a pre-aggregated frame as follows.
I currently have a data frame that displays like so. It's the initial structure and the point that I have gotten to by filtering out unneeded columns.
ID
Source
101
Grape
101
Flower
102
Bee
103
Peach
105
Flower
We can see from the example above that 101 is found in both Grape and Flower. I would like to arrange the format so that the distinct string values from the "Source" column become their own sources, as from there I can perform a groupBy for a specific arrangement of yes's and no's as so.
ID
Grape
Flower
Bee
Peach
101
Yes
Yes
No
No
102
No
No
Yes
No
103
No
No
No
Yes
I agree that creating this manually via the above example is a good fit, but I am working with +100m rows and need something more susinct.
What I've extracted so far is a list of distinct Source values and arranged them into a list:
dedupeTableColumnNames = dedupeTable.select('SOURCE').distinct().collect()
dedupeTableColumnNamesCleaned = re.findall(r"'([^']*)'", str(dedupeTableColumnNames))
That's just a pivot :
df.groupBy("id").pivot("source").count().show()
+---+------+------+------+------+
| id|Bee |Flower|Grape |Peach |
+---+------+------+------+------+
|103| null| null| null| 1|
|105| null| 1| null| null|
|101| null| 1| 1| null|
|102| 1| null| null| null|
+---+------+------+------+------+

Spark Scala - Bitwise operation in filter

I have a source dataset aggregated with columns col1, and col2. Col2 values are aggregated by bitwise OR operation. I need to apply filter on the Col2 values to select records whose bits are on for 8,4,2
initial source raw data
val SourceRawData = Seq(("Flag1", 2),
("Flag1", 4),("Flag1", 8), ("Flag2", 8), ("Flag2", 16),("Flag2", 32)
,("Flag3", 2),("Flag4", 32),
("Flag5", 2), ("Flag5", 8)).toDF("col1", "col2")
SourceRawData.show()
+-----+----+
| col1|col2|
+-----+----+
|Flag1| 2|
|Flag1| 4|
|Flag1| 8|
|Flag2| 8|
|Flag2| 16|
|Flag2| 32|
|Flag3| 2|
|Flag4| 32|
|Flag5| 2|
|Flag5| 8|
+-----+----+
Aggregated source data based on 'SourceRawData above' after collapsing Col1 values to single row per Col1 value and this is provided other team and Col2 values are aggregated by Bitwise OR operation. Note I here i am providing the output not the actual aggregation logic
val AggregatedSourceData = Seq(("Flag1", 14L),
("Flag2", 56L),("Flag3", 2L), ("Flag4", 32L), ("Flag5", 10L)).toDF("col1", "col2")
AggregatedSourceData.show()
+-----+----+
| col1|col2|
+-----+----+
|Flag1| 14|
|Flag2| 56|
|Flag3| 2|
|Flag4| 32|
|Flag5| 10|
+-----+----+
Now I need to apply filter on the aggregated dataset above to get the rows whose col2 values meeting any of the (8,4,2) col2 bits are on as per the original source raw data values
expected output
+-----+----+
|Col1 |Col2|
+-----+----+
|Flag1|14 |
|Flag2|56 |
|Flag3|2 |
|Flag5|10 |
+-----+----+
I tried something like below and seems to be getting hte expected output but unable to understand how its working. Is this the correct approach?? if so ,how its working ( I am not that knowledgeable in bitwise operations so looking for expert help to understand please)
`
``
var myfilter:Long = 2 | 4| 8
AggregatedSourceData.filter($"col2".bitwiseAND(myfilter) =!= 0).show()
+-----+----+
| col1|col2|
+-----+----+
|Flag1| 14|
|Flag2| 56|
|Flag3| 2|
|Flag5| 10|
+-----+----+
I think you do not need to use bitWiseAnd to filter, instead, just use contains/in “A set of decimal representation of the bit numbers you want” or == to “a decimal representation of a bit number you want”
Also if you try your existing calculations without Scala or spark, you will see where you understood things wrong, eg use here :
https://www.rapidtables.com/calc/math/binary-calculator.html
you will find you defined you filter “wrong”.
18&18 is 18
18|2 is 18
Your dataset the flag column each row will only be one value, so just filter the flag column , whose values are in the set of numbers you want .
$"flag == 18 or
(18,2,20) contains $"flag for example

Closest Date looking from One Column to another in PySpark Dataframe

I have a pyspark dataframe where price of Commodity is mentioned, but there is no data for when was the Commodity bought, I just have a range of 1 year.
+---------+------------+----------------+----------------+
|Commodity| BuyingPrice|Date_Upper_limit|Date_lower_limit|
+---------+------------+----------------+----------------+
| Apple| 5| 2020-07-04| 2019-07-03|
| Banana| 3| 2020-07-03| 2019-07-02|
| Banana| 4| 2019-10-02| 2018-10-01|
| Apple| 6| 2020-01-20| 2019-01-19|
| Banana| 3.5| 2019-08-17| 2018-08-16|
+---------+------------+----------------+----------------+
I have another pyspark dataframe where I can see the market price and date of all commodities.
+----------+----------+------------+
| Date| Commodity|Market Price|
+----------+----------+------------+
|2020-07-01| Apple| 3|
|2020-07-01| Banana| 3|
|2020-07-02| Apple| 4|
|2020-07-02| Banana| 2.5|
|2020-07-03| Apple| 7|
|2020-07-03| Banana| 4|
+----------+----------+------------+
I want to see the closest date to Upper limit of date when Market Price(MP) of that commodity < or = Buying Price(BP).
Expected Output (for 2 top columns):
+---------+------------+----------------+----------------+--------------------------------+
|Commodity| BuyingPrice|Date_Upper_limit|Date_lower_limit|Closest Date to UL when MP <= BP|
+---------+------------+----------------+----------------+--------------------------------+
| Apple| 5| 2020-07-04| 2019-07-03| 2020-07-02|
| Banana| 3| 2020-07-03| 2019-07-02| 2020-07-02|
+---------+------------+----------------+----------------+--------------------------------+
Even though Apple was much lower on 2020-07-01 ($3), but since 2020-07-02 was the first date going backwards from Upper Limit (UL) of date when MP <= BP. So, I selected 2020-07-02.
How can I see backwards to fill date of probable buying?
Try this with conditional join and window function
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().partitionBy("Commodity")
df1\ #first dataframe shown being df1 and second being df2
.join(df2.withColumnRenamed("Commodity","Commodity1")\
, F.expr("""`Market Price`<=BuyingPrice and Date<Date_Upper_limit and Commodity==Commodity1"""))\
.drop("Market Price","Commodity1")\
.withColumn("max", F.max("Date").over(w))\
.filter('max==Date').drop("max").withColumnRenamed("Date","Closest Date to UL when MP <= BP")\
.show()
#+---------+-----------+----------------+----------------+--------------------------------+
#|Commodity|BuyingPrice|Date_Upper_limit|Date_lower_limit|Closest Date to UL when MP <= BP|
#+---------+-----------+----------------+----------------+--------------------------------+
#| Banana| 3.0| 2020-07-03| 2019-07-02| 2020-07-02|
#| Apple| 5.0| 2020-07-04| 2019-07-03| 2020-07-02|
#+---------+-----------+----------------+----------------+--------------------------------+

How to remove row having different value in rows

I have a dataframe and I want to remove two rows having not relevant value in my dataframe
+--------------------+--------------------+--------------------+--------------------+--------------------+-------------+--------------------+
| url| address| name| online_order| book_table| rate| votes|
+--------------------+--------------------+--------------------+--------------------+--------------------+-------------+--------------------+
|https://www.zomat...|27th Main, 2nd Se...|Rock View Family ...| Yes| No| 3.3/5| 8|
|https://www.zomat...|1152, 22nd Cross,...| OMY Grill| Yes| No| 3.8/5| 34|
|xperience this pl...| ('Rated 3.0'| 'RATED\n Yummy ...| I got choco chip...|strangely they di...| ('Rated 3.0'|" """"RATED\n Th...|
|ing new to Bangalore|Chip����\x83����\...| ����\x83����\x83...| I was really hap...| Service was quic...| ('Rated 4.0'| 'RATED\n Visite...|
|https://www.zomat...|1086/A, Twin Tuli...| Wings Mama| Yes| No| null| 0|
+--------------------+--------------------+--------------------+--------------------+--------------------+-------------+--------------------+
and my dataframe is look like this after removing the rows
+--------------------+--------------------+--------------------+------------+----------+-----+-----+
| url| address| name|online_order|book_table| rate|votes|
+--------------------+--------------------+--------------------+------------+----------+-----+-----+
|https://www.zomat...|27th Main, 2nd Se...|Rock View Family ...| Yes| No|3.3/5| 8|
|https://www.zomat...|1152, 22nd Cross,...| OMY Grill| Yes| No|3.8/5| 34|
|https://www.zomat...|1086/A, Twin Tuli...| Wings Mama| Yes| No| null| 0|
+--------------------+--------------------+--------------------+------------+----------+-----+-----+
The best way to filter out the row using regular expression
expr = "^http.*"
df=df.filter(df["url"].rlike(expr))