How to remove row having different value in rows

How to remove row having different value in rows - pyspark

I have a dataframe and I want to remove two rows having not relevant value in my dataframe
+--------------------+--------------------+--------------------+--------------------+--------------------+-------------+--------------------+
| url| address| name| online_order| book_table| rate| votes|
+--------------------+--------------------+--------------------+--------------------+--------------------+-------------+--------------------+
|https://www.zomat...|27th Main, 2nd Se...|Rock View Family ...| Yes| No| 3.3/5| 8|
|https://www.zomat...|1152, 22nd Cross,...| OMY Grill| Yes| No| 3.8/5| 34|
|xperience this pl...| ('Rated 3.0'| 'RATED\n Yummy ...| I got choco chip...|strangely they di...| ('Rated 3.0'|" """"RATED\n Th...|
|ing new to Bangalore|Chip����\x83����\...| ����\x83����\x83...| I was really hap...| Service was quic...| ('Rated 4.0'| 'RATED\n Visite...|
|https://www.zomat...|1086/A, Twin Tuli...| Wings Mama| Yes| No| null| 0|
+--------------------+--------------------+--------------------+--------------------+--------------------+-------------+--------------------+
and my dataframe is look like this after removing the rows
+--------------------+--------------------+--------------------+------------+----------+-----+-----+
| url| address| name|online_order|book_table| rate|votes|
+--------------------+--------------------+--------------------+------------+----------+-----+-----+
|https://www.zomat...|27th Main, 2nd Se...|Rock View Family ...| Yes| No|3.3/5| 8|
|https://www.zomat...|1152, 22nd Cross,...| OMY Grill| Yes| No|3.8/5| 34|
|https://www.zomat...|1086/A, Twin Tuli...| Wings Mama| Yes| No| null| 0|
+--------------------+--------------------+--------------------+------------+----------+-----+-----+

The best way to filter out the row using regular expression
expr = "^http.*"
df=df.filter(df["url"].rlike(expr))

Related

PySpark convert Dataframe to Dictionary

I got the following DataFrame:
>>> df.show(50)
+--------------------+-------------+----------------+----+
| User Hash ID| Word|sum(Total Count)|rank|
+--------------------+-------------+----------------+----+
|00095808cdc611fb5...| errors| 5| 1|
|00095808cdc611fb5...| text| 3| 2|
|00095808cdc611fb5...| information| 3| 3|
|00095808cdc611fb5...| department| 2| 4|
|00095808cdc611fb5...| error| 2| 5|
|00095808cdc611fb5...| data| 2| 6|
|00095808cdc611fb5...| web| 2| 7|
|00095808cdc611fb5...| list| 2| 8|
|00095808cdc611fb5...| recognition| 2| 9|
|00095808cdc611fb5...| pipeline| 2| 10|
|000ac87bf9c1623ee...|consciousness| 14| 1|
|000ac87bf9c1623ee...| book| 3| 2|
|000ac87bf9c1623ee...| place| 2| 3|
|000ac87bf9c1623ee...| mystery| 2| 4|
|000ac87bf9c1623ee...| mental| 2| 5|
|000ac87bf9c1623ee...| flanagan| 2| 6|
|000ac87bf9c1623ee...| account| 2| 7|
|000ac87bf9c1623ee...| world| 2| 8|
|000ac87bf9c1623ee...| problem| 2| 9|
|000ac87bf9c1623ee...| theory| 2| 10|
This shows some for each user the 10 most frequent words he read.
I would like to create a dictionary, which then can be saved to a file, with the following format:
User : <top 1 word>, <top 2 word> .... <top 10 word>
To achieve this, I thought it might be more efficient to cut down the df as much as possible, before converting it. Thus, I tried:
>>> df.groupBy("User Hash ID").agg(collect_list("Word")).show(20)
+--------------------+--------------------+
| User Hash ID| collect_list(Word)|
+--------------------+--------------------+
|00095808cdc611fb5...|[errors, text, in...|
|000ac87bf9c1623ee...|[consciousness, b...|
|0038ccf6e16121e7c...|[potentials, orga...|
|0042bfbafc6646f47...|[fuel, car, consu...|
|00a19396b7bb52e40...|[face, recognitio...|
|00cec95a2c007b650...|[force, energy, m...|
|00df9406cbab4575e...|[food, history, w...|
|00e6e2c361f477e1c...|[image, based, al...|
|01636d715de360576...|[functional, lang...|
|01a778c390e44a8c3...|[trna, genes, pro...|
|01ab9ade07743d66b...|[packaging, car, ...|
|01bdceea066ec01c6...|[anthropology, de...|
|020c643162f2d581b...|[laser, electron,...|
|0211604d339d0b3db...|[food, school, ve...|
|0211e8f09720c7f47...|[privacy, securit...|
|021435b2c4523dd31...|[life, rna, origi...|
|0239620aa740f1514...|[method, image, d...|
|023ad5d85a948edfc...|[web, user, servi...|
|02416836b01461574...|[parts, based, ad...|
|0290152add79ae1d8...|[data, score, de,...|
+--------------------+--------------------+
From here, it should be more straight forward to generate that dictionary However, I cannot be sure if by using this agg function I am guaranteed that the words are in the correct order! That is why I am hesitant and wanted to get some feedback on maybe better options

Based on answers provided here - collect_list by preserving order based on another variable
you can write below query to make sure you have top 5 in correct order
import pyspark.sql.functions as F
grouped_df = dft.groupby("userid") \
.agg(F.sort_array(F.collect_list(F.struct("rank", "word"))) \
.alias("collected_list")) \
.withColumn("sorted_list",F.slice(F.col("collected_list.word"),start=1,length=5)) \
.drop("collected_list")\
.show(truncate=False)

First of all, if you go from a dataframe to a dictionary, you may have to face some memory issue as you will bring all the content of the dataframe to your driver (dictionary is a python object, not a spark object).
You are not that far away from a working solution. I'd do it that way :
from pyspark.sql import functions as F
df.groupBy("User Hash ID").agg(
F.collect_list(F.struct("Word", "sum(Total Count)", "rank")).alias("data")
)
This will create a data column where you have your 3 fields, aggregated by user id.
Then, to go from a dataframe to a dict object, you can use for example toJSON or Row object method asDict

Scala Spark functions like group by, describe() returning incorrect result

I have using Scala Spark on intellij IDE to analyze a csv file having 672,112 records . File is available on the link - https://www.kaggle.com/kiva/data-science-for-good-kiva-crowdfunding
File name : kiva_loans.csv
I ran show() command to view few records and it is reading all columns correctly but when I apply group by on the column "repayment_interval", it displays value which appears to be data from other columns (column shift ) as shown below.
distinct values in the "repayment_interval" columns are
Monthly (More frequent)
irregular
bullet
weekly (less frequent)
For testing purpose, I searched for values given in the screenshot and put those rows in a separate file and tried to read that file using scala spark. It is showing all values in correct column and even groupby is returning correct values.
I am facing this issue with describe() function.
As shown in above image , column - id & "funded_amount" is numeric columns but not sure why describe() on them is giving string values for "min","max".
read csv command as below
val kivaloans=spark.read
//.option("sep",",")
.format("com.databricks.spark.csv")
.option("header",true)
.option("inferschema","true")
.csv("kiva_loans.csv")
printSchema output after adding ".option("multiline","true")". It is reading few rows as header as shown in the highlighted yellow color.

It seems, there are new line characters in columns data. Hence, set property multiline as true.
val kivaloans=spark.read.format("com.databricks.spark.csv")
.option("multiline","true")
.option("header",true)
.option("inferschema","true")
.csv("kiva_loans.csv")
Data summary is as follows after setting multiline as true:
+-------+------------------+-----------------+-----------------+----------+-----------+--------------------+--------------------+------------------+------------+------------------+--------------------+--------------------+--------------------+--------------------+-----------------+------------------+--------------------+--------------------+--------------------+--------------------+
|summary| id| funded_amount| loan_amount| activity| sector| use| country_code| country| region| currency| partner_id| posted_time| disbursed_time| funded_time| term_in_months| lender_count| tags| borrower_genders| repayment_interval| date|
+-------+------------------+-----------------+-----------------+----------+-----------+--------------------+--------------------+------------------+------------+------------------+--------------------+--------------------+--------------------+--------------------+-----------------+------------------+--------------------+--------------------+--------------------+--------------------+
| count| 671205| 671205| 671205| 671205| 671205| 666977| 671197| 671205| 614441| 671186| 657699| 671195| 668808| 622890| 671196| 671199| 499834| 666957| 671191| 671199|
| mean| 993248.5937336581|785.9950611214159|842.3971066961659| null| null| 10000.0| null| null| null| null| 178.20274555550654| 162.01020408163265| 179.12244897959184| 189.3|13.74266332047713|20.588457578299735| 25.68553459119497| 26.4| 26.210526315789473| 27.304347826086957|
| stddev|196611.27542282813|1130.398941057504|1198.660072882945| null| null| NaN| null| null| null| null| 94.24892231613454| 78.65564973356628| 100.70555939905975| 125.87299363372507|8.631922222356161|28.458485403188924| 31.131029407317044| 35.87289875191111| 52.43279244938066| 41.99181173710449|
| min| 653047| 0.0| 25.0|Adult Care|AgricultuTo buy chicken.| ""fajas"" [wove...| 10 boxes of cream| 3x1 purlins| T-shaped brackets| among other prod...| among other item...| and pay for labour"| and cassava to m...| yeast| rice| milk| among other prod...|#Animals, #Biz Du...| #Elderly|
| 25%| 823364| 250.0| 275.0| null| null| 10000.0| null| null| null| null| 126.0| 123.0| 105.0| 87.0| 8.0| 7.0| 8.0| 8.0| 9.0| 6.0|
| 50%| 992996| 450.0| 500.0| null| null| 10000.0| null| null| null| null| 145.0| 144.0| 144.0| 137.0| 13.0| 13.0| 14.0| 15.0| 14.0| 17.0|
| 75%| 1163938| 900.0| 1000.0| null| null| 10000.0| null| null| null| null| 204.0| 177.0| 239.0| 201.0| 14.0| 24.0| 27.0| 31.0| 24.0| 34.0|
| max| 1340339| 100000.0| 100000.0| Wholesale| Wholesale|? provide a safer...| ZW| Zimbabwe| ?ZM?T| baguida| XOF| XOF| Yoro, Yoro| USD| USD| USD|volunteer_pick, v...|volunteer_pick, v...| weekly|volunteer_pick, v...|
+-------+------------------+-----------------+-----------------+----------+-----------+--------------------+--------------------+------------------+------------+------------------+--------------------+--------------------+--------------------+--------------------+-----------------+------------------+--------------------+--------------------+--------------------+--------------------+

Closest Date looking from One Column to another in PySpark Dataframe

I have a pyspark dataframe where price of Commodity is mentioned, but there is no data for when was the Commodity bought, I just have a range of 1 year.
+---------+------------+----------------+----------------+
|Commodity| BuyingPrice|Date_Upper_limit|Date_lower_limit|
+---------+------------+----------------+----------------+
| Apple| 5| 2020-07-04| 2019-07-03|
| Banana| 3| 2020-07-03| 2019-07-02|
| Banana| 4| 2019-10-02| 2018-10-01|
| Apple| 6| 2020-01-20| 2019-01-19|
| Banana| 3.5| 2019-08-17| 2018-08-16|
+---------+------------+----------------+----------------+
I have another pyspark dataframe where I can see the market price and date of all commodities.
+----------+----------+------------+
| Date| Commodity|Market Price|
+----------+----------+------------+
|2020-07-01| Apple| 3|
|2020-07-01| Banana| 3|
|2020-07-02| Apple| 4|
|2020-07-02| Banana| 2.5|
|2020-07-03| Apple| 7|
|2020-07-03| Banana| 4|
+----------+----------+------------+
I want to see the closest date to Upper limit of date when Market Price(MP) of that commodity < or = Buying Price(BP).
Expected Output (for 2 top columns):
+---------+------------+----------------+----------------+--------------------------------+
|Commodity| BuyingPrice|Date_Upper_limit|Date_lower_limit|Closest Date to UL when MP <= BP|
+---------+------------+----------------+----------------+--------------------------------+
| Apple| 5| 2020-07-04| 2019-07-03| 2020-07-02|
| Banana| 3| 2020-07-03| 2019-07-02| 2020-07-02|
+---------+------------+----------------+----------------+--------------------------------+
Even though Apple was much lower on 2020-07-01 ($3), but since 2020-07-02 was the first date going backwards from Upper Limit (UL) of date when MP <= BP. So, I selected 2020-07-02.
How can I see backwards to fill date of probable buying?

Try this with conditional join and window function
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().partitionBy("Commodity")
df1\ #first dataframe shown being df1 and second being df2
.join(df2.withColumnRenamed("Commodity","Commodity1")\
, F.expr("""`Market Price`<=BuyingPrice and Date<Date_Upper_limit and Commodity==Commodity1"""))\
.drop("Market Price","Commodity1")\
.withColumn("max", F.max("Date").over(w))\
.filter('max==Date').drop("max").withColumnRenamed("Date","Closest Date to UL when MP <= BP")\
.show()
#+---------+-----------+----------------+----------------+--------------------------------+
#|Commodity|BuyingPrice|Date_Upper_limit|Date_lower_limit|Closest Date to UL when MP <= BP|
#+---------+-----------+----------------+----------------+--------------------------------+
#| Banana| 3.0| 2020-07-03| 2019-07-02| 2020-07-02|
#| Apple| 5.0| 2020-07-04| 2019-07-03| 2020-07-02|
#+---------+-----------+----------------+----------------+--------------------------------+

spark scala transform a dataframe/rdd

I have a CSV file like below.
PK,key,Value
100,col1,val11
100,col2,val12
100,idx,1
100,icol1,ival11
100,icol3,ival13
100,idx,2
100,icol1,ival21
100,icol2,ival22
101,col1,val21
101,col2,val22
101,idx,1
101,icol1,ival11
101,icol3,ival13
101,idx,3
101,icol1,ival31
101,icol2,ival32
I want to transform this into the following.
PK,idx,key,Value
100,,col1,val11
100,,col2,val12
100,1,idx,1
100,1,icol1,ival11
100,1,icol3,ival13
100,2,idx,2
100,2,icol1,ival21
100,2,icol2,ival22
101,,col1,val21
101,,col2,val22
101,1,idx,1
101,1,icol1,ival11
101,1,icol3,ival13
101,3,idx,3
101,3,icol1,ival31
101,3,icol2,ival32
Basically I want to create the an new column called idx in the output dataframe which will be populated with the same value "n" as that of the row following the key=idx, value="n".

Here is one way using last window function with Spark >= 2.0.0:
import org.apache.spark.sql.functions.{last, when, lit}
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy("PK").rowsBetween(Window.unboundedPreceding, 0)
df.withColumn("idx", when($"key" === lit("idx"), $"Value"))
.withColumn("idx", last($"idx", true).over(w))
.orderBy($"PK")
.show
Output:
+---+-----+------+----+
| PK| key| Value| idx|
+---+-----+------+----+
|100| col1| val11|null|
|100| col2| val12|null|
|100| idx| 1| 1|
|100|icol1|ival11| 1|
|100|icol3|ival13| 1|
|100| idx| 2| 2|
|100|icol1|ival21| 2|
|100|icol2|ival22| 2|
|101| col1| val21|null|
|101| col2| val22|null|
|101| idx| 1| 1|
|101|icol1|ival11| 1|
|101|icol3|ival13| 1|
|101| idx| 3| 3|
|101|icol1|ival31| 3|
|101|icol2|ival32| 3|
+---+-----+------+----+
The code first creates a new column called idx which contains the value of Value when key == idx, or null otherwise. Then it retrieves the last observed idx over the defined window.

Pyspark groupBy Pivot Transformation

I'm having a hard time framing the following Pyspark dataframe manipulation.
Essentially I am trying to group by category and then pivot/unmelt the subcategories and add new columns.
I've tried a number of ways, but they are very slow and and are not leveraging Spark's parallelism.
Here is my existing (slow, verbose) code:
from pyspark.sql.functions import lit
df = sqlContext.table('Table')
#loop over category
listids = [x.asDict().values()[0] for x in df.select("category").distinct().collect()]
dfArray = [df.where(df.category == x) for x in listids]
for d in dfArray:
#loop over subcategory
listids_sub = [x.asDict().values()[0] for x in d.select("sub_category").distinct().collect()]
dfArraySub = [d.where(d.sub_category == x) for x in listids_sub]
num = 1
for b in dfArraySub:
#renames all columns to append a number
for c in b.columns:
if c not in ['category','sub_category','date']:
column_name = str(c)+'_'+str(num)
b = b.withColumnRenamed(str(c), str(c)+'_'+str(num))
b = b.drop('sub_category')
num += 1
#if no df exists, create one and continually join new columns
try:
all_subs = all_subs.drop('sub_category').join(b.drop('sub_category'), on=['cateogry','date'], how='left')
except:
all_subs = b
#Fixes missing columns on union
try:
try:
diff_columns = list(set(all_cats.columns) - set(all_subs.columns))
for d in diff_columns:
all_subs = all_subs.withColumn(d, lit(None))
all_cats = all_cats.union(all_subs)
except:
diff_columns = list(set(all_subs.columns) - set(all_cats.columns))
for d in diff_columns:
all_cats = all_cats.withColumn(d, lit(None))
all_cats = all_cats.union(all_subs)
except Exception as e:
print e
all_cats = all_subs
But this is very slow. Any guidance would be greatly appreciated!

Your output is not really logical, but we can achieve this result using the pivot function. You need to precise your rules otherwise I can see a lot of cases it may fails.
from pyspark.sql import functions as F
from pyspark.sql.window import Window
df.show()
+----------+---------+------------+------------+------------+
| date| category|sub_category|metric_sales|metric_trans|
+----------+---------+------------+------------+------------+
|2018-01-01|furniture| bed| 100| 75|
|2018-01-01|furniture| chair| 110| 85|
|2018-01-01|furniture| shelf| 35| 30|
|2018-02-01|furniture| bed| 55| 50|
|2018-02-01|furniture| chair| 45| 40|
|2018-02-01|furniture| shelf| 10| 15|
|2018-01-01| rug| circle| 2| 5|
|2018-01-01| rug| square| 3| 6|
|2018-02-01| rug| circle| 3| 3|
|2018-02-01| rug| square| 4| 5|
+----------+---------+------------+------------+------------+
df.withColumn("fg", F.row_number().over(Window().partitionBy('date', 'category').orderBy("sub_category"))).groupBy('date', 'category', ).pivot('fg').sum('metric_sales', 'metric_trans').show()
+----------+---------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+
| date| category|1_sum(CAST(`metric_sales` AS BIGINT))|1_sum(CAST(`metric_trans` AS BIGINT))|2_sum(CAST(`metric_sales` AS BIGINT))|2_sum(CAST(`metric_trans` AS BIGINT))|3_sum(CAST(`metric_sales` AS BIGINT))|3_sum(CAST(`metric_trans` AS BIGINT))|
+----------+---------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+
|2018-02-01| rug| 3| 3| 4| 5| null| null|
|2018-02-01|furniture| 55| 50| 45| 40| 10| 15|
|2018-01-01|furniture| 100| 75| 110| 85| 35| 30|
|2018-01-01| rug| 2| 5| 3| 6| null| null|
+----------+---------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to remove row having different value in rows - pyspark

The best way to filter out the row using regular expression expr = "^http.*" df=df.filter(df["url"].rlike(expr))

Related

PySpark convert Dataframe to Dictionary

Scala Spark functions like group by, describe() returning incorrect result

Closest Date looking from One Column to another in PySpark Dataframe

spark scala transform a dataframe/rdd

Pyspark groupBy Pivot Transformation

Categories

Resources