how count different elements of a column? - pyspark

Hi I am wondering how I can count elements of each row in the whole data set?
I have the following column:
col
(‘a’, ‘b’),(‘a’, ‘c’),(‘b’, ‘c’)
(‘g’, ‘h’),(‘a’, ‘c’),(‘a’, ‘b’)
I wanna count how many of the above pair exist in the data set!
Output:
(‘a’, ‘b’) 2
(‘a’, ‘c’). 2
(‘b’, ‘c’). 1
(‘g’, ‘h’). 1
I know in pandas I can do this:
h=data['col'].str.findall(r'(\([^()]+\))').explode().value_counts()

Assuming your input data is a string (based on your regex), I'd suggest changing the middle comma by another separator, which you can use to split later. I'm using pipe here for demo purpose only, you can change to another unique pattern.
from pyspark.sql import functions as F
df = (spark
.sparkContext
.parallelize([
("('a', 'b'),('a', 'c'),('b', 'c')",),
("('g', 'h'),('a', 'c'),('a', 'b')",),
])
.toDF(['col'])
)
# Output
# +--------------------------------+
# |col |
# +--------------------------------+
# |('a', 'b'),('a', 'c'),('b', 'c')|
# |('g', 'h'),('a', 'c'),('a', 'b')|
# +--------------------------------+
(df
.withColumn('col', F.regexp_replace('col', '\),\(', ')|('))
.withColumn('col', F.explode(F.split('col', '\|')))
.groupBy('col')
.count()
.show(10, False)
)
# Output
# +----------+-----+
# |col |count|
# +----------+-----+
# |('g', 'h')|1 |
# |('a', 'c')|2 |
# |('b', 'c')|1 |
# |('a', 'b')|2 |
# +----------+-----+

Related

Calculating difference between two dates in PySpark

Currently I'm working with a dataframe and need to calculate the number of days (as integer) between two dates formatted as timestamp
I've opted for this solution:
from pyspark.sql.functions import lit, when, col, datediff
df1 = df1.withColumn("LD", datediff("MD", "TD"))
But after calculating sum from a list I get an error: "Column in not iterable" which makes me impossible to calculate sum of the rows based on column names
col_list = ["a", "b", "c"]
df2 = df1.withColumn("My_Sum", sum([F.col(c) for c in col_list]))
How can I deal with it in order to calculate the difference between dates and then calculate the sum of the rows given the names of certain columns?
The datediff has nothing to do with the sum of a column. The pyspark sql sum function takes in 1 column and it calculates the sum of the rows in that column.
Here are a couple of ways to get the sum of a column from a list of columns using list comprehension.
Single row output with the sum of the column
data_sdf. \
select(*[func.sum(c).alias(c+'_sum') for c in col_list]). \
show()
# +-----+-----+-----+
# |a_sum|b_sum|c_sum|
# +-----+-----+-----+
# | 1337| 3778| 6270|
# +-----+-----+-----+
the sum of all rows of the column in each row
from pyspark.sql.window import Window as wd
data_sdf. \
select('*',
*[func.sum(c).over(wd.partitionBy()).alias(c+'_sum') for c in col_list]
). \
show(5)
# +---+---+---+-----+-----+-----+
# | a| b| c|a_sum|b_sum|c_sum|
# +---+---+---+-----+-----+-----+
# | 45| 58|125| 1337| 3778| 6270|
# | 9| 99|143| 1337| 3778| 6270|
# | 33| 91|146| 1337| 3778| 6270|
# | 21| 85|118| 1337| 3778| 6270|
# | 30| 55|101| 1337| 3778| 6270|
# +---+---+---+-----+-----+-----+
# only showing top 5 rows

PySpark, create line graph from a dataframe without a "category" on databricks

I'm running the following code on databricks:
dataToShow = jDataJoined.\
withColumn('id', monotonically_increasing_id()).\
filter(
(jDataJoined.containerNumber == 'SUDU8108536')).\
select(col('id'), col('returnTemperature'), col('supplyTemperature'))
This will give me tabular data like
Now I would like to display a line graph with this returnTemperature and supplyTemperature as categories.
As far as I understood, the method display in databricks wants as second argument the category, so basically what I should have is something like
id - temperatureCategory - value
1 - returnTemperature - 25.0
1 - supplyTemperature - 27.0
2 - returnTemperature - 24.0
2 - supplyTemperature - 28.0
How can I transform the dataframe in this way?
I don't know if your format is what the display method is expecting, but you can do this transformation with the sql functions create_map and explode:
#creates a example df
from pyspark.sql import functions as F
l1 = [(1,25.0,27.0),(2,24.0,28.0)]
df = spark.createDataFrame(l1,['id','returnTemperature','supplyTemperature'])
#creates a map column which contains the values of the returnTemperature and supplyTemperature
df = df.withColumn('mapCol', F.create_map(
F.lit('returnTemperature'),df.returnTemperature
,F.lit('supplyTemperature'),df.supplyTemperature
)
)
#The explode function creates a new row for each element of the map
df = df.select('id',F.explode(df.mapCol).alias('temperatureCategory','value'))
df.show()
Output:
+---+-------------------+-----+
| id|temperatureCategory|value|
+---+-------------------+-----+
| 1 | returnTemperature| 25.0|
| 1 | supplyTemperature| 27.0|
| 2 | returnTemperature| 24.0|
| 2 | supplyTemperature| 28.0|
+---+-------------------+-----+

Is there a better way to go about this process of trimming my spark DataFrame appropriately?

In the following example, I want to be able to only take the x Ids with the highest counts. x is number of these I want which is determined by a variable called howMany.
For the following example, given this Dataframe:
+------+--+-----+
|query |Id|count|
+------+--+-----+
|query1|11|2 |
|query1|12|1 |
|query2|13|2 |
|query2|14|1 |
|query3|13|2 |
|query4|12|1 |
|query4|11|1 |
|query5|12|1 |
|query5|11|2 |
|query5|14|1 |
|query5|13|3 |
|query6|15|2 |
|query6|16|1 |
|query7|17|1 |
|query8|18|2 |
|query8|13|3 |
|query8|12|1 |
+------+--+-----+
I would like to get the following dataframe if the variable number is 2.
+------+-------+-----+
|query |Ids |count|
+------+-------+-----+
|query1|[11,12]|2 |
|query2|[13,14]|2 |
|query3|[13] |2 |
|query4|[12,11]|1 |
|query5|[11,13]|2 |
|query6|[15,16]|2 |
|query7|[17] |1 |
|query8|[18,13]|2 |
+------+-------+-----+
I then want to remove the count column, but that is trivial.
I have a way to do this, but I think it defeats the purpose of scala all together and completely wastes a lot of runtime. Being new, I am unsure about the best ways to go about this
My current method is to first get a distinct list of the query column and create an iterator. Second I loop through the list using the iterator and trim the dataframe to only the current query in the list using df.select($"eachColumnName"...).where("query".equalTo(iter.next())). I then .limit(howMany) and then groupBy($"query").agg(collect_list($"Id").as("Ids")). Lastly, I have an empty dataframe and add each of these one by one to the empty dataframe and return this newly created dataframe.
df.select($"query").distinct().rdd.map(r => r(0).asInstanceOf[String]).collect().toList
val iter = queries.toIterator
while (iter.hasNext) {
middleDF = df.select($"query", $"Id", $"count").where($"query".equalTo(iter.next()))
queryDF = middleDF.sort(col("count").desc).limit(howMany).select(col("query"), col("Ids")).groupBy(col("query")).agg(collect_list("Id").as("Ids"))
emptyDF.union(queryDF) // Assuming emptyDF is made
}
emptyDF
I would do this using Window-Functions to get the rank, then groupBy to aggrgate:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val howMany = 2
val newDF = df
.withColumn("rank",row_number().over(Window.partitionBy($"query").orderBy($"count".desc)))
.where($"rank"<=howMany)
.groupBy($"query")
.agg(
collect_list($"Id").as("Ids"),
max($"count").as("count")
)

Reshaping RDD from an array of array to unique columns in pySpark

I want to use pySpark to restructure my data so that I can use it for MLLib models, currently for each user I have an array of array in one column and I want to convert it unique columns with the count.
Users | column1 |
user1 | [[name1, 4], [name2, 5]] |
user2 | [[name1, 2], [name3, 1]] |
should get converted to:
Users | name1 | name2 | name3 |
user1 | 4.0 | 5.0 | 0.0 |
user2 | 2.0 | 0.0 | 1.0 |
I came up with a method that uses for loops but I am looking for a way that can utilize spark because the data is huge. Could you give me any hints? Thanks.
Edit:
All of the unique names should come as individual columns with the score corresponding to each user. Basically, a sparse matrix.
I am working with pandas right now and the code I'm using to do this is
data = data.applymap(lambda x: dict(x)) # To convert the array of array into a dictionary
columns = list(data)
for i in columns:
# For each columns using the dictionary to make a new Series and appending it to the current dataframe
data = pd.concat([data.drop([i], axis=1), data[i].apply(pd.Series)], axis=1)
Figured out the answer,
import pyspark.sql.functions as F
# First we explode column`, this makes each element as a separate row
df= df.withColumn('column1', F.explode_outer(F.col('column1')))
# Then, seperate out the new column1 into two columns
df = df.withColumn(("column1_seperated"), F.col('column1')[0])
df= df.withColumn("count", F.col(i)['column1'].cast(IntegerType()))
# Then pivot the df
df= df.groupby('Users').pivot("column1_seperated").sum('count')

Reading a csv file into PySpark that contains the key:value pairing, such that key becomes the column and value is the data of it

I am a beginner of Spark. Please help me out with a solution.
The csv file contains the text in the form of key:value paring delimited by a comma. And in some lines, the keys(or columns) may be missing.
I have loaded this file into a single column of a dataframe. I want to segregate these keys as columns and values associated to it as data into that column. And when there are some columns missing i want to add a new column and a dummy data to that.
Dataframe
+----------------------------------------------------------------+
| _c0 |
+----------------------------------------------------------------+
|name:Pradnya,IP:100.0.0.4, college: SDM, year:2018 |
|name:Ram, IP:100.10.10.5, college: BVB, semester:IV, year:2018 |
+----------------------------------------------------------------+
I want the output in this form
+----------- ----------------------------------------------
| name | IP | College | Semester | year |
+-----------+-------------------------+-----------+-------+
| Pradnya |100.0.0.4 | SDM | null | 2018 |
+-----------+-------------+-----------+-----------+-------+
| Ram | 100.10.10.5 | BVB | IV |2018 |
+-----------+-------------+-----------+-----------+-------+
Thanks.
Pyspark won't recognize the key:value pairing. One workaround is convert the file int json format and then read the json file.
content of raw.txt:
name:Pradnya,IP:100.0.0.4, college: SDM, year:2018
name:Ram, IP:100.10.10.5, college: BVB, semester:IV, year:2018
Following code will create the json file :
with open('raw.json', 'w') as outfile:
json.dump([dict([p.split(':') for p in l.split(',')]) for l in open('raw.txt')], outfile)
Now you can create the pyspark dataframe using following code :
df = spark.read.format('json').load('raw.json')
If you know all field names and keys/values do not contain embedded delimiters. then you can probably convert the key/value lines into Row object through RDD's map function.
from pyspark.sql import Row
from string import lower
# assumed you already defined SparkSession named `spark`
sc = spark.sparkContext
# initialize the RDD
rdd = sc.textFile("key-value-file")
# define a list of all field names
columns = ['name', 'IP', 'College', 'Semester', 'year']
# set Row object
def setRow(x):
# convert line into key/value tuples. strip spaces and lowercase the `k`
z = dict((lower(k.strip()), v.strip()) for e in x.split(',') for k,v in [ e.split(':') ])
# make sure all columns shown in the Row object
return Row(**dict((c, z[c] if c in z else None) for c in map(lower, columns)))
# map lines to Row objects and then convert the result to dataframe
rdd.map(setRow).toDF().show()
#+-------+-----------+-------+--------+----+
#|college| ip| name|semester|year|
#+-------+-----------+-------+--------+----+
#| SDM| 100.0.0.4|Pradnya| null|2018|
#| BVB|100.10.10.5| Ram| IV|2018|
#+-------+-----------+-------+--------+----+