Counting words after grouping records - pyspark

Note: Although the provided answer is working, it can get rather slow on larger data sets. Take a look at this for a faster solution.
I am having a data frame which consists of labelled document such as this one:
df_ = spark.createDataFrame([
('1', 'hello how are are you today'),
('1', 'hello how are you'),
('2', 'hello are you here'),
('2', 'how is it'),
('3', 'hello how are you'),
('3', 'hello how are you'),
('4', 'hello how is it you today')
], schema=['label', 'text'])
What I want is to group the data frame by label and make a simple word count for each group. My problem is I'm not sure how I can do this in PySpark. In a first step I would split the text and get the document as a list of tokens:
def get_token_counts(text):
if text is None:
return list()
counter = Counter(text.lower().split())
return list(counter.items())
udf_get_token_counts = F.udf(get_token_counts)
df_.select(['label'] + [udf_get_tokens(F.col('text')).alias('text')])\
.show()
Gives
+-----+--------------------+
|label| text|
+-----+--------------------+
| 1|[hello, how, are,...|
| 1|[hello, how, are,...|
| 2|[hello, are, you,...|
| 2|[hello, how, is, it]|
| 3|[hello, how, are,...|
| 3|[hello, how, are,...|
| 4|[hello, how, is, ...|
+-----+--------------------+
I know how I can make a word count over the entire data frame but I don't know how I have to proceed with groupby() or reducyByKey().
I was thinking about partially counting the words in the data frame:
df_.select(['label'] + [udf_get_tokens(F.col('text')).alias('text')])\
.rdd.map(lambda x: (x[0], list(Counter(x[1]).items()))) \
.toDF(schema=['label', 'text'])\
.show()
which gives:
+-----+--------------------+
|label| text|
+-----+--------------------+
| 1|[[are,2], [hello,...|
| 1|[[are,1], [hello,...|
| 2|[[are,1], [hello,...|
| 2|[[how,1], [it,1],...|
| 3|[[are,1], [hello,...|
| 3|[[are,1], [hello,...|
| 4|[[you,1], [today,...|
+-----+--------------------+
but how can I aggregate this?

You should use pyspark.ml.feature.Tokenizer to split the text instead of using udf. (Also, depending on what you are doing, you may find StopWordsRemover to be useful.)
For example:
from pyspark.ml.feature import Tokenizer
tokenizer = Tokenizer(inputCol="text", outputCol="tokens")
tokens = tokenizer.transform(df_)
tokens.show(truncate=False)
+-----+---------------------------+----------------------------------+
|label|text |tokens |
+-----+---------------------------+----------------------------------+
|1 |hello how are are you today|[hello, how, are, are, you, today]|
|1 |hello how are you |[hello, how, are, you] |
|2 |hello are you here |[hello, are, you, here] |
|2 |how is it |[how, is, it] |
|3 |hello how are you |[hello, how, are, you] |
|3 |hello how are you |[hello, how, are, you] |
|4 |hello how is it you today |[hello, how, is, it, you, today] |
+-----+---------------------------+----------------------------------+
Then you can explode() the tokens, and do a groupBy() to get the count for each word:
import pyspark.sql.functions as f
token_counts = tokens.select("label", f.explode("tokens").alias("token"))\
.groupBy("label", "token").count()\
.orderBy("label", "token")
token_counts.show(truncate=False, n=10)
+-----+-----+-----+
|label|token|count|
+-----+-----+-----+
|1 |are |3 |
|1 |hello|2 |
|1 |how |2 |
|1 |today|1 |
|1 |you |2 |
|2 |are |1 |
|2 |hello|1 |
|2 |here |1 |
|2 |how |1 |
|2 |is |1 |
+-----+-----+-----+
only showing top 10 rows
If you want all of the tokens and counts on one row per label, just do another groupBy() with pyspark.sql.functions.collect_list() and concatenate the token and count columns using pyspark.sql.functions.struct():
tokens.select("label", f.explode("tokens").alias("token"))\
.groupBy("label", "token")\
.count()\
.groupBy("label")\
.agg(f.collect_list(f.struct(f.col("token"), f.col("count"))).alias("text"))\
.orderBy("label")\
.show(truncate=False)
+-----+----------------------------------------------------------------+
|label|text |
+-----+----------------------------------------------------------------+
|1 |[[hello,2], [how,2], [are,3], [today,1], [you,2]] |
|2 |[[you,1], [hello,1], [here,1], [are,1], [it,1], [how,1], [is,1]]|
|3 |[[are,2], [you,2], [how,2], [hello,2]] |
|4 |[[today,1], [hello,1], [it,1], [you,1], [how,1], [is,1]] |
+-----+----------------------------------------------------------------+

Related

Filtering empty partitions in RDD

Is there a way to filter empty partitions in RDD? I have some empty partitions after partitioning and I can't use them in action method.
I use Apache Spark in Scala
This is my sample data
val sc = spark.sparkContext
val myDataFrame = spark.range(20).toDF("mycol").repartition($"mycol")
myDataFrame.show(false)
Output :
+-----+
|mycol|
+-----+
|19 |
|0 |
|7 |
|6 |
|9 |
|17 |
|5 |
|1 |
|10 |
|3 |
|12 |
|8 |
|11 |
|2 |
|4 |
|13 |
|18 |
|14 |
|15 |
|16 |
+-----+
In the above code when you do repartition on column then 200 paritions will be created since spark.sql.shuffle.partitions = 200 in that many are not used or empty partitions since data is just 10 numbers (we are trying to fit 20 numbers in to 200 partitions means.... most of the partitions are empty.... :-))
1) Prepare a long accumulator variable to quickly count non empty partitions.
2) Add all non empty partitions in to accumulator variable like below example.
val nonEmptyPartitions = sc.longAccumulator("nonEmptyPartitions")
myDataFrame.foreachPartition(partition =>
if (partition.length > 0) nonEmptyPartitions.add(1))
drop non empty partitions (means coalesce them... less shuffle/ minimum shuffle ).
print them.
val finalDf = myDataFrame.coalesce(nonEmptyPartitions.value.toInt)
println(s"nonEmptyPart : ${nonEmptyPartitions.value.toInt}")
println(s"df.rdd.partitions.length : ${myDataFrame.rdd.getNumPartitions}")
println(s"finalDf.rdd.partitions.length : ${finalDf.rdd.getNumPartitions}")
print them ...
Result :
nonEmptyPart : 20
df.rdd.partitions.length : 200
finalDf.rdd.partitions.length : 20
Proof that all non empty partitions are dropped...
myDataFrame.withColumn("partitionId", org.apache.spark.sql.functions.spark_partition_id)
.groupBy("partitionId")
.count
.show
Result printed partition wise record count :
+-----------+-----+
|partitionId|count|
+-----------+-----+
|128 |1 |
|190 |1 |
|140 |1 |
|164 |1 |
|5 |1 |
|154 |1 |
|112 |1 |
|107 |1 |
|4 |1 |
|49 |1 |
|69 |1 |
|77 |1 |
|45 |1 |
|121 |1 |
|143 |1 |
|58 |1 |
|11 |1 |
|150 |1 |
|68 |1 |
|116 |1 |
+-----------+-----+
Note :
Usage spark_partition_id is for demo/debug purpose only not for production purpose.
I reduced 200 partitions (due to repartition on column ) to 20 non empty partitions.
Conclusion :
Finally you got rid of extra empty partitions which doesnt have any data and avoided un necessary schedule to dummy tasks on empty partitions.
From the little info you provide, I can think about two options. Use mapPartitions and just catching empty iterators and returning them, while working on the non-empty ones.
rdd.mapPartitions { case iter => if(iter.isEmpty) { iter } else { ??? } }
Or you can use repartition, to get rid of the empty partitions.
rdd.repartition(10) // or any proper number
If you dont know the distinct values within the column, and wish to avoid having empty partitions, you can use countApproxDistinct() as:
df.repartition(df.rdd.countApproxDistinct().toInt)
If you wish to filter the existing empty partitions and repartition, you can use as solution suggeste by Sasa
OR:
df.repartition(df.mapPartitions(part => List(part.length).iterator).collect().count(_ != 0)).df.getNumPartitions)
However, in later case the partitions may or may not contain records by value.

Map values of a column with ArrayType based on values from another dataframe in PySpark

What I have:
| ids. |items |item_id|value|timestamp|
+--------+--------+-------+-----+---------+
|[A,B,C] |1.0 |1 |5 |100 |
|[A,B,D] |1.0 |2 |6 |90 |
|[D] |0.0. |3 |7 |80 |
|[C] |0.0. |4 |8 |80 |
+--------+--------+-------+-----+----------
| ids |id_num |
+--------+--------+
|A |1 |
|B |2 |
|C |3 |
|D |4 |
+---+----+--------+
What I want:
| ids |
+--------+
|[1,2,3] |
|[1,2,4] |
|[3] |
|[4] |
+--------+
Is there a way to do this without an explode? Thank you for your help!
You can use a UDF:
from pyspark.sql.functions import udf, col
from pyspark.sql.types import ArrayType
# Suppose this is the dictionary you want to map
map_dict = {'A':1, 'B':2,'C':3,'D':4}
def array_map(array_col):
return list(map(map_dict.get, array_col))
"""
If you prefer list comprehension, you can return [map_dict[k] for k in array_col]
"""
array_map_udf = udf(array_map, ArrayType())
df = df.withColumn("mapped_array", array_map_udf(col("ids")))
I can't think of a different method, but to get a parallelized dictionary, you can just use the toJSON method. It will require further processing on the kind of reference df you have:
import json
df_json = df.toJSON().map(lambda x: json.loads(x))

How to Reverse arrangement DataFrame in Apache Spark

How can I reverse this DataFrame using Scala.
I saw sort functions but must specific column, I only want to reverse them
+---+--------+-----+
|id | name|note |
+---+--------+-----+
|1 | james |any |
|3 | marry |some |
|2 | john |some |
|5 | tom |any |
+---+--------+-----+
to:
+---+--------+-----+
|id | name|note |
+---+--------+-----+
|5 | tom |any |
|2 | john |some |
|3 | marry |some |
|1 | james |any |
+---+--------+-----+
You can add a column with increasing id with use of monotonically_increasing_id()
and sort in descending order
val dff = Seq(
(1, "james", "any"),
(3, "marry", "some"),
(2, "john", "some"),
(5, "tom", "any")
).toDF("id", "name", "note")
dff.withColumn("index", monotonically_increasing_id())
.sort($"index".desc)
.drop($"index")
.show(false)
Output:
+---+-----+----+
|id |name |note|
+---+-----+----+
|5 |tom |any |
|2 |john |some|
|3 |marry|some|
|1 |james|any |
+---+-----+----+
You could do something like this:
val reverseDf = df.withColumn("row_num", row_number.over(Window.partitionBy(lit(1)).orderBy(lit(1))))
.orderBy($"row_num".desc)
.drop("row_num")
Or refer this instead of row number.

SQL Select Unique Values Each Column

I'm looking to select unique values from each column of a table and output the results into a single table. Take the following example table:
+------+---------------+------+---------------+
|col1 |col2 |col_3 |col_4 |
+------+---------------+------+---------------+
|1 |"apples" |A |"red" |
|2 |"bananas" |A |"red" |
|3 |"apples" |B |"blue" |
+------+---------------+------+---------------+
the ideal output would be:
+------+---------------+------+---------------+
|col1 |col2 |col_3 |col_4 |
+------+---------------+------+---------------+
|1 |"apples" |A |"red" |
|2 |"bananas" |B |"blue" |
|3 | | | |
+------+---------------+------+---------------+
Thank you!
Edit: My actual table has many more columns, so ideally the SQL query can be done via a SELECT * as opposed to 4 individual select queries within the FROM statement.

Splittling list of JSON key/value pairs into columns of a row in a Dataset

I have a column with a list of key/value objects:
+----+--------------------------------------------------------------------------------------------+
|ID | Settings |
+----+--------------------------------------------------------------------------------------------+
|1 | [{"key":"key1","value":"val1"}, {"key":"key2","value":"val2"}, {"key":"key3","value":"val3"}] |
+----+--------------------------------------------------------------------------------------------+
Is it possible to split this list of objects into its own row?
As such:
+----+------+-------+-------+
|ID | key1 | key2 | key3 |
+----+------+-------+-------+
|1 | val1 | val2 | val3 |
+----+------+-------+-------+
I've tried exploding, and placing into a Struct:
case class Setting(key: String, value: String)
val newDF = df.withColumn("setting", explode($"settings"))
.select($"id", from_json($"setting" Encoders.product[Setting].schema) as 'settings)
which gives me:
+------+------------------------------+
|ID |settings |
+------+------------------------------+
|1 |[key1,val1] |
|1 |[key2,val2] |
|1 |[key3,val3] |
+------+------------------------------+
And from here I can use the specifies rows by such settings.key
But its not quite what I need. I need to access multiple keys in the one row of data
You are almost near, If you already got this
+------+------------------------------+
|ID |settings |
+------+------------------------------+
|1 |[key1,val1] |
|1 |[key2,val2] |
|1 |[key3,val3] |
+------+------------------------------+
Now you can use pivot to reshape the data as
newDF.groupBy($"ID")
.pivot("settings.key")
.agg(first("settings.value"))
Group by ID and use pivot, Use agg to get the first value but you can use any other function here.
Output:
+---+----+----+----+
|ID |key1|key2|key3|
+---+----+----+----+
|1 |val1|val2|val3|
+---+----+----+----+
Hope this helps!