How to apply Word Net Lemmatizer on pyspark data frame? - pyspark

I'm trying to apply WordNet Lemmatization on one of my Data Frame columns.
My dataframe looks like this:
+--------------------+-----+
| removed|stars|
+--------------------+-----+
|[today, second, t...| 1.0|
|[ill, first, admi...| 4.0|
|[believe, things,...| 1.0|
|[great, lunch, to...| 4.0|
|[weve, huge, slim...| 5.0|
|[plumbsmart, prov...| 5.0|
So each row is the list of tokens. Now I want to lemmatize each token.
I've tried with:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
df_lemma= df_removed.select(lemmatizer.lemmatize('removed'))
df_lemma.show()
I did not get any error message, but my dataframe did not change.
+--------------------+
| removed|
+--------------------+
|[today, second, t...|
|[ill, first, admi...|
|[believe, things,...|
|[great, lunch, to...|
|[weve, huge, slim...|
|[plumbsmart, prov...|
Is there any error in my code? How should I apply lemmatizer?

Related

arithmetic operations with lenght function

I have this Dataframe
+----+----+------+------+------+----+
|key1|key2| col1| col2| col3|col4|
+----+----+------+------+------+----+
| 5| d|value7|value8|value9| 20|
+----+----+------+------+------+----+
And I am trying to do something like this
df2.withColumn("new",repeat(lit("0"), 10-length(col("col3")) )).show()
But I get this error message TypeError: Column is not iterable
I would like to know if there is any way to do a subtraction or maybe an addition using "length("col3")"
Using repeat as SQL function instead of using the Python function works:
from pyspark.sql import functions as F
df.withColumn('new', F.expr('repeat("0", 10-length(col3))')).show()
Output:
+------+-----+
| col3| new|
+------+-----+
| hello|00000|
|value9| 0000|
+------+-----+

Spark RDD to Dataframe

Below is the data in a file
PREFIX|Description|Destination|Num_Type
1|C1|IDD|NA
7|C2|IDDD|NA
20|C3|IDDD|NA
27|C3|IDDD|NA
30|C5|IDDD|NA
I am trying to read it and convert into Dataframe.
val file=sc.textFile("/user/cloudera-scm/file.csv")
val list=file.collect.toList
list.toDF.show
+--------------------+
| value|
+--------------------+
|PREFIX|Descriptio...|
| 1|C1|IDD|NA|
| 7|C2|IDDD|NA|
| 20|C3|IDDD|NA|
| 27|C3|IDDD|NA|
| 30|C5|IDDD|NA|
+--------------------+
I am not able to convert this to datafram with exact table form
Let's first consider your code.
// reading a potentially big file
val file=sc.textFile("/user/cloudera-scm/file.csv")
// collecting everything to the driver
val list=file.collect.toList
// converting a local list to a dataframe (this does not work)
list.toDF.show
There are ways to make your code work, but the very logic awkward. You are reading data with the executors, putting all of it on the driver to simply convert it to a dataframe (back to the executors). That's a lot of network communication, and the driver will most likely run out of memory for any reasonably large dataset.
What you can do it read the data directly as a dataframe like this (the driver does nothing and there is no unnecessary IO):
spark.read
.option("sep", "|") // specify the delimiter
.option("header", true) // to tell spark that there is a header
.option("inferSchema", true) // optional, infer the types of the columns
.csv(".../data.csv").show
+------+-----------+-----------+--------+
|PREFIX|Description|Destination|Num_Type|
+------+-----------+-----------+--------+
| 1| C1| IDD| NA|
| 7| C2| IDDD| NA|
| 20| C3| IDDD| NA|
| 27| C3| IDDD| NA|
| 30| C5| IDDD| NA|
+------+-----------+-----------+--------+

What is the correct way to calculate average using pyspark.sql functions?

In pyspark dataframe, I have a timeseries of different events and I want to calculate the average count of events by month. What is the correct way to do that using the pyspark.sql functions?
I have a feeling that this requires agg, avg, window partitioning, but I couldn't make it work.
I have grouped the data by event and month and obtained something like this:
+------+-----+-----+
| event|month|count|
+------+-----+-----+
|event1| 1| 1023|
|event2| 1| 1009|
|event3| 1| 1002|
|event1| 2| 1012|
|event2| 2| 1023|
|event3| 2| 1017|
|event1| 3| 1033|
|event2| 3| 1011|
|event3| 3| 1004|
+------+-----+-----+
What I would like to have is this:
+------+-------------+
| event|avg_per_month|
+------+-------------+
|event1| 1022.6666|
|event2| 1014.3333|
|event3| 1007.6666|
+------+-------------+
What is the correct way to accomplish this?
This should help you to get desired result -
df = spark.createDataFrame(
[('event1',1,1023),
('event2',1,1009),
('event3',1,1002),
('event1',2,1012),
('event2',2,1023),
('event3',2,1017),
('event1',3,1033),
('event2',3,1011),
('event3',3,1004)
],["event", "month", "count"])
Example 1:
df.groupBy("event").\
avg("count").alias("avg_per_month").\
show()
Example 2:
df.groupBy("event").\
agg({'count' : 'avg'}).alias("avg_per_month").\
show()

Spark: Flatten simple multi-column DataFrame

How to flatten a simple (i.e. no nested structures) dataframe into a list?
My problem set is detecting all the node pairs that have been changed/added/removed from a table of node pairs.
This means I have a "before" and "after" table to compare. Combining the before and after dataframe yields rows that describe where a pair appears in one dataframe but not the other.
Example:
+-----------+-----------+-----------+-----------+
|before.id1 |before.id2 |after.id1 |after.id2 |
+-----------+-----------+-----------+-----------+
| null| null| E2| E3|
| B3| B1| null| null|
| I1| I2| null| null|
| A2| A3| null| null|
| null| null| G3| G4|
The goal is to get a list of all the (distinct) nodes in the entire dataframe which would look like:
{A2,A3,B1,B3,E2,E3,G3,G4,I1,I2}
Potential approaches:
Union all the columns separately and distinct
flatMap and distinct
map and flatten
Since the structure is well known and simple it seems like there should be an equally straightforward solution. Which approach, or others, would be the simplest approach?
Other notes
Order of id1-id2 pair is only important to for change detection
Order in the resulting list is not important
DataFrame is between 10k and 100k rows
distinct in the resulting list is nice to have, but not required; assuming is trivial with the distinct operation
Try following, converting all rows into seqs and then collect all rows and then flatten the data and remove null value:
val df = Seq(("A","B"),(null,"A")).toDF
val result = df.rdd.map(_.toSeq.toList)
.collect().toList.flatten.toSet - null

How to write a large RDD to local disk through the Scala spark-shell?

Through a Scala spark-shell, I have access to an Elasticsearch db using the elasticsearch-hadoop-5.5.0 connector.
I generate my RDD by passing the following command in the spark-shell:
val myRdd = sc.esRDD("myIndex/type", myESQuery)
myRDD contains 2.1 million records across 15 partitions. I have been trying to write all the data to a text file(s) on my local disk but when I try to run operations that convert the RDD to an array, like myRdd.collect(), I overload my java heap.
Is there a way to export the data (eg. 100k records at a time) incrementally so that I am never overloading my system memory?
When you use saveAsTextFile you can pass your filepath as "file:///path/to/output" to have it save locally.
Another option is to use rdd.toLocalIterator Which will allow you to iterate over the rdd on the driver. You can then write each line to a file. This method avoids pulling all the records in at once.
In case someone needs to do this in PySpark (to avoid overwhelming their driver), here's a complete example:
# ========================================================================
# Convenience functions for generating DataFrame Row()s w/ random ints.
# ========================================================================
NR,NC = 100,10 # Number of Rows(); Number of columns.
fn_row = lambda x: Row(*[random.randint(*x) for _ in range(NC)])
fn_df = (lambda x,y: spark.createDataFrame([fn_row(x) for _ in range(NR)])
.toDF(*[f'{y}{c}' for c in range(NC)]))
# ========================================================================
Generate a DataFrame with 100-Rows of 10-Columns; containing integer values between [1..100):
>>> myDF = fn_df((1,100),'c')
>>> myDF.show(5)
+---+---+---+---+---+---+---+---+---+---+
| c0| c1| c2| c3| c4| c5| c6| c7| c8| c9|
+---+---+---+---+---+---+---+---+---+---+
| 72| 88| 74| 81| 68| 80| 45| 32| 49| 29|
| 78| 6| 55| 2| 23| 84| 84| 84| 96| 95|
| 25| 77| 64| 89| 27| 51| 26| 9| 56| 30|
| 16| 16| 94| 33| 34| 86| 49| 16| 21| 86|
| 90| 69| 21| 79| 63| 43| 25| 82| 94| 61|
+---+---+---+---+---+---+---+---+---+---+
Then, using DataFrame.toLocalIterator(), "stream" the DataFrame Row by Row, applying whatever post-processing is desired. This avoids overwhelming Spark driver memory.
Here, we simply print() the Rows to show that each is the same as above:
>>> it = myDF.toLocalIterator()
>>> for _ in range(5): print(next(it)) # Analogous to myDF.show(5)
>>>
Row(c0=72, c1=88, c2=74, c3=81, c4=68, c5=80, c6=45, c7=32, c8=49, c9=29)
Row(c0=78, c1=6, c2=55, c3=2, c4=23, c5=84, c6=84, c7=84, c8=96, c9=95)
Row(c0=25, c1=77, c2=64, c3=89, c4=27, c5=51, c6=26, c7=9, c8=56, c9=30)
Row(c0=16, c1=16, c2=94, c3=33, c4=34, c5=86, c6=49, c7=16, c8=21, c9=86)
Row(c0=90, c1=69, c2=21, c3=79, c4=63, c5=43, c6=25, c7=82, c8=94, c9=61)
And if you wish to "stream" DataFrame Rows to a local file, perhaps transforming each Row along the way, you can use this template:
>>> it = myDF.toLocalIterator() # Refresh the iterator here.
>>> with open('/tmp/output.txt', mode='w') as f:
>>> for row in it: print(row, file=f)