How to use forEachPartition on pyspark dataframe?

How to use forEachPartition on pyspark dataframe? - pyspark

I am trying to use forEachPartition() method using pyspark on a RDD that has 8 partitions. My custom function tries to generate a string output for a given string input. Here is the code
from google.cloud import language
from google.cloud.language import enums
from google.cloud.language import types
import pandas as pd
import datetime
def compute_sentiment_score(text):
client = language.LanguageServiceClient()
document = types.Document(content=text,type=enums.Document.Type.PLAIN_TEXT, language='en')
sentiment = client.analyze_sentiment(document=document).document_sentiment
return str(sentiment.score)
def compute_sentiment_magnitude(text):
client = language.LanguageServiceClient()
document = types.Document(content=text,type=enums.Document.Type.PLAIN_TEXT, language='en')
sentiment = client.analyze_sentiment(document=document).document_sentiment
return str(sentiment.magnitude)
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="/path-to-file.json"
imdb_reviews = pd.read_csv('imdb_reviews.csv', header=None, names=['input1', 'input2'], encoding= "ISO-8859-1")
imdb_reviews.head()
input1 input2
0 first think another Disney movie, might good, ... 1
1 Put aside Dr. House repeat missed, Desperate H... 0
2 big fan Stephen King's work, film made even gr... 1
3 watched horrid thing TV. Needless say one movi... 0
4 truly enjoyed film. acting terrific plot. Jeff... 1
spark_imdb_reviews = spark.createDataFrame(imdb_reviews) # create spark dataframe
spark_imdb_reviews.printSchema()
root
|-- input1: string (nullable = true)
|-- input2: long (nullable = true)
And this is my custom function -
def compute_sentiment_score(text):
client = language.LanguageServiceClient()
document = types.Document(content=text,type=enums.Document.Type.PLAIN_TEXT, language='en')
sentiment = client.analyze_sentiment(document=document).document_sentiment
return str(sentiment.score)
def compute_sentiment_magnitude(text):
client = language.LanguageServiceClient()
document = types.Document(content=text,type=enums.Document.Type.PLAIN_TEXT, language='en')
sentiment = client.analyze_sentiment(document=document).document_sentiment
return str(sentiment.magnitude)
Here is how I try to use the forEachPartition() method -
create_rdd = spark_imdb_reviews.select("input1").rdd # create RDD
print(create_rdd.getNumPartitions()) # print the partitions
print(create_rdd.take(1)) # display data
new_rdd = create_rdd.foreachPartition(compute_sentiment_score) # compute score
Which gives this output and an error -
8
[Row(input1="first think another Disney movie, might good, it's kids movie. watch it, can't help enjoy it. ages love movie. first saw movie 10 8 years later still love it! Danny Glover superb could play part better. Christopher Lloyd hilarious perfect part. Tony Danza believable Mel Clark. can't help, enjoy movie! give 10/10!")]
File "<ipython-input-106-e3fd65ce75cc>", line 3, in compute_sentiment_score
TypeError: <itertools.chain object at 0x11ab7f198> has type itertools.chain, but expected one of: bytes, unicode

There are two similar functions:
RDD.forachPartition and
RDD.mapPartitions
Both functions expect another function as parameter (here compute_sentiment_score). This function gets the content of a partition passed in form of an iterator. The text parameter in the question is actually an iterator that can be used inside of compute_sentiment_score.
The difference between foreachPartition and mapPartition is that foreachPartition is a Spark action while mapPartition is a transformation. This means the code being called by foreachPartition is immediately executed and the RDD remains unchanged while mapPartition can be used to create a new RDD. In order to store the calculated sentiment score mapPartitions should be used.
def compute_sentiment_score(itr_text):
#setup the things that are expensive and should be prepared only once per partition
client = language.LanguageServiceClient()
#run the loop for each row of the partition
for text in itr_text:
document = types.Document(content=text.value,type=enums.Document.Type.PLAIN_TEXT, language='en')
sentiment = client.analyze_sentiment(document=document).document_sentiment
yield (text.value, sentiment.score)
df_with_score = df.rdd.mapPartitions(compute_sentiment_score)
df_with_score.foreach(print)
In this example client = language.LanguageServiceClient() is called once per partition. Probably the amount of partitions has to be reduced, for example with coalesce.

Related

Resiliency property of RDD

I have started studying pyspark and for some reason I'm not able to get the concept of Resiliency property of RDD. My understanding is RDD is a data structure like dataframe in pandas and is immutable. But I wrote a code (shown below) and it works.
file = sc.textfile('file name')
filterData = file.map(lambda x: x.split(','))
filterData = filterData.reduceByKey(lambda x,y: x+y)
filterData = filterData.sortBy(lambda x: x[1])
result = filterData.collect()
Doesn't this violate the immutable property as you can see I'm modifying the same RDD again and again.
File is a csv file with 2 columns. column 1 is an id and column 2 is just some integer.
Can you guys please explain where I'm going wrong with my understanding.

spark program to check if a given keyword exists in a huge text file or not

To find out a given keyword exists in a huge text file or not, I came up wit below two approaches.
Approach1:
def keywordExists(line):
if (line.find(“my_keyword”) > -1):
return 1
return 0
lines = sparkContext.textFile(“test_file.txt”);
isExist = lines.map(keywordExists);
sum = isExist.reduce(sum);
print(“Found” if sum>0 else “Not Found”)
Approach2:
var keyword="my_keyword"
val rdd=sparkContext.textFile("test_file.txt")
val count= rdd.filter(line=>line.contains(keyword)).count
print(“Found” if count>0 else “Not Found”)
Main difference is first one using map and then reducing whereas second one is filtering and doing a count.
Could anyone suggest which is efficient.

I would suggest:
val wordFound = !rdd.filter(line=>line.contains(keyword)).isEmpty()
Benefit: The search can be stopped once 1 occurence of keyword was found
see also Spark: Efficient way to test if an RDD is empty

What's the simplest way to get a Spark DataFrame from arbitrary Array Data in Scala?

I've been breaking my head about this one for a couple of days now. It feels like it should be intuitively easy... Really hope someone can help!
I've built an org.nd4j.linalg.api.ndarray.INDArray of word occurrence from some semi-structured data like this:
import org.nd4j.linalg.factory.Nd4j
import org.nd4s.Implicits._
val docMap = collection.mutable.Map[Int,Map[Int,Int]] //of the form Map(phrase -> Map(phrasePosition -> word)
val words = ArrayBuffer("word_1","word_2","word_3",..."word_n")
val windows = ArrayBuffer("$phrase,$phrasePosition_1","$phrase,$phrasePosition_2",..."$phrase,$phrasePosition_n")
var matrix = Nd4j.create(windows.length*words.length).reshape(windows.length,words.length)
for (row <- matrix.shape(0)){
for(column <- matrix.shape(1){
//+1 to (row,column) if word occurs at phrase, phrasePosition indicated by window_n.
}
}
val finalmatrix = matrix.T.dot(matrix) // to get co-occurrence matrix
So far so good...
Downstream of this point I need to integrate the data into an existing pipeline in Spark, and use that implementation of pca etc, so I need to create a DataFrame, or at least an RDD. If I knew the number of words and/or windows in advance I could do something like:
case class Row(window : String, word_1 : Double, word_2 : Double, ...etc)
val dfSeq = ArrayBuffer[Row]()
for (row <- matrix.shape(0)){
dfSeq += Row(windows(row),matrix.get(NDArrayIndex.point(row), NDArrayIndex.all()))
}
sc.parallelize(dfSeq).toDF("window","word_1","word_2",...etc)
but the number of windows and words is determined at runtime. I'm looking for a WindowsxWords org.apache.spark.sql.DataFrame as output, input is a WindowsxWords org.nd4j.linalg.api.ndarray.INDArray
Thanks in advance for any help you can offer.

Ok, so after several days work it looks like the simple answer is: there isn't one. In fact, it looks like trying to use Nd4j in this context at all is a bad idea for several reasons:
It's (really) hard to get data out of the native INDArray format once you've put it in.
Even using something like guava, the .data() method brings everything on heap which will quickly become expensive.
You've got the added hassle of having to compile an assembly jar or use hdfs etc to handle the library itself.
I did also consider using Breeze which may actually provide a viable solution but carries some of the same problems and can't be used on distributed data structures.
Unfortunately, using native Spark / Scala datatypes, although easier once you know how, is - for someone like me coming from Python + numpy + pandas heaven at least - painfully convoluted and ugly.
Nevertheless, I did implement this solution successfully:
import org.apache.spark.mllib.linalg.{Vectors,Vector,Matrix,DenseMatrix,DenseVector}
import org.apache.spark.mllib.linalg.distributed.RowMatrix
//first make a pseudo-matrix from Scala Array[Double]:
var rowSeq = Seq.fill(windows.length)(Array.fill(words.length)(0d))
//iterate through 'rows' and 'columns' to fill it:
for (row 0 until windows.length){
for (column 0 until words.length){
// rowSeq(row)(column) += 1 if word occurs at phrase, phrasePosition indicated by window_n.
}
}
//create Spark DenseMatrix
val rows : Array[Double] = rowSeq.transpose.flatten.toArray
val matrix = new DenseMatrix(windows.length,words.length,rows)
One of the main operations that I needed Nd4J for was matrix.T.dot(matrix) but it turns out that you can't multiply 2 matrices of Type org.apache.spark.mllib.linalg.DenseMatrix together, one of them (A) has to be a org.apache.spark.mllib.linalg.distributed.RowMatrix and - you guessed it - you can't call matrix.transpose() on a RowMatrix, only on a DenseMatrix! Since it's not really relevant to the question, I'll leave that part out, except to explain that what comes out of that step is a RowMatrix. Credit is also due here and here for the final part of the solution:
val rowMatrix : [RowMatrix] = transposeAndDotDenseMatrix(matrix)
// get DataFrame from RowMatrix via DenseMatrix
val newdense = new DenseMatrix(rowMatrix.numRows().toInt,rowMatrix.numCols().toInt,rowMatrix.rows.collect.flatMap(x => x.toArray)) // the call to collect() here is undesirable...
val matrixRows = newdense.rowIter.toSeq.map(_.toArray)
val df = spark.sparkContext.parallelize(matrixRows).toDF("Rows")
// then separate columns:
val df2 = (0 until words.length).foldLeft(df)((df, num) =>
df.withColumn(words(num), $"Rows".getItem(num)))
.drop("Rows")
Would love to hear improvements and suggestions on this, thanks.

Spark flushing Dataframe on show / count

I am trying to print the count of a dataframe, and then first few rows of it, before finally sending it out for further processing.
Strangely, after a call to count() the dataframe becomes empty.
val modifiedDF = funcA(sparkDF)
val deltaDF = modifiedDF.except(sparkDF)
println(deltaDF.count()) // prints 10
println(deltaDF.count()) //prints 0, similar behavior with show
funcB(deltaDF) //gets null dataframe
I was able to verify the same using deltaDF.collect.foreach(println) and subsequent calls to count.
However, if I do not call count or show, and just send it as is, funcB gets the whole DF with 10 rows.
Is it expected?
Definition of funcA() and its dependencies:
def funcA(inputDataframe: DataFrame): DataFrame = {
val col_name = "colA"
val modified_df = inputDataframe.withColumn(col_name, customUDF(col(col_name)))
val modifiedDFRaw = modified_df.limit(10)
modifiedDFRaw.withColumn("colA", modifiedDFRaw.col("colA").cast("decimal(38,10)"))
}
val customUDF = udf[Option[java.math.BigDecimal], java.math.BigDecimal](myUDF)
def myUDF(sval: java.math.BigDecimal): Option[java.math.BigDecimal] = {
val strg_name = Option(sval).getOrElse(return None)
if (change_cnt < 20) {
change_cnt = change_cnt + 1
Some(strg_name.multiply(new java.math.BigDecimal("1000")))
} else {
Some(strg_name)
}
}

First of all function used as UserDefinedFunction has to be at least idempotent, but optimally pure. Otherwise the results are simply non-deterministic. While some escape hatch is provided in the latest versions (it is possible to hint Spark that function shouldn't be re-executed) these won't help you here.
Moreover having mutable stable (it is not exactly clear what is the source of change_cnt, but it is both written and read in the udf) as simply no go - Spark doesn't provide global mutable state.
Overall your code:
Modifies some local copy of some object.
Makes decision based on such object.
Unfortunately both components are simply not salvageable. You'll have to go back to planning phase and rethink your design.

Your Dataframe is a distributed dataset and trying to do a count() returns unpredictable results since the count() can be different in each node. Read the documentation about RDDs below. It is applicable to DataFrames as well.
https://spark.apache.org/docs/2.3.0/rdd-programming-guide.html#understanding-closures-
https://spark.apache.org/docs/2.3.0/rdd-programming-guide.html#printing-elements-of-an-rdd

How to use existing trained model using LinearRegressionModel to work with SparkStreaming and predict data with it [duplicate]

I have the following code:
val blueCount = sc.accumulator[Long](0)
val output = input.map { data =>
for (value <- data.getValues()) {
if (record.getEnum() == DataEnum.BLUE) {
blueCount += 1
println("Enum = BLUE : " + value.toString()
}
}
data
}.persist(StorageLevel.MEMORY_ONLY_SER)
output.saveAsTextFile("myOutput")
Then the blueCount is not zero, but I got no println() output! Am I missing anything here? Thanks!

This is a conceptual question...
Imagine You have a big cluster, composed of many workers let's say n workers and those workers store a partition of an RDD or DataFrame, imagine You start a map task across that data, and inside that map you have a print statement, first of all:
Where will that data be printed out?
What node has priority and what partition?
If all nodes are running in parallel, who will be printed first?
How will be this print queue created?
Those are too many questions, thus the designers/maintainers of apache-spark decided logically to drop any support to print statements inside any map-reduce operation (this include accumulators and even broadcast variables).
This also makes sense because Spark is a language designed for very large datasets. While printing can be useful for testing and debugging, you wouldn't want to print every line of a DataFrame or RDD because they are built to have millions or billions of rows! So why deal with these complicated questions when you wouldn't even want to print in the first place?
In order to prove this you can run this scala code for example:
// Let's create a simple RDD
val rdd = sc.parallelize(1 to 10000)
def printStuff(x:Int):Int = {
println(x)
x + 1
}
// It doesn't print anything! because of a logic design limitation!
rdd.map(printStuff)
// But you can print the RDD by doing the following:
rdd.take(10).foreach(println)

I was able to work it around by making a utility function:
object PrintUtiltity {
def print(data:String) = {
println(data)
}
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to use forEachPartition on pyspark dataframe? - pyspark

Related

Resiliency property of RDD

spark program to check if a given keyword exists in a huge text file or not

What's the simplest way to get a Spark DataFrame from arbitrary Array Data in Scala?

Spark flushing Dataframe on show / count

How to use existing trained model using LinearRegressionModel to work with SparkStreaming and predict data with it [duplicate]

Categories

Resources