The role of `outputCols` in `Imputer` of PySpark

The role of `outputCols` in `Imputer` of PySpark - pyspark

I've just encountered this line of code:
imputer = Imputer(strategy="median", inputCols=imputeCols, outputCols=imputeCols)
From the documentation, I don't understand the role of outputCols. I know that if I remove that optional parameter, then I get an error message related to multiple imputation.
I think what we're doing is rewriting the original columns with this code.
(I've just started to learn pyspark)

Related

How to compare 2 datasets in scala?

I am creating unit tests for a Scala application using Scala Test. I have the actual and expected results as Dataset . When I verified manually both the data and schema matches between actual and expected datasets.
Actual Dataset= actual_ds
Expected Dataset = expected_ds
when I execute below command ,it returns False.
assert(actual_ds.equals(expected_ds))
Could anyone suggest what could be the reason. And is there any other inbuilt function to compare the datasets in scala?

Use one of libraries designed for spark tests, spark-fast-tests , spark-testing-base, spark-test
They are quite ease to use and with their help its easy to compare two datasets with formatted message on output
You may start with spark-fast-tests (you can find usage in readme file) and check others if it does not suite your needs (fro example if you need other output formatting)

That .equals() is from Java Object .equals so it's correct that the assert fails.
I would start testing two datasets with:
assert actual_ds.schema == expected_ds.schema
assert actual_ds.count() == expected_ds.count()
And then checking this question: DataFrame equality in Apache Spark

Getting Py4JJavaError Pyspark error on using rdd

I am getting the below error:
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
on this line:
result = df.select('student_age').rdd.flatMap(lambda x: x).collect()
'student_age' is a column name. It was running fine until last week but now this error.
Does anyone have any insights on that?

Using collect is dangerous for this very reason, It's prone to Out Of Memory errors. I suggest removing it. You also do not need to use a rdd for this you can do this with a data frame:
result = df.select(explode(df['student_age'])) #returns a dataFrame
#write code to use a data frame instead of any array.
If nothing else changed, likely the data did, and finally outgrew the size in memory.
It's also possible that you have new 'bad' data that is throwing an error.
Either way you could likely prove this by find this(OOM) or prove the data is bad by printing it.
def f(row):
print(row.student_age)
result.foreach(f) # used for simple stuff that doesn't require heavy initialization.
IF that works you may want to break your code down to use foreachPartition. This will let you do math on each value in the memory of each executor. The only trick is that within fun below as you are executing this code on the executor you cannot reference anything that uses sparkContext. (Python code only instead of Pyspark).
def f(rows):
#intialize a database connection here
for row in rows:
print(row.student_age) # do stuff with student_age
#close database connection here
result.foreachPartition(f) # used for things that need heavy initialization
Spark foreachPartition vs foreach | what to use?

This issue is solved, here is the answer:
result = [i[0] for i in df.select('student_age').toLocalIterator()]

Using DATASet Api with Spark Scala

Hi I am very new to spark/Scala and trying to implement some functionality.My requirement is very simple.I have to perform all the the operations using DataSet API.
Question1:
I converted the csv in form a case Class?Is it correct way of converting data frame to DataSet??Am I doing it correctly?
Also when I am trying to do transformation on orderItemFile1,for filter/map operation I am able to access with _.order_id.But same is not happening with groupBy
case class orderItemDetails (order_id_order_item:Int, item_desc:String,qty:Int, sale_value:Int)
val orderItemFile1=ss.read.format("csv")
.option("header",true)
.option("infersSchema",true)
.load("src/main/resources/Order_ItemData.csv").as[orderItemDetails]
orderItemFile1.filter(_.order_id_order_item>100) //Works Fine
orderItemFile1.map(_.order_id_order_item.toInt)// Works Fine
//Error .Inside group By I am unable to access it as _.order_id_order_item. Why So?
orderItemFile1.groupBy(_.order_id_order_item)
//Below Works..But How this will provide compile time safely as committed
//by DataSet Api.I can pass any wrong column name also here and it will be //caught only on run time
orderItemFile1.groupBy(orderItemFile1("order_id_order_item")).agg(sum(orderItemFile1("item_desc")))

Perhaps the functionality you're looking for is #groupByKey. See example here.
As for your first question, basically yes, you're reading a CSV into a Dataset[A] where A is a case class you've declared.

Spark serializes variable value as null instead of its real value

My understanding of the mechanics of Spark's code distribution toward the nodes running it is merely cursory, and I fail in having my code successfully run within Spark's mapPartitions API when I wish to instantiate a class for each partition, with an argument.
The code below worked perfectly, up until I evolved the class MyWorkerClass to require an argument:
val result : DataFrame =
inputDF.as[Foo].mapPartitions(sparkIterator => {
// (1) initialize heavy class instance once per partition
val workerClassInstance = MyWorkerClass(bar)
// (2) provide an iterator using a function from that class instance
new CloseableIteratorForSparkMapPartitions[Post, Post](sparkIterator, workerClassInstance.recordProcessFunc)
}
The code above worked perfectly well up to the point in time when I had (or chose) to add a constructor argument to my class MyWorkerClass. The passed argument value turns out as null in the worker, instead of the real value of bar. Somehow the serialization of the argument fails to work as intended.
How would you go about this?
Additional Thoughts/Comments
I'll avoid adding the bulky code of CloseableIteratorForSparkMapPartitions ― it merely provides a Spark friendly iterator and might even not be the most elegant implementation in that.
As I understand it, the constructor argument is not being correctly passed to the Spark worker due to how Spark captures state when serializing stuff to send for execution on the Spark worker. However instantiating the class does seamlessly make heavy-to-load assets included in that class ― normally available to the function provided on the last line of my above code; And the class did seem to instantiate per partition. Which is actually a valid if not key use case for using mapPartitions instead of map.
It's the passing of an argument to its instantiation, that I am having trouble figuring how to enable or work-around. In my case this argument is a value only known after the program started running (even if always invariant throughout a single execution of my job; it's actually a program argument). I do need it passing along for the initialization of the class.
I tried tinkering to solve, by providing a function which instantiates MyWorkerClass with its input argument, rather than directly instantiating as above, but this did not solve matters.
The root symptom of the problem is not any exception, but simply that the value of bar when MyWorkerClass is instantiated will just be null, instead of the actual value of bar which is known in the scope of the code enveloping the code snippet which I included above!
* one related old Spark issue discussion here

No such element exception in machine learning pipeline using scala

I am trying to implement an ML pipeline in Spark using Scala and I used the sample code available on the Spark website. I am converting my RDD[labeledpoints] into a data frame using the functions available in the SQlContext package. It gives me a NoSuchElementException:
Code Snippet:
Error Message:
Error at the line Pipeline.fit(training_df)

The type Vector you have inside your for-loop (prob: Vector) takes a type parameter; such as Vector[Double], Vector[String], etc. You just need to specify the type you data your vector will store.
As a site note: The single argument overloaded version of createDataFrame() you use seems to be experimental. In case you are planning to use it for some long term project.

The pipeline in your code snippet is currently empty, so there is nothing to be fit. You need to specify the stages using .setStages(). See the example in the spark.ml documentation here.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

The role of `outputCols` in `Imputer` of PySpark - pyspark

Related

How to compare 2 datasets in scala?

Getting Py4JJavaError Pyspark error on using rdd

Using DATASet Api with Spark Scala

Spark serializes variable value as null instead of its real value

No such element exception in machine learning pipeline using scala

Categories

Resources