MyPy on PySpark error: "DataFrameLike" has no attribute "values" - pyspark

I am developing a program in PySpark 3.2.1.
Mypy == 0.950
One of the operations requires to transform information of a small DataFrame into a list.
The code is:
result = df.select("col1","col2","col3").toPandas().values.tolist()
I need to convert it to a list because I then broadcast the information and a pyspark broadcast can't be a DataFrame
For this code I get the following mypy error:
error: "DataFrameLike" has no attribute "values"
I there something I might do to avoid the mypy error?

This is working fine for me.
>>> df=spark.read.option('header','true').csv("C:/Users/pc/Desktop/myfile.txt")
>>> df
DataFrame[col1: string, col2: string, col3: string]
>>> result = df.select("col1","col2","col3").toPandas().values.tolist()
>>> result
[['1', '100', '1001'], ['2', '200', '2002'], ['3', '300', '1421'], ['4', '400', '24214'], ['5', '500', '14141']]
what is Mypy here ?

Related

Convert udf over multiple columns in scala spark

I have the following code in pyspark which works fine.
from pyspark.sql.types import IntegerType, DoubleType
from pyspark.sql.functions import udf, array
prod_cols = udf(lambda arr: float(arr[0])*float(arr[1]), DoubleType())
finalDf = finalDf.withColumn('click_factor', sum_cols(array('rating', 'score')))
Now i tried similar code in scala.
val prod_cols = udf((rating: Double, score: Double) => {rating.toDouble*score.toDouble})
finalDf = finalDf.withColumn("cl_rate", prod_cols(finalDf("rating"), finalDf("score")))
Somehow second code doesnt give right answers, always null or zero
Can you help me get the right scala code. Essentially i just need a code two multiply two columns, considering there may be null values of score or rating.
Pass only Not Null values to UDF.
Change below code
val prod_cols = udf((rating: Double, score: Double) => {rating.toDouble*score.toDouble})
finalDf.withColumn("cl_rate", prod_cols(finalDf("rating"), finalDf("score")))
to
val prod_cols = udf((rating: Double, score: Double) => {rating.toDouble*score.toDouble})
finalDf
.withColumn("rating",$"rating".cast("double")) // Ignore this line if column data type is already double
.withColumn("score",$"score".cast("double")) // Ignore this line if column data type is already double
.withColumn("cl_rate",
when(
$"rating".isNotNull && $"score".isNotNull,
prod_cols($"rating", $"score")
).otherwise(lit(null).cast("double"))
)

Pass RDD in scala function. Output Dataframe

say I have the below csv and many more like it.
val csv = sc.parallelize(Array(
"col1, col2, col3",
"1, cat, dog",
"2, bird, bee"))
I would like to apply the below functions to the RDD to convert it to a data frame with the desired logic below. I keep running into the error error: not found: value DataFrame
How can I correct this?
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.Row
/
In most cases I would read CSV files directly as a dateframe using Spark's core functionality, but I am unable to in this case.
Any/all help is appreciated.
in order not to get error: not found: value DataFrame you must add the following import:
import org.apache.spark.sql.DataFrame
and your method declaration should be like this:
def udf(fName : RDD[String]): DataFrame = { ...

Count Distinct in dtypes of a pyspark dataframe

I need a function to get something like this in a pypspark dataframe:
Variables types:
Numeric: 4
Categorical: 4
Date: 1
Let's create a dummy Dataframe in our Pyspark Shell
>>> rdd = sc.parallelize([['x',1,'y',2,1.1]])
>>> df = spark.createDataFrame(rdd, schema=['Col1','Col2','Col3','Col4','Col5'])
Here are the column types for df
>>> df
DataFrame[Col1: string, Col2: bigint, Col3: string, Col4: bigint, Col5: double]
As per the documentation, if you use the dtypes attribute on a Spark DataFrame, https://spark.apache.org/docs/2.3.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.dtypes you will get "all column names and their data types as a list"
>>> print(df.dtypes)
[('Col1', 'string'), ('Col2', 'bigint'), ('Col3', 'string'), ('Col4', 'bigint'), ('Col5', 'double')]
You can now leverage the native Python Counter Library to get your desired output
>>> from Collections import Counter
>>> data_types = df.dtypes
>>> dict(Counter(dict(data_types).values()))
{'string': 2, 'bigint': 2, 'double': 1}
You should be easily able to convert these two lines into a function that meets your end requirements
Hope this helps!

Spark UDF not working: how to specify the column on which to apply it?

Let's say I have my DataFrame, with a given column named "X". I want to understand why the first code doesn't work whereas the second one does. For me, it doesn't change anything.
On the one hand, this doesn't work:
val dataDF = sqlContext
.read
.parquet(input_data)
.select(
"XXX", "YYY", "III"
)
.toDF(
"X", "Y", "I"
)
.groupBy(
"X", "Y"
)
.agg(
sum("I").as("sum_I")
)
.orderBy(desc("sum_I"))
.withColumn("f_sum_I", udf((x: Long) => f(x)).apply(dataDF("sum_I")))
.drop("sum_I")
dataDF.show(50, false)
IntelliJ doesn't compile my code and I have the following error:
Error:(88, 67) recursive value dataDF needs type
.withColumn("f_sum_I", udf((x: Long) => f(x)).apply(dataDF("sum_I")))
On the other hand, this work if I change the given line with this:
.withColumn("f_sum_I", udf((x: Long) => f(x)).apply(col("sum_I")))
All I did was replacing the call to my DataFrame column to use a more generic function "col". I don't understand the difference, and more especially why does it not prefer the first method (with the name of the DataFrame).
You're trying to use dataDF before you're done defining it - dataDF is the result of the entire expression starting with sqlContext.read and ending with .drop("sumI"), so you can't use it within that expression.
You can solve this by simply referencing the column without using the DataFrame, e.g. using the col function from org.apache.spark.sql.functions:
.withColumn("f_sum_I", udf((x: Long) => f(x)).apply(col("sum_I")))

Convert RDD of Lists to Dataframe

I am trying to convert an RDD of lists to a Dataframe in Spark.
RDD:
['ABC', 'AA', 'SSS', 'color-0-value', 'AAAAA_VVVV0-value_1', '1', 'WARNING', 'No test data for negative population! Re-using negative population for non-backtest.']
['ABC', 'SS', 'AA', 'color-0-SS', 'GG0-value_1', '1', 'Temp', 'After, date differences are outside tolerance (10 days) 95.1% of the time']
This is the content of the RDD, multiple lists.
How to convert this to a dataframe? Currently, it is converting it into a single column, but i need multiple columns.
Dataframe
+--------------+
| _1|
+--------------+
|['ABC', 'AA...|
|['ABC', 'SS...|
Just use Row.fromSeq:
import org.apache.spark.sql.Row
rdd.map(x => Row.fromSeq(x)).toDF