How can I normalize dataframe in pyspark? - pyspark

I'm trying to normalize user-item matrix, but I want to use this formula:
(df.values-df.values.min())/(df.values.max()-df.values.min())
in Dataframe like this.

You can create a function and reuse likewise in other function ---
def compute_function(df):
_count = (df.values-df.values.min())/(df.values.max()-df.values.min())
df = df.withColumn("new_column", F.lit(_count))
return df
df = compute_function(df)
df.show()

Related

convert breeze.linalg.DenseMatrix[Double] to a dataframe in Scala

i have a breeze.linalg.DenseMatrix[Double] as follow, i want to convert it to a dataframe
breeze.linalg.DenseMatrix[Double] =
0.009748169568491553 3.04248345416453E-4 -0.0018493112842201912 8.200326863261204E-4
3.0424834541645305E-4 0.00873118653317929 6.352723194418622E-4 1.84118791655692E-5
-0.001849311284220191 6.35272319441862E-4 0.008553284420541575 -6.407982513791382E-4
8.200326863261203E-4 1.8411879165568983E-5 -6.407982513791378E-4 0.008413484758510377
is there any way i can do that?
after couple times of try, i am able to create a dataframe that contains the flattened information of the matrix. and create a tempview, so that access from python as a dataframe
in scala
// covarianceMatrix (in scala)
val c = covarianceMatrix.toArray.toSeq
val covarianceMatrix_df = c.toDF("number")
covarianceMatrix_df.createOrReplaceTempView("covarianceMatrix_df")
in python
covarianceMatrix_df=spark.sql('''SELECT * FROM covarianceMatrix_df ''')
covarianceMatrix_pd = covarianceMatrix_df.toPandas()
nrows = np.sqrt(len(covarianceMatrix_pd))
covarianceMatrix_pd = covarianceMatrix_pd.to_numpy().reshape((int(nrows),int(nrows)))
covarianceMatrix_pd

Length of dataframe inside UDF function

I need to write a complex User Defined Function (UDF) that takes multiple columns as input. Something like:
val uudf = udf{(val:Int, lag:Int, cumsum_p:Double) => val + lag + cum_p} // actually a more complex function but let's make it simple
The third parameter cumsum_p indicate is a cumulative sum of p where p is a the length of the group it is computed. Because this udf will then be used in a groupby.
I come up with this solution which is almost ok:
val uudf = udf{(val:Int, lag:Int, cumsum_p:Double) => val + lag + cum_p}
val w = Window.orderBy($"sale_qty")
df.withColumn("needThat",
uudf(col("sale_qty"),
lead("sale_qty",1).over(w), sum(lit(1/length_group)).over(w)
)
).show()
The problem is that if I replace lit(1/length_group) with lit(1/count("sale_qty")) the created column now contains only 1 element which lead to an error...
You should compute count("sale_qty") first:
val w = Window.orderBy($"sale_qty")
df
.withColumn("cnt",count($"sale_qty").over())
.withColumn("needThat",
uudf(col("sale_qty"),
lead("sale_qty",1).over(w), sum(lit(1)/$"cnt").over(w)
)
).show()

Transforming a column and update the DataFrame

So, what I'm doing below is I drop a column A from a DataFrame because I want to apply a transformation (here I just json.loads a JSON string) and replace the old column with the transformed one. After the transformation I just join the two resulting data frames.
df = df_data.drop('A').join(
df_data[['ID', 'A']].rdd\
.map(lambda x: (x.ID, json.loads(x.A))
if x.A is not None else (x.ID, None))\
.toDF()\
.withColumnRenamed('_1', 'ID')\
.withColumnRenamed('_2', 'A'),
['ID']
)
The thing I dislike about this is of course the overhead I'm faced because I had to do the withColumnRenamed operations.
With pandas All I'd do something like this:
pdf = pd.DataFrame([json.dumps([0]*np.random.randint(5,10)) for i in range(10)], columns=['A'])
pdf.A = pdf.A.map(lambda x: json.loads(x))
pdf
but the following does not work in pyspark:
df.A = df[['A']].rdd.map(lambda x: json.loads(x.A))
So is there an easier way than what I'm doing in my first code snipped?
I do not think you need to drop the column and do the join. The following code should* be equivalent to what you posted:
cols = df_data.columns
df = df_data.rdd\
.map(
lambda row: tuple(
[row[c] if c != 'A' else (json.loads(row[c]) if row[c] is not None else None)
for c in cols]
)
)\
.toDF(cols)
*I haven't actually tested this code, but I think this should work.
But to answer your general question, you can transform a column in-place using withColumn().
df = df_data.withColumn("A", my_transformation_function("A").alias("A"))
Where my_transformation_function() can be a udf or a pyspark sql function.
From what i could understand, is it something like this you are trying to achieve?
import pyspark.sql.functions as F
import json
json_convert = F.udf(lambda x: json.loads(x) if x is not None else None)
cols = df_data.columns
df = df_data.select([json_convert(F.col('A')).alias('A')] + \
[col for col in cols if col != 'A'])

How to convert a DataFrame with String into a DataFrame with Vectors in Scala(Spark 2.0)

I have a DataFrame with a column named KFA containing a string with angular braces on both ends. There are 4 double values in this long string. I would like to convert this into a DataFrame with vectors.
This is the first element of the DataFrame:
> dataFrame1.first()
res130: org.apache.spark.sql.Row = [[.00663 .00197 .29809 .0034]]
Could you help me to covert it into a dense vector with 4 double values.
I have tried this command
dataFrame1.select("KFA")
.map((x=>x.mkString("").replace("]","").replace("[","").split(" ")))
.rdd.map(x=>Vectors.dense(x(0).toDouble,x(1).toDouble,x(2).toDouble,x(3).toDouble,x(4).toDouble))
This looks very clumsy and unreadable. Could you suggest any other ways of doing this?
Here is an option with Regular expression:
import org.apache.spark.mllib.linalg.{Vector, Vectors}
val p = "[.0-9]+".r
val rddVec = dataFrame1.select("KFA")
.map(x => Vectors.dense(p.findAllIn(x(0).toString).map(_.toDouble).toArray))
# rddVec: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MapPartitionsRDD[49] at map at <console>:39
rddVec.collect
# res43: Array[org.apache.spark.mllib.linalg.Vector] =
Array([0.00663,0.00197,0.29809,0.0034], [0.00663,0.00197,0.29809,0.0034])

Replace missing values with mean - Spark Dataframe

I have a Spark Dataframe with some missing values. I would like to perform a simple imputation by replacing the missing values with the mean for that column. I am very new to Spark, so I have been struggling to implement this logic. This is what I have managed to do so far:
a) To do this for a single column (let's say Col A), this line of code seems to work:
df.withColumn("new_Col", when($"ColA".isNull, df.select(mean("ColA"))
.first()(0).asInstanceOf[Double])
.otherwise($"ColA"))
b) However, I have not been able to figure out, how to do this for all the columns in my dataframe. I was trying out the Map function, but I believe it loops through each row of a dataframe
c) There is a similar question on SO - here. And while I liked the solution (using Aggregated tables and coalesce), I was very keen to know if there is a way to do this by looping through each column (I come from R, so looping through each column using a higher order functional like lapply seems more natural to me).
Thanks!
Spark >= 2.2
You can use org.apache.spark.ml.feature.Imputer (which supports both mean and median strategy).
Scala :
import org.apache.spark.ml.feature.Imputer
val imputer = new Imputer()
.setInputCols(df.columns)
.setOutputCols(df.columns.map(c => s"${c}_imputed"))
.setStrategy("mean")
imputer.fit(df).transform(df)
Python:
from pyspark.ml.feature import Imputer
imputer = Imputer(
inputCols=df.columns,
outputCols=["{}_imputed".format(c) for c in df.columns]
)
imputer.fit(df).transform(df)
Spark < 2.2
Here you are:
import org.apache.spark.sql.functions.mean
df.na.fill(df.columns.zip(
df.select(df.columns.map(mean(_)): _*).first.toSeq
).toMap)
where
df.columns.map(mean(_)): Array[Column]
computes an average for each column,
df.select(_: *).first.toSeq: Seq[Any]
collects aggregated values and converts row to Seq[Any] (I know it is suboptimal but this is the API we have to work with),
df.columns.zip(_).toMap: Map[String,Any]
creates aMap: Map[String, Any] which maps from the column name to its average, and finally:
df.na.fill(_): DataFrame
fills the missing values using:
fill: Map[String, Any] => DataFrame
from DataFrameNaFunctions.
To ingore NaN entries you can replace:
df.select(df.columns.map(mean(_)): _*).first.toSeq
with:
import org.apache.spark.sql.functions.{col, isnan, when}
df.select(df.columns.map(
c => mean(when(!isnan(col(c)), col(c)))
): _*).first.toSeq
For imputing the median (instead of the mean) in PySpark < 2.2
## filter numeric cols
num_cols = [col_type[0] for col_type in filter(lambda dtype: dtype[1] in {"bigint", "double", "int"}, df.dtypes)]
### Compute a dict with <col_name, median_value>
median_dict = dict()
for c in num_cols:
median_dict[c] = df.stat.approxQuantile(c, [0.5], 0.001)[0]
Then, apply na.fill
df_imputed = df.na.fill(median_dict)
For PySpark, this is the code I used:
mean_dict = { col: 'mean' for col in df.columns }
col_avgs = df.agg( mean_dict ).collect()[0].asDict()
col_avgs = { k[4:-1]: v for k,v in col_avgs.iteritems() }
df.fillna( col_avgs ).show()
The four steps are:
Create the dictionary mean_dict mapping column names to the aggregate operation (mean)
Calculate the mean for each column, and save it as the dictionary col_avgs
The column names in col_avgs start with avg( and end with ), e.g. avg(col1). Strip the parentheses out.
Fill the columns of the dataframe with the averages using col_avgs