How to get keys from pyspark SparseVector - pyspark

I conducted a tf-idf transform and now I want to get the keys and values from the result.
I am using the following udf code to get values:
def extract_values_from_vector(vector):
return vector.values.tolist()
extract_values_from_vector_udf = udf(lambda vector:extract_values_from_vector(vector), ArrayType(DoubleType()))
extract = rescaledData.withColumn("extracted_keys", extract_keys_from_vector_udf("features"))
So if the sparsevector looks like:
features=SparseVector(123241, {20672: 4.4233, 37393: 0.0, 109847: 3.7096, 118474: 5.4042}))
extracted_keys in my extract will look like:
[4.4233, 0.0, 3.7096, 5.4042]
My question is, how can I get the keys in the SparseVector dictionary? Such as keys = [20672, 37393, 109847, 118474] ?
I am trying the following code but it won't work
def extract_keys_from_vector(vector):
return vector.indices.tolist()
extract_keys_from_vector_udf = spf.udf(lambda vector:extract_keys_from_vector(vector), ArrayType(DoubleType()))
The result it gave me is: [null,null,null,null]
Can someone help?
Many thanks in advance!

Since the answer is in the comments above, I thought that I would take this time (while waiting to write a parquet of course) to write down the answer.
from pyspark.sql.types import *
from pyspark.sql import functions as F
def extract_keys_from_vector(vector):
return vector.indices.tolist()
feature_extract = F.UserDefinedFunction(lambda vector: extract_keys_from_vector(vector), ArrayType(IntegerType()))
df = df.withColumn("features", feature_extract(F.col("features")))

Related

Inconsistent behavior of pyspark code depending on order of line execution

I am running the code through jupyter on EMR, pyspark version 3.3.0.
I have two dataframes that I have preprocessed with the pyspark.ml.feature functions (OneHotEncoder, StringIndexer, VectorAssembler). The first dataframe, lets call it df_good, has 5 features, the second dataframe, lets call it df_bad, omits 2 of the features from df_good. The underlying dataset used to generate the two datasets is the same, the code to generate the datasets is identical (other than two features not be included in the VectorAssembler inputCols for df_bad).
Below is the code I am using to train the model:
import pyspark.sql.functions as F
from pyspark.sql.types import ArrayType, DoubleType
from pyspark.ml.classification import LogisticRegression
def split_array(col):
def to_list(v):
return v.toArray().tolist()
return F.udf(to_list, ArrayType(DoubleType()))(col)
def train_model(df):
train_df = df.selectExpr("label as label", "features as features")
logit = LogisticRegression()
logit = logit.setFamily("multinomial")
logit_mod = logit.fit(train_df)
df = logit_mod.transform(df)
df = df.withColumn("pred", split_array(F.col("probability"))[0])
return df
Here is where things get weird.
If I run the code below it works and runs in 10-20 seconds each:
df_good = spark.read.parquet("<s3_location_good>")
df_good = train_model(df_good)
df_good.select(F.sum("pred")).show()
df_bad = spark.read.parquet("<s3_location_bad>")
df_bad = train_model(df_bad)
df_bad.select(F.sum("pred")).show()
If I change the order, the code completely hangs on df_bad:
df_bad = spark.read.parquet("<s3_location_bad>")
df_bad = train_model(df_bad)
df_bad.select(F.sum("pred")).show()
df_good = spark.read.parquet("<s3_location_good>")
df_good = train_model(df_good)
df_good.select(F.sum("pred")).show()
The data is unchanged, the code is the same, the behavior is always the same.
Any thoughts are appreciated.

Getting model parameters from regression in Pyspark in efficient way for large data

I have created a function for applying OLS regression and just getting the model parameters. I used groupby and applyInPandas but it's taking too much of time. Is there are more efficient way to work around this?
Note: I din't had to use groupby as all features have many levels but as I cannot use applyInPandas without it so I created a dummy feature as 'group' having the same value as 1.
Code
import pandas as pd
import statsmodels.api as sm
from pyspark.sql.functions import lit
pdf = pd.DataFrame({
'x':[3,6,2,0,1,5,2,3,4,5],
'y':[0,1,2,0,1,5,2,3,4,5],
'z':[2,1,0,0,0.5,2.5,3,4,5,6]})
df = sqlContext.createDataFrame(pdf)
result_schema =StructType([
StructField('index',StringType()),
StructField('coef',DoubleType())
])
def ols(pdf):
y_column = ['z']
x_column = ['x', 'y']
y = pdf[y_column]
X = pdf[x_column]
model = sm.OLS(y, X).fit()
param_table =pd.DataFrame(model.params, columns = ['coef']).reset_index()
return param_table
#adding a new column to apply groupby
df = df.withColumn('group', lit(1))
#applying function
data = df.groupby('group').applyInPandas(ols, schema = result_schema)
Final output sample
index coef
x 0.183246073
y 0.770680628

How UDF function works in pyspark with dates as arguments?

I started in the pyspark world some time ago and I'm racking my brain with an algorithm, initially I want to create a function that calculates the difference of months between two dates, I know there is a function for that (months_between), but it works a little bit different from what I want, I want to extract the months from two dates and subtract without taking into account the days, only the month and the year, the point is, I can do this by manipulating base, creating new columns with the months and subtracting , but I want to do this as a UDF function, like below:
from datetime import datetime
import pyspark.sql.functions as f
base_study = spark.createDataFrame([("1", "2009-01-31", "2007-01-31"),("2","2009-01-31","2011-01-31")], ['ID', 'A', 'B'])
base_study = base_study.withColumn("A",f.to_date(base_study["A"], 'yyyy-MM-dd'))
base_study = base_study.withColumn("B",f.to_date(base_study["B"], 'yyyy-MM-dd'))
def intckSasFunc(RecentDate, PreviousDate):
RecentDate = f.month("RecentDate")
PreviousDate = f.month("PreviousDate")
months_diff = (RecentDate.year - PreviousDate.year) * 12 + (RecentDate.month - PreviousDate.month)
return months_diff
intckSasFuncUDF = f.udf(intckSasFunc, IntegerType())
base_study.withColumn('Result', intckSasFuncUDF(f.col('B'), f.col('A') ))
What I'm doing wrong ?
Another question: When I pass parameters in a UDF function, they sent one by one or it pass entire column? And this column is a series?
Thank you!
I found a solution and upgraded it to handle missings too.
from datetime import datetime
import pyspark.sql.functions as f
base_study = spark.createDataFrame([("1", None, "2015-01-01"),("2","2015-01-31","2015-01-31")], ['ID', 'A', 'B'])
base_study = base_study.withColumn("A",f.to_date(base_study["A"], 'yyyy-MM-dd'))
base_study = base_study.withColumn("B",f.to_date(base_study["B"], 'yyyy-MM-dd'))
def intckSasFunc(RecentDate, PreviousDate):
if (PreviousDate and RecentDate) is not None:
months_diff = (RecentDate.year - PreviousDate.year) * 12 + (RecentDate.month - PreviousDate.month)
return months_diff
else:
return None
intckSasFuncUDF = f.udf(lambda x,y:intckSasFunc(x,y) , IntegerType())
display(base_study.withColumn('Result', intckSasFuncUDF(f.col('B'), f.col('A'))))
for those who have doubts, as I had, the function treats one record at a time, as if it were a normal python function, I couldn't use pyspark.sql functions inside this UDF, it gives an error, it seems, these functions are used only in pypsark columns, and inside the UDF the transformation is row by row.

Split one row into multiple rows of dataframe

I want to convert one row from dataframe into multiple rows. If hours is same then rows will not get split but if hour is different then rows will split into multiple rows wrt difference between hours.I am good with solution using dataframe function or hive query.
Input Table or Dataframe
Expected Output Table or Dataframe
Please help me to get workaround for expected output.
The easiest solution for such a simple schema is to use Dataset.flatMap after defining case classes for the input and output schema.
A simple UDF solution would return a sequence and then you can use functions.explode. Far less clean & efficient that using flatMap.
Last but not least, you could create your own table-generating UDF but that would be extreme overkill for this problem.
You can implement your own logic inside the map operation and use flatMap to achieve this.
The following is the crude way, that I have implemented the solution, you can improvise it as per the need.
import java.time.format.DateTimeFormatter
import java.time.temporal.ChronoUnit
import java.time.{Duration, LocalDateTime}
import org.apache.spark.sql.Row
import scala.collection.mutable.ArrayBuffer
import sparkSession.sqlContext.implicits._
val df = Seq(("john", "2/9/2018", "2/9/2018 5:02", "2/9/2018 5:12"),
("smit", "3/9/2018", "3/9/2018 6:12", "3/9/2018 8:52"),
("rick", "4/9/2018", "4/9/2018 23:02", "5/9/2018 2:12")
).toDF("UserName", "Date", "start_time", "end_time")
val rdd = df.rdd.map(row => {
val result = new ArrayBuffer[Row]()
val formatter1 = DateTimeFormatter.ofPattern("d/M/yyyy H:m")
val formatter2 = DateTimeFormatter.ofPattern("d/M/yyyy H:mm")
val d1 = LocalDateTime.parse(row.getAs[String]("start_time"), formatter1)
val d2 = LocalDateTime.parse(row.getAs[String]("end_time"), formatter1)
if (d1.getHour == d2.getHour) result += row
else {
val hoursDiff = Duration.between(d1, d2).toHours.toInt
result += Row.fromSeq(Seq(
row.getAs[String]("UserName"),
row.getAs[String]("Date"),
row.getAs[String]("start_time"),
d1.plus(1, ChronoUnit.HOURS).withMinute(0).format(formatter2)))
for (index <- 1 until hoursDiff) {
result += Row.fromSeq(Seq(
row.getAs[String]("UserName"),
row.getAs[String]("Date"),
d1.plus(index, ChronoUnit.HOURS).withMinute(0).format(formatter1),
d1.plus(1 + index, ChronoUnit.HOURS).withMinute(0).format(formatter2)))
}
result += Row.fromSeq(Seq(
row.getAs[String]("UserName"),
row.getAs[String]("Date"),
d2.withMinute(0).format(formatter2),
row.getAs[String]("end_time")))
}
result
}).flatMap(_.toIterator)
rdd.collect.foreach(println)
and finally, your result is as follows:
[john,2/9/2018,2/9/2018 5:02,2/9/2018 5:12]
[smit,3/9/2018,3/9/2018 6:12,3/9/2018 7:00]
[smit,3/9/2018,3/9/2018 7:0,3/9/2018 8:00]
[smit,3/9/2018,3/9/2018 8:00,3/9/2018 8:52]
[rick,4/9/2018,4/9/2018 23:02,5/9/2018 0:00]
[rick,4/9/2018,5/9/2018 0:0,5/9/2018 1:00]
[rick,4/9/2018,5/9/2018 1:0,5/9/2018 2:00]
[rick,4/9/2018,5/9/2018 2:00,5/9/2018 2:12]

What's the simplest way to get a Spark DataFrame from arbitrary Array Data in Scala?

I've been breaking my head about this one for a couple of days now. It feels like it should be intuitively easy... Really hope someone can help!
I've built an org.nd4j.linalg.api.ndarray.INDArray of word occurrence from some semi-structured data like this:
import org.nd4j.linalg.factory.Nd4j
import org.nd4s.Implicits._
val docMap = collection.mutable.Map[Int,Map[Int,Int]] //of the form Map(phrase -> Map(phrasePosition -> word)
val words = ArrayBuffer("word_1","word_2","word_3",..."word_n")
val windows = ArrayBuffer("$phrase,$phrasePosition_1","$phrase,$phrasePosition_2",..."$phrase,$phrasePosition_n")
var matrix = Nd4j.create(windows.length*words.length).reshape(windows.length,words.length)
for (row <- matrix.shape(0)){
for(column <- matrix.shape(1){
//+1 to (row,column) if word occurs at phrase, phrasePosition indicated by window_n.
}
}
val finalmatrix = matrix.T.dot(matrix) // to get co-occurrence matrix
So far so good...
Downstream of this point I need to integrate the data into an existing pipeline in Spark, and use that implementation of pca etc, so I need to create a DataFrame, or at least an RDD. If I knew the number of words and/or windows in advance I could do something like:
case class Row(window : String, word_1 : Double, word_2 : Double, ...etc)
val dfSeq = ArrayBuffer[Row]()
for (row <- matrix.shape(0)){
dfSeq += Row(windows(row),matrix.get(NDArrayIndex.point(row), NDArrayIndex.all()))
}
sc.parallelize(dfSeq).toDF("window","word_1","word_2",...etc)
but the number of windows and words is determined at runtime. I'm looking for a WindowsxWords org.apache.spark.sql.DataFrame as output, input is a WindowsxWords org.nd4j.linalg.api.ndarray.INDArray
Thanks in advance for any help you can offer.
Ok, so after several days work it looks like the simple answer is: there isn't one. In fact, it looks like trying to use Nd4j in this context at all is a bad idea for several reasons:
It's (really) hard to get data out of the native INDArray format once you've put it in.
Even using something like guava, the .data() method brings everything on heap which will quickly become expensive.
You've got the added hassle of having to compile an assembly jar or use hdfs etc to handle the library itself.
I did also consider using Breeze which may actually provide a viable solution but carries some of the same problems and can't be used on distributed data structures.
Unfortunately, using native Spark / Scala datatypes, although easier once you know how, is - for someone like me coming from Python + numpy + pandas heaven at least - painfully convoluted and ugly.
Nevertheless, I did implement this solution successfully:
import org.apache.spark.mllib.linalg.{Vectors,Vector,Matrix,DenseMatrix,DenseVector}
import org.apache.spark.mllib.linalg.distributed.RowMatrix
//first make a pseudo-matrix from Scala Array[Double]:
var rowSeq = Seq.fill(windows.length)(Array.fill(words.length)(0d))
//iterate through 'rows' and 'columns' to fill it:
for (row 0 until windows.length){
for (column 0 until words.length){
// rowSeq(row)(column) += 1 if word occurs at phrase, phrasePosition indicated by window_n.
}
}
//create Spark DenseMatrix
val rows : Array[Double] = rowSeq.transpose.flatten.toArray
val matrix = new DenseMatrix(windows.length,words.length,rows)
One of the main operations that I needed Nd4J for was matrix.T.dot(matrix) but it turns out that you can't multiply 2 matrices of Type org.apache.spark.mllib.linalg.DenseMatrix together, one of them (A) has to be a org.apache.spark.mllib.linalg.distributed.RowMatrix and - you guessed it - you can't call matrix.transpose() on a RowMatrix, only on a DenseMatrix! Since it's not really relevant to the question, I'll leave that part out, except to explain that what comes out of that step is a RowMatrix. Credit is also due here and here for the final part of the solution:
val rowMatrix : [RowMatrix] = transposeAndDotDenseMatrix(matrix)
// get DataFrame from RowMatrix via DenseMatrix
val newdense = new DenseMatrix(rowMatrix.numRows().toInt,rowMatrix.numCols().toInt,rowMatrix.rows.collect.flatMap(x => x.toArray)) // the call to collect() here is undesirable...
val matrixRows = newdense.rowIter.toSeq.map(_.toArray)
val df = spark.sparkContext.parallelize(matrixRows).toDF("Rows")
// then separate columns:
val df2 = (0 until words.length).foldLeft(df)((df, num) =>
df.withColumn(words(num), $"Rows".getItem(num)))
.drop("Rows")
Would love to hear improvements and suggestions on this, thanks.