I have a sparse vector in spark and I want to randomly shuffle (reorder) its contents. This vector is actually a tf-idf vector and what I want is to reorder it so that in my new dataset the features have different order. is there any way to do this using scala?
this is my code for generating tf-idf vectors:
val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
val wordsData = tokenizer.transform(data).cache()
val cvModel: CountVectorizerModel = new CountVectorizer()
.setInputCol("words")
.setOutputCol("rawFeatures")
.fit(wordsData)
val featurizedData = cvModel.transform(wordsData).cache()
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedData)
val rescaledData = idfModel.transform(featurizedData).cache()
Perhaps this is useful-
Load the test data
val data = Array(
Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
)
val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
df.show(false)
df.printSchema()
/**
* +---------------------+
* |features |
* +---------------------+
* |(5,[1,3],[1.0,7.0]) |
* |[2.0,0.0,3.0,4.0,5.0]|
* |[4.0,0.0,0.0,6.0,7.0]|
* +---------------------+
*
* root
* |-- features: vector (nullable = true)
*/
shuffle the vector
val shuffleVector = udf((vector: Vector) =>
Vectors.dense(scala.util.Random.shuffle(mutable.WrappedArray.make[Double](vector.toArray)).toArray)
)
val p = df.withColumn("shuffled_vector", shuffleVector($"features"))
p.show(false)
p.printSchema()
/**
* +---------------------+---------------------+
* |features |shuffled_vector |
* +---------------------+---------------------+
* |(5,[1,3],[1.0,7.0]) |[1.0,0.0,0.0,0.0,7.0]|
* |[2.0,0.0,3.0,4.0,5.0]|[0.0,3.0,2.0,5.0,4.0]|
* |[4.0,0.0,0.0,6.0,7.0]|[4.0,7.0,6.0,0.0,0.0]|
* +---------------------+---------------------+
*
* root
* |-- features: vector (nullable = true)
* |-- shuffled_vector: vector (nullable = true)
*/
You can also use the above udf to create Transformer and put it in pipeline
please make sure to use import org.apache.spark.ml.linalg._
Update-1 convert shuffled vector to sparse
val shuffleVectorToSparse = udf((vector: Vector) =>
Vectors.dense(scala.util.Random.shuffle(mutable.WrappedArray.make[Double](vector.toArray)).toArray).toSparse
)
val p1 = df.withColumn("shuffled_vector", shuffleVectorToSparse($"features"))
p1.show(false)
p1.printSchema()
/**
* +---------------------+-------------------------------+
* |features |shuffled_vector |
* +---------------------+-------------------------------+
* |(5,[1,3],[1.0,7.0]) |(5,[0,3],[1.0,7.0]) |
* |[2.0,0.0,3.0,4.0,5.0]|(5,[1,2,3,4],[5.0,3.0,2.0,4.0])|
* |[4.0,0.0,0.0,6.0,7.0]|(5,[1,3,4],[7.0,4.0,6.0]) |
* +---------------------+-------------------------------+
*
* root
* |-- features: vector (nullable = true)
* |-- shuffled_vector: vector (nullable = true)
*/
Related
I have a column which is a wrapped array of type struct with an integer and a Double value.
The schema looks like this:
|-- pricing_data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: integer (nullable = false)
| | |-- _2: double (nullable = false)
So, whenever this column value is [[0,0.0]] I need to change it as an empty array.
[[0,0.0]] -> [[]].
How can I do this using a map? or using a Dataframe?
Try this-
spark>=2.4
val df = Seq(Seq((0, 0.0)), Seq((1, 2.2))).toDF("pricing_data")
df.show(false)
df.printSchema()
/**
* +------------+
* |pricing_data|
* +------------+
* |[[0, 0.0]] |
* |[[1, 2.2]] |
* +------------+
*
* root
* |-- pricing_data: array (nullable = true)
* | |-- element: struct (containsNull = true)
* | | |-- _1: integer (nullable = false)
* | | |-- _2: double (nullable = false)
*/
df.withColumn("pricing_data", expr(
"TRANSFORM(pricing_data, x -> if(x._1=0 and x._2=0.0, named_struct('_1', null, '_2', null), x))"
))
.show(false)
/**
* +------------+
* |pricing_data|
* +------------+
* |[[,]] |
* |[[1, 2.2]] |
* +------------+
*/
spark<2.4
// spark<2.4
val dataType = df.schema("pricing_data").dataType
val replace = udf((arrayOfStruct: mutable.WrappedArray[Row]) => {
arrayOfStruct.map(row => {
val map = row.getValuesMap(row.schema.map(_.name))
if(map("_1")==0 && map("_2") == 0.0) {
Row.fromTuple((null, null))
} else row
})
}, dataType)
df.withColumn("pricing_data", replace($"pricing_data"))
.show(false)
/**
* +------------+
* |pricing_data|
* +------------+
* |[[,]] |
* |[[1, 2.2]] |
* +------------+
*/
I have a DataFrame with the following schema :
root
|-- user_id: string (nullable = true)
|-- user_loans_arr: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- loan_date: string (nullable = true)
| | |-- loan_amount: string (nullable = true)
|-- new_loan: struct (nullable = true)
| |-- loan_date : string (nullable = true)
| |-- loan_amount : string (nullable = true)
I want to use a UDF, which takes user_loans_arr and new_loan as inputs and add the new_loan struct to the existing user_loans_arr. Then, from user_loans_arr delete all the elements whose loan_date is older than 12 months.
Thanks in advance.
if spark >= 2.4 then you don't need UDF, check the example below-
Load the input data
val df = spark.sql(
"""
|select user_id, user_loans_arr, new_loan
|from values
| ('u1', array(named_struct('loan_date', '2019-01-01', 'loan_amount', 100)), named_struct('loan_date',
| '2020-01-01', 'loan_amount', 100)),
| ('u2', array(named_struct('loan_date', '2020-01-01', 'loan_amount', 200)), named_struct('loan_date',
| '2020-01-01', 'loan_amount', 100))
| T(user_id, user_loans_arr, new_loan)
""".stripMargin)
df.show(false)
df.printSchema()
/**
* +-------+-------------------+-----------------+
* |user_id|user_loans_arr |new_loan |
* +-------+-------------------+-----------------+
* |u1 |[[2019-01-01, 100]]|[2020-01-01, 100]|
* |u2 |[[2020-01-01, 200]]|[2020-01-01, 100]|
* +-------+-------------------+-----------------+
*
* root
* |-- user_id: string (nullable = false)
* |-- user_loans_arr: array (nullable = false)
* | |-- element: struct (containsNull = false)
* | | |-- loan_date: string (nullable = false)
* | | |-- loan_amount: integer (nullable = false)
* |-- new_loan: struct (nullable = false)
* | |-- loan_date: string (nullable = false)
* | |-- loan_amount: integer (nullable = false)
*/
Process as per below requirement
user_loans_arr and new_loan as inputs and add the new_loan struct to the existing user_loans_arr. Then, from user_loans_arr delete all the elements whose loan_date is older than 12 months.
spark >= 2.4
df.withColumn("user_loans_arr",
expr(
"""
|FILTER(array_union(user_loans_arr, array(new_loan)),
| x -> months_between(current_date(), to_date(x.loan_date)) < 12)
""".stripMargin))
.show(false)
/**
* +-------+--------------------------------------+-----------------+
* |user_id|user_loans_arr |new_loan |
* +-------+--------------------------------------+-----------------+
* |u1 |[[2020-01-01, 100]] |[2020-01-01, 100]|
* |u2 |[[2020-01-01, 200], [2020-01-01, 100]]|[2020-01-01, 100]|
* +-------+--------------------------------------+-----------------+
*/
spark < 2.4
// spark < 2.4
val outputSchema = df.schema("user_loans_arr").dataType
import java.time._
val add_and_filter = udf((userLoansArr: mutable.WrappedArray[Row], loan: Row) => {
(userLoansArr :+ loan).filter(row => {
val loanDate = LocalDate.parse(row.getAs[String]("loan_date"))
val period = Period.between(loanDate, LocalDate.now())
period.getYears * 12 + period.getMonths < 12
})
}, outputSchema)
df.withColumn("user_loans_arr", add_and_filter($"user_loans_arr", $"new_loan"))
.show(false)
/**
* +-------+--------------------------------------+-----------------+
* |user_id|user_loans_arr |new_loan |
* +-------+--------------------------------------+-----------------+
* |u1 |[[2020-01-01, 100]] |[2020-01-01, 100]|
* |u2 |[[2020-01-01, 200], [2020-01-01, 100]]|[2020-01-01, 100]|
* +-------+--------------------------------------+-----------------+
*/
You need to pass you array and structure column to the udf as an array or struct. I prefer passing it as struct.
There you can manipulate the elements and return an array type.
import pyspark.sql.functions as F
from pyspark.sql.functions import udf
from pyspark.sql.types import *
import numpy as np
#Test data
tst = sqlContext.createDataFrame([(1,2,3,4),(3,4,5,1),(5,6,7,8),(7,8,9,2)],schema=['col1','col2','col3','col4'])
tst_1=(tst.withColumn("arr",F.array('col1','col2'))).withColumn("str",F.struct('col3','col4'))
# udf to return array
#udf(ArrayType(StringType()))
def fn(row):
if(row.arr[1]>row.str.col4):
res=[]
else:
res.append(row.str[i])
res = row.arr+row.str.asDict().values()
return(res)
# calling udf with a struct of array and struct column
tst_fin = tst_1.withColumn("res",fn(F.struct('arr','str')))
The result is
tst_fin.show()
+----+----+----+----+------+------+------------+
|col1|col2|col3|col4| arr| str| res|
+----+----+----+----+------+------+------------+
| 1| 2| 3| 4|[1, 2]|[3, 4]|[1, 2, 4, 3]|
| 3| 4| 5| 1|[3, 4]|[5, 1]| []|
| 5| 6| 7| 8|[5, 6]|[7, 8]|[5, 6, 8, 7]|
| 7| 8| 9| 2|[7, 8]|[9, 2]| []|
+----+----+----+----+------+------+----------
This example takes everything as int. Since you have strings as date , inside you udf you have to use datetime functions of python for the comparison.
I have been told that EXCEPT is a very costly operation and one should always try to avoid using EXCEPT.
My Use Case -
val myFilter = "rollNo='11' AND class='10'"
val rawDataDf = spark.table(<table_name>)
val myFilteredDataframe = rawDataDf.where(myFilter)
val allOthersDataframe = rawDataDf.except(myFilteredDataframe)
But I am confused, in such use case , what are my alternatives ?
Use left anti join as below-
val df = spark.range(2).withColumn("name", lit("foo"))
df.show(false)
df.printSchema()
/**
* +---+----+
* |id |name|
* +---+----+
* |0 |foo |
* |1 |foo |
* +---+----+
*
* root
* |-- id: long (nullable = false)
* |-- name: string (nullable = false)
*/
val df2 = df.filter("id=0")
df.join(df2, df.columns.toSeq, "leftanti")
.show(false)
/**
* +---+----+
* |id |name|
* +---+----+
* |1 |foo |
* +---+----+
*/
I am very new to Spark, i have to perform string manipulation operations and create new column in spark dataframe. I have created UDF functions for string manipulation and due to performance i want to do this without UDF. Following is my code and output. Could please help me to create this in better way?
object Demo2 extends Context {
import org.apache.spark.sql.functions.udf
def main(args: Array[String]): Unit = {
import sparkSession.sqlContext.implicits._
val data = Seq(
("bankInfo.SBI.C_1.Kothrud.Pune.displayInfo"),
("bankInfo.ICICI.C_2.TilakRoad.Pune.displayInfo"),
("bankInfo.Axis.C_3.Santacruz.Mumbai.displayInfo"),
("bankInfo.HDFC.C_4.Deccan.Pune.displayInfo")
)
val df = data.toDF("Key")
println("Input Dataframe")
df.show(false)
//get local_address
val get_local_address = udf((key: String) => {
val first_index = key.indexOf(".")
val tmp_key = key.substring(first_index + 1)
val last_index = tmp_key.lastIndexOf(".")
val local_address = tmp_key.substring(0, last_index)
local_address
})
//get address
val get_address = udf((key: String) => {
val first_index = key.indexOf(".")
val tmp_key = key.substring(first_index + 1)
val last_index1 = tmp_key.lastIndexOf(".")
val tmp_key1 = tmp_key.substring(0, last_index1)
val last_index2 = tmp_key1.lastIndexOf(".");
val first_index1 = tmp_key1.lastIndexOf(".", last_index2 - 1);
val address = tmp_key1.substring(0, first_index1) + tmp_key1.substring(last_index2)
address
})
val df2 = df
.withColumn("Local Address", get_local_address(df("Key")))
.withColumn("Address", get_address(df("Key")))
println("Output Dataframe")
df2.show(false)
}
}
Input Dataframe
+----------------------------------------------+
|Key |
+----------------------------------------------+
|bankInfo.SBI.C_1.Kothrud.Pune.displayInfo |
|bankInfo.ICICI.C_2.TilakRoad.Pune.displayInfo |
|bankInfo.Axis.C_3.Santacruz.Mumbai.displayInfo|
|bankInfo.HDFC.C_4.Deccan.Pune.displayInfo |
+----------------------------------------------+
Output Dataframe
+----------------------------------------------+-------------------------+---------------+
|Key |Local Address |Address |
+----------------------------------------------+-------------------------+---------------+
|bankInfo.SBI.C_1.Kothrud.Pune.displayInfo |SBI.C_1.Kothrud.Pune |SBI.C_1.Pune |
|bankInfo.ICICI.C_2.TilakRoad.Pune.displayInfo |ICICI.C_2.TilakRoad.Pune |ICICI.C_2.Pune |
|bankInfo.Axis.C_3.Santacruz.Mumbai.displayInfo|Axis.C_3.Santacruz.Mumbai|Axis.C_3.Mumbai|
|bankInfo.HDFC.C_4.Deccan.Pune.displayInfo |HDFC.C_4.Deccan.Pune |HDFC.C_4.Pune |
+----------------------------------------------+-------------------------+---------------+
Since you have fixed sized array, you can structurize them and then concat as required-
Load the test data provided
val data =
"""
|Key
|bankInfo.SBI.C_1.Kothrud.Pune.displayInfo
|bankInfo.ICICI.C_2.TilakRoad.Pune.displayInfo
|bankInfo.Axis.C_3.Santacruz.Mumbai.displayInfo
|bankInfo.HDFC.C_4.Deccan.Pune.displayInfo
""".stripMargin
val stringDS1 = data.split(System.lineSeparator())
.map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
.toSeq.toDS()
val df1 = spark.read
.option("sep", ",")
.option("inferSchema", "true")
.option("header", "true")
.option("nullValue", "null")
.csv(stringDS1)
df1.show(false)
df1.printSchema()
/**
* +----------------------------------------------+
* |Key |
* +----------------------------------------------+
* |bankInfo.SBI.C_1.Kothrud.Pune.displayInfo |
* |bankInfo.ICICI.C_2.TilakRoad.Pune.displayInfo |
* |bankInfo.Axis.C_3.Santacruz.Mumbai.displayInfo|
* |bankInfo.HDFC.C_4.Deccan.Pune.displayInfo |
* +----------------------------------------------+
*
* root
* |-- Key: string (nullable = true)
*/
Derive the columns from the fixed format string column
df1.select($"key", split($"key", "\\.").as("x"))
.withColumn("bankInfo",
expr(
"""
|named_struct('name', element_at(x, 2), 'cust_id', element_at(x, 3),
| 'branch', element_at(x, 4), 'dist', element_at(x, 5))
""".stripMargin))
.select($"key",
concat_ws(".", $"bankInfo.name", $"bankInfo.cust_id", $"bankInfo.branch", $"bankInfo.dist")
.as("Local_Address"),
concat_ws(".", $"bankInfo.name", $"bankInfo.cust_id", $"bankInfo.dist")
.as("Address"))
.show(false)
/**
* +----------------------------------------------+-------------------------+---------------+
* |key |Local_Address |Address |
* +----------------------------------------------+-------------------------+---------------+
* |bankInfo.SBI.C_1.Kothrud.Pune.displayInfo |SBI.C_1.Kothrud.Pune |SBI.C_1.Pune |
* |bankInfo.ICICI.C_2.TilakRoad.Pune.displayInfo |ICICI.C_2.TilakRoad.Pune |ICICI.C_2.Pune |
* |bankInfo.Axis.C_3.Santacruz.Mumbai.displayInfo|Axis.C_3.Santacruz.Mumbai|Axis.C_3.Mumbai|
* |bankInfo.HDFC.C_4.Deccan.Pune.displayInfo |HDFC.C_4.Deccan.Pune |HDFC.C_4.Pune |
* +----------------------------------------------+-------------------------+---------------+
*/
df1.select($"key", split($"key", "\\.").as("x"))
.withColumn("bankInfo",
expr("named_struct('name', x[1], 'cust_id', x[2], 'branch', x[3], 'dist', x[4])"))
.select($"key",
concat_ws(".", $"bankInfo.name", $"bankInfo.cust_id", $"bankInfo.branch", $"bankInfo.dist")
.as("Local_Address"),
concat_ws(".", $"bankInfo.name", $"bankInfo.cust_id", $"bankInfo.dist")
.as("Address"))
.show(false)
/**
* +----------------------------------------------+-------------------------+---------------+
* |key |Local_Address |Address |
* +----------------------------------------------+-------------------------+---------------+
* |bankInfo.SBI.C_1.Kothrud.Pune.displayInfo |SBI.C_1.Kothrud.Pune |SBI.C_1.Pune |
* |bankInfo.ICICI.C_2.TilakRoad.Pune.displayInfo |ICICI.C_2.TilakRoad.Pune |ICICI.C_2.Pune |
* |bankInfo.Axis.C_3.Santacruz.Mumbai.displayInfo|Axis.C_3.Santacruz.Mumbai|Axis.C_3.Mumbai|
* |bankInfo.HDFC.C_4.Deccan.Pune.displayInfo |HDFC.C_4.Deccan.Pune |HDFC.C_4.Pune |
* +----------------------------------------------+-------------------------+---------------+
*/
Check below code.
scala> df.show(false)
+----------------------------------------------+
|Key |
+----------------------------------------------+
|bankInfo.SBI.C_1.Kothrud.Pune.displayInfo |
|bankInfo.ICICI.C_2.TilakRoad.Pune.displayInfo |
|bankInfo.Axis.C_3.Santacruz.Mumbai.displayInfo|
|bankInfo.HDFC.C_4.Deccan.Pune.displayInfo |
+----------------------------------------------+
scala> val maxLength = df.select(split($"key","\\.").as("keys")).withColumn("length",size($"keys")).select(max($"length").as("length")).map(_.getAs[Int](0)).collect.head
maxLength: Int = 6
scala> val address_except = Seq(0,3,maxLength-1)
address_except: Seq[Int] = List(0, 3, 5)
scala> val local_address_except = Seq(0,maxLength-1)
local_address_except: Seq[Int] = List(0, 5)
scala> def parse(column: Column,indexes:Seq[Int]) = (0 to maxLength).filter(i => !indexes.contains(i)).map(i => column(i)).reduce(concat_ws(".",_,_))
parse: (column: org.apache.spark.sql.Column, indexes: Seq[Int])org.apache.spark.sql.Column
scala> df.select(split($"key","\\.").as("keys")).withColumn("local_address",parse($"keys",local_address_except)).withColumn("address",parse($"keys",address_except)).show(false)
+-----------------------------------------------------+-------------------------+---------------+
|keys |local_address |address |
+-----------------------------------------------------+-------------------------+---------------+
|[bankInfo, SBI, C_1, Kothrud, Pune, displayInfo] |SBI.C_1.Kothrud.Pune |SBI.C_1.Pune |
|[bankInfo, ICICI, C_2, TilakRoad, Pune, displayInfo] |ICICI.C_2.TilakRoad.Pune |ICICI.C_2.Pune |
|[bankInfo, Axis, C_3, Santacruz, Mumbai, displayInfo]|Axis.C_3.Santacruz.Mumbai|Axis.C_3.Mumbai|
|[bankInfo, HDFC, C_4, Deccan, Pune, displayInfo] |HDFC.C_4.Deccan.Pune |HDFC.C_4.Pune |
+-----------------------------------------------------+-------------------------+---------------+
I have a CSV file and I want to create a new minute timestamp column as shown below
Actual:
Col1, Col2
1.19185711131486, 0.26615071205963
-1.3598071336738, -0.0727811733098497
-0.966271711572087, -0.185226008082898
-0.966271711572087, -0.185226008082898
-1.15823309349523, 0.877736754848451
-0.425965884412454, 0.960523044882985
Expected:
Col1, Col2, ts
1.19185711131486, 0.26615071205963, 00:00:00
-1.3598071336738, -0.0727811733098497, 00:01:00
-0.966271711572087, -0.185226008082898, 00:02:00
-0.966271711572087, -0.185226008082898, 00:03:00
-1.15823309349523, 0.877736754848451, 00:04:00
-0.425965884412454, 0.960523044882985, 00:05:00
Thanks in advance!
perhaps this is useful -
val data =
"""
|Col1, Col2
|1.19185711131486, 0.26615071205963
|-1.3598071336738, -0.0727811733098497
|-0.966271711572087, -0.185226008082898
|-0.966271711572087, -0.185226008082898
|-1.15823309349523, 0.877736754848451
|-0.425965884412454, 0.960523044882985
""".stripMargin
val stringDS = data.split(System.lineSeparator())
.map(_.split("\\,").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
.toSeq.toDS()
val df = spark.read
.option("sep", ",")
.option("inferSchema", "true")
.option("header", "true")
.option("nullValue", "null")
.csv(stringDS)
df.printSchema()
df.show(false)
/**
* root
* |-- Col1: double (nullable = true)
* |-- Col2: double (nullable = true)
*
* +------------------+-------------------+
* |Col1 |Col2 |
* +------------------+-------------------+
* |1.19185711131486 |0.26615071205963 |
* |-1.3598071336738 |-0.0727811733098497|
* |-0.966271711572087|-0.185226008082898 |
* |-0.966271711572087|-0.185226008082898 |
* |-1.15823309349523 |0.877736754848451 |
* |-0.425965884412454|0.960523044882985 |
* +------------------+-------------------+
*/
df.withColumn("ts",
date_format(to_timestamp((row_number().over(Window.orderBy(df.columns.map(col): _*)) - 1).cast("string"),
"mm")
, "00:mm:00"))
.show(false)
/**
* +------------------+-------------------+--------+
* |Col1 |Col2 |ts |
* +------------------+-------------------+--------+
* |-1.3598071336738 |-0.0727811733098497|00:00:00|
* |-1.15823309349523 |0.877736754848451 |00:01:00|
* |-0.966271711572087|-0.185226008082898 |00:02:00|
* |-0.966271711572087|-0.185226008082898 |00:03:00|
* |-0.425965884412454|0.960523044882985 |00:04:00|
* |1.19185711131486 |0.26615071205963 |00:05:00|
* +------------------+-------------------+--------+
*/