I'm using Scala on Spark and I have dense matrix like this:
vp
res63: org.apache.spark.ml.linalg.DenseMatrix =
-0.26035262239241047 -0.9349256695883184
0.08719326360909431 -0.06393629243008418
0.006698866707269257 0.04124873027993731
0.011979122705128064 -0.005430767154896149
0.049075485175059275 0.04810618828561001
0.001605411530424612 0.015016736357799364
0.9587151228619724 -0.2534046936998956
-0.04668498310146597 0.06015550772431999
-0.022360873382091598 -0.22147143481166376
-0.014153052584280682 -0.025947327705852636
I want use VectorAssembler to create feature column so I transform vp:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
val c = vp.toArray.toSeq
val vp_df = c.toDF("number")
vp_df.createOrReplaceTempView("vp_df")
val vp_list = vp_df.collect.map(_.toSeq).flatten
val vp_string = vp_list.map(_.toString)
vp_string
res64: Array[String] = Array(-0.26035262239241047, 0.08719326360909431, 0.006698866707269257, 0.011979122705128064, 0.049075485175059275, 0.001605411530424612, 0.9587151228619724, -0.04668498310146597, -0.022360873382091598, -0.014153052584280682, -0.9349256695883184, -0.06393629243008418, 0.04124873027993731, -0.005430767154896149, 0.04810618828561001, 0.015016736357799364, -0.2534046936998956, 0.06015550772431999, -0.22147143481166376, -0.025947327705852636)
Then I use VectorAssembler:
val assembler = new VectorAssembler().setInputCols(vp_string).setOutputCol("features")
val output = assembler.transform(vp_df)
output.select("features").show(false)
But I have an error and I don't understand why
IllegalArgumentException: -0.26035262239241047 does not exist. Available: number
I don't know how this is possible, I've done the AssemblerVector several times and this is the first time I've seen this
Your "number" column is already in an array of double, so all you need is to convert this column into a dense vector.
val arrayToVectorUDF = udf((array : Seq[Double]) => {
Vectors.dense(array.toArray)
})
vp_df.withColumn("vector", arrayToVectorUDF(col("number")))
Update: I misunderstood your code.
The number column is a DoubleType column, so all you need to do is pass the column name to the vector assembler.
import org.apache.spark.ml.linalg.DenseMatrix
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
val data = (1 to 20).map(_.toDouble).toArray
val dm = new DenseMatrix(2, 10, data)
val vp_df = dm.toArray.toSeq.toDF("number")
val assembler = new VectorAssembler().setInputCols(Array("number")).setOutputCol("features")
val output = assembler.transform(vp_df)
output.select("features").show(false)
+--------+
|features|
+--------+
|[1.0] |
|[2.0] |
|[3.0] |
|[4.0] |
|[5.0] |
|[6.0] |
|[7.0] |
|[8.0] |
|[9.0] |
|[10.0] |
|[11.0] |
|[12.0] |
|[13.0] |
|[14.0] |
|[15.0] |
|[16.0] |
|[17.0] |
|[18.0] |
|[19.0] |
|[20.0] |
+--------+
Related
I have a table which has a column containing array like this -
Student_ID | Subject_List | New_Subject
1 | [Mat, Phy, Eng] | Chem
I want to append the new subject into the subject list and get the new list.
Creating the dataframe -
val df = sc.parallelize(Seq((1, Array("Mat", "Phy", "Eng"), "Chem"))).toDF("Student_ID","Subject_List","New_Subject")
I have tried this with UDF as follows -
def append_list = (arr: Seq[String], s: String) => {
arr :+ s
}
val append_list_UDF = udf(append_list)
val df_new = df.withColumn("New_List", append_list_UDF($"Subject_List",$"New_Subject"))
With UDF, I get the required output
Student_ID | Subject_List | New_Subject | New_List
1 | [Mat, Phy, Eng] | Chem | [Mat, Phy, Eng, Chem]
Can we do it without udf ? Thanks.
In Spark 2.4 or later a combination of array and concat should do the trick,
import org.apache.spark.sql.functions.{array, concat}
import org.apache.spark.sql.Column
def append(arr: Column, col: Column) = concat(arr, array(col))
df.withColumn("New_List", append($"Subject_List",$"New_Subject")).show
+----------+---------------+-----------+--------------------+
|Student_ID| Subject_List|New_Subject| New_List|
+----------+---------------+-----------+--------------------+
| 1|[Mat, Phy, Eng]| Chem|[Mat, Phy, Eng, C...|
+----------+---------------+-----------+--------------------+
but I wouldn't expect serious performance gains here.
val df = Seq((1, Array("Mat", "Phy", "Eng"), "Chem"),
(2, Array("Hindi", "Bio", "Eng"), "IoT"),
(3, Array("Python", "R", "scala"), "C")).toDF("Student_ID","Subject_List","New_Subject")
df.show(false)
val final_df = df.withColumn("exploded", explode($"Subject_List")).select($"Student_ID",$"exploded")
.union(df.select($"Student_ID",$"New_Subject"))
.groupBy($"Student_ID").agg(collect_list($"exploded") as "Your_New_List").show(false)
[enter code here][1]
var columnnames= "callStart_t,callend_t" // Timestamp column names are dynamic input.
scala> df1.show()
+------+------------+--------+----------+
| name| callStart_t|personid| callend_t|
+------+------------+--------+----------+
| Bindu|1080602418 | 2|1080602419|
|Raphel|1647964576 | 5|1647964576|
| Ram|1754536698 | 9|1754536699|
+------+------------+--------+----------+
code which i tried :
val newDf = df1.withColumn("callStart_Time", to_utc_timestamp(from_unixtime($"callStart_t"/1000,"yyyy-MM-dd hh:mm:ss"),"Europe/Berlin"))
val newDf = df1.withColumn("callend_Time", to_utc_timestamp(from_unixtime($"callend_t"/1000,"yyyy-MM-dd hh:mm:ss"),"Europe/Berlin"))
Here, I don't want new columns to convert (from_unixtime to to_utc_timestamp), the existing column itself I want to convert
Example Output
+------+---------------------+--------+--------------------+
| name| callStart_t |personid| callend_t |
+------+---------------------+--------+--------------------+
| Bindu|1970-01-13 04:40:02 | 2|1970-01-13 04:40:02 |
|Raphel|1970-01-20 06:16:04 | 5|1970-01-20 06:16:04 |
| Ram|1970-01-21 11:52:16 | 9|1970-01-21 11:52:16 |
+------+---------------------+--------+--------------------+
Note: The Timestamp column names are dynamic.
how to get each column dynamically?
Just use the same name for the column and it will replace it:
val newDf = df1.withColumn("callStart_t", to_utc_timestamp(from_unixtime($"callStart_t"/1000,"yyyy-MM-dd hh:mm:ss"),"Europe/Berlin"))
val newDf = df1.withColumn("callend_t", to_utc_timestamp(from_unixtime($"callend_t"/1000,"yyyy-MM-dd hh:mm:ss"),"Europe/Berlin"))
To make it dynamic, just use the relevant string. For example:
val colName = "callend_t"
val newDf = df.withColumn(colName , to_utc_timestamp(from_unixtime(col(colName)/1000,"yyyy-MM-dd hh:mm:ss"),"Europe/Berlin"))
For multiple columns you can do:
val columns=Seq("callend_t", "callStart_t")
val newDf = columns.foldLeft(df1){ case (curDf, colName) => curDf.withColumn(colName , to_utc_timestamp(from_unixtime(col(colName)/1000,"yyyy-MM-dd hh:mm:ss"),"Europe/Berlin"))}
Note: as stated in the comments, the division by 1000 is not needed.
convert myMap = Map([Col_1->1],[Col_2->2],[Col_3->3])
to Spark scala Data-frame key as column and value as column value,i am not
getting expected result, please check my code and provide solution.
var finalBufferList = new ListBuffer[String]()
var finalDfColumnList = new ListBuffer[String]()
var myMap:Map[String,String] = Map.empty[String,String]
for ((k,v) <- myMap){
println(k+"->"+v)
finalBufferList += v
//finalDfColumnList += "\""+k+"\""
finalDfColumnList += k
}
val dff = Seq(finalBufferList.toSeq).toDF(finalDfColumnList.toList.toString())
dff.show()
My result :
+------------------------+
|List(Test, Rest, Incedo)|
+------------------------+
| [4, 5, 3]|
+------------------------+
Expected result :
+------+-------+-------+
|Col_1 | Col_2 | Col_3 |
+------+-------+-------+
| 4 | 5 | 3 |
+------+-------+-------+
please give me suggestion .
if you have defined your Map as
val myMap = Map("Col_1"->"1", "Col_2"->"2", "Col_3"->"3")
then you should create RDD[Row] using the values as
import org.apache.spark.sql.Row
val rdd = sc.parallelize(Seq(Row.fromSeq(myMap.values.toSeq)))
then you create a schema using the keys as
import org.apache.spark.sql.types._
val schema = StructType(myMap.keys.toSeq.map(StructField(_, StringType)))
then finally use createDataFrame function to create the dataframe as
val df = sqlContext.createDataFrame(rdd, schema)
df.show(false)
finally you should have
+-----+-----+-----+
|Col_1|Col_2|Col_3|
+-----+-----+-----+
|1 |2 |3 |
+-----+-----+-----+
I hope the answer is helpful
But remember all this would be useless if you are working in small dataset.
I am new to apache spark and scala. I have data set like this which I am taking from csv file and converting it into RDD using scala.
+-----------+-----------+----------+
| recent | Freq | Monitor |
+-----------+-----------+----------+
| 1 | 1234 | 199090|
| 4 | 2553| 198613|
| 6 | 3232 | 199090|
| 1 | 8823 | 498831|
| 7 | 2902 | 890000|
| 8 | 7991 | 081097|
| 9 | 7391 | 432370|
| 12 | 6138 | 864981|
| 7 | 6812 | 749821|
+-----------+-----------+----------+
I want to calculate z-score value or to standardize the data. So I am calculating the z-score for each column and then try to combine them so I get standard scale.
Here is my code for calculating the z-score for first column
val scores1 = sorted.map(_.split(",")(0)).cache
val count = scores.count
val mean = scores.sum / count
val devs = scores.map(score => (score - mean) * (score - mean))
val stddev = Math.sqrt(devs.sum / count)
val zscore = sorted.map(x => math.round((x.toDouble - mean)/stddev))
How do I calculate for each column ? Or is there any other way to normalize or standardize the data ?
My requirement is to assign the rank(or scale).
Thanks
If you want to standardize the columns, you can use the StandardScaler class from Spark MLlib. Data should be in the form of RDD[Vectors[Double], where Vectors are a part of MLlib Linalg package. You can choose to use mean or standard deviation or both to standardize your data.
import org.apache.spark.mllib.feature.StandardScaler
import org.apache.spark.mllib.linalg.Vectors
val data = sc.parallelize(Array(
Array(1.0,2.0,3.0),
Array(4.0,5.0,6.0),
Array(7.0,8.0,9.0),
Array(10.0,11.0,12.0)))
// Converting RDD[Array] to RDD[Vectors]
val features = data.map(a => Vectors.dense(a))
// Creating a Scaler model that standardizes with both mean and SD
val scaler = new StandardScaler(withMean = true, withStd = true).fit(features)
// Scale features using the scaler model
val scaledFeatures = scaler.transform(features)
This scaledFeatures RDD contains the Z-score of all columns.
Hope this answer helps. Check the Documentation for more info.
You may want to use below code to perform Standard Scaling on required columns.Vector Assembler is used to select required columns that need to be transformed. StandardScaler constructor also provides you an option to select values of Mean and Standard deviation
Code:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.feature
import org.apache.spark.ml.feature.StandardScaler
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load("/user/hadoop/data/your_dataset.csv")
df.show(Int.MaxValue)
val assembler = new VectorAssembler().setInputCols(Array("recent","Freq","Monitor")).setOutputCol("features")
val transformVector = assembler.transform(df)
val scaler = new StandardScaler().setInputCol("features").setOutputCol("scaledFeatures").setWithStd(true).setWithMean(false)
val scalerModel = scaler.fit(transformVector)
val scaledData = scalerModel.transform(transformVector)
scaledData.show() 20, False
scaledData.show(Int.MaxValue)
scaledData.show(20, false)
I am new to Spark. Can you give any idea what is the problem with below code:
val rawData="""USA | E001 | ABC DE | 19850607 | IT | $100
UK | E005 | CHAN CL | 19870512 | OP | $200
USA | E003 | XYZ AB | 19890101 | IT | $250
USA | E002 | XYZ AB | 19890705 | IT | $200"""
val sc = ...
val data= rawData.split("\n")
val rdd= sc.parallelize(data)
val data1=rdd.flatMap(line=> line.split(" | "))
val data2 = data1.map(arr => (arr(2), arr.mkString(""))).sortByKey(false)
data2.saveAsTextFile("./sample_data1_output")
Here, .sortByKey(false) is not working and compiler gives me error:
[error] /home/admin/scala/spark-poc/src/main/scala/SparkApp.scala:26: value sortByKey is not a member of org.apache.spark.rdd.RDD[(String, String)]
[error] val data2 = data1.map(arr => (arr(2), arr.mkString(""))).sortByKey(false)
Question is how to get MappedRDD? Or on what object should I call sortByKey()?
Spark provides additional operations, like sortByKey(), on RDDs of pairs. These operations are available through a class called PairRDDFunctions and Spark uses implicit conversions to automatically perform the RDD -> PairRDDFunctions wrapping.
To import the implicit conversions, add the following lines to the top of your program:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
This is discussed in the Spark programming guide's section on Working with key-value pairs.