How to convert matrix to DataFrame with scala/spark? - scala

I have a matrix and number of columns and rows is unknow
One example Matrix is:
[5,1.3]
[1,5.2]
I want to convert it to DataFrame,column name is random,how to achive it?
this is my expect result:
+-------------+----+
| _1 | _2 |
+-------------+----+
|5 |1.3 |
|1 |5.2 |
--------------------

I suggest you convert matrix to RDD and then convert RDD to DataFrame, it is not a good way but works fine in Spark 2.0.0.
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.mllib.linalg._
import org.apache.spark.rdd.RDD
object mat2df {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("mat2df").setMaster("local[1]")
val sc = new SparkContext(conf)
val values = Array(5, 1, 1.3, 5.2)
val mat = Matrices.dense(2, 2, values).asInstanceOf[DenseMatrix]
def toRDD(m: Matrix): RDD[Vector] = {
val columns = m.toArray.grouped(m.numRows)
val rows = columns.toSeq.transpose
val vectors = rows.map(row => new DenseVector(row.toArray))
sc.parallelize(vectors)
}
val mat_rows = toRDD(mat)// matrix to rdd
val mat_rdd = mat_rows.map(_.toArray).map{case Array(p0, p1) => (p0, p1)}
val spark: SparkSession = SparkSession.builder.master("local").getOrCreate
val df = spark.createDataFrame(mat_rdd) // rdd to dataframe
df.show()
}
}

def matrixToDataFrame(sc:SparkContext, matrix:Matrix, m_nodeColName:String):DataFrame={
val rdd = sc.parallelize(matrix.colIter.toSeq).map(x => {
Row.fromSeq(x.toArray.toSeq)
})
val sc = new SQLContext(nodeContext.getSparkCtx())
var schema = new StructType()
val ids = ArrayBuffer[String]()
for (i <- 0 until matrix.rowIter.size) {
schema = schema.add(StructField(m_nodeColName +"_"+ i.toString(), DoubleType, true))
ids.append(m_nodeColName +"_"+ i.toString())
}
sc.sparkSession.createDataFrame(rdd, schema)
}

Related

scala spark dataframe modify column with udf return value

I have a spark dataframe which has a timestamp field and i want to convert this to long datatype. I used a UDF and the standalone code works fine but when i plug to to a generic logic where any timestamp will need to be converted i m not ble to get it working.Issue is how can i assing the return value from UDF back to the dataframe column
Below is the code snippet
val spark: SparkSession = SparkSession.builder().master("local[*]").appName("Test3").getOrCreate();
import org.apache.spark.sql.functions._
val sqlContext = spark.sqlContext
val df2 = sqlContext.jsonRDD(spark.sparkContext.parallelize(Array(
"""{"year":2012, "make": "Tesla", "model": "S", "comment": "No Comment", "blank": "","manufacture_ts":"2017-10-16 00:00:00"}""",
"""{"year":1997, "make": "Ford", "model": "E350", "comment": "Get one", "blank": "","manufacture_ts":"2017-10-16 00:00:00"}""",
)))
val convertTimeStamp = udf { (manTs :java.sql.Timestamp) =>
manTs.getTime
}
df2.withColumn("manufacture_ts",getTime(df2("manufacture_ts"))).show
+-----+----------+-----+--------------+-----+----+
| |No Comment|Tesla| 1508126400000| S|2012|
| | Get one| Ford| 1508126400000| E350|1997|
| | |Chevy| 1508126400000| Volt|2015|
+-----+----------+-----+--------------+-----+----+
Now i want to invoke this from a dataframe to be clled on all columns which are of type long
object Test4 extends App{
val spark: SparkSession = SparkSession.builder().master("local[*]").appName("Test").getOrCreate();
import spark.implicits._
import scala.collection.JavaConversions._
val long : Long = "1508299200000".toLong
val data = Seq(Row("10000020_LUX_OTC",long,"2020-02-14"))
val schema = List( StructField("rowkey",StringType,true)
,StructField("order_receipt_dt",LongType,true)
,StructField("maturity_dt",StringType,true))
val dataDF = spark.createDataFrame(spark.sparkContext.parallelize(data),StructType(schema))
val modifedDf2= schema.foldLeft(dataDF) { case (newDF,StructField(name,dataType,flag,metadata)) =>
newDF.withColumn(name,DataTypeUtil.transformLong(newDF,name,dataType.typeName))
modifedDf2,show
}
}
val convertTimeStamp = udf { (manTs :java.sql.Timestamp) =>
manTs.getTime
}
def transformLong(dataFrame: DataFrame,name:String, fieldType:String):Column = {
import org.apache.spark.sql.functions._
fieldType.toLowerCase match {
case "timestamp" => convertTimeStamp(dataFrame(name))
case _ => dataFrame.col(name)
}
}
Maybe your udf crashed if the timestamp is nullYou can do :
use unix_timestamp instead of UDF.. or make your UDF null-safe
only apply on fields which need to be converted.
Given the data:
import spark.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.TimestampType
val df = Seq(
(1L,Timestamp.valueOf(LocalDateTime.now()),Timestamp.valueOf(LocalDateTime.now()))
).toDF("id","ts1","ts2")
you can do:
val newDF = df.schema.fields.filter(_.dataType == TimestampType).map(_.name)
.foldLeft(df)((df,field) => df.withColumn(field,unix_timestamp(col(field))))
newDF.show()
which gives:
+---+----------+----------+
| id| ts1| ts2|
+---+----------+----------+
| 1|1589109282|1589109282|
+---+----------+----------+

nanoseconds truncated while inserting data into hive from Spark

While inserting data into Hive TimestampType from spark, nanoseconds are truncated. does anyone has any solution towards it? I have tried writing to orc and csv format on hive.
CSV: it appeared as 2018-03-20T13:04:20.123Z
ORC: 2018-03-20 13:04:20.123456
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types._
import org.apache.spark.sql.types.StructField
import java.util.Date
import org.apache.spark.sql.Row
import java.math.{BigDecimal,MathContext,RoundingMode}
/**
* Main class to read Order, Route and Trade records and convert them to ORC File format
* #author Shefali.Nema
* #since 1.0.0
*/
object testDateAndDecimal {
def main(args: Array[String]): Unit = {
execute;
}
private def execute: Unit = {
val sparkConf = new SparkConf().setAppName("Test");
val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// Define DataTypes
val datetimestring: String = "2018-03-20 13:04:20.123456789"
val dt = java.sql.Timestamp.valueOf(datetimestring)
//val DecimalType = DataTypes.createDecimalType(18, 8)
//Define Values
val id = 1
//System.out.println(new BigDecimal("135.69")); // 135.69
val price = new BigDecimal("1234567890.1234567899")
System.out.println("\n###################################################price###################################" + price + "\n")
System.out.println("\n###################################################dt###################################" + dt + "\n")
val schema = StructType(StructField("id",IntegerType,true) :: StructField("name",TimestampType,true) :: StructField("price",DecimalType(18,8),true) :: Nil)
val values = List(id,dt,price)
val row = Row.fromSeq(values)
// Create `RDD` from `Row`
val rdd = sc.makeRDD(List(row))
val orcFolderName = "testDecimal"
val hiveRowsDF = sqlContext.createDataFrame(rdd, schema)
hiveRowsDF.write.mode(org.apache.spark.sql.SaveMode.Append).orc(orcFolderName)
}
}

Error with spark Row.fromSeq for a text file

import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark._
import org.apache.spark.sql.types._
import org.apache.spark.sql._
object fixedLength {
def main(args:Array[String]) {
def getRow(x : String) : Row={
val columnArray = new Array[String](4)
columnArray(0)=x.substring(0,3)
columnArray(1)=x.substring(3,13)
columnArray(2)=x.substring(13,18)
columnArray(3)=x.substring(18,22)
Row.fromSeq(columnArray)
}
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession.builder().master("local").appName("ReadingCSV").getOrCreate()
val conf = new SparkConf().setAppName("FixedLength").setMaster("local[*]").set("spark.driver.allowMultipleContexts", "true");
val sc = new SparkContext(conf)
val fruits = sc.textFile("in/fruits.txt")
val schemaString = "id,fruitName,isAvailable,unitPrice";
val fields = schemaString.split(",").map( field => StructField(field,StringType,nullable=true))
val schema = StructType(fields)
val df = spark.createDataFrame(fruits.map { x => getRow(x)} , schema)
df.show() // Error
println("End of the program")
}
}
I'm getting error in the df.show() command.
My file content is
56 apple TRUE 0.56
45 pear FALSE1.34
34 raspberry TRUE 2.43
34 plum TRUE 1.31
53 cherry TRUE 1.4
23 orange FALSE2.34
56 persimmon FALSE23.2
ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.ClassCastException: org.apache.spark.util.SerializableConfiguration cannot be cast to [B
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:81)
Can you please help?
You are creating rdd in old way SparkContext(conf)
val conf = new SparkConf().setAppName("FixedLength").setMaster("local[*]").set("spark.driver.allowMultipleContexts", "true");
val sc = new SparkContext(conf)
val fruits = sc.textFile("in/fruits.txt")
whereas you are creating dataframe in new way using SparkSession
val spark = SparkSession.builder().master("local").appName("ReadingCSV").getOrCreate()
val df = spark.createDataFrame(fruits.map { x => getRow(x)} , schema)
Ultimately you are mixing rdd created with old sparkContext functions with dataframe created by using new sparkSession.
I would suggest you to use only one way.
I guess thats the reason for the issue
Update
doing the following should work for you
def getRow(x : String) : Row={
val columnArray = new Array[String](4)
columnArray(0)=x.substring(0,3)
columnArray(1)=x.substring(3,13)
columnArray(2)=x.substring(13,18)
columnArray(3)=x.substring(18,22)
Row.fromSeq(columnArray)
}
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession.builder().master("local").appName("ReadingCSV").getOrCreate()
val fruits = spark.sparkContext.textFile("in/fruits.txt")
val schemaString = "id,fruitName,isAvailable,unitPrice";
val fields = schemaString.split(",").map( field => StructField(field,StringType,nullable=true))
val schema = StructType(fields)
val df = spark.createDataFrame(fruits.map { x => getRow(x)} , schema)

Scala - Groupby and Max on pair RDD

I am new in spark scala and want to find the max salary in each department
Dept,Salary
Dept1,1000
Dept2,2000
Dept1,2500
Dept2,1500
Dept1,1700
Dept2,2800
I implemented below code
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object MaxSalary {
val sc = new SparkContext(new SparkConf().setAppName("Max Salary").setMaster("local[2]"))
case class Dept(dept_name : String, Salary : Int)
val data = sc.textFile("file:///home/user/Documents/dept.txt").map(_.split(","))
val recs = data.map(r => (r(0), Dept(r(0), r(1).toInt)))
val a = recs.max()???????
})
}
but stuck how to implement group by and max function. I am using pair RDD.
Thanks
This can be done using RDDs with the following code:
val emp = sc.textFile("file:///home/user/Documents/dept.txt")
.mapPartitionsWithIndex( (idx, row) => if(idx==0) row.drop(1) else row )
.map(x => (x.split(",")(0).toString, x.split(",")(1).toInt))
val maxSal = emp.reduceByKey(math.max(_,_))
Should give you:
Array[(String, Int)] = Array((Dept1,2500), (Dept2,2800))
If you are using Dataset here is the solution
case class Dept(dept_name : String, Salary : Int)
val sc = new SparkContext(new SparkConf().setAppName("Max Salary").setMaster("local[2]"))
val sq = new SQLContext(sc)
import sq.implicits._
val file = "resources/ip.csv"
val data = sc.textFile(file).map(_.split(","))
val recs = data.map(r => Dept(r(0), r(1).toInt )).toDS()
recs.groupBy($"dept_name").agg(max("Salary").alias("max_solution")).show()
Output:
+---------+------------+
|dept_name|max_solution|
+---------+------------+
| Dept2| 2800|
| Dept1| 2500|
+---------+------------+

How to get Vector in DataFrame

I get some Feature Vector using SparkML TF-IDF algorithm. Now I want to get the Vector in the column of "idfFeatures".
My code is:
val vectors = allDF.select("idfFeatures").map{
case Row(vector: Vector) =>
vector
}
vectors.foreach(println(_))
There is a bug in console:
Error:(38, 24) type Vector takes type parameters
case Row(vector: Vector) =>
^
If I change Vector to String, there is another bug:
scala.MatchError: [(262144,[622,4200,7303,8501......,2.1972245773362196,1.2809338454620642])] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
at scala.TFIDFTest2$$anonfun$1.apply(TFIDFTest2.scala:37)
How can I get the Vector?
Spark 1.x:
import org.apache.spark.mllib.linalg.Vector
Spark 2.0:
import org.apache.spark.ml.linalg.Vector
Example:
// https://spark.apache.org/docs/latest/ml-features.html#tf-idf
import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
val sentenceData = spark.createDataFrame(Seq(
(0, "Hi I heard about Spark"),
(0, "I wish Java could use case classes"),
(1, "Logistic regression models are neat")
)).toDF("label", "sentence")
val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
val wordsData = tokenizer.transform(sentenceData)
val hashingTF = new HashingTF()
.setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)
val featurizedData = hashingTF.transform(wordsData)
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedData)
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedData)
val rescaledData = idfModel.transform(featurizedData)
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row
rescaledData.select("features").rdd.map { case Row(v: Vector) => v}.first