Convert arbitrary number of columns to Vector - scala

How to convert a group of arbitrary columns to a Mllib Vector?
Basically, I have first column of my DataFrame with a fixed name and then a number of arbitrary named columns each having Double values inside.
like so:
name | a | b | c |
val1 | 0.0 | 1.0 | 1.0 |
val2 | 2.0 | 1.0 | 5.0 |
Could be any number of columns. I need to get a DataSet of the following:
final case class ValuesRow(name: String, values: Vector)

This can be done in a simple way using VectorAssembler. The columns that are to be merged into a Vector are used as input, in this case all columns except the first.
val df = spark.createDataFrame(Seq(("val1", 0, 1, 1), ("val2", 2, 1, 5)))
.toDF("name", "a", "b", "c")
val columnNames = df.columns.drop(1) // drop the name column
val assembler = new VectorAssembler()
.setInputCols(columnNames)
.setOutputCol("values")
val df2 = assembler.transform(df).select("name", "values").as[ValuesRow]
The result will be a dataset containing the name and values columns:
+----+-------------+
|name| values|
+----+-------------+
|val1|[0.0,1.0,1.0]|
|val2|[2.0,1.0,5.0]|
+----+-------------+

Here's one way to do it:
import org.apache.spark.sql.functions._
import org.apache.spark.mllib.linalg.DenseVector
val ds = Seq(
("val1", 0.0, 1.0, 1.0),
("val2", 2.0, 1.0, 5.0)
).toDF("name", "a", "b", "c").
as[(String, Double, Double, Double)]
val colList = ds.columns
val keyCol = colList(0)
val valCols = colList.drop(1)
def arrToVec = udf(
(s: Seq[Double]) => new DenseVector(s.toArray)
)
ds.select(
col(keyCol), arrToVec( array(valCols.map(x => col(x)): _*) ).as("values")
).show
// +----+-------------+
// |name| values|
// +----+-------------+
// |val1|[0.0,1.0,1.0]|
// |val2|[2.0,1.0,5.0]|
// +----+-------------+

Related

Spark create a dataframe from multiple lists/arrays

So, I have 2 lists in Spark(scala). They both contain the same number of values. The first list a contains all strings and the second list b contains all Long's.
a: List[String] = List("a", "b", "c", "d")
b: List[Long] = List(17625182, 17625182, 1059731078, 100)
I also have a schema defined as follows:
val schema2=StructType(
Array(
StructField("check_name", StringType, true),
StructField("metric", DecimalType(38,0), true)
)
)
What is the best way to convert my lists to a single dataframe, that has schema schema2 and the columns are made from a and b respectively?
You can create an RDD[Row] and convert to Spark dataframe with the given schema:
val df = spark.createDataFrame(
sc.parallelize(a.zip(b).map(x => Row(x._1, BigDecimal(x._2)))),
schema2
)
df.show
+----------+----------+
|check_name| metric|
+----------+----------+
| a| 17625182|
| b| 17625182|
| c|1059731078|
| d| 100|
+----------+----------+
Using Dataset:
import spark.implicits._
case class Schema2(a: String, b: Long)
val el = (a zip b) map { case (a, b) => Schema2(a, b)}
val df = spark.createDataset(el).toDF()

Add new column containing an Array of column names sorted by the row-wise values

Given a dataFrame with a few columns, I'm trying to create a new column containing an array of these columns' names sorted by decreasing order, based on the row-wise values of these columns.
| a | b | c | newcol|
|---|---|---|-------|
| 1 | 4 | 3 |[b,c,a]|
| 4 | 1 | 3 |[a,c,b]|
---------------------
The names of the columns are stored in a var names:Array[String]
What approach should I go for?
Using UDF is most simple way to achieve custom tasks here.
val df = spark.createDataFrame(Seq((1,4,3), (4,1,3))).toDF("a", "b", "c")
val names=df.schema.fieldNames
val sortNames = udf((v: Seq[Int]) => {v.zip(names).sortBy(_._1).map(_._2)})
df.withColumn("newcol", sortNames(array(names.map(col): _*))).show
Something like this can be an approach using Dataset:
case class Element(name: String, value: Int)
case class Columns(a: Int, b: Int, c: Int, elements: Array[String])
def function1()(implicit spark: SparkSession) = {
import spark.implicits._
val df0: DataFrame =
spark.createDataFrame(spark.sparkContext
.parallelize(Seq(Row(1, 2, 3), Row(4, 1, 3))),
StructType(Seq(StructField("a", IntegerType, false),
StructField("b", IntegerType, false),
StructField("c", IntegerType, false))))
val df1 = df0
.flatMap(row => Seq(Columns(row.getAs[Int]("a"),
row.getAs[Int]("b"),
row.getAs[Int]("c"),
Array(Element("a", row.getAs[Int]("a")),
Element("b", row.getAs[Int]("b")),
Element("c", row.getAs[Int]("c"))).sortBy(-_.value).map(_.name))))
df1
}
def main(args: Array[String]) : Unit = {
implicit val spark = SparkSession.builder().master("local[1]").getOrCreate()
function1().show()
}
gives:
+---+---+---+---------+
| a| b| c| elements|
+---+---+---+---------+
| 1| 2| 3|[a, b, c]|
| 4| 1| 3|[b, c, a]|
+---+---+---+---------+
Try something like this:
val sorted_column_names = udf((column_map: Map[String, Int]) =>
column_map.toSeq.sortBy(- _._2).map(_._1)
)
df.withColumn("column_map", map(lit("a"), $"a", lit("b"), $"b", lit("c"), $"c")
.withColumn("newcol", sorted_column_names($"column_map"))

Spark SQL Split or Extract words from String of Words

I have a spark Dataframe like Below.I'm trying to split the column into 2 more columns:
date time content
28may 11am [ssid][customerid,shopid]
val personDF2 = personDF.withColumn("temp",split(col("content"),"\\[")).select(
col("*") +: (0 until 3).map(i => col("temp").getItem(i).as(s/col$i)): _*)
date time content col1 col2 col3
28may 11 [ssid][customerid,shopid] ssid customerid shopid
Assuming a String to represent an Array of Words. Got your request. You can optimize the number of dataframes as well to reduce load on system. If there are more than 9 cols etc. you may need to use c00, c01, etc. for c10 etc. Or just use integer as name for columns. leave that up to you.
import org.apache.spark.sql.functions._
import scala.collection.mutable.WrappedArray
// Set up data
val df = spark.sparkContext.parallelize(Seq(
("A", "[foo][customerid,shopid][Donald,Trump,Esq][single]"),
("B", "[foo]")
)).toDF("k", "v")
val df2 = df.withColumn("words_temp", regexp_replace($"v", lit("]"), lit("" )))
val df3 = df2.withColumn("words_temp2", regexp_replace($"words_temp", lit(","), lit("[" ))).drop("words_temp")
val df4 = df3.withColumn("words_temp3", expr("substring(words_temp2, 2, length(words_temp2))")).withColumn("cnt", expr("length(words_temp2)")).drop("words_temp2")
val df5 = df4.withColumn("words",split(col("words_temp3"),"\\[")).drop("words_temp3")
val df6 = df5.withColumn("num_words", size($"words"))
val df7 = df6.withColumn("v2", explode($"words"))
// Convert to Array of sorts via group by
val df8 = df7.groupBy("k")
.agg(collect_list("v2"))
// Convert to rdd Tuple and then find position so as to gen col names! That is the clue so as to be able to use pivot
val rdd = df8.rdd
val rdd2 = rdd.map(row => (row.getAs[String](0), row.getAs[WrappedArray[String]](1).toArray))
val rdd3 = rdd2.map { case (k, list) => (k, list.zipWithIndex) }
val df9 = rdd3.toDF("k", "v")
val df10 = df9.withColumn("vn", explode($"v"))
val df11 = df10.select($"k", $"vn".getField("_1"), concat(lit("c"),$"vn".getField("_2"))).toDF("k", "v", "c")
// Final manipulation
val result = df11.groupBy("k")
.pivot("c")
.agg(expr("coalesce(first(v),null)")) // May never occur in your case, just done for completeness and variable length cols.
result.show(100,false)
returns in this case:
+---+---+----------+------+------+-----+----+------+
|k |c0 |c1 |c2 |c3 |c4 |c5 |c6 |
+---+---+----------+------+------+-----+----+------+
|B |foo|null |null |null |null |null|null |
|A |foo|customerid|shopid|Donald|Trump|Esq |single|
+---+---+----------+------+------+-----+----+------+
Update:
Based on original title stating array of words. See other answer.
If new, then a few things here. Can also be done with dataset and map I assume. Here is a solution using DFs and rdd's. I might well investigate a complete DS in future, but this works for sure and at scale.
// Can amalgamate more steps
import org.apache.spark.sql.functions._
import scala.collection.mutable.WrappedArray
// Set up data
val df = spark.sparkContext.parallelize(Seq(
("A", Array(Array("foo", "bar"), Array("Donald", "Trump","Esq"), Array("single"))),
("B", Array(Array("foo2", "bar2"), Array("single2"))),
("C", Array(Array("foo3", "bar3", "x", "y", "z")))
)).toDF("k", "v")
// flatten via 2x explode, can be done more elegeantly with def or UDF, but keeping it simple here
val df2 = df.withColumn("v2", explode($"v"))
val df3 = df2.withColumn("v3", explode($"v2"))
// Convert to Array of sorts via group by
val df4 = df3.groupBy("k")
.agg(collect_list("v3"))
// Convert to rdd Tuple and then find position so as to gen col names! That is the clue so as to be able to use pivot
val rdd = df4.rdd
val rdd2 = rdd.map(row => (row.getAs[String](0), row.getAs[WrappedArray[String]](1).toArray))
val rdd3 = rdd2.map { case (k, list) => (k, list.zipWithIndex) }
val df5 = rdd3.toDF("k", "v")
val df6 = df5.withColumn("vn", explode($"v"))
val df7 = df6.select($"k", $"vn".getField("_1"), concat(lit("c"),$"vn".getField("_2"))).toDF("k", "v", "c")
// Final manipulation
val result = df7.groupBy("k")
.pivot("c")
.agg(expr("coalesce(first(v),null)")) // May never occur in your case, just done for completeness and variable length cols.
result.show(100,false)
returns in correct col order:
+---+----+----+-------+-----+----+------+
|k |c0 |c1 |c2 |c3 |c4 |c5 |
+---+----+----+-------+-----+----+------+
|B |foo2|bar2|single2|null |null|null |
|C |foo3|bar3|x |y |z |null |
|A |foo |bar |Donald |Trump|Esq |single|
+---+----+----+-------+-----+----+------+

How to extract efficiently multiple columns from a single string column RDD?

I have a file with 20+ columns of which I would like to extract a few. Until now, I have the following code. I'm sure there is a smart way to do it, but not able to get it working successfully. Any ideas?
mvnmdata is of type RDD[String]
val strpcols = mvnmdata.map(x => x.split('|')).map(x => (x(0),x(1),x(5),x(6),x(7),x(8),x(9),x(10),x(11),x(12),x(13),x(14),x(15),x(16),x(17),x(18),x(19),x(20),x(21),x(22),x(23) ))```
The next solution provides an easy and scalable way to manage your column names and indices. It is based on a map which determines the column name/index relation. The map will also help us to handle both the index of the extracted column and its name.
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StringType, StructType, StructField}
val rdd = spark.sparkContext.parallelize(Seq(
"1|500|400|300",
"1|34|67|89",
"2|10|20|56",
"3|2|5|56",
"3|1|8|22"))
val dictColums = Map("c0" -> 0, "c2" -> 2)
// create schema from map keys
val schema = StructType(dictColums.keys.toSeq.map(StructField(_, StringType, true)))
val mappedRDD = rdd.map{line => line.split('|')}
.map{
cols => Row.fromSeq(dictColums.values.toSeq.map{cols(_)})
}
val df = spark.createDataFrame(mappedRDD, schema).show
//output
+---+---+
| c0| c2|
+---+---+
| 1|400|
| 1| 67|
| 2| 20|
| 3| 5|
| 3| 8|
+---+---+
First we declare dictColums in this example we will extract the cols "c0" -> 0 and "c2" -> 2
Next we create the schema from the keys of the map
The one map (which you already have) will split the line by |, the second one will create a Row containing the values that correspond to each item of dictColums.values
UPDATE:
You could also create a function from the above functionality in order to be able to reuse it multiple times:
import org.apache.spark.sql.DataFrame
def stringRddToDataFrame(colsMapping: Map[String, Int], rdd: RDD[String]) : DataFrame = {
val schema = StructType(colsMapping.keys.toSeq.map(StructField(_, StringType, true)))
val mappedRDD = rdd.map{line => line.split('|')}
.map{
cols => Row.fromSeq(colsMapping.values.toSeq.map{cols(_)})
}
spark.createDataFrame(mappedRDD, schema)
}
And then use it for your case:
val cols = Map("c0" -> 0, "c1" -> 1, "c5" -> 5, ... "c23" -> 23)
val df = stringRddToDataFrame(cols, rdd)
As below,if you don't want to write repeated x(i),you can process it in a loop. Example 1:
val strpcols = mvnmdata.map(x => x.split('|'))
.map(x =>{
val xbuffer = new ArrayBuffer[String]()
for (i <- Array(0,1,5,6...)){
xbuffer.append(x(i))
}
xbuffer
})
If you only want to define the index list with start&end and the numbers to be excluded, see Example 2 of below:
scala> (1 to 10).toSet
res8: scala.collection.immutable.Set[Int] = Set(5, 10, 1, 6, 9, 2, 7, 3, 8, 4)
scala> ((1 to 10).toSet -- Set(2,9)).toArray.sortBy(row=>row)
res9: Array[Int] = Array(1, 3, 4, 5, 6, 7, 8, 10)
The final code you want:
//define the function to process indexes
def getSpecIndexes(start:Int, end:Int, removedValueSet:Set[Int]):Array[Int] = {
((start to end).toSet -- removedValueSet).toArray.sortBy(row=>row)
}
val strpcols = mvnmdata.map(x => x.split('|'))
.map(x =>{
val xbuffer = new ArrayBuffer[String]()
//call the function
for (i <- getSpecIndexes(0,100,Set(3,4,5,6))){
xbuffer.append(x(i))
}
xbuffer
})

Finding size of distinct array column

I am using Scala and Spark to create a dataframe. Here's my code so far:
val df = transformedFlattenDF
.groupBy($"market", $"city", $"carrier").agg(count("*").alias("count"), min($"bandwidth").alias("bandwidth"), first($"network").alias("network"), concat_ws(",", collect_list($"carrierCode")).alias("carrierCode")).withColumn("carrierCode", split(($"carrierCode"), ",").cast("array<string>")).withColumn("Carrier Count", collect_set("carrierCode"))
The column carrierCode becomes an array column. The data is present as follows:
CarrierCode
1: [12,2,12]
2: [5,2,8]
3: [1,1,3]
I'd like to create a column that counts the number of distinct values in each array. I tried doing collect_set, however, it gives me an error saying grouping expressions sequence is empty Is it possible to find the number of distinct values in each row's array? So that way in our same example, there could be a column like so:
Carrier Count
1: 2
2: 3
3: 2
collect_set is for aggregation hence should be applied within your groupBy-agg step:
val df = transformedFlattenDF.groupBy($"market", $"city", $"carrier").agg(
count("*").alias("count"), min($"bandwidth").alias("bandwidth"),
first($"network").alias("network"),
concat_ws(",", collect_list($"carrierCode")).alias("carrierCode"),
size(collect_set($"carrierCode")).as("carrier_count") // <-- ADDED `collect_set`
).
withColumn("carrierCode", split(($"carrierCode"), ",").cast("array<string>"))
If you don't want to change the existing groupBy-agg code, you can create a UDF like in the following example:
import org.apache.spark.sql.functions._
val codeDF = Seq(
Array("12", "2", "12"),
Array("5", "2", "8"),
Array("1", "1", "3")
).toDF("carrier_code")
def distinctElemCount = udf( (a: Seq[String]) => a.toSet.size )
codeDF.withColumn("carrier_count", distinctElemCount($"carrier_code")).
show
// +------------+-------------+
// |carrier_code|carrier_count|
// +------------+-------------+
// | [12, 2, 12]| 2|
// | [5, 2, 8]| 3|
// | [1, 1, 3]| 2|
// +------------+-------------+
Without UDF and using RDD conversion and back to DF for posterity:
import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq(
("A", 2, 100, 2), ("F", 7, 100, 1), ("B", 10, 100, 100)
)).toDF("c1", "c2", "c3", "c4")
val x = df.select("c1", "c2", "c3", "c4").rdd.map(x => (x.get(0), List(x.get(1), x.get(2), x.get(3))) )
val y = x.map {case (k, vL) => (k, vL.toSet.size) }
// Manipulate back to your DF, via conversion, join, what not.
Returns:
res15: Array[(Any, Int)] = Array((A,2), (F,3), (B,2))
Solution above better, as stated more so for posterity.
You can take help for udf and you can do like this.
//Input
df.show
+-----------+
|CarrierCode|
+-----------+
|1:[12,2,12]|
| 2:[5,2,8]|
| 3:[1,1,3]|
+-----------+
//udf
val countUDF=udf{(str:String)=>val strArr=str.split(":"); strArr(0)+":"+strArr(1).split(",").distinct.length.toString}
df.withColumn("Carrier Count",countUDF(col("CarrierCode"))).show
//Sample Output:
+-----------+-------------+
|CarrierCode|Carrier Count|
+-----------+-------------+
|1:[12,2,12]| 1:3|
| 2:[5,2,8]| 2:3|
| 3:[1,1,3]| 3:3|
+-----------+-------------+