I am extracting Ngrams from a Spark 2.2 dataframe column using Scala, thus (trigrams in this example):
val ngram = new NGram().setN(3).setInputCol("incol").setOutputCol("outcol")
How do I create an output column that contains all of 1 to 5 grams? So it might be something like:
val ngram = new NGram().setN(1:5).setInputCol("incol").setOutputCol("outcol")
but that doesn't work.
I could loop through N and create new dataframes for each value of N but this seems inefficient. Can anyone point me in the right direction, as my Scala is ropey?
If you want to combine these into vectors you can rewrite Python answer by zero323.
import org.apache.spark.ml.feature._
import org.apache.spark.ml.Pipeline
def buildNgrams(inputCol: String = "tokens",
outputCol: String = "features", n: Int = 3) = {
val ngrams = (1 to n).map(i =>
new NGram().setN(i)
.setInputCol(inputCol).setOutputCol(s"${i}_grams")
)
val vectorizers = (1 to n).map(i =>
new CountVectorizer()
.setInputCol(s"${i}_grams")
.setOutputCol(s"${i}_counts")
)
val assembler = new VectorAssembler()
.setInputCols(vectorizers.map(_.getOutputCol).toArray)
.setOutputCol(outputCol)
new Pipeline().setStages((ngrams ++ vectorizers :+ assembler).toArray)
}
val df = Seq((1, Seq("a", "b", "c", "d"))).toDF("id", "tokens")
Result
buildNgrams().fit(df).transform(df).show(1, false)
// +---+------------+------------+---------------+--------------+-------------------------------+-------------------------+-------------------+-------------------------------------+
// |id |tokens |1_grams |2_grams |3_grams |1_counts |2_counts |3_counts |features |
// +---+------------+------------+---------------+--------------+-------------------------------+-------------------------+-------------------+-------------------------------------+
// |1 |[a, b, c, d]|[a, b, c, d]|[a b, b c, c d]|[a b c, b c d]|(4,[0,1,2,3],[1.0,1.0,1.0,1.0])|(3,[0,1,2],[1.0,1.0,1.0])|(2,[0,1],[1.0,1.0])|[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]|
// +---+------------+------------+---------------+--------------+-------------------------------+-------------------------+-------------------+-------------------------------------+
This could be simpler with a UDF:
val ngram = udf((xs: Seq[String], n: Int) =>
(1 to n).map(i => xs.sliding(i).filter(_.size == i).map(_.mkString(" "))).flatten)
spark.udf.register("ngram", ngram)
val ngramer = new SQLTransformer().setStatement(
"""SELECT *, ngram(tokens, 3) AS ngrams FROM __THIS__"""
)
ngramer.transform(df).show(false)
// +---+------------+----------------------------------+
// |id |tokens |ngrams |
// +---+------------+----------------------------------+
// |1 |[a, b, c, d]|[a, b, c, d, ab, bc, cd, abc, bcd]|
// +---+------------+----------------------------------+
Related
I have a data frame with two Array column, trying to create a new column by joining both A and B sequentially.
val df = Seq((Seq("a","b","c"),Seq("d","5","6"))).toDF("A","B")
Expected output:
C: ["a d", "b 5", "c 6"]
I am exploring both the arrays and join it again using "import org.apache.spark.sql.functions.array" function but it's not giving expected result.
Got the expected result using the arrays_zip function as like below:
import org.apache.spark.sql.functions.arrays_zip
val output = df.withColumn(
"zipped", arrays_zip($"A", $"B")
)
I think Spark not have ready function for it. You can use user defined function here zip for example:
import spark.implicits._
def zipFunc: (Seq[String], Seq[String]) => Seq[String] = (x: Seq[String], y: Seq[String]) =>
x.zip(y).map{ case (xi, yi) => s"$xi $yi"}
val df = Seq(
(Seq("a","b","c"), Seq("d","5","6"))
).toDF("A","B")
df.printSchema()
val zipUDF = spark.udf.register("zipUdf", zipFunc)
df.withColumn("C", zipUDF($"A", $"B")).show()
prints:
+---------+---------+---------------+
| A| B| C|
+---------+---------+---------------+
|[a, b, c]|[d, 5, 6]|[a d, b 5, c 6]|
+---------+---------+---------------+
I have a spark Dataframe like Below.I'm trying to split the column into 2 more columns:
date time content
28may 11am [ssid][customerid,shopid]
val personDF2 = personDF.withColumn("temp",split(col("content"),"\\[")).select(
col("*") +: (0 until 3).map(i => col("temp").getItem(i).as(s/col$i)): _*)
date time content col1 col2 col3
28may 11 [ssid][customerid,shopid] ssid customerid shopid
Assuming a String to represent an Array of Words. Got your request. You can optimize the number of dataframes as well to reduce load on system. If there are more than 9 cols etc. you may need to use c00, c01, etc. for c10 etc. Or just use integer as name for columns. leave that up to you.
import org.apache.spark.sql.functions._
import scala.collection.mutable.WrappedArray
// Set up data
val df = spark.sparkContext.parallelize(Seq(
("A", "[foo][customerid,shopid][Donald,Trump,Esq][single]"),
("B", "[foo]")
)).toDF("k", "v")
val df2 = df.withColumn("words_temp", regexp_replace($"v", lit("]"), lit("" )))
val df3 = df2.withColumn("words_temp2", regexp_replace($"words_temp", lit(","), lit("[" ))).drop("words_temp")
val df4 = df3.withColumn("words_temp3", expr("substring(words_temp2, 2, length(words_temp2))")).withColumn("cnt", expr("length(words_temp2)")).drop("words_temp2")
val df5 = df4.withColumn("words",split(col("words_temp3"),"\\[")).drop("words_temp3")
val df6 = df5.withColumn("num_words", size($"words"))
val df7 = df6.withColumn("v2", explode($"words"))
// Convert to Array of sorts via group by
val df8 = df7.groupBy("k")
.agg(collect_list("v2"))
// Convert to rdd Tuple and then find position so as to gen col names! That is the clue so as to be able to use pivot
val rdd = df8.rdd
val rdd2 = rdd.map(row => (row.getAs[String](0), row.getAs[WrappedArray[String]](1).toArray))
val rdd3 = rdd2.map { case (k, list) => (k, list.zipWithIndex) }
val df9 = rdd3.toDF("k", "v")
val df10 = df9.withColumn("vn", explode($"v"))
val df11 = df10.select($"k", $"vn".getField("_1"), concat(lit("c"),$"vn".getField("_2"))).toDF("k", "v", "c")
// Final manipulation
val result = df11.groupBy("k")
.pivot("c")
.agg(expr("coalesce(first(v),null)")) // May never occur in your case, just done for completeness and variable length cols.
result.show(100,false)
returns in this case:
+---+---+----------+------+------+-----+----+------+
|k |c0 |c1 |c2 |c3 |c4 |c5 |c6 |
+---+---+----------+------+------+-----+----+------+
|B |foo|null |null |null |null |null|null |
|A |foo|customerid|shopid|Donald|Trump|Esq |single|
+---+---+----------+------+------+-----+----+------+
Update:
Based on original title stating array of words. See other answer.
If new, then a few things here. Can also be done with dataset and map I assume. Here is a solution using DFs and rdd's. I might well investigate a complete DS in future, but this works for sure and at scale.
// Can amalgamate more steps
import org.apache.spark.sql.functions._
import scala.collection.mutable.WrappedArray
// Set up data
val df = spark.sparkContext.parallelize(Seq(
("A", Array(Array("foo", "bar"), Array("Donald", "Trump","Esq"), Array("single"))),
("B", Array(Array("foo2", "bar2"), Array("single2"))),
("C", Array(Array("foo3", "bar3", "x", "y", "z")))
)).toDF("k", "v")
// flatten via 2x explode, can be done more elegeantly with def or UDF, but keeping it simple here
val df2 = df.withColumn("v2", explode($"v"))
val df3 = df2.withColumn("v3", explode($"v2"))
// Convert to Array of sorts via group by
val df4 = df3.groupBy("k")
.agg(collect_list("v3"))
// Convert to rdd Tuple and then find position so as to gen col names! That is the clue so as to be able to use pivot
val rdd = df4.rdd
val rdd2 = rdd.map(row => (row.getAs[String](0), row.getAs[WrappedArray[String]](1).toArray))
val rdd3 = rdd2.map { case (k, list) => (k, list.zipWithIndex) }
val df5 = rdd3.toDF("k", "v")
val df6 = df5.withColumn("vn", explode($"v"))
val df7 = df6.select($"k", $"vn".getField("_1"), concat(lit("c"),$"vn".getField("_2"))).toDF("k", "v", "c")
// Final manipulation
val result = df7.groupBy("k")
.pivot("c")
.agg(expr("coalesce(first(v),null)")) // May never occur in your case, just done for completeness and variable length cols.
result.show(100,false)
returns in correct col order:
+---+----+----+-------+-----+----+------+
|k |c0 |c1 |c2 |c3 |c4 |c5 |
+---+----+----+-------+-----+----+------+
|B |foo2|bar2|single2|null |null|null |
|C |foo3|bar3|x |y |z |null |
|A |foo |bar |Donald |Trump|Esq |single|
+---+----+----+-------+-----+----+------+
I have a dataframe of type
[value: array<struct<_1:string,_2:string>>]
I want to add a new column to this dataframe, which includes the length of all the unique elements retrieved by expanding all the tuples in each row. My primary purpose is to drop the row when this length is greater than a certain value.
What I have so far is just the length for each row - code shown below
val size = inputDF.rdd.map(_.getSeq[Row](0)).map(x => {
val aSet = scala.collection.mutable.Set[String]()
x.map {
case Row(aa: String, bb: String) =>
aSet += aa
aSet += bb
}
(aSet.size)
})
However when I try to add this as a new column to the inputDF data, it does not work.
Sample inputDF is:
val inputDF = Seq(
(Array(("A","B"))),
(Array(("C","D"),("C","E"),("D","F"),("F","G"),("G","H"))),
(Array(("C","D"))),
(Array(("P","Q"),("R","S"),("T","U"),("T","V")))
).toDF
And the expected column to be appended to this has values - 2,6,2,7
If you're using Spark version 2.4.0 or greater, then you can do the same without using UDF (which is supposed to be more optimized solution):
scala> inputDF.selectExpr("*", "size(array_distinct(flatten(transform(value, (v, i) -> array(v._1, v._2))))) as count").show(false)
+----------------------------------------+-----+
|value |count|
+----------------------------------------+-----+
|[[A, B]] |2 |
|[[C, D], [C, E], [D, F], [F, G], [G, H]]|6 |
|[[C, D]] |2 |
|[[P, Q], [R, S], [T, U], [T, V]] |7 |
+----------------------------------------+-----+
Read more about Apache Spark higher order functions:
https://docs.databricks.com/_static/notebooks/apache-spark-2.4-functions.html
I would suggest using a UDF that folds the struct elements into a Set and returns its size, as shown below:
import org.apache.spark.sql.functions._
import spark.implicits._
import org.apache.spark.sql.Row
val df = Seq(
Seq(("a", "b"), ("b", "c")),
Seq(("d", "e"), ("f", "g"), ("g", "h")),
Seq(("i", "j"))
).toDF("c1")
val distinctElemCount = udf{ (arr: Seq[Row]) =>
arr.foldLeft(Set.empty[String])(
(acc, r) => acc + r.getString(0) + r.getString(1)
).size
}
df.withColumn("count", distinctElemCount($"c1")).show(false)
// +------------------------+-----+
// |c1 |count|
// +------------------------+-----+
// |[[a, b], [b, c]] |3 |
// |[[d, e], [f, g], [g, h]]|5 |
// |[[i, j]] |2 |
// +------------------------+-----+
I got requirement to collapse the rows and have wrappedarray. here is original data and expected result. need to do it in spark scala.
Original Data:
Column1 COlumn2 Units UnitsByDept
ABC BCD 3 [Dept1:1,Dept2:2]
ABC BCD 13 [Dept1:5,Dept3:8]
Expected Result:
ABC BCD 16 [Dept1:6,Dept2:2,Dept3:8]
It would probably be best to use DataFrame APIs for what you need. If you prefer using row-based functions like reduceByKey, here's one approach:
Convert the DataFrame to a PairRDD
Apply reduceByKey to sum up Units and aggregate UnitsByDept by Dept
Convert the resulting RDD back to a DataFrame:
Sample code below:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
val df = Seq(
("ABC", "BCD", 3, Seq("Dept1:1", "Dept2:2")),
("ABC", "BCD", 13, Seq("Dept1:5", "Dept3:8"))
).toDF("Column1", "Column2", "Units", "UnitsByDept")
val rdd = df.rdd.
map{ case Row(c1: String, c2: String, u: Int, ubd: Seq[String]) =>
((c1, c2), (u, ubd))
}.
reduceByKey( (acc, t) => (acc._1 + t._1, acc._2 ++ t._2) ).
map{ case ((c1, c2), (u, ubd)) =>
val aggUBD = ubd.map(_.split(":")).map(arr => (arr(0), arr(1).toInt)).
groupBy(_._1).mapValues(_.map(_._2).sum).
map{ case (d, u) => d + ":" + u }
( c1, c2, u, aggUBD)
}
rdd.collect
// res1: Array[(String, String, Int, scala.collection.immutable.Iterable[String])] =
// Array((ABC,BCD,16,List(Dept3:8, Dept2:2, Dept1:6)))
val rowRDD = rdd.map{ case (c1: String, c2: String, u: Int, ubd: Array[String]) =>
Row(c1, c2, u, ubd)
}
val dfResult = spark.createDataFrame(rowRDD, df.schema)
dfResult.show(false)
// +-------+-------+-----+---------------------------+
// |Column1|Column2|Units|UnitsByDept |
// +-------+-------+-----+---------------------------+
// |ABC |BCD |16 |[Dept3:8, Dept2:2, Dept1:6]|
// +-------+-------+-----+---------------------------+
val temp = sqlContext.sql(s"SELECT A, B, C, (CASE WHEN (D) in (1,2,3) THEN ((E)+0.000)/60 ELSE 0 END) AS Z from TEST.TEST_TABLE")
val temp1 = temp.map({ temp => ((temp.getShort(0), temp.getString(1)), (USAGE_TEMP.getDouble(2), USAGE_TEMP.getDouble(3)))})
.reduceByKey((x, y) => ((x._1+y._1),(x._2+y._2)))
instead of the above code which is doing the computation(case evaluation) on hive layer I would like to have the transformation done in scala. How would I do it?
Is it possible to do the same while filling the data inside Map?
val temp = sqlContext.sql(s"SELECT A, B, C, D, E from TEST.TEST_TABLE")
val tempTransform = temp.map(row => {
val z = List[Double](1, 2, 3).contains(row.getDouble(3)) match {
case true => row.getDouble(4) / 60
case _ => 0
}
Row(row.getShort(0), Row.getString(1), Row.getDouble(2), z)
})
val temp1 = tempTransform.map({ temp => ((temp.getShort(0), temp.getString(1)), (USAGE_TEMP.getDouble(2), USAGE_TEMP.getDouble(3)))})
.reduceByKey((x, y) => ((x._1+y._1),(x._2+y._2)))
you can use this syntax as well
new_df = old_df.withColumn('target_column', udf(df.name))
as reffered by this example
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._ // for `toDF` and $""
import org.apache.spark.sql.functions._ // for `when`
val df = sc.parallelize(Seq((4, "blah", 2), (2, "", 3), (56, "foo", 3), (100, null, 5)))
.toDF("A", "B", "C")
val newDf = df.withColumn("D", when($"B".isNull or $"B" === "", 0).otherwise(1))
In your case, execute sql which be dataframe like below
val temp = sqlContext.sql(s"SELECT A, B, C, D, E from TEST.TEST_TABLE")
and apply withColumn with case or when otherwise or if needed spark udf
, call scala function logic instead of hiveudf