How to decode HTML entities in Spark-scala? - scala

I have a spark code to read some data from a database.
One of the columns (of type string) named "title" contains the following data.
+-------------------------------------------------+
|title |
+-------------------------------------------------+
|Example sentence |
|Read the ‘Book’ |
|‘LOTR’ Is A Great Book |
+-------------------------------------------------+
I'd like to remove the HTML entities and decode it to look as given below.
+-------------------------------------------+
|title |
+-------------------------------------------+
|Example sentence |
|Read the ‘Book’ |
|‘LOTR’ Is A Great Book |
+-------------------------------------------+
There is a library "html-enitites" for node.js that does exactly what I am looking for,
but i am unable to find something similar for spark-scala.
What would be good approach to do this?

You can use org.apache.commons.lang.StringEscapeUtils with a help of UDF to achieve this.
import org.apache.commons.lang.StringEscapeUtils;
val decodeHtml = (html:String) => {
StringEscapeUtils.unescapeHtml(html);
}
val decodeHtmlUDF = udf(decodeHtml)
df.withColumn("title", decodeHtmlUDF($"title")).show()
/*
+--------------------+
| title|
+--------------------+
| Example sentence |
| Read the ‘Book’ |
|‘LOTR’ Is A Great...|
+--------------------+
*/

Related

How to create Scala trait which stores data from other columns in dataset and then create new dataset with column storing the trait in Scala?

I am new to Scala and am currently studying datasets for Scala and Spark. Based on my input dataset below, I am trying to create a new dataset (see below). In the new dataset, I aim to have a new column which contains a Scala trait Seq[order_summary]. The Scala trait stores data the corresponding Name, Ticket Number, and Seat Number taken from the input dataset.
I have implemented input_dataset.groupyBy("Name") to organise the dataset and have tried df.withColumn("NewColumn", struct(df("a"), df("b"))) to combine different columns together. However, I would like to use a Scala trait instead and am also stuck with matching the name to the ticket number. Would anyone know how to resolve this or point me towards the right direction?
Input dataset: input_dataset
Name Type is String. Ticket Number Type is Int
+----+---------------+-------------+
|Name| Ticket Number | Seat Number |
+----+---------------+-------------+
|Adam| 123 | AB |
|Adam| 456 | AC |
|Adam| 789 | AD |
|Bob | 1234 | BA |
|Bob | 5678 | BB |
|Sam | 987 | CA |
|Sam | 654 | CB |
|Sam | 321 | CC |
|Sam | 876 | CD |
+----+---------------+-------------+
Output dataset
Name Type is String. Purchase Order Summary is a trait, Seq[order_summary]
+----+-----------------------------------------------------+
|Name| Purchase Order Summary |
+----+-----------------------------------------------------+
|Adam|((Adam,123,AB),(Adam,456,AC),(Adam,789,AD)) |
|Bob |((Bob,1234,BA),(Bob,5678,BB)) |
|Sam |((Sam,987,CA),(Sam,654,CB),(Sam,321,CC),(Sam,876,CD))|
+----+-----------------------------------------------------+
Pretty sure Spark has a map method.
So you could just create a case class
case class PurchaseOrderSummary(name: String, ticketNum: Long, seatNum: Int)
and instantiate it inside a map from your DF, then collect it into a list.
df.map(row => PurchaseOrderSummary(row.getString(0), row.getLong(1), row.getInt(2))).collectAsList
collectAsList should retrieve data from the RDD and transform it to a scala List[PurchaseOrderSummary].

How do I explode a nested Struct in Spark using Scala

I am creating a dataframe using
val snDump = table_raw
.applyMapping(mappings = Seq(
("event_id", "string", "eventid", "string"),
("lot-number", "string", "lotnumber", "string"),
("serial-number", "string", "serialnumber", "string"),
("event-time", "bigint", "eventtime", "bigint"),
("companyid", "string", "companyid", "string")),
caseSensitive = false, transformationContext = "sn")
.toDF()
.groupBy(col("eventid"), col("lotnumber"), col("companyid"))
.agg(collect_list(struct("serialnumber", "eventtime")).alias("snetlist"))
.createOrReplaceTempView("sn")
I have data like this in the df
eventid | lotnumber | companyid | snetlist
123 | 4q22 | tu56ff | [[12345,67438]]
456 | 4q22 | tu56ff | [[12346,67434]]
258 | 4q22 | tu56ff | [[12347,67455], [12333,67455]]
999 | 4q22 | tu56ff | [[12348,67459]]
I want to explode it put data in 2 columns in my table for that what I am doing is
val serialNumberEvents = snDump.select(col("eventid"), col("lotnumber"), explode(col("snetlist")).alias("serialN"), explode(col("snetlist")).alias("eventT"), col("companyid"))
Also tried
val serialNumberEvents = snDump.select(col("eventid"), col("lotnumber"), col($"snetlist.serialnumber").alias("serialN"), col($"snetlist.eventtime").alias("eventT"), col("companyid"))
but it turns out that explode can be only used once and I get error in the select so how do I use explode/or something else to achieve what I am trying to.
eventid | lotnumber | companyid | serialN | eventT |
123 | 4q22 | tu56ff | 12345 | 67438 |
456 | 4q22 | tu56ff | 12346 | 67434 |
258 | 4q22 | tu56ff | 12347 | 67455 |
258 | 4q22 | tu56ff | 12333 | 67455 |
999 | 4q22 | tu56ff | 12348 | 67459 |
I have looked at a lot of stackoverflow threads but none of it helped me. It is possible that such question is already answered but my understanding of scala is very less which might have made me not understand the answer. If this is a duplicate then someone could direct me to the correct answer. Any help is appreciated.
First, explode the array in a temporary struct-column, then unpack it:
val serialNumberEvents = snDump
.withColumn("tmp",explode((col("snetlist"))))
.select(
col("eventid"),
col("lotnumber"),
col("companyid"),
// unpack struct
col("tmp.serialnumber").as("serialN"),
col("tmp.eventtime").as("serialT")
)
The trick is to pack the columns you want to explode in an array (or struct), use explode on the array and then unpack them.
val col_names = Seq("eventid", "lotnumber", "companyid", "snetlist")
val data = Seq(
(123, "4q22", "tu56ff", Seq(Seq(12345,67438))),
(456, "4q22", "tu56ff", Seq(Seq(12346,67434))),
(258, "4q22", "tu56ff", Seq(Seq(12347,67455), Seq(12333,67455))),
(999, "4q22", "tu56ff", Seq(Seq(12348,67459)))
)
val snDump = spark.createDataFrame(data).toDF(col_names: _*)
val serialNumberEvents = snDump.select(col("eventid"), col("lotnumber"), explode(col("snetlist")).alias("snetlist"), col("companyid"))
val exploded = serialNumberEvents.select($"eventid", $"lotnumber", $"snetlist".getItem(0).alias("serialN"), $"snetlist".getItem(1).alias("eventT"), $"companyid")
exploded.show()
Note that my snetlist has the schema Array(Array) rather then Array(Struct). You can simply get this by also creating an array instead of a struct out of your columns
Another approach, if needing to explode twice, is as follows - for another example, but to demonstrate the point:
val flattened2 = df.select($"director", explode($"films.actors").as("actors_flat"))
val flattened3 = flattened2.select($"director", explode($"actors_flat").as("actors_flattened"))
See Is there an efficient way to join two large Datasets with (deeper) nested array field? for a slightly different context, but same approach applies.
This answer in response to your assertion you can only explode once.

Is there a better way to go about this process of trimming my spark DataFrame appropriately?

In the following example, I want to be able to only take the x Ids with the highest counts. x is number of these I want which is determined by a variable called howMany.
For the following example, given this Dataframe:
+------+--+-----+
|query |Id|count|
+------+--+-----+
|query1|11|2 |
|query1|12|1 |
|query2|13|2 |
|query2|14|1 |
|query3|13|2 |
|query4|12|1 |
|query4|11|1 |
|query5|12|1 |
|query5|11|2 |
|query5|14|1 |
|query5|13|3 |
|query6|15|2 |
|query6|16|1 |
|query7|17|1 |
|query8|18|2 |
|query8|13|3 |
|query8|12|1 |
+------+--+-----+
I would like to get the following dataframe if the variable number is 2.
+------+-------+-----+
|query |Ids |count|
+------+-------+-----+
|query1|[11,12]|2 |
|query2|[13,14]|2 |
|query3|[13] |2 |
|query4|[12,11]|1 |
|query5|[11,13]|2 |
|query6|[15,16]|2 |
|query7|[17] |1 |
|query8|[18,13]|2 |
+------+-------+-----+
I then want to remove the count column, but that is trivial.
I have a way to do this, but I think it defeats the purpose of scala all together and completely wastes a lot of runtime. Being new, I am unsure about the best ways to go about this
My current method is to first get a distinct list of the query column and create an iterator. Second I loop through the list using the iterator and trim the dataframe to only the current query in the list using df.select($"eachColumnName"...).where("query".equalTo(iter.next())). I then .limit(howMany) and then groupBy($"query").agg(collect_list($"Id").as("Ids")). Lastly, I have an empty dataframe and add each of these one by one to the empty dataframe and return this newly created dataframe.
df.select($"query").distinct().rdd.map(r => r(0).asInstanceOf[String]).collect().toList
val iter = queries.toIterator
while (iter.hasNext) {
middleDF = df.select($"query", $"Id", $"count").where($"query".equalTo(iter.next()))
queryDF = middleDF.sort(col("count").desc).limit(howMany).select(col("query"), col("Ids")).groupBy(col("query")).agg(collect_list("Id").as("Ids"))
emptyDF.union(queryDF) // Assuming emptyDF is made
}
emptyDF
I would do this using Window-Functions to get the rank, then groupBy to aggrgate:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val howMany = 2
val newDF = df
.withColumn("rank",row_number().over(Window.partitionBy($"query").orderBy($"count".desc)))
.where($"rank"<=howMany)
.groupBy($"query")
.agg(
collect_list($"Id").as("Ids"),
max($"count").as("count")
)

Generating all possible combinations from a Data Frame in Apache Spark

I'm trying to do something quite simple where I have 2 arrays that have been converted into a Data Frame, and I want to show all possible combinations. So for example my output at the moment looks something like this:
+-----------+-----------+
| A | B |
+-----------+-----------+
| First | T |
| Second | P |
+-----------|-----------+
However what I'm actually looking for is this:
+-----------+-----------+
| A | B |
+-----------+-----------+
| First | T |
| First | P |
| Second | T |
| Second | P |
+-----------|-----------+
So far I've got some fairly straight forward code to map my arrays into columns but being quite new to using both Scala and Spark I'm not sure how I'd grab all those combinations. Here is what I have so far:
val firstColumnValues = Array("First", "Second")
val secondColumnValues = Array("T", "P")
val xs = Array(firstColumnValues, secondColumnValues).transpose
val mapped = sparkContext.parallelize(xs).map(ys => Row(ys(0), ys(1)))
val df = mapped.toDF("A", "B")
df.show
...
case class Row(first: String, second: String)
Thanks in advance for any help
In Spark 2.3
val firstColumnValues = sc.parallelize(Array("First", "Second")).toDF("A")
val secondColumnValues = sc.parallelize(Array("T", "P")).toDF("B")
val fullouter = firstColumnValues.crossJoin(secondColumnValues).show

Spark explode multiple columns of row in multiple rows

I have a problem with converting one row using three 3 columns into 3 rows
For example:
<pre>
<b>ID</b> | <b>String</b> | <b>colA</b> | <b>colB</b> | <b>colC</b>
<em>1</em> | <em>sometext</em> | <em>1</em> | <em>2</em> | <em>3</em>
</pre>
I need to convert it into:
<pre>
<b>ID</b> | <b>String</b> | <b>resultColumn</b>
<em>1</em> | <em>sometext</em> | <em>1</em>
<em>1</em> | <em>sometext</em> | <em>2</em>
<em>1</em> | <em>sometext</em> | <em>3</em>
</pre>
I just have dataFrame which is connected with first schema(table).
val df: dataFrame
Note: I can do it using RDD, but do we have other way? Thanks
Assuming that df has the schema of your first snippet, I would try:
df.select($"ID", $"String", explode(array($"colA", $"colB",$"colC")).as("resultColumn"))
I you further want to keep the column names, you can use a trick that consists in creating a column of arrays that contains the array of the value and the name. First create your expression
val expr = explode(array(array($"colA", lit("colA")), array($"colB", lit("colB")), array($"colC", lit("colC"))))
then use getItem (since you can not use generator on nested expressions, you need 2 select here)
df.select($"ID, $"String", expr.as("tmp")).select($"ID", $"String", $"tmp".getItem(0).as("resultColumn"), $"tmp".getItem(1).as("columnName"))
It is a bit verbose though, there might be more elegant way to do this.