How to find similar rows by matching column values spark? - scala

So i have a data set like
{"customer":"customer-1","attributes":{"att-a":"att-a-7","att-b":"att-b-3","att-c":"att-c-10","att-d":"att-d-10","att-e":"att-e-15","att-f":"att-f-11","att-g":"att-g-2","att-h":"att-h-7","att-i":"att-i-5","att-j":"att-j-14"}}
{"customer":"customer-2","attributes":{"att-a":"att-a-9","att-b":"att-b-7","att-c":"att-c-12","att-d":"att-d-4","att-e":"att-e-10","att-f":"att-f-4","att-g":"att-g-13","att-h":"att-h-4","att-i":"att-i-1","att-j":"att-j-13"}}
{"customer":"customer-3","attributes":{"att-a":"att-a-10","att-b":"att-b-6","att-c":"att-c-1","att-d":"att-d-1","att-e":"att-e-13","att-f":"att-f-12","att-g":"att-g-9","att-h":"att-h-6","att-i":"att-i-7","att-j":"att-j-4"}}
{"customer":"customer-4","attributes":{"att-a":"att-a-9","att-b":"att-b-14","att-c":"att-c-7","att-d":"att-d-4","att-e":"att-e-8","att-f":"att-f-7","att-g":"att-g-14","att-h":"att-h-9","att-i":"att-i-13","att-j":"att-j-3"}}
I have flattened the data in the DF like this
+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-----------+
| att-a| att-b| att-c| att-d| att-e| att-f| att-g| att-h| att-i| att-j| customer|
+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-----------+
| att-a-7| att-b-3|att-c-10|att-d-10|att-e-15|att-f-11| att-g-2| att-h-7| att-i-5|att-j-14| customer-1|
| att-a-9| att-b-7|att-c-12| att-d-4|att-e-10| att-f-4|att-g-13| att-h-4| att-i-1|att-j-13| customer-2|
I want to complete the comapreColumns function.
which compares the columns of the two dataframes(userDF and flattenedDF) and returns a new DF as sample output.
how to do that? Like, compare each row's and column in flattenedDF with userDF and count++ if they match? e.g att-a with att-a att-b with att-b.
def getCustomer(customerID: String)(dataFrame: DataFrame): DataFrame = {
dataFrame.filter($"customer" === customerID).toDF()
}
def compareColumns(customerID: String)(dataFrame: DataFrame): DataFrame = {
val userDF = dataFrame.transform(getCustomer(customerID))
userDF.printSchema()
userDF
}
Sample Output:
+--------------------+-----------+
| customer | similarity_score |
+--------------------+-----------+
|customer-1 | -1 | its the same as the reference customer so to ignore '-1'
|customer-12 | 2 |
|customer-3 | 2 |
|customer-44 | 5 |
|customer-5 | 1 |
|customer-6 | 10 |
Thanks

Related

How do I explode a nested Struct in Spark using Scala

I am creating a dataframe using
val snDump = table_raw
.applyMapping(mappings = Seq(
("event_id", "string", "eventid", "string"),
("lot-number", "string", "lotnumber", "string"),
("serial-number", "string", "serialnumber", "string"),
("event-time", "bigint", "eventtime", "bigint"),
("companyid", "string", "companyid", "string")),
caseSensitive = false, transformationContext = "sn")
.toDF()
.groupBy(col("eventid"), col("lotnumber"), col("companyid"))
.agg(collect_list(struct("serialnumber", "eventtime")).alias("snetlist"))
.createOrReplaceTempView("sn")
I have data like this in the df
eventid | lotnumber | companyid | snetlist
123 | 4q22 | tu56ff | [[12345,67438]]
456 | 4q22 | tu56ff | [[12346,67434]]
258 | 4q22 | tu56ff | [[12347,67455], [12333,67455]]
999 | 4q22 | tu56ff | [[12348,67459]]
I want to explode it put data in 2 columns in my table for that what I am doing is
val serialNumberEvents = snDump.select(col("eventid"), col("lotnumber"), explode(col("snetlist")).alias("serialN"), explode(col("snetlist")).alias("eventT"), col("companyid"))
Also tried
val serialNumberEvents = snDump.select(col("eventid"), col("lotnumber"), col($"snetlist.serialnumber").alias("serialN"), col($"snetlist.eventtime").alias("eventT"), col("companyid"))
but it turns out that explode can be only used once and I get error in the select so how do I use explode/or something else to achieve what I am trying to.
eventid | lotnumber | companyid | serialN | eventT |
123 | 4q22 | tu56ff | 12345 | 67438 |
456 | 4q22 | tu56ff | 12346 | 67434 |
258 | 4q22 | tu56ff | 12347 | 67455 |
258 | 4q22 | tu56ff | 12333 | 67455 |
999 | 4q22 | tu56ff | 12348 | 67459 |
I have looked at a lot of stackoverflow threads but none of it helped me. It is possible that such question is already answered but my understanding of scala is very less which might have made me not understand the answer. If this is a duplicate then someone could direct me to the correct answer. Any help is appreciated.
First, explode the array in a temporary struct-column, then unpack it:
val serialNumberEvents = snDump
.withColumn("tmp",explode((col("snetlist"))))
.select(
col("eventid"),
col("lotnumber"),
col("companyid"),
// unpack struct
col("tmp.serialnumber").as("serialN"),
col("tmp.eventtime").as("serialT")
)
The trick is to pack the columns you want to explode in an array (or struct), use explode on the array and then unpack them.
val col_names = Seq("eventid", "lotnumber", "companyid", "snetlist")
val data = Seq(
(123, "4q22", "tu56ff", Seq(Seq(12345,67438))),
(456, "4q22", "tu56ff", Seq(Seq(12346,67434))),
(258, "4q22", "tu56ff", Seq(Seq(12347,67455), Seq(12333,67455))),
(999, "4q22", "tu56ff", Seq(Seq(12348,67459)))
)
val snDump = spark.createDataFrame(data).toDF(col_names: _*)
val serialNumberEvents = snDump.select(col("eventid"), col("lotnumber"), explode(col("snetlist")).alias("snetlist"), col("companyid"))
val exploded = serialNumberEvents.select($"eventid", $"lotnumber", $"snetlist".getItem(0).alias("serialN"), $"snetlist".getItem(1).alias("eventT"), $"companyid")
exploded.show()
Note that my snetlist has the schema Array(Array) rather then Array(Struct). You can simply get this by also creating an array instead of a struct out of your columns
Another approach, if needing to explode twice, is as follows - for another example, but to demonstrate the point:
val flattened2 = df.select($"director", explode($"films.actors").as("actors_flat"))
val flattened3 = flattened2.select($"director", explode($"actors_flat").as("actors_flattened"))
See Is there an efficient way to join two large Datasets with (deeper) nested array field? for a slightly different context, but same approach applies.
This answer in response to your assertion you can only explode once.

Is there a better way to go about this process of trimming my spark DataFrame appropriately?

In the following example, I want to be able to only take the x Ids with the highest counts. x is number of these I want which is determined by a variable called howMany.
For the following example, given this Dataframe:
+------+--+-----+
|query |Id|count|
+------+--+-----+
|query1|11|2 |
|query1|12|1 |
|query2|13|2 |
|query2|14|1 |
|query3|13|2 |
|query4|12|1 |
|query4|11|1 |
|query5|12|1 |
|query5|11|2 |
|query5|14|1 |
|query5|13|3 |
|query6|15|2 |
|query6|16|1 |
|query7|17|1 |
|query8|18|2 |
|query8|13|3 |
|query8|12|1 |
+------+--+-----+
I would like to get the following dataframe if the variable number is 2.
+------+-------+-----+
|query |Ids |count|
+------+-------+-----+
|query1|[11,12]|2 |
|query2|[13,14]|2 |
|query3|[13] |2 |
|query4|[12,11]|1 |
|query5|[11,13]|2 |
|query6|[15,16]|2 |
|query7|[17] |1 |
|query8|[18,13]|2 |
+------+-------+-----+
I then want to remove the count column, but that is trivial.
I have a way to do this, but I think it defeats the purpose of scala all together and completely wastes a lot of runtime. Being new, I am unsure about the best ways to go about this
My current method is to first get a distinct list of the query column and create an iterator. Second I loop through the list using the iterator and trim the dataframe to only the current query in the list using df.select($"eachColumnName"...).where("query".equalTo(iter.next())). I then .limit(howMany) and then groupBy($"query").agg(collect_list($"Id").as("Ids")). Lastly, I have an empty dataframe and add each of these one by one to the empty dataframe and return this newly created dataframe.
df.select($"query").distinct().rdd.map(r => r(0).asInstanceOf[String]).collect().toList
val iter = queries.toIterator
while (iter.hasNext) {
middleDF = df.select($"query", $"Id", $"count").where($"query".equalTo(iter.next()))
queryDF = middleDF.sort(col("count").desc).limit(howMany).select(col("query"), col("Ids")).groupBy(col("query")).agg(collect_list("Id").as("Ids"))
emptyDF.union(queryDF) // Assuming emptyDF is made
}
emptyDF
I would do this using Window-Functions to get the rank, then groupBy to aggrgate:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val howMany = 2
val newDF = df
.withColumn("rank",row_number().over(Window.partitionBy($"query").orderBy($"count".desc)))
.where($"rank"<=howMany)
.groupBy($"query")
.agg(
collect_list($"Id").as("Ids"),
max($"count").as("count")
)

Spark Scala Dataframe - replace/join column values with values from another dataframe (but is transposed)

I have a table with ~300 columns filled with characters (stored as String):
valuesDF:
| FavouriteBeer | FavouriteCheese | ...
|---------------|-----------------|--------
| U | C | ...
| U | E | ...
| I | B | ...
| C | U | ...
| ... | ... | ...
I have a Data Summary, which maps the characters onto their actual meaning. It is in this form:
summaryDF:
| Field | Value | ValueDesc |
|------------------|-------|---------------|
| FavouriteBeer | U | Unknown |
| FavouriteBeer | C | Carlsberg |
| FavouriteBeer | I | InnisAndGunn |
| FavouriteBeer | D | DoomBar |
| FavouriteCheese | C | Cheddar |
| FavouriteCheese | E | Emmental |
| FavouriteCheese | B | Brie |
| FavouriteCheese | U | Unknown |
| ... | ... | ... |
I want to programmatically replace the character values of each column in valuesDF with the Value Descriptions from summaryDF. This is the result I'm looking for:
finalDF:
| FavouriteBeer | FavouriteCheese | ...
|---------------|-----------------|--------
| Unknown | Cheddar | ...
| Unknown | Emmental | ...
| InnisAndGunn | Brie | ...
| Carlsberg | Unknown | ...
| ... | ... | ...
As there are ~300 columns, I'm not keen to type out withColumn methods for each one.
Unfortunately I'm a bit of a novice when it comes to programming for Spark, although I've picked up enough to get by over the last 2 months.
What I'm pretty sure I need to do is something along the lines of:
valuesDF.columns.foreach { col => ...... } to iterate over each column
Filter summaryDF on Field using col String value
Left join summaryDF onto valuesDF based on current column
withColumn to replace the original character code column from valuesDF with new description column
Assign new DF as a var
Continue loop
However, trying this gave me Cartesian product error (I made sure to define the join as "left").
I tried and failed to pivot summaryDF (as there are no aggregations to do??) then join both dataframes together.
This is the sort of thing I've tried, and always getting a NullPointerException. I know this is really not the right way to do this, and can see why I'm getting Null Pointer... but I'm really stuck and reverting back to old, silly & bad Python habits in desperation.
var valuesDF = sourceDF
// I converted summaryDF to a broadcasted RDD
// because its small and a "constant" lookup table
summaryBroadcast
.value
.foreach{ x =>
// searchValue = Value (e.g. `U`),
// replaceValue = ValueDescription (e.g. `Unknown`),
val field = x(0).toString
val searchValue = x(1).toString
val replaceValue = x(2).toString
// error catching as summary data does not exactly mapping onto field names
// the joys of business people working in Excel...
try {
// I'm using regexp_replace because I'm lazy
valuesDF = valuesDF
.withColumn( attribute, regexp_replace(col(attribute), searchValue, replaceValue ))
}
catch {case _: Exception =>
null
}
}
Any ideas? Advice? Thanks.
First, we'll need a function that executes a join of valuesDf with summaryDf by Value and the respective pair of Favourite* and Field:
private def joinByColumn(colName: String, sourceDf: DataFrame): DataFrame = {
sourceDf.as("src") // alias it to help selecting appropriate columns in the result
// the join
.join(summaryDf, $"Value" === col(colName) && $"Field" === colName, "left")
// we do not need the original `Favourite*` column, so drop it
.drop(colName)
// select all previous columns, plus the one that contains the match
.select("src.*", "ValueDesc")
// rename the resulting column to have the name of the source one
.withColumnRenamed("ValueDesc", colName)
}
Now, to produce the target result we can iterate on the names of the columns to match:
val result = Seq("FavouriteBeer",
"FavouriteCheese").foldLeft(valuesDF) {
case(df, colName) => joinByColumn(colName, df)
}
result.show()
+-------------+---------------+
|FavouriteBeer|FavouriteCheese|
+-------------+---------------+
| Unknown| Cheddar|
| Unknown| Emmental|
| InnisAndGunn| Brie|
| Carlsberg| Unknown|
+-------------+---------------+
In case a value from valuesDf does not match with anything in summaryDf, the resulting cell in this solution will contain null. If you want just to replace it with Unknown value, instead of .select and .withColumnRenamed lines above use:
.withColumn(colName, when($"ValueDesc".isNotNull, $"ValueDesc").otherwise(lit("Unknown")))
.select("src.*", colName)

How to append List[String] to every row of DataFrame?

After a series of validations over a DataFrame,
I obtain a List of String with certain values like this:
List[String]=(lvalue1, lvalue2, lvalue3, ...)
And I have a Dataframe with n values:
dfield 1 | dfield 2 | dfield 3
___________________________
dvalue1 | dvalue2 | dvalue3
dvalue1 | dvalue2 | dvalue3
I want to append the values of the List at the beggining of my Dataframe, in order to get a new DF with something like this:
dfield 1 | dfield 2 | dfield 3 | dfield4 | dfield5 | dfield6
__________________________________________________________
lvalue1 | lvalue2 | lvalue3 | dvalue1 | dvalue2 | dvalue3
lvalue1 | lvalue2 | lvalue3 | dvalue1 | dvalue2 | dvalue3
I have found something using a UDF. Could be this correct for my purpose?
Regards.
TL;DR Use select or withColumn with lit function.
I'd use lit function with select operator (or withColumn).
lit(literal: Any): Column Creates a Column of literal value.
A solution could be as follows.
val values = List("lvalue1", "lvalue2", "lvalue3")
val dfields = values.indices.map(idx => s"dfield ${idx + 1}")
val dataset = Seq(
("dvalue1", "dvalue2", "dvalue3"),
("dvalue1", "dvalue2", "dvalue3")
).toDF("dfield 1", "dfield 2", "dfield 3")
val offsets = dataset.
columns.
indices.
map { idx => idx + colNames.size + 1 }
val offsetDF = offsets.zip(dataset.columns).
foldLeft(dataset) { case (df, (off, col)) => df.withColumnRenamed(col, s"dfield $off") }
val newcols = colNames.zip(dfields).
map { case (v, dfield) => lit(v) as dfield } :+ col("*")
scala> offsetDF.select(newcols: _*).show
+--------+--------+--------+--------+--------+--------+
|dfield 1|dfield 2|dfield 3|dfield 4|dfield 5|dfield 6|
+--------+--------+--------+--------+--------+--------+
| lvalue1| lvalue2| lvalue3| dvalue1| dvalue2| dvalue3|
| lvalue1| lvalue2| lvalue3| dvalue1| dvalue2| dvalue3|
+--------+--------+--------+--------+--------+--------+

How to rank the data set having multiple columns in Scala?

I have data set like this which i am fetching from csv file but how to
store in Scala to do the processing.
+-----------+-----------+----------+
| recent | Freq | Monitor |
+-----------+-----------+----------+
| 1 | 1234| 199090|
| 4 | 2553| 198613|
| 6 | 3232 | 199090|
| 1 | 8823 | 498831|
| 7 | 2902 | 890000|
| 8 | 7991 | 081097|
| 9 | 7391 | 432370|
| 12 | 6138 | 864981|
| 7 | 6812 | 749821|
+-----------+-----------+----------+
Actually I need to sort the data and rank it.
I am new to Scala programming.
Thanks
Answering your question here is the solution, this code reads a csv and order by the third column
object CSVDemo extends App {
println("recent, freq, monitor")
val bufferedSource = io.Source.fromFile("./data.csv")
val list: Array[Array[String]] = (bufferedSource.getLines map { line => line.split(",").map(_.trim) }).toArray
val newList = list.sortBy(_(2))
newList map { line => println(line.mkString(" ")) }
bufferedSource.close
}
you read the file and you parse it to an Array[Array[String]], then you order by the third column, and you print
Here I am using the list and try to normalize each column at a time and then concatenating them. Is there any other way to iterate column wise and normalize them. Sorry my coding is very basic.
val col1 = newList.map(line => line.head)
val mi = newList.map(line => line.head).min
val ma = newList.map(line => line.head).max
println("mininumn value of first column is " +mi)
println("maximum value of first column is : " +ma)
// calculate scale for the first column
val scale = col1.map(x => math.round((x.toInt - mi.toInt) / (ma.toInt - mi.toInt)))
println("Here is the normalized range of first column of the data")
scale.foreach(println)