Check if a set of a field values is mapped against single value of another field in dataframe - scala

Consider the below dataframe with store and books available:
+-----------+------+-------+
| storename | book | price |
+-----------+------+-------+
| S1 | B11 | 10$ | <<
| S2 | B11 | 11$ |
| S1 | B15 | 29$ | <<
| S2 | B10 | 25$ |
| S2 | B16 | 30$ |
| S1 | B09 | 21$ | <
| S3 | B15 | 22$ |
+-----------+------+-------+
Suppose we need to find the stores which have two books namely, B11 and B15. Here, the answer is S1 as it stores both books.
One way of doing it is to find intersection of the stores having book B11 with the stores having book B15 using below command:
val df_select = df.filter($"book" === "B11").select("storename")
.join(df.filter($"book" === "B15").select("storename"), Seq("storename"), "inner")
which contains the name of stores having both.
But instead I want a table
+-----------+------+-------+
| storename | book | price |
+-----------+------+-------+
| S1 | B11 | 10$ | <<
| S1 | B15 | 29$ | <<
| S1 | B09 | 21$ | <
+-----------+------+-------+
which contains all records related to that fulfilling store. Note that B09 is not left out. (Use case : the user can explore some other books as well in the same store)
We can do this by doing another intersection of above result with original dataframe:
df_select.join(df, Seq("storename"), "inner")
But, I see scalability and readability issue with step 1 as I have to keep on joining one dataframe to another if the number of books are more than 2. Lots of pain to do and that's error-prone too. Is there a more elegant way to do the same? Something like:
val storewise = Window.partitionBy("storename")
df.filter($"book".contains{"B11", "B15"}.over(storewise))

Found a simple solution using array_except function.
Add required set-of-field-values as an array in a new column, req_books
Add a column, all_books, storing all the books stored in a store using Window.
Using above two columns find if the store misses any required book, and filter them out if it misses anything.
Drop the excess columns created.
Code:
val df1 = df.withColumn("req_books", array(lit("B11"), lit("B15")))
.withColumn("all_books", collect_set('book).over(Window.partitionBy('storename)))
df1.withColumn("missing_books", array_except('req_books, 'all_books))
.filter(size('missing_books) === 0)
.drop('missing_book).drop('all_books).drop('req_books).show

Using Window Functions to create array of all values and check if it contains all the necessary values.
val bookList = List("B11", "B15") //list of books to search
def arrayContainsMultiple(bookList: Seq[String]) = udf((allBooks: WrappedArray[String]) => allBooks.intersect(bookList).sorted.equals(bookList.sorted))
val filteredDF = input
.withColumn("allBooks", collect_set($"books").over(Window.partitionBy($"storename")))
.filter(arrayContainsMultiple(bookList)($"allBooks"))
.drop($"allBooks")

Related

pyspark - left join with random row matching the key

I am looking to a way to join 2 dataframes but with random rows matching the key. This strange request is due to a very long calculation to generate positions.
I would like to do a kind of "random left join" in pyspark.
I have a dataframe with an areaID (string) and a count (int). The areaID is unique (around 7k).
+--------+-------+
| areaID | count |
+--------+-------+
| A | 10 |
| B | 30 |
| C | 1 |
| D | 25 |
| E | 18 |
+--------+-------+
I have a second dataframe with around 1000 precomputed rows for each areaID with 2 positions columns x (float) and y (float). This dataframe is around 7 millions rows.
+--------+------+------+
| areaID | x | y |
+--------+------+------+
| A | 0.0 | 0 |
| A | 0.1 | 0.7 |
| A | 0.3 | 1 |
| A | 0.1 | 0.3 |
| ... | | |
| E | 3.15 | 4.17 |
| E | 3.14 | 4.22 |
+--------+------+------+
I would like to end with a dataframe like:
+--------+------+------+
| areaID | x | y |
+--------+------+------+
| A | 0.1 | 0.32 | < row 1/10 - randomly picked where areaID are the same
| A | 0.0 | 0.18 | < row 2/10
| A | 0.09 | 0.22 | < row 3/10
| ... | | |
| E | 3.14 | 4.22 | < row 1/18
| ... | | |
+--------+------+------+
My first idea is to iterate over each areaID of the first dataframe, filter the second dataframe by areaID and sample count rows of this dataframe. The problem is that this is quite slow with 7k load/filtering/sampling processes.
The second approach is to do an outer join on areaID, then shuffle the dataframe (but seems quite complex), apply a rank and keep when the rank <= count but I don't like the approch to load a lot a data to filter afterward.
I am wondering if there is a way to do it using a "random" left join ? In that case, I'll duplicate each row count times and apply it.
Many thanks in advance,
Nicolas
One can interpret the question as stratified sampling of the second dataframe where the number of samples to be taken from each subpopulation is given by the first dataframe.
There is Spark function for stratified sampling.
df1 = ...
df2 = ...
#first calculate the fraction for each areaID based on the required number
#given in df1 and the number of rows for the areaID in df2
fractionRows = df2.groupBy("areaId").agg(F.count("areaId").alias("count2")) \
.join(df1, "areaId") \
.withColumn("fraction", F.col("count") / F.col("count2")) \
.select("areaId", "fraction") \
.collect()
fractions = {f[0]:f[1] for f in fractionRows}
#now run the statified samling
df2.stat.sampleBy("areaID", fractions).show()
There is caveat with this approach: as the sampling done by Spark is a random process, the exact number of rows given in the first dataframe will not always be met exactly.
Edit: fractions > 1.0 are not supported by sampleBy. Looking at the Scala code of sampleBy shows why: the function is implemented as filter with a random variable indicating whether to keep to row or not. Returning multiple copies of a single row will therefore not work.
A similar idea can be used to support fractions > 1.0: instead of using a filter, an udf is created that returns an array. The array contains one entry per copy of the row that should be contained in the result. After applying the udf, the array column is exploded and then dropped:
from pyspark.sql import functions as F
from pyspark.sql import types as T
fractions = {'A': 1.5, 'C': 0.5}
def ff(stratum,x):
fraction = fractions.get(stratum, 0.0)
ret=[]
while fraction >= 1.0:
ret.append("x")
fraction = fraction - 1
if x < fraction:
ret.append("x")
return ret
f=F.udf(ff, T.ArrayType(T.StringType())).asNondeterministic()
seed=42
df2.withColumn("r", F.rand(seed)) \
.withColumn("r",f("areaID", F.col("r")))\
.withColumn("r", F.explode("r")) \
.drop("r") \
.show()

Spark Scala Dataframe - replace/join column values with values from another dataframe (but is transposed)

I have a table with ~300 columns filled with characters (stored as String):
valuesDF:
| FavouriteBeer | FavouriteCheese | ...
|---------------|-----------------|--------
| U | C | ...
| U | E | ...
| I | B | ...
| C | U | ...
| ... | ... | ...
I have a Data Summary, which maps the characters onto their actual meaning. It is in this form:
summaryDF:
| Field | Value | ValueDesc |
|------------------|-------|---------------|
| FavouriteBeer | U | Unknown |
| FavouriteBeer | C | Carlsberg |
| FavouriteBeer | I | InnisAndGunn |
| FavouriteBeer | D | DoomBar |
| FavouriteCheese | C | Cheddar |
| FavouriteCheese | E | Emmental |
| FavouriteCheese | B | Brie |
| FavouriteCheese | U | Unknown |
| ... | ... | ... |
I want to programmatically replace the character values of each column in valuesDF with the Value Descriptions from summaryDF. This is the result I'm looking for:
finalDF:
| FavouriteBeer | FavouriteCheese | ...
|---------------|-----------------|--------
| Unknown | Cheddar | ...
| Unknown | Emmental | ...
| InnisAndGunn | Brie | ...
| Carlsberg | Unknown | ...
| ... | ... | ...
As there are ~300 columns, I'm not keen to type out withColumn methods for each one.
Unfortunately I'm a bit of a novice when it comes to programming for Spark, although I've picked up enough to get by over the last 2 months.
What I'm pretty sure I need to do is something along the lines of:
valuesDF.columns.foreach { col => ...... } to iterate over each column
Filter summaryDF on Field using col String value
Left join summaryDF onto valuesDF based on current column
withColumn to replace the original character code column from valuesDF with new description column
Assign new DF as a var
Continue loop
However, trying this gave me Cartesian product error (I made sure to define the join as "left").
I tried and failed to pivot summaryDF (as there are no aggregations to do??) then join both dataframes together.
This is the sort of thing I've tried, and always getting a NullPointerException. I know this is really not the right way to do this, and can see why I'm getting Null Pointer... but I'm really stuck and reverting back to old, silly & bad Python habits in desperation.
var valuesDF = sourceDF
// I converted summaryDF to a broadcasted RDD
// because its small and a "constant" lookup table
summaryBroadcast
.value
.foreach{ x =>
// searchValue = Value (e.g. `U`),
// replaceValue = ValueDescription (e.g. `Unknown`),
val field = x(0).toString
val searchValue = x(1).toString
val replaceValue = x(2).toString
// error catching as summary data does not exactly mapping onto field names
// the joys of business people working in Excel...
try {
// I'm using regexp_replace because I'm lazy
valuesDF = valuesDF
.withColumn( attribute, regexp_replace(col(attribute), searchValue, replaceValue ))
}
catch {case _: Exception =>
null
}
}
Any ideas? Advice? Thanks.
First, we'll need a function that executes a join of valuesDf with summaryDf by Value and the respective pair of Favourite* and Field:
private def joinByColumn(colName: String, sourceDf: DataFrame): DataFrame = {
sourceDf.as("src") // alias it to help selecting appropriate columns in the result
// the join
.join(summaryDf, $"Value" === col(colName) && $"Field" === colName, "left")
// we do not need the original `Favourite*` column, so drop it
.drop(colName)
// select all previous columns, plus the one that contains the match
.select("src.*", "ValueDesc")
// rename the resulting column to have the name of the source one
.withColumnRenamed("ValueDesc", colName)
}
Now, to produce the target result we can iterate on the names of the columns to match:
val result = Seq("FavouriteBeer",
"FavouriteCheese").foldLeft(valuesDF) {
case(df, colName) => joinByColumn(colName, df)
}
result.show()
+-------------+---------------+
|FavouriteBeer|FavouriteCheese|
+-------------+---------------+
| Unknown| Cheddar|
| Unknown| Emmental|
| InnisAndGunn| Brie|
| Carlsberg| Unknown|
+-------------+---------------+
In case a value from valuesDf does not match with anything in summaryDf, the resulting cell in this solution will contain null. If you want just to replace it with Unknown value, instead of .select and .withColumnRenamed lines above use:
.withColumn(colName, when($"ValueDesc".isNotNull, $"ValueDesc").otherwise(lit("Unknown")))
.select("src.*", colName)

Best way to join multiples small tables with a big table in Spark SQL

I'm doing a join multiples tables using spark sql. One of the table is very big and the others are small (10-20 records). really I want to replace values in the biggest table using others tables that contain pairs of key-value.
i.e.
Bigtable:
| Col 1 | Col 2 | Col 3 | Col 4 | ....
--------------------------------------
| A1 | B1 | C1 | D1 | ....
| A2 | B1 | C2 | D2 | ....
| A1 | B1 | C3 | D2 | ....
| A2 | B2 | C3 | D1 | ....
| A1 | B2 | C2 | D1 | ....
.
.
.
.
.
Table2:
| Col 1 | Col 2
----------------
| A1 | 1a
| A2 | 2a
Table3:
| Col 1 | Col 2
----------------
| B1 | 1b
| B2 | 2b
Table3:
| Col 1 | Col 2
----------------
| C1 | 1c
| C2 | 2c
| C3 | 3c
Table4:
| Col 1 | Col 2
----------------
| D1 | 1d
| D2 | 2d
Expected table is
| Col 1 | Col 2 | Col 3 | Col 4 | ....
--------------------------------------
| 1a | 1b | 1c | 1d | ....
| 2a | 1b | 2c | 2d | ....
| 1a | 1b | 3c | 2d | ....
| 2a | 2b | 3c | 1d | ....
| 1a | 2b | 2c | 1d | ....
.
.
.
.
.
My question is; which is best way to join the tables. (Think that there are 100 or more small tables)
1) Collecting the small dataframes, to transforming it to maps, broadcasting the maps and transforming the big datataframe in one only step
bigdf.transform(ds.map(row => (small1.get(row.col1),.....)
2) Broadcasting the tables and making join using select method.
spark.sql("
select *
from bigtable
left join small1 using(id1)
left join small2 using(id2)")
3) Broadcasting the tables and Concatenate multiples joins
bigtable.join(broadcast(small1), bigtable('col1') ==small1('col1')).join...
Thanks in advance
You might do:
broadcast all small tables (automaticaly done by setting spark.sql.autoBroadcastJoinThreshold slightly superior to the small table number of rows)
run a sql query that join the big table such
val df = spark.sql("
select *
from bigtable
left join small1 using(id1)
left join small2 using(id2)")
EDIT:
Choosing between sql and spark "dataframe" syntax:
The sql syntax is more readable, and less verbose than the spark syntax (for a database user perspective.)
From a developper perspective, dataframe syntax might be more readeble.
The main advantage of using the "dataset" syntax, is the compiler will be able to track some error. Using any string syntax such sql or columns name (col("mycol")) will be spotted at run time.
Best way, as already written in answers, to broadcast all small tables. It can also be done in SparkSQL using BROADCAST hint:
val df = spark.sql("""
select /*+ BROADCAST(t2, t3) */
*
from bigtable t1
left join small1 t2 using(id1)
left join small2 t3 using(id2)
""")
If the data in your small tables is less than the threshold size and physical files for your data is in parquet format then spark will automatically broadcast the small tables but if you are reading the data from some other data sources like sql, PostgreSQL etc. then some times spark does not broadcast the table automatically.
If you know that the tables are small sized and the size of table is not expected to increase ( In case of lookup tables) you can explicitly broadcast the data frame or table and in this way you can efficiently join a larger table with the small tables.
you can verify that the small table is getting broadcasted using the explain command on the data frame or you can do that from Spark UI also.

How to get back aggregate values across 2 dimensions using Python Cubes?

Situation
Using Python 3, Django 1.9, Cubes 1.1, and Postgres 9.5.
These are my datatables in pictorial form:
The same in text format:
Store table
------------------------------
| id | code | address |
|-----|------|---------------|
| 1 | S1 | Kings Row |
| 2 | S2 | Queens Street |
| 3 | S3 | Jacks Place |
| 4 | S4 | Diamonds Alley|
| 5 | S5 | Hearts Road |
------------------------------
Product table
------------------------------
| id | code | name |
|-----|------|---------------|
| 1 | P1 | Saucer 12 |
| 2 | P2 | Plate 15 |
| 3 | P3 | Saucer 13 |
| 4 | P4 | Saucer 14 |
| 5 | P5 | Plate 16 |
| and many more .... |
|1000 |P1000 | Bowl 25 |
|----------------------------|
Sales table
----------------------------------------
| id | product_id | store_id | amount |
|-----|------------|----------|--------|
| 1 | 1 | 1 |7.05 |
| 2 | 1 | 2 |9.00 |
| 3 | 2 | 3 |1.00 |
| 4 | 2 | 3 |1.00 |
| 5 | 2 | 5 |1.00 |
| and many more .... |
| 1000| 20 | 4 |1.00 |
|--------------------------------------|
The relationships are:
Sales belongs to Store
Sales belongs to Product
Store has many Sales
Product has many Sales
What I want to achieve
I want to use cubes to be able to do a display by pagination in the following manner:
Given the stores S1-S3:
-------------------------
| product | S1 | S2 | S3 |
|---------|----|----|----|
|Saucer 12|7.05|9 | 0 |
|Plate 15 |0 |0 | 2 |
| and many more .... |
|------------------------|
Note the following:
Even though there were no records in sales for Saucer 12 under Store S3, I displayed 0 instead of null or none.
I want to be able to do sort by store, say descending order for, S3.
The cells indicate the SUM total of that particular product spent in that particular store.
I also want to have pagination.
What I tried
This is the configuration I used:
"cubes": [
{
"name": "sales",
"dimensions": ["product", "store"],
"joins": [
{"master":"product_id", "detail":"product.id"},
{"master":"store_id", "detail":"store.id"}
]
}
],
"dimensions": [
{ "name": "product", "attributes": ["code", "name"] },
{ "name": "store", "attributes": ["code", "address"] }
]
This is the code I used:
result = browser.aggregate(drilldown=['Store','Product'],
order=[("Product.name","asc"), ("Store.name","desc"), ("total_products_sale", "desc")])
I didn't get what I want.
I got it like this:
----------------------------------------------
| product_id | store_id | total_products_sale |
|------------|----------|---------------------|
| 1 | 1 | 7.05 |
| 1 | 2 | 9 |
| 2 | 3 | 2.00 |
| and many more .... |
|---------------------------------------------|
which is the whole table with no pagination and if the products not sold in that store it won't show up as zero.
My question
How do I get what I want?
Do I need to create another data table that aggregates everything by store and product before I use cubes to run the query?
Update
I have read more. I realised that what I want is called dicing as I needed to go across 2 dimensions. See: https://en.wikipedia.org/wiki/OLAP_cube#Operations
Cross-posted at Cubes GitHub issues to get more attention.
This is a pure SQL solution using crosstab() from the additional tablefunc module to pivot the aggregated data. It typically performs better than any client-side alternative. If you are not familiar with crosstab(), read this first:
PostgreSQL Crosstab Query
And this about the "extra" column in the crosstab() output:
Pivot on Multiple Columns using Tablefunc
SELECT product_id, product
, COALESCE(s1, 0) AS s1 -- 1. ... displayed 0 instead of null
, COALESCE(s2, 0) AS s2
, COALESCE(s3, 0) AS s3
, COALESCE(s4, 0) AS s4
, COALESCE(s5, 0) AS s5
FROM crosstab(
'SELECT s.product_id, p.name, s.store_id, s.sum_amount
FROM product p
JOIN (
SELECT product_id, store_id
, sum(amount) AS sum_amount -- 3. SUM total of product spent in store
FROM sales
GROUP BY product_id, store_id
) s ON p.id = s.product_id
ORDER BY s.product_id, s.store_id;'
, 'VALUES (1),(2),(3),(4),(5)' -- desired store_id's
) AS ct (product_id int, product text -- "extra" column
, s1 numeric, s2 numeric, s3 numeric, s4 numeric, s5 numeric)
ORDER BY s3 DESC; -- 2. ... descending order for S3
Produces your desired result exactly (plus product_id).
To include products that have never been sold replace [INNER] JOIN with LEFT [OUTER] JOIN.
SQL Fiddle with base query.
The tablefunc module is not installed on sqlfiddle.
Major points
Read the basic explanation in the reference answer for crosstab().
I am including with product_id because product.name is hardly unique. This might otherwise lead to sneaky errors conflating two different products.
You don't need the store table in the query if referential integrity is guaranteed.
ORDER BY s3 DESC works, because s3 references the output column where NULL values have been replaced with COALESCE. Else we would need DESC NULLS LAST to sort NULL values last:
PostgreSQL sort by datetime asc, null first?
For building crosstab() queries dynamically consider:
Dynamic alternative to pivot with CASE and GROUP BY
I also want to have pagination.
That last item is fuzzy. Simple pagination can be had with LIMIT and OFFSET:
Displaying data in grid view page by page
I would consider a MATERIALIZED VIEW to materialize results before pagination. If you have a stable page size I would add page numbers to the MV for easy and fast results.
To optimize performance for big result sets, consider:
SQL syntax term for 'WHERE (col1, col2) < (val1, val2)'
Optimize query with OFFSET on large table

How to rank the data set having multiple columns in Scala?

I have data set like this which i am fetching from csv file but how to
store in Scala to do the processing.
+-----------+-----------+----------+
| recent | Freq | Monitor |
+-----------+-----------+----------+
| 1 | 1234| 199090|
| 4 | 2553| 198613|
| 6 | 3232 | 199090|
| 1 | 8823 | 498831|
| 7 | 2902 | 890000|
| 8 | 7991 | 081097|
| 9 | 7391 | 432370|
| 12 | 6138 | 864981|
| 7 | 6812 | 749821|
+-----------+-----------+----------+
Actually I need to sort the data and rank it.
I am new to Scala programming.
Thanks
Answering your question here is the solution, this code reads a csv and order by the third column
object CSVDemo extends App {
println("recent, freq, monitor")
val bufferedSource = io.Source.fromFile("./data.csv")
val list: Array[Array[String]] = (bufferedSource.getLines map { line => line.split(",").map(_.trim) }).toArray
val newList = list.sortBy(_(2))
newList map { line => println(line.mkString(" ")) }
bufferedSource.close
}
you read the file and you parse it to an Array[Array[String]], then you order by the third column, and you print
Here I am using the list and try to normalize each column at a time and then concatenating them. Is there any other way to iterate column wise and normalize them. Sorry my coding is very basic.
val col1 = newList.map(line => line.head)
val mi = newList.map(line => line.head).min
val ma = newList.map(line => line.head).max
println("mininumn value of first column is " +mi)
println("maximum value of first column is : " +ma)
// calculate scale for the first column
val scale = col1.map(x => math.round((x.toInt - mi.toInt) / (ma.toInt - mi.toInt)))
println("Here is the normalized range of first column of the data")
scale.foreach(println)