Concatenate two arrays element with all possible combinations - pyspark

I have a hive table with
| row | column |
| --------------------------- | ---------------------------|
| null | ["black", "blue", "orange"]
| ["mom", "dad", "sister"] | ["amazon", "fiipkart", "meesho", "jiomart", ""]
Using Spark SQL, I would like to create a new column with an array of all possible combinations:
| row | column | output |
| ---------------------------|------------------|-----------------------------------|
| null |["b", "s", "m"] |["b", "s", "m"] |
| ["1", "2"] |["a", "b",""] |["1_a", "1_b","1","2_a", "2_b","2"]|

Two ways to implement this:
The first way includes array transformations:
df
// we first make your null value as an empty array
.withColumn("row", when(col("row").isNull, Array("")).otherwise(col("row")))
// we create an array of 1s that is equal to the multiplying of row size and column size
.withColumn("repeated", array_repeat(lit(1), size(col("row")) * size(col("column"))))
// we create indexes according to the sizes
.withColumn("indexes", expr("transform(repeated, (x, i) -> array(i % size(row), i % size(column)))"))
// we concat the elements
.withColumn("concat", expr("transform(indexes, (x, i) -> concat_ws('_', row[x[0]], column[x[1]]))"))
// we remove underscores before and after the name (if found)
.withColumn("output", expr("transform(concat, x -> regexp_replace(x, '(_$)|(^_)', ''))"))
Output:
+------+---------+------------------+------------------------------------------------+----------------------------+--------------------------+
|row |column |repeated |indexes |concat |output |
+------+---------+------------------+------------------------------------------------+----------------------------+--------------------------+
|[] |[b, s, m]|[1, 1, 1] |[[0, 0], [0, 1], [0, 2]] |[_b, _s, _m] |[b, s, m] |
|[1, 2]|[a, b, ] |[1, 1, 1, 1, 1, 1]|[[0, 0], [1, 1], [0, 2], [1, 0], [0, 1], [1, 2]]|[1_a, 2_b, 1_, 2_a, 1_b, 2_]|[1_a, 2_b, 1, 2_a, 1_b, 2]|
+------+---------+------------------+------------------------------------------------+----------------------------+--------------------------+
The second way includes explode and other functions:
df
// we first make your null value as an empty array
.withColumn("row", when(col("row").isNull, Array("")).otherwise(col("row")))
// we create a unique ID to group by later and collect
.withColumn("id", monotonically_increasing_id())
// we explode the columns and the rows
.withColumn("column", explode(col("column")))
.withColumn("row", explode(col("row")))
// we combine the output with underscores as separators
.withColumn("output", concat_ws("_", col("row"), col("column")))
// we group by again and collect set
.groupBy("id").agg(
collect_set("row").as("row"),
collect_set("column").as("column"),
collect_set("output").as("output")
)
.drop("id")
// we replace whatever ends with _ and starst with _ (_1, or 1_)
.withColumn("output", expr("transform(output, x -> regexp_replace(x, '(_$)|(^_)', ''))"))
Final output:
+------+---------+--------------------------+
|row |column |output |
+------+---------+--------------------------+
|[1, 2]|[b, a, ] |[2, 1_a, 2_a, 1, 1_b, 2_b]|
|[] |[s, m, b]|[s, m, b] |
+------+---------+--------------------------+
I left other columns in case you want to see what is happening, good luck!

One straight forward and easy solution is to use a custom UDF function:
from pyspark.sql.functions import udf, col
def combinations(a, b):
c = []
for x in a:
for y in b:
if not x:
c.append(y)
elif not y:
c.append(x)
else:
c.append(f"{x}_{y}")
return c
udf_combination = udf(combinations)
df = spark.createDataFrame([
[["1", "2"], ["a", "b", ""]]
], ["row", "column"])
df.withColumn("res", udf_combination(col("row"), col("column")))
# +------+--------+--------------------------+
# |row |column |res |
# +------+--------+--------------------------+
# |[1, 2]|[a, b, ]|[1_a, 1_b, 1, 2_a, 2_b, 2]|
# +------+--------+--------------------------+

Related

Minizinc : how to check if a row exists in an 2D array

I have an 2D array like that :
array[1..3,1..3] of var 1..3: t;
how can I write a constraint to check if there is a table row that matches |1,2,3| (for example) ?
I wrote it like that but it doesn't work that return a type error :
constraint [1,2,3] in t;
Here is one way to check that [1,2,3] is in t:
array[1..3,1..3] of var 1..3: t;
var bool: check;
constraint
exists(i in 1..3) (
t[i,..] == [1,2,3]
) <-> check
;
Some solutions:
...
t =
[| 3, 2, 3
| 3, 3, 3
| 3, 3, 3
|];
check = false;
----------
t =
[| 1, 2, 3
| 3, 3, 3
| 3, 3, 3
|];
check = true;
----------
t =
[| 2, 1, 2
| 3, 3, 3
| 3, 3, 3
|];
check = false;
----------
...
If you want to ensure that [1,2,3] is in t, you can skip the check variable:
array[1..3,1..3] of var 1..3: t;
constraint
exists(i in 1..3) (
t[i,..] == [1,2,3]
)
;
Some solutions:
t =
[| 1, 2, 3
| 3, 3, 3
| 3, 3, 3
|];
----------
t =
[| 1, 2, 3
| 2, 3, 3
| 3, 3, 3
|];
----------
t =
[| 1, 2, 3
| 1, 3, 3
| 3, 3, 3
|];
----------
...

Merge two rows by conditional variable

Suppose I have a Dataframe like the following:
a <- data.frame(A = c("01.01.2000", "02.01.2000", "03.01.2000", "01.01.2000.1", "04.01.2000", "04.01.2000.1"),
B = c(1, NA, 2, 1, 4, NA ),
C= c(NA, NA, 4, NA, NA, 1))
Now I want to merge all rows, where the ".1" is at the end, to its "counterpart" without the .1. So in this example I want to join 01.01.2000.1 to 01.01.2000 and 04.01.2000.1 to 04.01.2000 (it's always with this pattern .1 at the end).
If there are two values, they should be comma separated and if one is NA it shall take the other one.
The desired output is:
b <- data.frame(A = c("01.01.2000", "02.01.2000", "03.01.2000", "01.01.2000.1", "04.01.2000"),
B = c("1,1", NA, 2, 1, 4 ),
C= c(NA, NA, 4, NA, 1))

Distinct Word count per line

This is little different than the common word count program. I am trying to get the distinct word count per line.
Input:
Line number one has six words
Line number two has two words
Expected output:
line1 => (Line,1),(number,1),(one,1),(has,1),(six,1),(words,1)
line2 => (Line,1),(number,1),(two,2),(has,1),(words,1)
Can anyone please guide me.
By using Dataframe in built functions explode,split,collect_set,groupBy.
//input data
val df=Seq("Line number one has six words","Line number two has has two words").toDF("input")
scala> :paste
// Entering paste mode (ctrl-D to finish)
df.withColumn("words",explode(split($"input","\\s+"))) //split by space and explode
.groupBy("input","words") //group by on both columns
.count()
.withColumn("line_word_count",struct($"words",$"count")) //create struct
.groupBy("input") //grouping by input data column
.agg(collect_set("line_word_count").alias("line_word_count"))
.show(false)
Result:
+---------------------------------+------------------------------------------------------------------+
|input |line_word_count |
+---------------------------------+------------------------------------------------------------------+
|Line number one has six words |[[one, 1], [has, 1], [six, 1], [number, 1], [words, 1], [Line, 1]]|
|Line number two has has two words|[[has, 2], [two, 2], [words, 1], [number, 1], [Line, 1]] |
+---------------------------------+------------------------------------------------------------------+
If you are expecting line numbers then use concat,monotonically_increasing_id functions:
df.withColumn("line",concat(lit("line"),monotonically_increasing_id()+1))
.withColumn("words",explode(split($"input","\\s+")))
.groupBy("input","words","line")
.count()
.withColumn("line_word_count",struct($"words",$"count"))
.groupBy("line")
.agg(collect_set("line_word_count").alias("line_word_count"))
.show(false)
Result:
+-----+------------------------------------------------------------------+
|line |line_word_count |
+-----+------------------------------------------------------------------+
|line1|[[one, 1], [has, 1], [six, 1], [words, 1], [number, 1], [Line, 1]]|
|line2|[[has, 2], [two, 2], [number, 1], [words, 1], [Line, 1]] |
+-----+------------------------------------------------------------------+
Note incase of larger datasets to make it consecutive we need to do .repartition(1).
Here is another way using RDD API:
val rdd = df.withColumn("output", split($"input", " ")).rdd.map(l => (
l.getAs[String](0),
l.getAs[Seq[String]](1).groupBy(identity).mapValues(_.size).map(identity))
)
val dfCount = spark.createDataFrame(rdd).toDF("input", "output")
Not a big fan of using UDF, but it can also be done like this:
import org.apache.spark.sql.functions.udf
val mapCount: Seq[String] => Map[String, Integer] = _.groupBy(identity).mapValues(_.size)
val countWordsUdf = udf(mapCount)
df.withColumn("output", countWordsUdf(split($"input", " "))).show(false)
Gives:
+---------------------------------+------------------------------------------------------------------+
|input |output |
+---------------------------------+------------------------------------------------------------------+
|Line number one has six words |[number -> 1, Line -> 1, has -> 1, six -> 1, words -> 1, one -> 1]|
|Line number two has has two words|[number -> 1, two -> 2, Line -> 1, has -> 2, words -> 1] |
+---------------------------------+------------------------------------------------------------------+

Reformatting Dataframe Containing Array to RowMatrix

I have this dataframe in the following format:
+----+-----+
| features |
+----+-----+
|[1,4,7,10]|
|[2,5,8,11]|
|[3,6,9,12]|
+----+----+
Script to create sample dataframe:
rows2 = sc.parallelize([ IndexedRow(0, [1, 4, 7, 10 ]),
IndexedRow(1, [2, 5, 8, 1]),
IndexedRow(1, [3, 6, 9, 12]),
])
rows_df = rows2.toDF()
row_vec= rows_df.drop("index")
row_vec.show()
The feature column contains 4 features, and there are 3 row ids. I want to convert this data to a rowmatrix, where the columns and rows will be in the following mat format:
from pyspark.mllib.linalg.distributed import RowMatrix
rows = sc.parallelize([(1, 2, 3), (4, 5, 6), (7, 8, 9), (10, 11, 12)])
# Convert to RowMatrix
mat = RowMatrix(rows)
# Calculate exact and approximate similarities
exact = mat.columnSimilarities()
approx = mat.columnSimilarities(0.05)
Basically, I want to transpose the dataframe into the new format so that I can run the columnSimilarities() function. I have a much larger dataframe that contains 50 features, and 39000 rows.
Is this what you are trying to do? Hate using collect() but don't think it can be avoided here since you want to reshape/convert structured object to matrix ... right?
X = np.array(row_vec.select("_2").collect()).reshape(-1,3)
X = sc.parallelize(X)
for i in X.collect(): print(i)
[1 4 7]
[10 2 5]
[8 1 3]
[ 6 9 12]
I figured it out, I used the following:
from pyspark.mllib.linalg.distributed import RowMatrix
features_rdd = row_vec.select("features").rdd.map(lambda row: row[0])
features_mat = RowMatrix(features_rdd )
from pyspark.mllib.linalg.distributed import CoordinateMatrix, MatrixEntry
coordmatrix_features = CoordinateMatrix(
features_mat .rows.zipWithIndex().flatMap(
lambda x: [MatrixEntry(x[1], j, v) for j, v in enumerate(x[0])]
)
)
transposed_rowmatrix_features = coordmatrix_features.transpose().toRowMatrix()

GWT Flow Table fill cell

In GWT, I am trying to get a FlexTable with buttons in the grid so that clicking one of the buttons leads to something happening. However, when I look at the result in the browser, I get the following:
+-+-+ +-+
|a|b| |c|
+-+-+ +-+
|e|
+-+-+-+-+-+
|d|h|i|f|g|
+-+-+-+-+-+
|j|
+-+
|k|
+-+
whereas what I want is this:
+-+---+---+
|a| b | c |
+-+---+-+-+
| | e | | |
| +-+-+ | |
| |h|i|f|g|
|d+-+-+ | |
| | j | | |
| +---+-+-+
| | k |
+-+-------+
I tried using table.getFlexCellFormatter().setColSpan(row, col, span) as recommended in the API, but this led to the first result above. Any ideas on how I should proceed?
#Stefan: my code is
public MyPanel() {
Button b;
FlexTable ft = new FlexTable();
ClickHandler click = new ClickHandler() {
#Override
public void onClick() {
// do something here with the button that was clicked
}
}
b = new Button("a");
b.addClickHandler(click);
ft.setWidget(0, 0, b);
// do the same for button "b" at (0,1), "c" at (0,3), "d" at (1,0), "e" at (1,1),
// "f" at (1,3), "g" at (1,4), "h" at (2,1), "i" at (2,2), "j" at (3,1), "k" at (4,1)
ft.getFlexCellFormatter().setColSpan(0, 1, 2);
ft.getFlexCellFormatter().setColSpan(0, 3, 2);
ft.getFlexCellFormatter().setRowSpan(1, 0, 4);
ft.getFlexCellFormatter().setColSpan(1, 1, 2);
ft.getFlexCellFormatter().setRowSpan(1, 3, 3);
ft.getFlexCellFormatter().setRowSpan(1, 4, 3);
ft.getFlexCellFormatter().setColSpan(3, 1, 2);
ft.getFlexCellFormatter().setColSpan(4, 1, 4);
}
Setting the colspan and rowspan on a table cell doesn't change the adjacent cells index, it just shoves them out of the way. For example, to achieve this table:
+-+-+
|a|b|
| |-+
|a|c|
+-+-+
We use:
ft.setWidget(0, 0, a);
ft.setWidget(0, 1, b);
ft.setWidget(1, 0, c); // !!!
ft.getFlexCellFormatter().setRowSpan(0,0,2);
See, widget c is in the first cell that belongs to the first row. It may help to see things this way if you consider the HTML that would be generated:
<table>
<tr>
<td rolspan="2">a</td><td>b</td>
</tr><tr>
<td>c</td>
</tr>
</table>
Adjust your cell indexes and everything should line up the way you want them to:
ft.setWidget(0, 0, a);
ft.setWidget(0, 1, b);
ft.setWidget(0, 2, c);
ft.setWidget(1, 0, d);
ft.setWidget(1, 1, e);
ft.setWidget(1, 2, f);
ft.setWidget(1, 3, g);
ft.setWidget(2, 0, h);
ft.setWidget(2, 1, i);
ft.setWidget(3, 0, j);
ft.setWidget(4, 0, k);
ft.getFlexCellFormatter().setColSpan(0, 1, 2);
ft.getFlexCellFormatter().setColSpan(0, 2, 2);
ft.getFlexCellFormatter().setRowSpan(1, 0, 4);
ft.getFlexCellFormatter().setColSpan(1, 1, 2);
ft.getFlexCellFormatter().setRowSpan(1, 2, 3);
ft.getFlexCellFormatter().setRowSpan(1, 3, 3);
ft.getFlexCellFormatter().setColSpan(3, 0, 2);
ft.getFlexCellFormatter().setColSpan(4, 0, 4);