Related
I have a hive table with
| row | column |
| --------------------------- | ---------------------------|
| null | ["black", "blue", "orange"]
| ["mom", "dad", "sister"] | ["amazon", "fiipkart", "meesho", "jiomart", ""]
Using Spark SQL, I would like to create a new column with an array of all possible combinations:
| row | column | output |
| ---------------------------|------------------|-----------------------------------|
| null |["b", "s", "m"] |["b", "s", "m"] |
| ["1", "2"] |["a", "b",""] |["1_a", "1_b","1","2_a", "2_b","2"]|
Two ways to implement this:
The first way includes array transformations:
df
// we first make your null value as an empty array
.withColumn("row", when(col("row").isNull, Array("")).otherwise(col("row")))
// we create an array of 1s that is equal to the multiplying of row size and column size
.withColumn("repeated", array_repeat(lit(1), size(col("row")) * size(col("column"))))
// we create indexes according to the sizes
.withColumn("indexes", expr("transform(repeated, (x, i) -> array(i % size(row), i % size(column)))"))
// we concat the elements
.withColumn("concat", expr("transform(indexes, (x, i) -> concat_ws('_', row[x[0]], column[x[1]]))"))
// we remove underscores before and after the name (if found)
.withColumn("output", expr("transform(concat, x -> regexp_replace(x, '(_$)|(^_)', ''))"))
Output:
+------+---------+------------------+------------------------------------------------+----------------------------+--------------------------+
|row |column |repeated |indexes |concat |output |
+------+---------+------------------+------------------------------------------------+----------------------------+--------------------------+
|[] |[b, s, m]|[1, 1, 1] |[[0, 0], [0, 1], [0, 2]] |[_b, _s, _m] |[b, s, m] |
|[1, 2]|[a, b, ] |[1, 1, 1, 1, 1, 1]|[[0, 0], [1, 1], [0, 2], [1, 0], [0, 1], [1, 2]]|[1_a, 2_b, 1_, 2_a, 1_b, 2_]|[1_a, 2_b, 1, 2_a, 1_b, 2]|
+------+---------+------------------+------------------------------------------------+----------------------------+--------------------------+
The second way includes explode and other functions:
df
// we first make your null value as an empty array
.withColumn("row", when(col("row").isNull, Array("")).otherwise(col("row")))
// we create a unique ID to group by later and collect
.withColumn("id", monotonically_increasing_id())
// we explode the columns and the rows
.withColumn("column", explode(col("column")))
.withColumn("row", explode(col("row")))
// we combine the output with underscores as separators
.withColumn("output", concat_ws("_", col("row"), col("column")))
// we group by again and collect set
.groupBy("id").agg(
collect_set("row").as("row"),
collect_set("column").as("column"),
collect_set("output").as("output")
)
.drop("id")
// we replace whatever ends with _ and starst with _ (_1, or 1_)
.withColumn("output", expr("transform(output, x -> regexp_replace(x, '(_$)|(^_)', ''))"))
Final output:
+------+---------+--------------------------+
|row |column |output |
+------+---------+--------------------------+
|[1, 2]|[b, a, ] |[2, 1_a, 2_a, 1, 1_b, 2_b]|
|[] |[s, m, b]|[s, m, b] |
+------+---------+--------------------------+
I left other columns in case you want to see what is happening, good luck!
One straight forward and easy solution is to use a custom UDF function:
from pyspark.sql.functions import udf, col
def combinations(a, b):
c = []
for x in a:
for y in b:
if not x:
c.append(y)
elif not y:
c.append(x)
else:
c.append(f"{x}_{y}")
return c
udf_combination = udf(combinations)
df = spark.createDataFrame([
[["1", "2"], ["a", "b", ""]]
], ["row", "column"])
df.withColumn("res", udf_combination(col("row"), col("column")))
# +------+--------+--------------------------+
# |row |column |res |
# +------+--------+--------------------------+
# |[1, 2]|[a, b, ]|[1_a, 1_b, 1, 2_a, 2_b, 2]|
# +------+--------+--------------------------+
I want to write a python 3.7 function that has a sorted list of numbers as an input, and a number n which is the max number each one of the integers can be repeated and modifies the list in place, so that any numbers that are repeated more than n times, would be cut to n repeats, and it should be done in O(1) space, no additional data structures allowed (e.g. set()). Special case - remove duplicates where n = 1. Example:
dup_list = [1, 1, 1, 2, 3, 7, 7, 7, 7, 12]
dedup(dup_list, n = 1)
print(dup_list)
[1, 2, 3, 7, 12]
dup_list = [1, 1, 1, 2, 3, 7, 7, 7, 7, 12]
dedup(dup_list, n = 2)
print(dup_list)
[1, 1, 2, 3, 7, 7, 12]
dup_list = [1, 1, 1, 2, 3, 7, 7, 7, 7, 12]
dedup(dup_list, n = 3)
print(dup_list)
[1, 1, 1, 2, 3, 7, 7, 7, 12]
Case n = 1 is easy, the code is below (code is taken from Elements of Prograqmming Interviews, 2008, page 49 except the last line return dup_list[:write_index]):
def dedup(dup_list):
if not dup_list:
return 0
write_index = 1
for i in range(1, len(dup_list)):
if dup_list[write_index-1] != dup_list[i]:
dup_list[write_index] = dup_list[i]
write_index += 1
return dup_list[:write_index]
This should work:
def dedup2(dup_list, n):
count = 1
list_len = len(dup_list)
i = 1
while i < list_len:
if dup_list[i - 1] != dup_list[i]:
count = 1
else:
count += 1
if count > n:
del(dup_list[i])
i -= 1
list_len -= 1
i += 1
return dup_list
print(dedup2([1, 2, 3, 3, 4, 4, 5, 5, 5, 5, 8, 9], 1))
With an unordered array of Ints as such:
let numbers = [4, 3, 1, 5, 2]
Is it possible in Swift, using .sorted { }, to order the array with one item prioritised and placed in the first index of the array. So instead of returning [1, 2, 3, 4, 5] we could get [3, 1, 2, 4, 5]?
You can declare a function like this :
func sort(_ array: [Int], prioritizing n: Int) -> [Int] {
var copy = array
let pivot = copy.partition { $0 != n }
copy[pivot...].sort()
return copy
}
Which uses the partition(by:) function.
You could use it like so:
let numbers = [4, 3, 1, 5, 2]
let specialNumber = 3
sort(numbers, prioritizing: specialNumber) //[3, 1, 2, 4, 5]
Here are some test cases :
sort([3, 3, 3], prioritizing: 3) //[3, 3, 3]
sort([9, 4, 1, 5, 2], prioritizing: 3) //[1, 2, 4, 5, 9]
Here an alternative solution that uses sorted(by:) only :
let numbers = [4, 3, 1, 5, 2]
let vipNumber = 3
let result = numbers.sorted {
($0 == vipNumber ? Int.min : $0) < ($1 == vipNumber ? Int.min : $1)
}
print(result) //[3, 1, 2, 4, 5]
I have two arrays(type: int) of different length. How could I find the closest number in array b for each number in array a(the following does not work though probably because of syntax error):
int: m;
int: n;
array [1..m] of int: a;
array [1..n] of int: b;
array[1..m] of int: results;
results = [abs(a[i] - b[j])| i in 1..m, j in 1..n];
solve minimize results;
output ["Solution: ", show(results)];
(It always helps with a complete model with as much information as possible, e.g. the values of "m" and "n" and other known/fixed values. Also, mention of the error messages helps in general.)
There are a couple of unknown things in your model, so I have to guess a little.
I guess that "results" really should be a single decision variable, and not an array as you defined it. Then you can write
var int: results = sum([abs(a[i] - b[j])| i in 1..m, j in 1..n]);
or
var int: results;
...
constraint results = sum([abs(a[i] - b[j])| i in 1..m, j in 1..n]);
Also, as it stand the model is not especially interesting since it just define two constant arrays "a" and "b" (which must be filled with constant values). I assume that at least one of them are meant to be decision variables. An array of decision variable must be declared with "var int" (or better: something like "var 1..size" where 1..size is the domain of possible values in the array).
Here is an example of a working model, which may or may not be something like what you have in mind:
int: m = 10;
int: n = 10;
array [1..m] of int: a = [1,2,3,4,5,6,7,8,9,10];
array [1..n] of var 1..10: b;
var int: results = sum([abs(a[i] - b[j])| i in 1..m, j in 1..n]);
solve minimize results;
output [
"Solution: ", show(results),"\n",
"a: ", show(a), "\n",
"b: ", show(b), "\n",
];
Update 2015-11-19:
I'm not sure I've understand the requirements completely, but here is a variant. Note that the sum loop don't use the "b" array at all, just "a" and "results". To ensure that the values in "results" are selected from "b" the domain of "results" are simply the set of the values in "b".
int: m = 10;
int: n = 10;
array [1..m] of int: a = [1,2,3,4,5,6,7,8,9,10];
array [1..n] of int: b = [5,6,13,14,15,16,17,18,19,20];
% decision variables
% values only from b
array[1..m] of var {b[i] | i in 1..n}: results;
var int: z; % to minimize
constraint
z >= 0 /\
z = sum(i in 1..m) (
sum(j in 1..m) (abs(a[i]-results[j]))
% (abs(a[i]-results[i])) % alternative interpretation (see below)
)
;
solve minimize z;
output [
"z: ", show(z), "\n",
"results: ", show(results),"\n",
"a: ", show(a), "\n",
"b: ", show(b), "\n",
];
Gecode has this for the optimal solution:
z: 250
results: [5, 5, 5, 5, 5, 5, 5, 5, 5, 5]
a: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
b: [5, 6, 13, 14, 15, 16, 17, 18, 19, 20]
Another solver (Opturion CPX) has a solution more similar to your variant:
z: 250
results: [6, 6, 5, 5, 5, 6, 6, 6, 5, 5]
a: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
b: [5, 6, 13, 14, 15, 16, 17, 18, 19, 20]
Note that both solutions has the same optimal objective value ("z") of 250.
There is, however, an alternative interpretation of the requirement (from your comment):
for each element in a, select a corresponding value from b - this
value has to be the closest in value to each element in a.
where each value in "results" corresponds just the the value in "a" with the same index ("i") , i.e.
% ...
constraint
z >= 0 /\
z = sum(i in 1..m) (
(abs(a[i]-results[i]))
)
;
The solution then is (Gecode):
z: 19
results: [5, 5, 5, 5, 5, 6, 6, 6, 6, 13]
a: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
b: [5, 6, 13, 14, 15, 16, 17, 18, 19, 20]
The last value in "results" (13) is then chosen since it's nearer to 10 (last element in "a").
Update 2 (2015-11-20)
Regarding the second comment, about a 2D (not a 3D version you wrote), here's a model. It's based on the second interpretation of the model above. Extending it to larger dimensions is just a matter of changing the dimensions and adding the loop variables.
Note that this assumes - perhaps contrary to your original question - that the dimensions of "a" and "results" are identical. If they are not, the the second interpretation can not be the one you intend. Also, I changed the values in "a" and "b" to make it more interesting. :-)
int: m = 3;
int: n = 3;
array [1..m,1..n] of int: a = [|1,2,3|4,5,6|7,8,9|];
array [1..m,1..n] of int: b = [|5,6,13|14,15,16,|7,18,19|];
% decision variables
% values only from b
array[1..m,1..n] of var {b[i,j] | i in 1..m, j in 1..n}: results;
var int: z;
constraint
z >= 0 /\
z = sum(i in 1..m, j in 1..n) (
(abs(a[i,j]-results[i,j]))
)
;
solve minimize z;
output [ "z: ", show(z), "\n" ]
++["results:"]++
[
if j = 1 then "\n" else " " endif ++
show_int(2,results[i,j])
| i in 1..m, j in 1..n
]
++["\na:"]++
[
if j = 1 then "\n" else " " endif ++
show_int(2,a[i,j])
| i in 1..m, j in 1..n
]
++["\nb:"]++
[
if j = 1 then "\n" else " " endif ++
show_int(2,b[i,j])
| i in 1..m, j in 1..n
];
One optimal solution of this is this:
z: 13
results:
5 5 5
5 5 6
7 7 7
a:
1 2 3
4 5 6
7 8 9
b:
5 6 13
14 15 16
7 18 19
This is my first day using scala. I am trying to make a string of the number of times each digit is represented in a string. For instance, the number 4310227 would return "1121100100" because 0 appears once, 1 appears once, 2 appears twice and so on...
def pow(n:Int) : String = {
val cubed = (n * n * n).toString
val digits = 0 to 9
val str = ""
for (a <- digits) {
println(a)
val b = cubed.count(_==a.toString)
println(b)
}
return cubed
}
and it doesn't seem to work. would like some scalay reasons why and to know whether I should even be going about it in this manner. Thanks!
When you iterate over strings, which is what you are doing when you call String#count(), you are working with Chars, not Strings. You don't want to compare these two with ==, since they aren't the same type of object.
One way to solve this problem is to call Char#toString() before performing the comparison, e.g., amend your code to read cubed.count(_.toString==a.toString).
As Rado and cheeken said, you're comparing a Char with a String, which will never be be equal. An alternative to cheekin's answer of converting each character to a string is to create a range from chars, ie '0' to '9':
val digits = '0' to '9'
...
val b = cubed.count(_ == a)
Note that if you want the Int that a Char represents, you can call char.asDigit.
Aleksey's, Ren's and Randall's answers are something you will want to strive towards as they separate out the pure solution to the problem. However, given that it's your first day with Scala, depending on what background you have, you might need a bit more context before understanding them.
Fairly simple:
scala> ("122333abc456xyz" filter (_.isDigit)).foldLeft(Map.empty[Char, Int]) ((histo, c) => histo + (c -> (histo.getOrElse(c, 0) + 1)))
res1: scala.collection.immutable.Map[Char,Int] = Map(4 -> 1, 5 -> 1, 6 -> 1, 1 -> 1, 2 -> 2, 3 -> 3)
This is perhaps not the fastest approach because intermediate datatype like String and Char are used but one of the most simplest:
def countDigits(n: Int): Map[Int, Int] =
n.toString.groupBy(x => x) map { case (n, c) => (n.asDigit, c.size) }
Example:
scala> def countDigits(n: Int): Map[Int, Int] = n.toString.groupBy(x => x) map { case (n, c) => (n.asDigit, c.size) }
countDigits: (n: Int)Map[Int,Int]
scala> countDigits(12345135)
res0: Map[Int,Int] = Map(5 -> 2, 1 -> 2, 2 -> 1, 3 -> 2, 4 -> 1)
Where myNumAsString is a String, eg "15625"
myNumAsString.groupBy(x => x).map(x => (x._1, x._2.length))
Result = Map(2 -> 1, 5 -> 2, 1 -> 1, 6 -> 1)
ie. A map containing the digit with its corresponding count.
What this is doing is taking your list, grouping the values by value (So for the initial string of "15625", it produces a map of 1 -> 1, 2 -> 2, 6 -> 6, and 5 -> 55.). The second bit just creates a map of the value to the count of how many times it occurs.
The counts for these hundred digits happen to fit into a hex digit.
scala> val is = for (_ <- (1 to 100).toList) yield r.nextInt(10)
is: List[Int] = List(8, 3, 9, 8, 0, 2, 0, 7, 8, 1, 6, 9, 9, 0, 3, 6, 8, 6, 3, 1, 8, 7, 0, 4, 4, 8, 4, 6, 9, 7, 4, 6, 6, 0, 3, 0, 4, 1, 5, 8, 9, 1, 2, 0, 8, 8, 2, 3, 8, 6, 4, 7, 1, 0, 2, 2, 6, 9, 3, 8, 6, 7, 9, 5, 0, 7, 6, 8, 7, 5, 8, 2, 2, 2, 4, 1, 2, 2, 6, 8, 1, 7, 0, 7, 6, 9, 5, 5, 5, 3, 5, 8, 2, 5, 1, 9, 5, 7, 2, 3)
scala> (new Array[Int](10) /: is) { case (a, i) => a(i) += 1 ; a } map ("%x" format _) mkString
warning: there were 1 feature warning(s); re-run with -feature for details
res7: String = a8c879caf9
scala> (new Array[Int](10) /: is) { case (a, i) => a(i) += 1 ; a } sum
warning: there were 1 feature warning(s); re-run with -feature for details
res8: Int = 100
I was going to point out that no one used a char range, but now I see Kristian did.
def pow(n:Int) : String = {
val cubed = (n * n * n).toString
val cnts = for (a <- '0' to '9') yield cubed.count(_ == a)
(cnts map (c => ('0' + c).toChar)).mkString
}