Merge two rows by conditional variable - merge

Suppose I have a Dataframe like the following:
a <- data.frame(A = c("01.01.2000", "02.01.2000", "03.01.2000", "01.01.2000.1", "04.01.2000", "04.01.2000.1"),
B = c(1, NA, 2, 1, 4, NA ),
C= c(NA, NA, 4, NA, NA, 1))
Now I want to merge all rows, where the ".1" is at the end, to its "counterpart" without the .1. So in this example I want to join 01.01.2000.1 to 01.01.2000 and 04.01.2000.1 to 04.01.2000 (it's always with this pattern .1 at the end).
If there are two values, they should be comma separated and if one is NA it shall take the other one.
The desired output is:
b <- data.frame(A = c("01.01.2000", "02.01.2000", "03.01.2000", "01.01.2000.1", "04.01.2000"),
B = c("1,1", NA, 2, 1, 4 ),
C= c(NA, NA, 4, NA, 1))

Related

Concatenate two arrays element with all possible combinations

I have a hive table with
| row | column |
| --------------------------- | ---------------------------|
| null | ["black", "blue", "orange"]
| ["mom", "dad", "sister"] | ["amazon", "fiipkart", "meesho", "jiomart", ""]
Using Spark SQL, I would like to create a new column with an array of all possible combinations:
| row | column | output |
| ---------------------------|------------------|-----------------------------------|
| null |["b", "s", "m"] |["b", "s", "m"] |
| ["1", "2"] |["a", "b",""] |["1_a", "1_b","1","2_a", "2_b","2"]|
Two ways to implement this:
The first way includes array transformations:
df
// we first make your null value as an empty array
.withColumn("row", when(col("row").isNull, Array("")).otherwise(col("row")))
// we create an array of 1s that is equal to the multiplying of row size and column size
.withColumn("repeated", array_repeat(lit(1), size(col("row")) * size(col("column"))))
// we create indexes according to the sizes
.withColumn("indexes", expr("transform(repeated, (x, i) -> array(i % size(row), i % size(column)))"))
// we concat the elements
.withColumn("concat", expr("transform(indexes, (x, i) -> concat_ws('_', row[x[0]], column[x[1]]))"))
// we remove underscores before and after the name (if found)
.withColumn("output", expr("transform(concat, x -> regexp_replace(x, '(_$)|(^_)', ''))"))
Output:
+------+---------+------------------+------------------------------------------------+----------------------------+--------------------------+
|row |column |repeated |indexes |concat |output |
+------+---------+------------------+------------------------------------------------+----------------------------+--------------------------+
|[] |[b, s, m]|[1, 1, 1] |[[0, 0], [0, 1], [0, 2]] |[_b, _s, _m] |[b, s, m] |
|[1, 2]|[a, b, ] |[1, 1, 1, 1, 1, 1]|[[0, 0], [1, 1], [0, 2], [1, 0], [0, 1], [1, 2]]|[1_a, 2_b, 1_, 2_a, 1_b, 2_]|[1_a, 2_b, 1, 2_a, 1_b, 2]|
+------+---------+------------------+------------------------------------------------+----------------------------+--------------------------+
The second way includes explode and other functions:
df
// we first make your null value as an empty array
.withColumn("row", when(col("row").isNull, Array("")).otherwise(col("row")))
// we create a unique ID to group by later and collect
.withColumn("id", monotonically_increasing_id())
// we explode the columns and the rows
.withColumn("column", explode(col("column")))
.withColumn("row", explode(col("row")))
// we combine the output with underscores as separators
.withColumn("output", concat_ws("_", col("row"), col("column")))
// we group by again and collect set
.groupBy("id").agg(
collect_set("row").as("row"),
collect_set("column").as("column"),
collect_set("output").as("output")
)
.drop("id")
// we replace whatever ends with _ and starst with _ (_1, or 1_)
.withColumn("output", expr("transform(output, x -> regexp_replace(x, '(_$)|(^_)', ''))"))
Final output:
+------+---------+--------------------------+
|row |column |output |
+------+---------+--------------------------+
|[1, 2]|[b, a, ] |[2, 1_a, 2_a, 1, 1_b, 2_b]|
|[] |[s, m, b]|[s, m, b] |
+------+---------+--------------------------+
I left other columns in case you want to see what is happening, good luck!
One straight forward and easy solution is to use a custom UDF function:
from pyspark.sql.functions import udf, col
def combinations(a, b):
c = []
for x in a:
for y in b:
if not x:
c.append(y)
elif not y:
c.append(x)
else:
c.append(f"{x}_{y}")
return c
udf_combination = udf(combinations)
df = spark.createDataFrame([
[["1", "2"], ["a", "b", ""]]
], ["row", "column"])
df.withColumn("res", udf_combination(col("row"), col("column")))
# +------+--------+--------------------------+
# |row |column |res |
# +------+--------+--------------------------+
# |[1, 2]|[a, b, ]|[1_a, 1_b, 1, 2_a, 2_b, 2]|
# +------+--------+--------------------------+

Concat column values in a dataframe

I have a csv file that looks like the following
name, state, a, b, c, d, ..., x
Jon, NY, 1, 4, 6, 2, 6
Eric, CA, 5, 3, 1, 5, 6
Chris,LA, 4, 4, 3, 1, 5
and I want the following result (one column with concat fields (inkl header name))
concate-fields
"name=Jon, state=NY, a=1, b=4, c=6, d= 2, ... x=6"
"name=Eric, state=CA, a=5, b=3, c=1, d= 5, ... x=6"
"name=Chris, state=LA, a=4, b=4, c=3, d= 1, ... x=5"
There can be many headers from a...>x so these should be appended in a generic way
I now have
import org.apache.spark.sql.functions.{concat, lit}
val lp = sample.select(concat(lit("name), $"name", lit(",state="), $"state")
display(lp)
But I have trouble adding the same for column a->x (as this needs to be done in a generic way)
You can dynamically create SQL expression to concat columns by calling map method on df.columns() as shown below.
val df = // Read CSV
df.withColumn("concate-fields", expr(s"concat(${df.columns.map(col=>s"'$col=', nvl($col,'null'),','").mkString("").dropRight(4)})"))
.withColumn("concate-fields", concat(lit("\""),col("concate-fields"),lit("\"")))

Distinct Word count per line

This is little different than the common word count program. I am trying to get the distinct word count per line.
Input:
Line number one has six words
Line number two has two words
Expected output:
line1 => (Line,1),(number,1),(one,1),(has,1),(six,1),(words,1)
line2 => (Line,1),(number,1),(two,2),(has,1),(words,1)
Can anyone please guide me.
By using Dataframe in built functions explode,split,collect_set,groupBy.
//input data
val df=Seq("Line number one has six words","Line number two has has two words").toDF("input")
scala> :paste
// Entering paste mode (ctrl-D to finish)
df.withColumn("words",explode(split($"input","\\s+"))) //split by space and explode
.groupBy("input","words") //group by on both columns
.count()
.withColumn("line_word_count",struct($"words",$"count")) //create struct
.groupBy("input") //grouping by input data column
.agg(collect_set("line_word_count").alias("line_word_count"))
.show(false)
Result:
+---------------------------------+------------------------------------------------------------------+
|input |line_word_count |
+---------------------------------+------------------------------------------------------------------+
|Line number one has six words |[[one, 1], [has, 1], [six, 1], [number, 1], [words, 1], [Line, 1]]|
|Line number two has has two words|[[has, 2], [two, 2], [words, 1], [number, 1], [Line, 1]] |
+---------------------------------+------------------------------------------------------------------+
If you are expecting line numbers then use concat,monotonically_increasing_id functions:
df.withColumn("line",concat(lit("line"),monotonically_increasing_id()+1))
.withColumn("words",explode(split($"input","\\s+")))
.groupBy("input","words","line")
.count()
.withColumn("line_word_count",struct($"words",$"count"))
.groupBy("line")
.agg(collect_set("line_word_count").alias("line_word_count"))
.show(false)
Result:
+-----+------------------------------------------------------------------+
|line |line_word_count |
+-----+------------------------------------------------------------------+
|line1|[[one, 1], [has, 1], [six, 1], [words, 1], [number, 1], [Line, 1]]|
|line2|[[has, 2], [two, 2], [number, 1], [words, 1], [Line, 1]] |
+-----+------------------------------------------------------------------+
Note incase of larger datasets to make it consecutive we need to do .repartition(1).
Here is another way using RDD API:
val rdd = df.withColumn("output", split($"input", " ")).rdd.map(l => (
l.getAs[String](0),
l.getAs[Seq[String]](1).groupBy(identity).mapValues(_.size).map(identity))
)
val dfCount = spark.createDataFrame(rdd).toDF("input", "output")
Not a big fan of using UDF, but it can also be done like this:
import org.apache.spark.sql.functions.udf
val mapCount: Seq[String] => Map[String, Integer] = _.groupBy(identity).mapValues(_.size)
val countWordsUdf = udf(mapCount)
df.withColumn("output", countWordsUdf(split($"input", " "))).show(false)
Gives:
+---------------------------------+------------------------------------------------------------------+
|input |output |
+---------------------------------+------------------------------------------------------------------+
|Line number one has six words |[number -> 1, Line -> 1, has -> 1, six -> 1, words -> 1, one -> 1]|
|Line number two has has two words|[number -> 1, two -> 2, Line -> 1, has -> 2, words -> 1] |
+---------------------------------+------------------------------------------------------------------+

Reformatting Dataframe Containing Array to RowMatrix

I have this dataframe in the following format:
+----+-----+
| features |
+----+-----+
|[1,4,7,10]|
|[2,5,8,11]|
|[3,6,9,12]|
+----+----+
Script to create sample dataframe:
rows2 = sc.parallelize([ IndexedRow(0, [1, 4, 7, 10 ]),
IndexedRow(1, [2, 5, 8, 1]),
IndexedRow(1, [3, 6, 9, 12]),
])
rows_df = rows2.toDF()
row_vec= rows_df.drop("index")
row_vec.show()
The feature column contains 4 features, and there are 3 row ids. I want to convert this data to a rowmatrix, where the columns and rows will be in the following mat format:
from pyspark.mllib.linalg.distributed import RowMatrix
rows = sc.parallelize([(1, 2, 3), (4, 5, 6), (7, 8, 9), (10, 11, 12)])
# Convert to RowMatrix
mat = RowMatrix(rows)
# Calculate exact and approximate similarities
exact = mat.columnSimilarities()
approx = mat.columnSimilarities(0.05)
Basically, I want to transpose the dataframe into the new format so that I can run the columnSimilarities() function. I have a much larger dataframe that contains 50 features, and 39000 rows.
Is this what you are trying to do? Hate using collect() but don't think it can be avoided here since you want to reshape/convert structured object to matrix ... right?
X = np.array(row_vec.select("_2").collect()).reshape(-1,3)
X = sc.parallelize(X)
for i in X.collect(): print(i)
[1 4 7]
[10 2 5]
[8 1 3]
[ 6 9 12]
I figured it out, I used the following:
from pyspark.mllib.linalg.distributed import RowMatrix
features_rdd = row_vec.select("features").rdd.map(lambda row: row[0])
features_mat = RowMatrix(features_rdd )
from pyspark.mllib.linalg.distributed import CoordinateMatrix, MatrixEntry
coordmatrix_features = CoordinateMatrix(
features_mat .rows.zipWithIndex().flatMap(
lambda x: [MatrixEntry(x[1], j, v) for j, v in enumerate(x[0])]
)
)
transposed_rowmatrix_features = coordmatrix_features.transpose().toRowMatrix()

Is there a "for" syntax for flatmap?

Is there a "for" syntax for
c flatmap ( x => d flatmap (y => f(x,y) ) )
?
Because I've used Haskell in the past, I keep expecting the "for" syntax in Scala to mimic the "do" syntax of Haskell. This is probably an unrealistic expectation. In Haskell which I could write
do x <- c
y <- d
f(x, y)
You could also map the last result on itself.
Using the same example as dhg:
val c = 1 to 3
val d = 4 to 6
def f(x: Int, y: Int) = Vector(x,y)
for {
x <- c
y <- d
z <- f(x, y)
} yield z
// Vector(1, 4, 1, 5, 1, 6, 2, 4, 2, 5, 2, 6, 3, 4, 3, 5, 3, 6)
Which corresponds to:
c flatMap ( x => d flatMap (y => f(x,y) map (identity) ) )
You can just flatten the result:
val c = 1 to 3
val d = 4 to 6
def f(x: Int, y: Int) = Vector(x,y)
c flatMap ( x => d flatMap (y => f(x,y) ) )
// Vector(1, 4, 1, 5, 1, 6, 2, 4, 2, 5, 2, 6, 3, 4, 3, 5, 3, 6)
(for { x <- c; y <- d } yield f(x,y)).flatten
// Vector(1, 4, 1, 5, 1, 6, 2, 4, 2, 5, 2, 6, 3, 4, 3, 5, 3, 6)
Presumably this is a much less frequently used case since it is necessarily less common that the output of the for is flattenable. And sticking .flatten on the end is pretty easy, so having a special syntax for it seems unnecessarily complicated.
Flattening may impact performance but i think scalac is clever enough to encode
for {
x <- c
y <- d
z <- f(x,y)
} yield z
into
c flatMap { x => d flatMap { y => f(x,y) } }
This is annoying that the 'for' syntax is not as convenient as the 'do'-notation (writing _ <- someExpression instead of just someExpression in a for feels my heart with sadness).