Scala/Spark - How to get first elements of all sub-arrays - scala

I have the following DataFrame in a Spark (I'm using Scala):
[[1003014, 0.95266926], [15, 0.9484202], [754, 0.94236785], [1029530, 0.880922], [3066, 0.7085166], [1066440, 0.69400793], [1045811, 0.663178], [1020059, 0.6274495], [1233982, 0.6112905], [1007801, 0.60937023], [1239278, 0.60044676], [1000088, 0.5789191], [1056268, 0.5747936], [1307569, 0.5676605], [10334513, 0.56592846], [930, 0.5446228], [1170206, 0.52525467], [300, 0.52473146], [2105178, 0.4972785], [1088572, 0.4815367]]
I want to get a Dataframe with only first Ints of each sub-array, something like:
[1003014, 15, 754, 1029530, 3066, 1066440, ...]
Keeping hence only the x[0] of each sub-array x of the Array listed above.
I'm new to Scala, and couldn't find the right anonymous map function.
Thanks in advance for any help

For Spark >= 2.4, you can use Higher-Order Function transform with lambda function to extract the first element of each value array.
scala> df.show(false)
+----------------------------------------------------------------------------------------+
|arrays |
+----------------------------------------------------------------------------------------+
|[[1003014.0, 0.95266926], [15.0, 0.9484202], [754.0, 0.94236785], [1029530.0, 0.880922]]|
+----------------------------------------------------------------------------------------+
scala> df.select(expr("transform(arrays, x -> x[0])").alias("first_array_elements")).show(false)
+-----------------------------------+
|first_array_elements |
+-----------------------------------+
|[1003014.0, 15.0, 754.0, 1029530.0]|
+-----------------------------------+
Spark < 2.4
Explode the initial array and then aggregate with collect_list to collect the first element of each sub array:
df.withColumn("exploded_array", explode(col("arrays")))
.agg(collect_list(col("exploded_array")(0)))
.show(false)
EDIT:
In case the array contains structs and not sub-arrays, just change the accessing method using dots for struct elements:
val transfrom_expr = "transform(arrays, x -> x.canonical_id)"
df.select(expr(transfrom_expr).alias("first_array_elements")).show(false)

Using Spark 2.4:
val df = Seq(
Seq(Seq(1.0,2.0),Seq(3.0,4.0))
).toDF("arrs")
df.show()
+--------------------+
| arrs|
+--------------------+
|[[1.0, 2.0], [3.0...|
+--------------------+
df
.select(expr("transform(arrs, x -> x[0])").as("arr_first"))
.show()
+----------+
| arr_first|
+----------+
|[1.0, 3.0]|
+----------+

Related

Flatten array of arrays (different dimensions) of a sql.dataframe.DataFrame in pyspark

I have a pyspark.sql.dataframe.DataFrame which is something like this:
+---------------------------+--------------------+--------------------+
|collect_list(results) | userid | page |
+---------------------------+--------------------+--------------------+
| [[[roundtrip, fal...|13482f06-9185-47f...|1429d15b-91d0-44b...|
+---------------------------+--------------------+--------------------+
Inside the collect_list(results) column there is an array with len = 2, and the elements are also arrays (the first one has a len = 1, and the second one a len = 9).
Is there a way to flatten this array of arrays into a unique array with len = 10 using pyspark?
Thanks!
You can flatten an array of array using pyspark.sql.functions.flatten. Documentation here. For example this will create a new column called results with the flatten results assuming your dataframe variable is called df.
import pyspark.sql.functions as F
...
df.withColumn('results', F.flatten('collect_list(results)')
For a version that works before Spark 2.4 (but not before 1.3), you could try to explode the dataset you obtained before grouping, thereby unnesting one level of the array, then call groupBy and collect_list. Like this:
from pyspark.sql.functions import collect_list, explode
df = spark.createDataFrame([("foo", [1,]), ("foo", [2, 3])], schema=("foo", "bar"))
df.show()
# +---+------+
# |foo| bar|
# +---+------+
# |foo| [1]|
# |foo|[2, 3]|
# +---+------+
(df.select(
df.foo,
explode(df.bar))
.groupBy("foo")
.agg(collect_list("col"))
.show())
# +---+-----------------+
# |foo|collect_list(col)|
# +---+-----------------+
# |foo| [1, 2, 3]|
# +---+-----------------+

spark scala cartesian product of each element in a column

I have a dataframe which is like :
df:
col1 col2
a [p1,p2,p3]
b [p1,p4]
Desired output is that:
df_out:
col1 col2 col3
p1 p2 a
p1 p3 a
p2 p3 a
p1 p4 b
I did some research and i think that converting df to rdd and then flatMap with cartesian product are ideal for the problem. However i could not combine them together.
Thanks,
It looks like you are trying to do combination rather than cartesian. Please check my understanding.
This is in PySpark but the only python thing is the UDF, the rest is just DataFrame operations.
process is
Create dataframe
define UDF to get all pairs of combinations ignoring order
use UDF to convert array into array of pairs of structs, one for each element of the combination
explode the results to get rows of pair of structs
select each struct and original column 1 into desired result columns
from itertools import combinations
from pyspark.sql import functions as F
df = spark.createDataFrame([
("a", ["p1", "p2", "p3"]),
("b", ["p1", "p4"])
],
["col1", "col2"]
)
# define and register udf that takes an array and returns an array of struct of two strings
#udf("array<struct<_1: string, _2: string>>")
def combinations_list(x):
return combinations(x, 2)
resultDf = df.select("col1", F.explode(combinations_list(df.col2)).alias("combos"))
resultDf.selectExpr("combos._1 as col1", "combos._2 as col2", "col1 as col3").show()
Result:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| p1| p2| a|
| p1| p3| a|
| p2| p3| a|
| p1| p4| b|
+----+----+----+

Scala Spark - split vector column into separate columns in a Spark DataFrame

I have a Spark DataFrame where I have a column with Vector values. The vector values are all n-dimensional, aka with the same length. I also have a list of column names Array("f1", "f2", "f3", ..., "fn"), each corresponds to one element in the vector.
some_columns... | Features
... | [0,1,0,..., 0]
to
some_columns... | f1 | f2 | f3 | ... | fn
... | 0 | 1 | 0 | ... | 0
What is the best way to achieve this? I thought of one way which is to create a new DataFrame with createDataFrame(Row(Features), featureNameList) and then join with the old one, but it requires spark context to use createDataFrame. I only want to transform the existing data frame. I also know .withColumn("fi", value) but what do I do if n is large?
I'm new to Scala and Spark and couldn't find any good examples for this. I think this can be a common task. My particular case is that I used the CountVectorizer and wanted to recover each column individually for better readability instead of only having the vector result.
One way could be to convert the vector column to an array<double> and then using getItem to extract individual elements.
import org.apache.spark.sql.functions._
import org.apache.spark.ml._
val df = Seq( (1 , linalg.Vectors.dense(1,0,1,1,0) ) ).toDF("id", "features")
//df: org.apache.spark.sql.DataFrame = [id: int, features: vector]
df.show
//+---+---------------------+
//|id |features |
//+---+---------------------+
//|1 |[1.0,0.0,1.0,1.0,0.0]|
//+---+---------------------+
// A UDF to convert VectorUDT to ArrayType
val vecToArray = udf( (xs: linalg.Vector) => xs.toArray )
// Add a ArrayType Column
val dfArr = df.withColumn("featuresArr" , vecToArray($"features") )
// Array of element names that need to be fetched
// ArrayIndexOutOfBounds is not checked.
// sizeof `elements` should be equal to the number of entries in column `features`
val elements = Array("f1", "f2", "f3", "f4", "f5")
// Create a SQL-like expression using the array
val sqlExpr = elements.zipWithIndex.map{ case (alias, idx) => col("featuresArr").getItem(idx).as(alias) }
// Extract Elements from dfArr
dfArr.select(sqlExpr : _*).show
//+---+---+---+---+---+
//| f1| f2| f3| f4| f5|
//+---+---+---+---+---+
//|1.0|0.0|1.0|1.0|0.0|
//+---+---+---+---+---+

How to split column into multiple columns in Spark 2?

I am reading the data from HDFS into DataFrame using Spark 2.2.0 and Scala 2.11.8:
val df = spark.read.text(outputdir)
df.show()
I see this result:
+--------------------+
| value|
+--------------------+
|(4056,{community:...|
|(56,{community:56...|
|(2056,{community:...|
+--------------------+
If I run df.head(), I see more details about the structure of each row:
[(4056,{community:1,communitySigmaTot:1020457,internalWeight:0,nodeWeight:1020457})]
I want to get the following output:
+---------+----------+
| id | value|
+---------+----------+
|4056 |1 |
|56 |56 |
|2056 |20 |
+---------+----------+
How can I do it? I tried using .map(row => row.mkString(",")),
but I don't know how to extract the data as I showed.
The problem is that you are getting the data as a single column of strings. The data format is not really specified in the question (ideally it would be something like JSON), but given what we know, we can use a regular expression to extract the number on the left (id) and the community field:
val r = """\((\d+),\{.*community:(\d+).*\}\)"""
df.select(
F.regexp_extract($"value", r, 1).as("id"),
F.regexp_extract($"value", r, 2).as("community")
).show()
A bunch of regular expressions should give you required result.
df.select(
regexp_extract($"value", "^\\(([0-9]+),.*$", 1) as "id",
explode(split(regexp_extract($"value", "^\\(([0-9]+),\\{(.*)\\}\\)$", 2), ",")) as "value"
).withColumn("value", split($"value", ":")(1))
If your data is always of the following format
(4056,{community:1,communitySigmaTot:1020457,internalWeight:0,nodeWeight:1020457})
Then you can simply use split and regex_replace inbuilt functions to get your desired output dataframe as
import org.apache.spark.sql.functions._
df.select(regexp_replace((split(col("value"), ",")(0)), "\\(", "").as("id"), regexp_replace((split(col("value"), ",")(1)), "\\{community:", "").as("value") ).show()
I hope the answer is helpful

How to perform arithmetic operation on two seperate dataframes in Apache Spark?

I have two dataframes as follows which have only one row and one column each. Both holds two different numeric values.
How do I perform or achieve division or other arithmetic operation on those two dataframe values?
Please help.
First, if these DataFrames contain a single record each - any further use of Spark would likely be wasteful (Spark is intended for large data sets, small ones would be processed faster locally). So, you can simply collect these one-record values using first() an go on from there:
import spark.implicits._
val df1 = Seq(2.0).toDF("col1")
val df2 = Seq(3.5).toDF("col2")
val v1: Double = df1.first().getAs[Double](0)
val v2: Double = df2.first().getAs[Double](0)
val sum = v1 + v2
If, for some reason, you do want to use DataFrames all the way, you can use crossJoin to join the records together and then apply any arithmetic operation:
import spark.implicits._
val df1 = Seq(2.0).toDF("col1")
val df2 = Seq(3.5).toDF("col2")
df1.crossJoin(df2)
.select($"col1" + $"col2" as "sum")
.show()
// +---+
// |sum|
// +---+
// |5.5|
// +---+
If you have dataframes as
scala> df1.show(false)
+------+
|value1|
+------+
|2 |
+------+
scala> df2.show(false)
+------+
|value2|
+------+
|2 |
+------+
You can get the value by doing the following
scala> df1.take(1)(0)(0)
res3: Any = 2
But the dataType is Any, type casting is needed before we do arithmetic operations as
scala> df1.take(1)(0)(0).asInstanceOf[Int]*df2.take(1)(0)(0).asInstanceOf[Int]
res8: Int = 4