Concat of all data frame columns using fold, reduce with Spark / Scala

Concat of all data frame columns using fold, reduce with Spark / Scala - scala

The following works fine with a dynamic column generation:
import org.apache.spark.sql.functions._
import sqlContext.implicits._
import org.apache.spark.sql.DataFrame
val input = sc.parallelize(Seq(
("a", "5a", "7w", "9", "a12", "a13")
)).toDF("ID", "var1", "var2", "var3", "var4", "var5")
val columns_to_concat = input.columns
input.select(concat(columns_to_concat.map(c => col(c)): _*).as("concat_column")).show(false)
returns:
+-------------+
|concat_column|
+-------------+
|a5a7w9a12a13 |
+-------------+
How can I do this with foldLeft, reduceLeft - whilst retaining the dynamic column generation?
I always get either an error, or a null value returned. Whilst concat suffices, I am curious as to how fold, etc. could work.

It is definitely not the way to go*, but if you treat it as a programming exercise:
import org.apache.spark.sql.functions.{col, concat, lit}
columns_to_concat.map(col(_)).reduce(concat(_, _))
or
columns_to_concat.map(col(_)).foldLeft(lit(""))(concat(_, _))
* Because
It is a convoluted solution for something that already is provided by a high level API.
Because it requires additional work from the planner / optimizer to flatten recursive expression, not to mention that the expression don't use tail call recursion and can simply overflow.

Related

How to split column in Spark Dataframe to multiple columns

In my case how to split a column contain StringType with a format '1-1235.0 2-1248.0 3-7895.2' to another column with ArrayType contains ['1','2','3']

this is relatively simple with UDF:
val df = Seq("1-1235.0 2-1248.0 3-7895.2").toDF("input")
val extractFirst = udf((s: String) => s.split(" ").map(_.split('-')(0).toInt))
df.withColumn("newCol", extractFirst($"input"))
.show()
gives
+--------------------+---------+
| input| newCol|
+--------------------+---------+
|1-1235.0 2-1248.0...|[1, 2, 3]|
+--------------------+---------+
I could not find an easy soluton with spark internals (other than using split in combination with explode etc and then re-aggregating)

You can split the string to an array using split function and then you can transform the array using Higher Order Function TRANSFORM (it is available since Sark 2.4) together with substring_index:
import org.apache.spark.sql.functions.{split, expr}
val df = Seq("1-1235.0 2-1248.0 3-7895.2").toDF("stringCol")
df.withColumn("array", split($"stringCol", " "))
.withColumn("result", expr("TRANSFORM(array, x -> substring_index(x, '-', 1))"))
Notice that this is native approach, no UDF applied.

Creating a SQLContext Dataset from an RDD containing arrays of Strings in Spark [duplicate]

This question already has answers here:
Why is "Unable to find encoder for type stored in a Dataset" when creating a dataset of custom case class?
(3 answers)
Closed 5 years ago.
So I have a variable data which is a RDD[Array[String]]. I want to iterate over it and compare adjacent elements. To do this I must create a dataset from the RDD.
I try the following, sc is my SparkContext:
import org.apache.spark.sql.SQLContext
val sqc = new SQLContext(sc)
val lines = sqc.createDataset(data)
And I get the two following errors:
Error:(12, 34) Unable to find encoder for type stored in a Dataset.
Primitive types (Int, String, etc) and Product types (case classes)
are supported by importing sqlContext.implicits._ Support for
serializing other types will be added in future releases.
val lines = sqc.createDataset(data)
Error:(12, 34) not enough arguments for method createDataset:
(implicit evidence$4:
org.apache.spark.sql.Encoder[Array[String]])org.apache.spark.sql.Dataset[Array[String]].
Unspecified value parameter evidence$4.
val lines = sqc.createDataset(data)
Sure, I understand I need to pass an Encoder argument, however, what would it be in this case and how do I import Encoders? When I try myself it says that createDataset does not take that as argument.
There are similar questions, but they do not answer how to use the encoder argument. If my RDD is a RDD[String] it works perfectly fine, however in this case it is RDD[Array[String]].

All of the comments in the question are trying to tell you the following things
You say you have RDD[Array[String]], which I create by doing the following
val rdd = sc.parallelize(Seq(Array("a", "b"), Array("d", "e"), Array("g", "f"), Array("e", "r"))) //rdd: org.apache.spark.rdd.RDD[Array[String]] = ParallelCollectionRDD[0] at parallelize at worksheetTest.sc4592:13
Now converting the rdd to dataframe is to call .toDF but before that you need to import implicits._ of sqlContext as below
val sqc = new SQLContext(sc)
import sqc.implicits._
rdd.toDF().show(false)
You should have dataframe as
+------+
|value |
+------+
|[a, b]|
|[d, e]|
|[g, f]|
|[e, r]|
+------+
Isn't this all simple?

how to create an extra column in DataFrame based on a simple condition in spark

I have a dataframe and I'd like to add an extra column to it based on a simple condition which basically says whether the value sof another column is equal to a given string or not. I know I can create an UDF and register it and use it then, however I think there must be an easier way of doing it. This is the psuedocode of what I'm about to do
df.withColumn("extra", if (col("a) == "str" 1 else 2))

You are pretty much there:
scala> val df = Seq((1,2), (3,3), (4,5)).toDF("a", "b")
scala> df.show
+-+-+
|a|b|
+-+-+
|1|2|
|3|3|
|4|5|
+-+-+
scala> df.withColumn("New", when($"a" === $"b", "equal").otherwise("not")).show
+-+-+-----+
|a|b| New|
+-+-+-----+
|1|2| not|
|3|3|equal|
|4|5| not|
+-+-+-----+
Note that you will need functions and implicits imported for the above to work.

How to sum the values of one column of a dataframe in spark/scala

I have a Dataframe that I read from a CSV file with many columns like: timestamp, steps, heartrate etc.
I want to sum the values of each column, for instance the total number of steps on "steps" column.
As far as I see I want to use these kind of functions:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
But I can understand how to use the function sum.
When I write the following:
val df = CSV.load(args(0))
val sumSteps = df.sum("steps")
the function sum cannot be resolved.
Do I use the function sum wrongly?
Do Ι need to use first the function map? and if yes how?
A simple example would be very helpful! I started writing Scala recently.

You must first import the functions:
import org.apache.spark.sql.functions._
Then you can use them like this:
val df = CSV.load(args(0))
val sumSteps = df.agg(sum("steps")).first.get(0)
You can also cast the result if needed:
val sumSteps: Long = df.agg(sum("steps").cast("long")).first.getLong(0)
Edit:
For multiple columns (e.g. "col1", "col2", ...), you could get all aggregations at once:
val sums = df.agg(sum("col1").as("sum_col1"), sum("col2").as("sum_col2"), ...).first
Edit2:
For dynamically applying the aggregations, the following options are available:
Applying to all numeric columns at once:
df.groupBy().sum()
Applying to a list of numeric column names:
val columnNames = List("col1", "col2")
df.groupBy().sum(columnNames: _*)
Applying to a list of numeric column names with aliases and/or casts:
val cols = List("col1", "col2")
val sums = cols.map(colName => sum(colName).cast("double").as("sum_" + colName))
df.groupBy().agg(sums.head, sums.tail:_*).show()

If you want to sum all values of one column, it's more efficient to use DataFrame's internal RDD and reduce.
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val df = sc.parallelize(Array(10,2,3,4)).toDF("steps")
df.select(col("steps")).rdd.map(_(0).asInstanceOf[Int]).reduce(_+_)
//res1 Int = 19

Simply apply aggregation function, Sum on your column
df.groupby('steps').sum().show()
Follow the Documentation http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html
Check out this link also https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/

Not sure this was around when this question was asked but:
df.describe().show("columnName")
gives mean, count, stdtev stats on a column. I think it returns on all columns if you just do .show()

Using spark sql query..just incase if it helps anyone!
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.SparkContext
import java.util.stream.Collectors
val conf = new SparkConf().setMaster("local[2]").setAppName("test")
val spark = SparkSession.builder.config(conf).getOrCreate()
val df = spark.sparkContext.parallelize(Seq(1, 2, 3, 4, 5, 6, 7)).toDF()
df.createOrReplaceTempView("steps")
val sum = spark.sql("select sum(steps) as stepsSum from steps").map(row => row.getAs("stepsSum").asInstanceOf[Long]).collect()(0)
println("steps sum = " + sum) //prints 28

Scala spark Select as not working as expected

Hope someone can help. Fairly certain this is something I'm doing wrong.
I have a dataframe called uuidvar with 1 column called 'uuid' and another dataframe, df1, with a number of columns, one of which is also 'uuid'. I would like to select from from df1 all of the rows which have a uuid that appear in uuidvar. Now, having the same column names is not ideal so I tried to do it with
val uuidselection=df1.join(uuidvar, df1("uuid") === uuidvar("uuid").as("another_uuid"), "right_outer").select("*")
However when I show uuidselection I have 2 columns called "uuid". Furthermore, if I try and select the specific columns I want, I am told
cannot resolve 'uuidvar' given input columns
or similar depending on what I try and select.
I have tried to make it simpler and just do
val uuidvar2=uuidvar.select("uuid").as("uuidvar")
and this doesn't rename the column in uuidvar.
Does 'as' not operate as I am expecting it to, am I making some other fundamental error or is it broken?
I'm using spark 1.5.1 and scala 1.10.

Answer
You can't use as when specifying the join-criterion.
Use withColumnRenamed to modify the column before the join.
Seccnd, use generic col function for accessing columns via name (instead of using the dataframe's apply method, e.g. df1(<columnname>)
case class UUID1 (uuid: String)
case class UUID2 (uuid: String, b:Int)
class UnsortedTestSuite2 extends SparkFunSuite {
configuredUnitTest("SO - uuid") { sc =>
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val uuidvar = sc.parallelize( Seq(
UUID1("cafe-babe-001"),
UUID1("cafe-babe-002"),
UUID1("cafe-babe-003"),
UUID1("cafe-babe-004")
)).toDF()
val df1 = sc.parallelize( Seq(
UUID2("cafe-babe-001", 1),
UUID2("cafe-babe-002", 2),
UUID2("cafe-babe-003", 3)
)).toDF()
val uuidselection=df1.join(uuidvar.withColumnRenamed("uuid", "another_uuid"), col("uuid") === col("another_uuid"), "right_outer")
uuidselection.show()
}
}
delivers
+-------------+----+-------------+
| uuid| b| another_uuid|
+-------------+----+-------------+
|cafe-babe-001| 1|cafe-babe-001|
|cafe-babe-002| 2|cafe-babe-002|
|cafe-babe-003| 3|cafe-babe-003|
| null|null|cafe-babe-004|
+-------------+----+-------------+
Comment
.select("*") does not have any effect. So
df.select("*") =^= df

I've always used the withColumnRenamed api to rename columns:
Take this table as an example:
| Name | Age |
df.withColumnRenamed('Age', 'newAge').show()
| Name | newAge |
So to make it work with your code, something like this should work:
val uuidvar_another = uuidvar.withColumnRenamed("uuid", "another_uuid")
val uuidselection=df1.join(uuidvar, df1("uuid") === uuidvar("another_uuid"), "right_outer").select("*")