How to fetch the value and type of each column of each row in a dataframe? - scala

How can I convert a dataframe to a tuple that includes the datatype for each column?
I have a number of dataframes with varying sizes and types. I need to be able to determine the type and value of each column and row of a given dataframe so I can perform some actions that are type-dependent.
So for example say I have a dataframe that looks like:
+-------+-------+
| foo | bar |
+-------+-------+
| 12345 | fnord |
| 42 | baz |
+-------+-------+
I need to get
Seq(
(("12345", "Integer"), ("fnord", "String")),
(("42", "Integer"), ("baz", "String"))
)
or something similarly simple to iterate over and work with programmatically.
Thanks in advance and sorry for what is, I'm sure, a very noobish question.

If I understand your question correct, then following shall be your solution.
val df = Seq(
(12345, "fnord"),
(42, "baz"))
.toDF("foo", "bar")
This creates dataframe which you already have.
+-----+-----+
| foo| bar|
+-----+-----+
|12345|fnord|
| 42| baz|
+-----+-----+
Next step is to extract dataType from the schema of the dataFrame and create a iterator.
val fieldTypesList = df.schema.map(struct => struct.dataType)
Next step is to convert the dataframe rows into rdd list and map each value to dataType from the list created above
val dfList = df.rdd.map(row => row.toString().replace("[","").replace("]","").split(",").toList)
val tuples = dfList.map(list => list.map(value => (value, fieldTypesList(list.indexOf(value)))))
Now if we print it
tuples.foreach(println)
It would give
List((12345,IntegerType), (fnord,StringType))
List((42,IntegerType), (baz,StringType))
Which you can iterate over and work with programmatically

Related

how to convert seq[row] to a dataframe in scala

Is there any way to convert Seq[Row] into a dataframe in scala.
I have a dataframe and a list of strings that have weights of each row in input dataframe.I want to build a DataFrame that will include all rows with unique weights.
I was able to filter unique rows and append to seq[row] but I want to build a dataframe.
This is my code .Thanks in advance.
def dataGenerator(input : DataFrame, val : List[String]): Dataset[Row]= {
val valitr = val.iterator
var testdata = Seq[Row]()
var val = HashSet[String]()
if(valitr!=null) {
input.collect().foreach((r) => {
var valnxt = valitr.next()
if (!valset.contains(valnxt)) {
valset += valnxt
testdata = testdata :+ r
}
})
}
//logic to convert testdata as DataFrame and return
}
You said that 'val is calculated using fields from inputdf itself'. If this is the case then you should be able to make a new dataframe with a new column for the 'val' like this:
+------+------+
|item |weight|
+------+------+
|item 1|w1 |
|item 2|w2 |
|item 3|w2 |
|item 4|w3 |
|item 5|w4 |
+------+------+
This is the key thing. Then you will be able to work on the dataframe instead of doing a collect.
What is bad about doing collect? Well there is no point in going to the trouble and overhead of using a distributed big data processing framework just to pull all the data into the memory of 1 machine. See here: Spark dataframe: collect () vs select ()
When you have the input dataframe how you want it, as above, you can get the result. Here is a way that works, which groups the data by the weight column and picks the first item for each grouping.
val result = input
.rdd // get underlying rdd
.groupBy(r => r.get(1)) // group by "weight" field
.map(x => x._2.head.getString(0)) // get the first "item" for each weight
.toDF("item") // back to a dataframe
Then you get the only the first item in case of duplicated weight:
+------+
|item |
+------+
|item 1|
|item 2|
|item 4|
|item 5|
+------+

How to merge two or more columns into one?

I have a streaming Dataframe that I want to calculate min and avg over some columns.
Instead of getting separate resulting columns of min and avg after applying the operations, I want to merge the min and average output into a single column.
The dataframe look like this:
+-----+-----+
| 1 | 2 |
+-----+-----+-
|24 | 55 |
+-----+-----+
|20 | 51 |
+-----+-----+
I thought I'd use a Scala tuple for it, but that does not seem to work:
val res = List("1","2").map(name => (min(col(name)), avg(col(name))).as(s"result($name)"))
All code used:
val res = List("1","2").map(name => (min(col(name)),avg(col(name))).as(s"result($name)"))
val groupedByTimeWindowDF1 = processedDf.groupBy($"xyz", window($"timestamp", "60 seconds"))
.agg(res.head, res.tail: _*)
I'm expecting the output after applying the min and avg mathematical opearations to be:
+-----------+-----------+
| result(1)| result(2)|
+-----------+-----------+
|20 ,22 | 51,53 |
+-----------+-----------+
How I should write the expression?
Use struct standard function:
struct(colName: String, colNames: String*): Column
struct(cols: Column*): Column
Creates a new struct column that composes multiple input columns.
That gives you the values as well as the names (of the columns).
val res = List("1","2").map(name =>
struct(min(col(name)), avg(col(name))) as s"result($name)")
^^^^^^ HERE
The power of struct can be seen when you want to reference one field in the struct and you can use the name (not index).
q.select("structCol.name")
What you want to do is to merge the values of multiple columns together in a single column. For this you can use the array function. In this case it would be:
val res = List("1","2").map(name => array(min(col(name)),avg(col(name))).as(s"result($name)"))
Which will give you :
+------------+------------+
| result(1)| result(2)|
+------------+------------+
|[20.0, 22.0]|[51.0, 53.0]|
+------------+------------+

How to create a DataFrame from List?

I want to create DataFrame df that should look as simple as this:
+----------+----------+
| timestamp| col2|
+----------+----------+
|2018-01-11| 123|
+----------+----------+
This is what I do:
val values = List(List("timestamp", "2018-01-11"),List("col2","123")).map(x =>(x(0), x(1)))
val df = values.toDF
df.show()
And this is what I get:
+---------+----------+
| _1| _2|
+---------+----------+
|timestamp|2018-01-11|
| col2| 123|
+---------+----------+
What's wrong here?
Use
val df = List(("2018-01-11", "123")).toDF("timestamp", "col2")
toDF expects the input list to contain one entry per resulting Row
Each such entry should be a case class or a tuple
It does not expect column "headers" in the data itself (to name columns - pass names as arguments of toDF)
If you don't know the names of the columns statically you can use following syntax sugar
.toDF( columnNames: _*)
Where columnNames is the List with the names.
EDIT (sorry, I missed that you had the headers glued to each column).
Maybe something like this could work:
val values = List(
List("timestamp", "2018-01-11"),
List("col2","123")
)
val heads = values.map(_.head) // extracts headers of columns
val cols = values.map(_.tail) // extracts columns without headers
val rows = cols(0).zip(cols(1)) // zips two columns into list of rows
rows.toDF(heads: _*)
This would work if the "values" contained two longer lists, but it does not generalize to more lists.

Spark (scala) dataframes - Check whether strings in column contain any items from a set

I'm pretty new to scala and spark and I've been trying to find a solution for this issue all day - it's doing my head in. I've tried 20 different variations of the following code and keep getting type mismatch errors when I try to perform calculations on a column.
I have a spark dataframe, and I wish to check whether each string in a particular column contains any number of words from a pre-defined List (or Set) of words.
Here is some example data for replication:
// sample data frame
val df = Seq(
(1, "foo"),
(2, "barrio"),
(3, "gitten"),
(4, "baa")).toDF("id", "words")
// dictionary Set of words to check
val dict = Set("foo","bar","baaad")
Now, i am trying to create a third column with the results of a comparison to see if the strings in the $"words" column within them contain any of the words in the dict Set of words. So the result should be:
+---+-----------+-------------+
| id| words| word_check|
+---+-----------+-------------+
| 1| foo| true|
| 2| bario| true|
| 3| gitten| false|
| 4| baa| false|
+---+-----------+-------------+
First, I tried to see if i could do it natively without using UDFs, since the dict Set will actually be a large dictionary of > 40K words, and as I understand it this would be more efficient than a UDF:
df.withColumn("word_check", dict.exists(d => $"words".contains(d)))
But i get the error:
type mismatch;
found : org.apache.spark.sql.Column
required: Boolean
I have also tried to create a UDF to do this (using both mutable.Set and mutable.WrappedArray to describe the Set - not sure which is correct but neither work):
val checker: ((String, scala.collection.mutable.Set[String]) => Boolean) = (col: String, array: scala.collection.mutable.Set[String] ) => array.exists(d => col.contains(d))
val udf1 = udf(checker)
df.withColumn("word_check", udf1($"words", dict )).show()
But get another type mismatch:
found : scala.collection.immutable.Set[String]
required: org.apache.spark.sql.Column
If the set was a fixed number, I should be able to use Lit(Int) in the expression? But I don't really understand performing more complex functions on a column by mixing different data types works in scala.
Any help greatly appreciated, especially if it can be done efficiently (it is a large df of > 5m rows).
Regardless of efficiency, this seems to work:
df.withColumn("word_check", dict.foldLeft(lit(false))((a, b) => a || locate(b, $"words") > 0)).show
+---+------+----------+
| id| words|word_check|
+---+------+----------+
| 1| foo| true|
| 2|barrio| true|
| 3|gitten| false|
| 4| baa| false|
+---+------+----------+
Here's how you'd do it with a UDF:
val checkerUdf = udf { (s: String) => dict.exists(s.contains(_)) }
df.withColumn("word_check", checkerUdf($"words")).show()
The mistake in your implementation is that you've created a UDF expecting two arguments, which means you'd have to pass two Columns when applying it - but dict isn't a Column in your DataFrame but rather a local vairable.
if your dict is large, you should not just reference it in your udf, because the entire dict is sent over the network for every task. I would broadcast your dict in combination with an udf:
import org.apache.spark.broadcast.Broadcast
def udf_check(words: Broadcast[scala.collection.immutable.Set[String]]) = {
udf {(s: String) => words.value.exists(s.contains(_))}
}
df.withColumn("word_check", udf_check(sparkContext.broadcast(dict))($"words"))
alternatively, you could also use a join:
val dict_df = dict.toList.toDF("word")
df
.join(broadcast(dict_df),$"words".contains($"word"),"left")
.withColumn("word_check",$"word".isNotNull)
.drop($"word")

Apache Spark DataFrame apply custom operation after GroupBy

I have 2 columns say ID, value Id is of type Int and value is of type List[String].
Ids are repeating so to make them unique I apply GroupBy("id") on My DataFrame now my problem is I want to append the value with each other and value column must be distinct.
Example :- i have a data like
+---+---+
| id| v |
+---+---+
| 1|[a]|
| 1|[b]|
| 1|[a]|
| 2|[e]|
| 2|[b]|
+---+---+
and i want my output like this
+---+---+--
| id| v |
+---+-----+
| 1|[a,b]|
| 2|[e,b]|
i tried this :-
val uniqueDF = df.groupBy("id").agg(collect_list("v"))
uniqueDf.map{row => (row.getInt(0),
row.getAsSeq[String].toList.distinct)}
Can I do the same after groupBy() say in agg() or something I do not want to apply map operation
thanks
val uniqueDF = df.groupBy("id").agg(collect_set("v"))
Set will have only unique values