how to convert seq[row] to a dataframe in scala - scala

Is there any way to convert Seq[Row] into a dataframe in scala.
I have a dataframe and a list of strings that have weights of each row in input dataframe.I want to build a DataFrame that will include all rows with unique weights.
I was able to filter unique rows and append to seq[row] but I want to build a dataframe.
This is my code .Thanks in advance.
def dataGenerator(input : DataFrame, val : List[String]): Dataset[Row]= {
val valitr = val.iterator
var testdata = Seq[Row]()
var val = HashSet[String]()
if(valitr!=null) {
input.collect().foreach((r) => {
var valnxt = valitr.next()
if (!valset.contains(valnxt)) {
valset += valnxt
testdata = testdata :+ r
}
})
}
//logic to convert testdata as DataFrame and return
}

You said that 'val is calculated using fields from inputdf itself'. If this is the case then you should be able to make a new dataframe with a new column for the 'val' like this:
+------+------+
|item |weight|
+------+------+
|item 1|w1 |
|item 2|w2 |
|item 3|w2 |
|item 4|w3 |
|item 5|w4 |
+------+------+
This is the key thing. Then you will be able to work on the dataframe instead of doing a collect.
What is bad about doing collect? Well there is no point in going to the trouble and overhead of using a distributed big data processing framework just to pull all the data into the memory of 1 machine. See here: Spark dataframe: collect () vs select ()
When you have the input dataframe how you want it, as above, you can get the result. Here is a way that works, which groups the data by the weight column and picks the first item for each grouping.
val result = input
.rdd // get underlying rdd
.groupBy(r => r.get(1)) // group by "weight" field
.map(x => x._2.head.getString(0)) // get the first "item" for each weight
.toDF("item") // back to a dataframe
Then you get the only the first item in case of duplicated weight:
+------+
|item |
+------+
|item 1|
|item 2|
|item 4|
|item 5|
+------+

Related

How can I split a column off of a DataFrame, but keep its association with the initial DataFrame?

I have a dataframe dataDF that is:
+-------+------+-----+-----+-----------+
|TEST_PK| COL_1|COL_2|COL_3|h_timestamp|
+-------+------+-----+-----+-----------+
| 1| apple| 10| 1.79| 1111|
| 1| apple| 11| 1.79| 1114|
| 2|banana| 15| 1.79| 1112|
| 2|banana| 16| 1.79| 1115|
| 3|orange| 7| 1.79| 1113|
+-------+------+-----+-----+-----------+
And I need to run this function:
operation(row, h_timestamp)
On each row, but row can not contain h_timestamp, so my first thought is to split the dataframe like:
val columns = dataDF.drop("h_timestamp")
val timestamp = dataDF.select("h_timestamp")
But that doesn't help when I want to perform the operation on every row like:
dataDF.map(row => {
...
val rowWithoutTimestamp = ???
val timestamp = ???
operation(rowWithoutTimestamp, timestamp)
...
})
But now those two dataframes are not linked and I don't know how to get the right timestamp for each row. The TEST_PK column is not necessarily unique.
Is there a way to use .drop() or .select() on just a row or some other way to do this?
Edit: Also, the table could have any number of columns, but will always have the timestamp column and at least one more that is not the timestamp
Since you have what appears to be a primary key column, just fork the timestamp with the id column into it's own dataframe to re-join it later.
val tsDF = dataDF.select("TEST_PK", "h_timestamp")
Then, drop the column from dataDF, do your operation, and re-join the h_timestamp back onto a new dataframe.
val finalDF = postopDF.join(tsDF, "TEST_PK")
Update
The sample code is helpful, you should be able to essentially dissasemble your row, and rebuild a new row with the desired values with something like this:
dataDF.map(row => {
val rowWithoutTimestamp = Row(
row.getAs[Long]("TEST_PK"),
row.getAs[String]("COL_1"),
row.getAs[Long]("COL_2"),
row.getAs[Double]("COL_3")
)
val timestamp = row.getAs[Long]("h_timestamp")
val result = operation(rowWithoutTimestamp, timestamp)
Row(result, timestamp)
})
Of course, I'm not certain what your operation() returns, so it may be necessary to disassemble result into individual values and compose a new row with those and the timestamp.
Update 2
OK, here is a more generic method. It wraps "all columns except" h_timestamp into a struct, and maps over the (struct, ts) tuple. Actually more elegant than the previous solution anyway.
val cols = df.drop("h_timestamp").columns.toSeq
dataDF
.select(struct(cols.map(c => col(c)):_*).as("row_no_ts"), $"h_timestamp")
.map(row => {
val rowWithoutTimestamp = row.getAs[Row]("row_no_ts")
val timestamp = row.getAs[Long]("h_timestamp")
operation(rowWithoutTimestamp, timestamp)
})
I'm not sure if you are mapping to just the output of operation() or some combination with the timestamp again, but both are available to modify to suit your needs.

Spark How can I filter out rows that contain char sequences from another dataframe?

So, I am trying to remove rows from df2 if the Value in df2 is "like" a key from df1. I'm not sure if this is possible, or if I might need to change df1 into a list first? It's a fairly small dataframe, but as you can see, we want to remove the 2nd and 3rd rows from df2 and just return back df2 without them.
df1
+--------------------+
| key|
+--------------------+
| Monthly Beginning|
| Annual Percentage|
+--------------------+
df2
+--------------------+--------------------------------+
| key| Value|
+--------------------+--------------------------------+
| Date| 1/1/2018|
| Date| Monthly Beginning on Tuesday|
| Number| Annual Percentage Rate for...|
| Number| 17.5|
+--------------------+--------------------------------+
I thought it would be something like this?
df.filter(($"Value" isin (keyDf.select("key") + "%"))).show(false)
But that doesn't work and I'm not surprised, but I think it helps show what I am trying to do if my previous explanation was not sufficient enough. Thank you for your help ahead of time.
Convert the first dataframe df1 to List[String] and then create one udf and apply filter condition
Spark-shell-
import org.apache.spark.sql.functions._
//Converting df1 to list
val df1List=df1.select("key").map(row=>row.getString(0).toLowerCase).collect.toList
//Creating udf , spark stands for spark session
spark.udf.register("filterUDF", (str: String) => df1List.filter(str.toLowerCase.contains(_)).length)
//Applying filter
df2.filter("filterUDF(Value)=0").show
//output
+------+--------+
| key| Value|
+------+--------+
| Date|1/1/2018|
|Number| 17.5|
+------+--------+
Scala-IDE -
val sparkSession=SparkSession.builder().master("local").appName("temp").getOrCreate()
val df1=sparkSession.read.format("csv").option("header","true").load("C:\\spark\\programs\\df1.csv")
val df2=sparkSession.read.format("csv").option("header","true").load("C:\\spark\\programs\\df2.csv")
import sparkSession.implicits._
val df1List=df1.select("key").map(row=>row.getString(0).toLowerCase).collect.toList
sparkSession.udf.register("filterUDF", (str: String) => df1List.filter(str.toLowerCase.contains(_)).length)
df2.filter("filterUDF(Value)=0").show
Convert df1 to List. Convert df2 to Dataset.
case class s(key:String,Value:String)
df2Ds = df2.as[s]
Then we can use the filter method to filter out the records.
Somewhat like this.
def check(str:String):Boolean = {
var i = ""
for(i<-df1List)
{
if(str.contains(i))
return false
}
return true
}
df2Ds.filter(s=>check(s.Value)).collect

dynamically pass arguments to function in scala

i have record as string with 1000 fields with delimiter as comma in dataframe like
"a,b,c,d,e.......upto 1000" -1st record
"p,q,r,s,t ......upto 1000" - 2nd record
I am using below suggested solution from stackoverflow
Split 1 column into 3 columns in spark scala
df.withColumn("_tmp", split($"columnToSplit", "\\.")).select($"_tmp".getItem(0).as("col1"),$"_tmp".getItem(1).as("col2"),$"_tmp".getItem(2).as("col3")).drop("_tmp")
however in my case i am having 1000 columns which i have in JSON schema which i can retrive like
column_seq:Seq[Array]=Schema_func.map(_.name)
for(i <-o to column_seq.length-1){println(i+" " + column_seq(i))}
which returns like
0 col1
1 col2
2 col3
3 col4
Now I need to pass all this indexes and column names to below function of DataFrame
df.withColumn("_tmp", split($"columnToSplit", "\\.")).select($"_tmp".getItem(0).as("col1"),$"_tmp".getItem(1).as("col2"),$"_tmp".getItem(2).as("col3")).drop("_tmp")
in
$"_tmp".getItem(0).as("col1"),$"_tmp".getItem(1).as("col2"),
as i cant create the long statement with all 1000 columns , is there any effective way to pass all this arguments from above mentioned json schema to select function , so that i can split the columns , add the header and then covert the DF to parquet.
You can build a series of org.apache.spark.sql.Column, where each one is the result of selecting the right item and has the right name, and then select these columns:
val columns: Seq[Column] = Schema_func.map(_.name)
.zipWithIndex // attach index to names
.map { case (name, index) => $"_tmp".getItem(index) as name }
val result = df
.withColumn("_tmp", split($"columnToSplit", "\\."))
.select(columns: _*)
For example, for this input:
case class A(name: String)
val Schema_func = Seq(A("c1"), A("c2"), A("c3"), A("c4"), A("c5"))
val df = Seq("a.b.c.d.e").toDF("columnToSplit")
The result would be:
// +---+---+---+---+---+
// | c1| c2| c3| c4| c5|
// +---+---+---+---+---+
// | a| b| c| d| e|
// +---+---+---+---+---+

Scala Spark - split vector column into separate columns in a Spark DataFrame

I have a Spark DataFrame where I have a column with Vector values. The vector values are all n-dimensional, aka with the same length. I also have a list of column names Array("f1", "f2", "f3", ..., "fn"), each corresponds to one element in the vector.
some_columns... | Features
... | [0,1,0,..., 0]
to
some_columns... | f1 | f2 | f3 | ... | fn
... | 0 | 1 | 0 | ... | 0
What is the best way to achieve this? I thought of one way which is to create a new DataFrame with createDataFrame(Row(Features), featureNameList) and then join with the old one, but it requires spark context to use createDataFrame. I only want to transform the existing data frame. I also know .withColumn("fi", value) but what do I do if n is large?
I'm new to Scala and Spark and couldn't find any good examples for this. I think this can be a common task. My particular case is that I used the CountVectorizer and wanted to recover each column individually for better readability instead of only having the vector result.
One way could be to convert the vector column to an array<double> and then using getItem to extract individual elements.
import org.apache.spark.sql.functions._
import org.apache.spark.ml._
val df = Seq( (1 , linalg.Vectors.dense(1,0,1,1,0) ) ).toDF("id", "features")
//df: org.apache.spark.sql.DataFrame = [id: int, features: vector]
df.show
//+---+---------------------+
//|id |features |
//+---+---------------------+
//|1 |[1.0,0.0,1.0,1.0,0.0]|
//+---+---------------------+
// A UDF to convert VectorUDT to ArrayType
val vecToArray = udf( (xs: linalg.Vector) => xs.toArray )
// Add a ArrayType Column
val dfArr = df.withColumn("featuresArr" , vecToArray($"features") )
// Array of element names that need to be fetched
// ArrayIndexOutOfBounds is not checked.
// sizeof `elements` should be equal to the number of entries in column `features`
val elements = Array("f1", "f2", "f3", "f4", "f5")
// Create a SQL-like expression using the array
val sqlExpr = elements.zipWithIndex.map{ case (alias, idx) => col("featuresArr").getItem(idx).as(alias) }
// Extract Elements from dfArr
dfArr.select(sqlExpr : _*).show
//+---+---+---+---+---+
//| f1| f2| f3| f4| f5|
//+---+---+---+---+---+
//|1.0|0.0|1.0|1.0|0.0|
//+---+---+---+---+---+

How to fetch the value and type of each column of each row in a dataframe?

How can I convert a dataframe to a tuple that includes the datatype for each column?
I have a number of dataframes with varying sizes and types. I need to be able to determine the type and value of each column and row of a given dataframe so I can perform some actions that are type-dependent.
So for example say I have a dataframe that looks like:
+-------+-------+
| foo | bar |
+-------+-------+
| 12345 | fnord |
| 42 | baz |
+-------+-------+
I need to get
Seq(
(("12345", "Integer"), ("fnord", "String")),
(("42", "Integer"), ("baz", "String"))
)
or something similarly simple to iterate over and work with programmatically.
Thanks in advance and sorry for what is, I'm sure, a very noobish question.
If I understand your question correct, then following shall be your solution.
val df = Seq(
(12345, "fnord"),
(42, "baz"))
.toDF("foo", "bar")
This creates dataframe which you already have.
+-----+-----+
| foo| bar|
+-----+-----+
|12345|fnord|
| 42| baz|
+-----+-----+
Next step is to extract dataType from the schema of the dataFrame and create a iterator.
val fieldTypesList = df.schema.map(struct => struct.dataType)
Next step is to convert the dataframe rows into rdd list and map each value to dataType from the list created above
val dfList = df.rdd.map(row => row.toString().replace("[","").replace("]","").split(",").toList)
val tuples = dfList.map(list => list.map(value => (value, fieldTypesList(list.indexOf(value)))))
Now if we print it
tuples.foreach(println)
It would give
List((12345,IntegerType), (fnord,StringType))
List((42,IntegerType), (baz,StringType))
Which you can iterate over and work with programmatically