I have a dataframe like this:
dog
cat
Cell 1
Cell 2
Cell 3
Cell 4
And a list like this:
dog, bulldog
cat, persian
I would like to create a function that find the name of the column in the list and substitute it with the second element (bulldog, persian).
So the final result should be:
| bulldog | persian |
| -------- | -------- |
| Cell 1 | Cell 2 |
| Cell 3 | Cell 4 |
You need to perform a look-up for your original column in the pre-defined list that you have shown. It's easier to create a Map out of it so lookups can be performed:
val list: List[(String, String)] = List(("dog", "bulldog"), ("cat", "persian"))
val columnMap = list.toMap
// columnMap: scala.collection.immutable.Map[String,String] = Map(dog -> bulldog, cat -> persian)
val originalCols = df.columns
val renamedCols = originalCols.map{
c => if (columnMap.keys.toArray.contains(c)) s"${c} as ${columnMap.getOrElse(c, "")}"
else c
}
println(renamedCols)
// renamedCols: Array[String] = Array(dog as bulldog, cat as persian)
df.selectExpr(renamedCols: _*).show(false)
// +-------+-------+
// |bulldog|persian|
// +-------+-------+
// |Cell 1 |Cell 2 |
// |Cell 1 |Cell 2 |
// +-------+-------+
Related
I am loading a MLDataTable from a given .csv file. The data type for each column is inferred automatically depending on the content of the input file.
I need predictable, explicit types when I process the table later.
How can I enforce a certain type when loading a file or alternatively change the type in a second step?
Simplified Example:
import Foundation
import CreateML
// file.csv:
//
// value1,value2
// 1.5,1
let table = try MLDataTable(contentsOf:URL(fileURLWithPath:"/path/to/file.csv"))
print(table.columnTypes)
// actual output:
// ["value2": Int, "value1": Double] <--- type for value2 is 'Int'
//
// wanted output:
// ["value2": Double, "value1": Double] <--- how can I make it 'Double'?
Use MLDataColumn's map(to:) method to derive a new column from the existing one with the desired underlying type:
let squaresArrayInt = (1...5).map{$0 * $0}
var table = try! MLDataTable(dictionary: ["Ints" : squaresArrayInt])
print(table)
let squaresColumnDouble = table["Ints"].map(to: Double.self)
table.addColumn(squaresColumnDouble, named: "Doubles")
print(table)
Produces the following output:
Columns:
Ints integer
Rows: 5
Data:
+----------------+
| Ints |
+----------------+
| 1 |
| 4 |
| 9 |
| 16 |
| 25 |
+----------------+
[5 rows x 1 columns]
Columns:
Ints integer
Doubles float
Rows: 5
Data:
+----------------+----------------+
| Ints | Doubles |
+----------------+----------------+
| 1 | 1 |
| 4 | 4 |
| 9 | 9 |
| 16 | 16 |
| 25 | 25 |
+----------------+----------------+
[5 rows x 2 columns]
I have a dataframe like this:
|-----+-----+-------+---------|
| foo | bar | fox | cow |
|-----+-----+-------+---------|
| 1 | 2 | red | blue | // row 0
| 1 | 2 | red | yellow | // row 1
| 2 | 2 | brown | green | // row 2
| 3 | 4 | taupe | fuschia | // row 3
| 3 | 4 | red | orange | // row 4
|-----+-----+-------+---------|
I need to group the records by "foo" and "bar" and then perform some magical computation on "fox" and "cow" to produce "badger", which may insert or delete records:
|-----+-----+-------+---------+---------|
| foo | bar | fox | cow | badger |
|-----+-----+-------+---------+---------|
| 1 | 2 | red | blue | zebra |
| 1 | 2 | red | blue | chicken |
| 1 | 2 | red | yellow | cougar |
| 2 | 2 | brown | green | duck |
| 3 | 4 | red | orange | peacock |
|-----+-----+-------+---------+---------|
(In this example, row 0 has been split into two "badger" values, and row 3 has been deleted from the final output.)
My best approach so far looks like this:
val groups = df.select("foo", "bar").distinct
groups.flatMap(row => {
val (foo, bar): (String, String) = (row(0), row(1))
val group: DataFrame = df.where(s"foo == '$foo' AND bar == '$bar'")
val rowsWithBadgers: List[Row] = makeBadgersFor(group)
rowsWithBadgers
})
This approach has a few problems:
It's clumsy to match on foo and bar individually. (A utility method can clean that up, so not a big deal.)
It throws an Invalid tree: null\nnull error because of the nested operation in which I try to refer to df from inside groups.flatMap. Don't know how to get around that one yet.
I'm not sure whether this mapping and filtering actually leverages Spark distributed computation correctly.
Is there a more performant and/or elegant approach to this problem?
This question is very similar to Spark DataFrame: operate on groups, but I'm including it here because 1) it's not clear if that question requires addition and deletion of records, and 2) the answers in that question are out-of-date and lacking detail.
I don't see a way to accomplish this with groupBy and a user-defined aggregate function, because an aggregation function aggregates to a single row. In other words,
udf(<records with foo == 'foo' && bar == 'bar'>) => [foo,bar,aggregatedValue]
I need to possibly return two or more different rows, or zero rows after analyzing my group. I don't see a way for aggregation functions to do this -- if you have an example, please share.
A user-defined function could be used.
The single row returned can contain a list.
Then you can explode the list into multiple rows and reconstruct the columns.
The aggregator:
import org.apache.spark.sql.Encoder
import org.apache.spark.sql.Encoders.kryo
import org.apache.spark.sql.expressions.Aggregator
case class StuffIn(foo: BigInt, bar: BigInt, fox: String, cow: String)
case class StuffOut(foo: BigInt, bar: BigInt, fox: String, cow: String, badger: String)
object StuffOut {
def apply(stuffIn: StuffIn): StuffOut = new StuffOut(stuffIn.foo,
stuffIn.bar, stuffIn.fox, stuffIn.cow, "dummy")
}
object MultiLineAggregator extends Aggregator[StuffIn, Seq[StuffOut], Seq[StuffOut]] {
def zero: Seq[StuffOut] = Seq[StuffOut]()
def reduce(buffer: Seq[StuffOut], stuff: StuffIn): Seq[StuffOut] = {
makeBadgersForDummy(buffer, stuff)
}
def merge(b1: Seq[StuffOut], b2: Seq[StuffOut]): Seq[StuffOut] = {
b1 ++: b2
}
def finish(reduction: Seq[StuffOut]): Seq[StuffOut] = reduction
def bufferEncoder: Encoder[Seq[StuffOut]] = kryo[Seq[StuffOut]]
def outputEncoder: Encoder[Seq[StuffOut]] = kryo[Seq[StuffOut]]
}
The call:
val averageSalary: TypedColumn[StuffIn, Seq[StuffOut]] = MultiLineAggregator.toColumn
val res: DataFrame =
ds.groupByKey(x => (x.foo, x.bar))
.agg(averageSalary)
.map(_._2)
.withColumn("value", explode($"value"))
.withColumn("foo", $"value.foo")
.withColumn("bar", $"value.bar")
.withColumn("fox", $"value.fox")
.withColumn("cow", $"value.cow")
.withColumn("badger", $"value.badger")
.drop("value")
I have a Dataframe :
| subcategory | subcategory_label | category |
| 00EEE | 00EEE FFF | Drink |
| 0000EEE | 00EEE FFF | Fruit |
| 0EEE | 000EEE FFF | Meat |
from which I need to remove leading 0's from the columns in Dataframe and need a result like this
| subcategory | subcategory_label | category |
| EEE | EEE FFF | Drink |
| EEE | EEE FFF | Fruit |
| EEE | EEE FFF | Meat |
So far, I am able to remove the leading 0's from one column using
df.withColumn("subcategory ", regexp_replace(df("subcategory "), "^0*", "")).show
How to remove the leading 0's from dataframe in one go?
With this as the provided dataframe :
+-----------+-----------------+--------+
|subcategory|subcategory_label|category|
+-----------+-----------------+--------+
|0000FFFF |0000EE 000FF |ABC |
+-----------+-----------------+--------+
You can create a regexp_replace for all the columns. Something like :
val regex_all = df.columns.map( c => regexp_replace(col(c), "^0*", "" ).as(c) )
And then, use select since it takes a varargs of type Column :
df.select(regex_all :_* ).show(false)
+-----------+-----------------+--------+
|subcategory|subcategory_label|category|
+-----------+-----------------+--------+
|FFFF |EE 000FF |ABC |
+-----------+-----------------+--------+
EDIT:
Defining a function to do return a regexp_replaced Sequence is straight forward :
/**
* #param origCols total cols in the DF, pass `df.columns`
* #param replacedCols `Seq` of columns for which expression is to be generated
* #return `Seq[org.apache.spark.sql.Column]` Spark SQL expression
*/
def createRegexReplaceZeroes(origCols : Seq[String], replacedCols: Seq[String] ) = {
origCols.map{ c =>
if(replacedCols.contains(c)) regexp_replace(col(c), "^0*", "" ).as(c)
else col(c)
}
}
This function will return an Array[org.apache.spark.sql.Column]
Now, store the columns you want to replace in an Array :
val removeZeroes = Array( "subcategory", "subcategory_label" )
And, then call the function with removeZeroes as argument. This will return the regexp_replace statements for the columns available in removeZeroes
df.select( createRegexReplaceZeroes(df.columns, removeZeroes) :_* )
You can use UDF for doing the same.
I feel it looks more elegant.
scala> val removeLeadingZerosUDF = udf({ x: String => x.replaceAll("^0*", "") })
removeLeadingZerosUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))
scala> val df = Seq( "000012340023", "000123400023", "001234000230", "012340002300", "123400002300" ).toDF("cols")
df: org.apache.spark.sql.DataFrame = [cols: string]
scala> df.show()
+------------+
| cols|
+------------+
|000012340023|
|000123400023|
|001234000230|
|012340002300|
|123400002300|
+------------+
scala> df.withColumn("newCols", removeLeadingZerosUDF($"cols")).show()
+------------+------------+
| cols| newCols|
+------------+------------+
|000012340023| 12340023|
|000123400023| 123400023|
|001234000230| 1234000230|
|012340002300| 12340002300|
|123400002300|123400002300|
+------------+------------+
After a series of validations over a DataFrame,
I obtain a List of String with certain values like this:
List[String]=(lvalue1, lvalue2, lvalue3, ...)
And I have a Dataframe with n values:
dfield 1 | dfield 2 | dfield 3
___________________________
dvalue1 | dvalue2 | dvalue3
dvalue1 | dvalue2 | dvalue3
I want to append the values of the List at the beggining of my Dataframe, in order to get a new DF with something like this:
dfield 1 | dfield 2 | dfield 3 | dfield4 | dfield5 | dfield6
__________________________________________________________
lvalue1 | lvalue2 | lvalue3 | dvalue1 | dvalue2 | dvalue3
lvalue1 | lvalue2 | lvalue3 | dvalue1 | dvalue2 | dvalue3
I have found something using a UDF. Could be this correct for my purpose?
Regards.
TL;DR Use select or withColumn with lit function.
I'd use lit function with select operator (or withColumn).
lit(literal: Any): Column Creates a Column of literal value.
A solution could be as follows.
val values = List("lvalue1", "lvalue2", "lvalue3")
val dfields = values.indices.map(idx => s"dfield ${idx + 1}")
val dataset = Seq(
("dvalue1", "dvalue2", "dvalue3"),
("dvalue1", "dvalue2", "dvalue3")
).toDF("dfield 1", "dfield 2", "dfield 3")
val offsets = dataset.
columns.
indices.
map { idx => idx + colNames.size + 1 }
val offsetDF = offsets.zip(dataset.columns).
foldLeft(dataset) { case (df, (off, col)) => df.withColumnRenamed(col, s"dfield $off") }
val newcols = colNames.zip(dfields).
map { case (v, dfield) => lit(v) as dfield } :+ col("*")
scala> offsetDF.select(newcols: _*).show
+--------+--------+--------+--------+--------+--------+
|dfield 1|dfield 2|dfield 3|dfield 4|dfield 5|dfield 6|
+--------+--------+--------+--------+--------+--------+
| lvalue1| lvalue2| lvalue3| dvalue1| dvalue2| dvalue3|
| lvalue1| lvalue2| lvalue3| dvalue1| dvalue2| dvalue3|
+--------+--------+--------+--------+--------+--------+
I have data set like this which i am fetching from csv file but how to
store in Scala to do the processing.
+-----------+-----------+----------+
| recent | Freq | Monitor |
+-----------+-----------+----------+
| 1 | 1234| 199090|
| 4 | 2553| 198613|
| 6 | 3232 | 199090|
| 1 | 8823 | 498831|
| 7 | 2902 | 890000|
| 8 | 7991 | 081097|
| 9 | 7391 | 432370|
| 12 | 6138 | 864981|
| 7 | 6812 | 749821|
+-----------+-----------+----------+
Actually I need to sort the data and rank it.
I am new to Scala programming.
Thanks
Answering your question here is the solution, this code reads a csv and order by the third column
object CSVDemo extends App {
println("recent, freq, monitor")
val bufferedSource = io.Source.fromFile("./data.csv")
val list: Array[Array[String]] = (bufferedSource.getLines map { line => line.split(",").map(_.trim) }).toArray
val newList = list.sortBy(_(2))
newList map { line => println(line.mkString(" ")) }
bufferedSource.close
}
you read the file and you parse it to an Array[Array[String]], then you order by the third column, and you print
Here I am using the list and try to normalize each column at a time and then concatenating them. Is there any other way to iterate column wise and normalize them. Sorry my coding is very basic.
val col1 = newList.map(line => line.head)
val mi = newList.map(line => line.head).min
val ma = newList.map(line => line.head).max
println("mininumn value of first column is " +mi)
println("maximum value of first column is : " +ma)
// calculate scale for the first column
val scale = col1.map(x => math.round((x.toInt - mi.toInt) / (ma.toInt - mi.toInt)))
println("Here is the normalized range of first column of the data")
scale.foreach(println)