I have added a new column seq_col containing unique sequence using
val df2 = dfFromRDD1.withColumn("monotonically_increasing_id", monotonically_increasing_id())
val window = Window.orderBy(col("monotonically_increasing_id"))
val df3_consecutiveval = df2.withColumn("seq_col",row_number().over(window)).drop(col("monotonically_increasing_id").show()
dataframe:
col1 col2 seq_col
a aa 1
b ff 2
c rr 3
d yy 4
e tt 5
Now I want to add values to that new column in dataframe which will have data based on start and increment values specified like below example
Ex:
Start = 100
increment = 3
dataframe:
col1 col2 seq_col
a aa 100
b ff 103
c rr 106
d yy 109
e tt 112
You can define a udf that is responsible to calculate the id with the given logic, for instance in this case:
val step = 3 // increment 3 by 3
val startOffset = 100 // you want it to start with 100
val calculateId = udf((rowNum: Int) => startOffset + (rowNum * step))
df.withColumn("seq_col", calculateId(row_number().over(window))
This worked for me using some random dataframe.
The above answer is technically correct, but you should avoid using udfs whenever possible for performance reasons. This case is so simple that basic arithmetic will do the trick:
scala> val df = Seq(("a", "aa"), ("b", "ff"), ("c", "rr"), ("d", "yy"), ("e", "tt")).toDF("col1", "col2")
df: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
scala> val start = 100
start: Int = 100
scala> val increment = 3
increment: Int = 3
scala> df.withColumn("seq_col", monotonically_increasing_id() * increment + start).show
+----+----+-------+
|col1|col2|seq_col|
+----+----+-------+
| a| aa| 100|
| b| ff| 103|
| c| rr| 106|
| d| yy| 109|
| e| tt| 112|
+----+----+-------+
scala>
Related
I am creating a new attribute x with concatenation of attributes a + b + c. If the total length of x is less than 10, then I want to prefix only attribute c with 0. How can I do this?
val x = when length(concat($"a", $"b", $"c")) < 10,
concat($"a", $"b", lpad($"c", 10, '0'))
.otherwise(concat($"a", $"b", $"c"))
The above wont work as column c will be prefixed with 0's until length 10, whereas I want the overall length after concat to be 10. Please suggest.
You can use an sql expression for the lpad:
val df =Seq(("aaa","bbb","cccc"),
("a","b","c"),
("a","b","1234567890123"),
("a","b","")).toDF("a","b","c")
df.withColumn("x", when(length(concat($"a",$"b",$"c")) < 10,
concat($"a", $"b", expr("lpad(c, 10 - char_length(a) - char_length(b), '0')")))
.otherwise(concat($"a",$"b",$"c")))
.show()
Output:
+---+---+-------------+---------------+
| a| b| c| x|
+---+---+-------------+---------------+
|aaa|bbb| cccc| aaabbbcccc|
| a| b| c| ab0000000c|
| a| b|1234567890123|ab1234567890123|
| a| b| | ab00000000|
+---+---+-------------+---------------+
Logically what you want is the following:
val ab = concat($"a", $"b")
val x = concat(ab, lpad($"c", lit(10) - length(ab), lit("0")))
// ^ Not possible ^ Not possible
Unfortunately this isn't possible because the Spark Scala lpad function has the signature str: Column, len: Int, pad: String and you can't provide Column objects as the len and pad parameters.
If you examine the underlying Catalyst StringLPad expression type, it doesn't require constants for the len and pad parameters. This means we can define our own version of lpad that allows for Column values passed as len and pad so that the length and the length and padding string can be variable for each row.
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
import org.apache.spark.sql.catalyst.expressions.StringLPad
def lpad_var(str: Column, len: Column, pad: Column) =
new Column(StringLPad(str.expr, len.expr, pad.expr))
val ab = concat($"a", $"b")
val x = concat(ab, lpad_var($"c", lit(10) - length(ab), lit("0")))
Here is an example output with some edge cases:
val df = Seq(
("a", "bb", "c"),
("aa", "bbb", "c"),
("", "", "")
).toDF("a", "b", "c")
df.select(x.as("x")).show()
// +----------+
// | x|
// +----------+
// |abb000000c|
// |aabbb0000c|
// |0000000000|
// +----------+
So I have a huge data frame which is combination of individual tables, it has an identifier column at the end which specifies the table number as shown below
+----------------------------+
| col1 col2 .... table_num |
+----------------------------+
| x y 1 |
| a b 1 |
| . . . |
| . . . |
| q p 2 |
+----------------------------+
(original table)
I have to split this into multiple little dataframes based on table num. The number of tables combined to create this is pretty large so it's not feasible to individually create the disjoint subset dataframes, so I was thinking if I made a for loop iterating over min to max values of table_num I could achieve this task but I can't seem to do it, any help is appreciated.
This is what I came up with
for (x < min(table_num) to max(table_num)) {
var df(x)= spark.sql("select * from df1 where state = x")
df(x).collect()
but I don't think the declaration is right.
so essentially what I need is df's that look like this
+-----------------------------+
| col1 col2 ... table_num |
+-----------------------------+
| x y 1 |
| a b 1 |
+-----------------------------+
+------------------------------+
| col1 col2 ... table_num |
+------------------------------+
| xx xy 2 |
| aa bb 2 |
+------------------------------+
+-------------------------------+
| col1 col2 ... table_num |
+-------------------------------+
| xxy yyy 3 |
| aaa bbb 3 |
+-------------------------------+
... and so on ...
(how I would like the Dataframes split)
In Spark Arrays can be almost in data type. When made as vars you can dynamically add and remove elements from them. Below I am going to isolate the table nums into their own array, this is so I can easily iterate through them. After isolated I go through a while loop to add each table as a unique element to the DF Holder Array. To query the elements of the array use DFHolderArray(n-1) where n is the position you want to query with 0 being the first element.
//This will go and turn the distinct row nums in a queriable (this is 100% a word) array
val tableIDArray = inputDF.selectExpr("table_num").distinct.rdd.map(x=>x.mkString.toInt).collect
//Build the iterator
var iterator = 1
//holders for DF and transformation step
var tempDF = spark.sql("select 'foo' as bar")
var interimDF = tempDF
//This will be an array for dataframes
var DFHolderArray : Array[org.apache.spark.sql.DataFrame] = Array(tempDF)
//loop while the you have note reached end of array
while(iterator<=tableIDArray.length) {
//Call the table that is stored in that location of the array
tempDF = spark.sql("select * from df1 where state = '" + tableIDArray(iterator-1) + "'")
//Fluff
interimDF = tempDF.withColumn("User_Name", lit("Stack_Overflow"))
//If logic to overwrite or append the DF
DFHolderArray = if (iterator==1) {
Array(interimDF)
} else {
DFHolderArray ++ Array(interimDF)
}
iterator = iterator + 1
}
//To query the data
DFHolderArray(0).show(10,false)
DFHolderArray(1).show(10,false)
DFHolderArray(2).show(10,false)
//....
Approach is to collect all unique keys and build respective data frames. I added some functional flavor to it.
Sample dataset:
name,year,country,id
Bayern Munich,2014,Germany,7747
Bayern Munich,2014,Germany,7747
Bayern Munich,2014,Germany,7746
Borussia Dortmund,2014,Germany,7746
Borussia Mönchengladbach,2014,Germany,7746
Schalke 04,2014,Germany,7746
Schalke 04,2014,Germany,7753
Lazio,2014,Germany,7753
Code:
val df = spark.read.format(source = "csv")
.option("header", true)
.option("delimiter", ",")
.option("inferSchema", true)
.load("groupby.dat")
import spark.implicits._
//collect data for each key into a data frame
val uniqueIds = df.select("id").distinct().map(x => x.mkString.toInt).collect()
// List buffer to hold separate data frames
var dataframeList: ListBuffer[org.apache.spark.sql.DataFrame] = ListBuffer()
println(uniqueIds.toList)
// filter data
uniqueIds.foreach(x => {
val tempDF = df.filter(col("id") === x)
dataframeList += tempDF
})
//show individual data frames
for (tempDF1 <- dataframeList) {
tempDF1.show()
}
One approach would be to write the DataFrame as partitioned Parquet files and read them back into a Map, as shown below:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
("a", "b", 1), ("c", "d", 1), ("e", "f", 1),
("g", "h", 2), ("i", "j", 2)
).toDF("c1", "c2", "table_num")
val filePath = "/path/to/parquet/files"
df.write.partitionBy("table_num").parquet(filePath)
val tableNumList = df.select("table_num").distinct.map(_.getAs[Int](0)).collect
// tableNumList: Array[Int] = Array(1, 2)
val dfMap = ( for { n <- tableNumList } yield
(n, spark.read.parquet(s"$filePath/table_num=$n").withColumn("table_num", lit(n)))
).toMap
To access the individual DataFrames from the Map:
dfMap(1).show
// +---+---+---------+
// | c1| c2|table_num|
// +---+---+---------+
// | a| b| 1|
// | c| d| 1|
// | e| f| 1|
// +---+---+---------+
dfMap(2).show
// +---+---+---------+
// | c1| c2|table_num|
// +---+---+---------+
// | g| h| 2|
// | i| j| 2|
// +---+---+---------+
I am loading a file in dataframe in spark databrick
spark.sql("""select A,X,Y,Z from fruits""")
A X Y Z
1E5 1.000 0.000 0.000
1U2 2.000 5.000 0.000
5G6 3.000 0.000 10.000
I need output as
A D
1E5 X 1
1U2 X 2, Y 5
5G6 X 3, Z 10
I am able to find the solution.
Each column name can be joined with value, and then all values can be joined in one column, separated by comma:
// data
val df = Seq(
("1E5", 1.000, 0.000, 0.000),
("1U2", 2.000, 5.000, 0.000),
("5G6", 3.000, 0.000, 10.000))
.toDF("A", "X", "Y", "Z")
// action
val columnsToConcat = List("X", "Y", "Z")
val columnNameValueList = columnsToConcat.map(c =>
when(col(c) =!= 0, concat(lit(c), lit(" "), col(c).cast(IntegerType)))
.otherwise("")
)
val valuesJoinedByComaColumn = columnNameValueList.reduce((a, b) =>
when(org.apache.spark.sql.functions.length(a) =!= 0 && org.apache.spark.sql.functions.length(b) =!= 0, concat(a, lit(", "), b))
.otherwise(concat(a, b))
)
val result = df.withColumn("D", valuesJoinedByComaColumn)
.drop(columnsToConcat: _*)
Output:
+---+---------+
|A |D |
+---+---------+
|1E5|X 1 |
|1U2|X 2, Y 5 |
|5G6|X 3, Z 10|
+---+---------+
Solution similar with proposed by stack0114106, but looks more explicit.
Check this out:
scala> val df = Seq(("1E5",1.000,0.000,0.000),("1U2",2.000,5.000,0.000),("5G6",3.000,0.000,10.000)).toDF("A","X","Y","Z")
df: org.apache.spark.sql.DataFrame = [A: string, X: double ... 2 more fields]
scala> df.show()
+---+---+---+----+
| A| X| Y| Z|
+---+---+---+----+
|1E5|1.0|0.0| 0.0|
|1U2|2.0|5.0| 0.0|
|5G6|3.0|0.0|10.0|
+---+---+---+----+
scala> val newcol = df.columns.drop(1).map( x=> when(col(x)===0,lit("")).otherwise(concat(lit(x),lit(" "),col(x).cast("int").cast("string"))) ).reduce( (x,y) => concat(x,lit(", "),y) )
newcol: org.apache.spark.sql.Column = concat(concat(CASE WHEN (X = 0) THEN ELSE concat(X, , CAST(CAST(X AS INT) AS STRING)) END, , , CASE WHEN (Y = 0) THEN ELSE concat(Y, , CAST(CAST(Y AS INT) AS STRING)) END), , , CASE WHEN (Z = 0) THEN ELSE concat(Z, , CAST(CAST(Z AS INT) AS STRING)) END)
scala> df.withColumn("D",newcol).withColumn("D",regexp_replace(regexp_replace('D,", ,",","),", $", "")).drop("X","Y","Z").show(false)
+---+---------+
|A |D |
+---+---------+
|1E5|X 1 |
|1U2|X 2, Y 5 |
|5G6|X 3, Z 10|
+---+---------+
scala>
I have the following problem:
A DataFrame containing col1 with strings A, B, or C.
A second col2 with an Integer.
And three other columns col3, col4 and col5 (these columns are also named A, B, and C).
Thus,
col1 - col2 - A (col3) - B (col4) - C (col5)
|--------------------------------------------
A 6
B 5
C 6
should obtain
col1 - col2 - A (col3) - B (col4) - C (col5)
|--------------------------------------------
A 6 6
B 5 5
C 6 6
Now I would like to go through each row and assign the integer in col2 to the column A, B or C based on the entry in col1.
How do I achieve this?
df.withColumn() I cannot use (or at least I do not know why) and the same holds for val df2 = df.map(x => x ).
Looking forward to you help and thanks in advance!
Best, Ken
Create a mapping between key and target column:
val mapping = Seq(("A", "col3"), ("B", "col4"), ("C", "col5"))
Use it to generate sequence of columns:
import org.apache.spark.sql.functions.when
val exprs = mapping.map { case (key, target) =>
when($"col1" === key, $"col2").alias(target) }
Prepend star and select:
val df = Seq(("A", 6), ("B", 5), ("C", 6)).toDF("col1", "col2")
df.select($"*" +: exprs: _*)
The result is:
+----+----+----+----+----+
|col1|col2|col3|col4|col5|
+----+----+----+----+----+
| A| 6| 6|null|null|
| B| 5|null| 5|null|
| C| 6|null|null| 6|
+----+----+----+----+----+
I have a Spark dataframe with several columns. I want to add a column on to the dataframe that is a sum of a certain number of the columns.
For example, my data looks like this:
ID var1 var2 var3 var4 var5
a 5 7 9 12 13
b 6 4 3 20 17
c 4 9 4 6 9
d 1 2 6 8 1
I want a column added summing the rows for specific columns:
ID var1 var2 var3 var4 var5 sums
a 5 7 9 12 13 46
b 6 4 3 20 17 50
c 4 9 4 6 9 32
d 1 2 6 8 10 27
I know it is possible to add columns together if you know the specific columns to add:
val newdf = df.withColumn("sumofcolumns", df("var1") + df("var2"))
But is it possible to pass a list of column names and add them together? Based off of this answer which is basically what I want but it is using the python API instead of scala (Add column sum as new column in PySpark dataframe) I think something like this would work:
//Select columns to sum
val columnstosum = ("var1", "var2","var3","var4","var5")
// Create new column called sumofcolumns which is sum of all columns listed in columnstosum
val newdf = df.withColumn("sumofcolumns", df.select(columstosum.head, columnstosum.tail: _*).sum)
This throws the error value sum is not a member of org.apache.spark.sql.DataFrame. Is there a way to sum across columns?
Thanks in advance for your help.
You should try the following:
import org.apache.spark.sql.functions._
val sc: SparkContext = ...
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val input = sc.parallelize(Seq(
("a", 5, 7, 9, 12, 13),
("b", 6, 4, 3, 20, 17),
("c", 4, 9, 4, 6 , 9),
("d", 1, 2, 6, 8 , 1)
)).toDF("ID", "var1", "var2", "var3", "var4", "var5")
val columnsToSum = List(col("var1"), col("var2"), col("var3"), col("var4"), col("var5"))
val output = input.withColumn("sums", columnsToSum.reduce(_ + _))
output.show()
Then the result is:
+---+----+----+----+----+----+----+
| ID|var1|var2|var3|var4|var5|sums|
+---+----+----+----+----+----+----+
| a| 5| 7| 9| 12| 13| 46|
| b| 6| 4| 3| 20| 17| 50|
| c| 4| 9| 4| 6| 9| 32|
| d| 1| 2| 6| 8| 1| 18|
+---+----+----+----+----+----+----+
Plain and simple:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{lit, col}
def sum_(cols: Column*) = cols.foldLeft(lit(0))(_ + _)
val columnstosum = Seq("var1", "var2", "var3", "var4", "var5").map(col _)
df.select(sum_(columnstosum: _*))
with Python equivalent:
from functools import reduce
from operator import add
from pyspark.sql.functions import lit, col
def sum_(*cols):
return reduce(add, cols, lit(0))
columnstosum = [col(x) for x in ["var1", "var2", "var3", "var4", "var5"]]
select("*", sum_(*columnstosum))
Both will default to NA if there is a missing value in the row. You can use DataFrameNaFunctions.fill or coalesce function to avoid that.
I assume you have a dataframe df. Then you can sum up all cols except your ID col. This is helpful when you have many cols and you don't want to manually mention names of all columns like everyone mentioned above. This post has the same answer.
val sumAll = df.columns.collect{ case x if x != "ID" => col(x) }.reduce(_ + _)
df.withColumn("sum", sumAll)
Here's an elegant solution using python:
NewDF = OldDF.withColumn('sums', sum(OldDF[col] for col in OldDF.columns[1:]))
Hopefully this will influence something similar in Spark ... anyone?.