Spark dataframe get column value into a string variable - scala

I am trying extract column value into a variable so that I can use the value somewhere else in the code. I am trying like the following
val name= test.filter(test("id").equalTo("200")).select("name").col("name")
It returns
name org.apache.spark.sql.Column = name
how to get the value?

The col("name") gives you a column expression. If you want to extract data from column "name" just do the same thing without col("name"):
val names = test.filter(test("id").equalTo("200"))
.select("name")
.collectAsList() // returns a List[Row]
Then for a row you could get name in String by:
val name = row.getString(0)

val maxDate = spark.sql("select max(export_time) as export_time from tier1_spend.cost_gcp_raw").first()
val rowValue = maxDate.get(0)

By this snippet, you can extract all the values in a column into a string.
Modify the snippet with where clauses to get your desired value.
val df = Seq((5, 2), (10, 1)).toDF("A", "B")
val col_val_df = df.select($"A").collect()
val col_val_str = col_val_df.map(x => x.get(0)).mkString(",")
/*
df: org.apache.spark.sql.DataFrame = [A: int, B: int]
col_val_row: Array[org.apache.spark.sql.Row] = Array([5], [10])
col_val_str: String = 5,10
*/
The value of entire column is stored in col_val_str
col_val_str: String = 5,10

Let us assume you need to pick the name from the below table for a particular Id and store that value in a variable.
+-----+-------+
| id | name |
+-----+-------+
| 100 | Alex |
| 200 | Bidan |
| 300 | Cary |
+-----+-------+
SCALA
-----------
Irrelevant data is filtered out first and then the name column is selected and finally stored into name variable
var name = df.filter($"id" === "100").select("name").collect().map(_.getString(0)).mkString("")
PYTHON (PYSPARK)
-----------------------------
For simpler usage, I have created a function that returns the value by passing the dataframe and the desired column name to this (this is spark Dataframe and not Pandas Dataframe). Before passing the dataframe to this function, filter is applied to filter out other records.
def GetValueFromDataframe(_df,columnName):
for row in _df.rdd.collect():
return row[columnName].strip()
name = GetValueFromDataframe(df.filter(df.id == "100"),"name")
There might be more simpler approach than this using 3x version of Python. The code which I showed above was tested for 2.7 version.
Note :
It is most likely to encounter out of memory error (Driver memory) since we use the collect function. Hence it is always recommended to apply transformations (like filter,where etc) before you call the collect function. If you
still encounter with driver out of memory issue, you could pass --conf spark.driver.maxResultSize=0 as command line argument to make use of unlimited driver memory.

For anyone interested below is an way to turn a column into an Array, for the below case we are just taking the first value.
val names= test.filter(test("id").equalTo("200")).selectExpr("name").rdd.map(x=>x.mkString).collect
val name = names(0)

s is the string of column values
.collect() converts columns/rows to an array of lists, in this case, all rows will be converted to a tuple, temp is basically an array of such tuples/row.
x(n-1) retrieves the n-th column value for x-th row, which is by default of type "Any", so needs to be converted to String so as to append to the existing strig.
s =""
// say the n-th column is the target column
val temp = test.collect() // converts Rows to array of list
temp.foreach{x =>
s += (x(n-1).asInstanceOf[String])
}
println(s)

Related

How to pass date values from dataframe to query in Spark /Scala

I am reading the data from Store table which is in snowflake. I want to pass the date from dataframe maxdatefromtbl to my query in spark sql to filter records.
This condition (s"CREATED_DATE!='$maxdatefromtbl'") is not working as expected
var retail = spark.read.format("snowflake").options(options).option("query","Select MAX(CREATED_DATE) as CREATED_DATE from RSTORE").load()
val maxdatefromtbl = retail.select("CREATED_DATE").toString
var retailnew = spark.read.format("snowflake").options(options).option("query","Select * from RSTORE").load()
var finaldataresult = retailnew.filter(s"CREATED_DATE!='$maxdatefromtbl'")
Select a single value from the retail dataframe to use in the filter.
val maxdatefromtbl = retail.select("CREATED_DATE").collect().head.getString(0)
var finaldataresult = retailnew.filter(col("CREATED_DATE") =!= maxdatefromtbl)
The type of retail.select("CREATED_DATE") is DataFrame, and DataFrame.toString returns the schema rather than the value of the single row you have. Please see the following example from a Spark shell.
scala> val s = Seq(1, 2, 3).toDF()
scala> s.select("value").toString
res0: String = [value: int]
In first line in the code snipped above, collect() wraps the dataframe, with a single row in your case, in an array; head takes the first element of the array, and .getString(0) gets the value from the cell with at the index 0 as String. Please see the DataFrame and Row documentation pages for more information.

check data size spark dataframes

I have the following question :
Actually I am working with the following csv file:
""job"";""marital"""
""management"";""married"""
""technician"";""single"""
I loaded it into a spark dataframe as follows:
My aim is to check the length and type of each field in the dataframe following the set od rules below :
col type
job char10
marital char7
I started implementing the check of the length of each field but I am getting a compilation error :
val data = spark.read.option("inferSchema", "true").option("header", "true").csv("file:////home/user/Desktop/user/file.csv")
data.map(line => {
val fields = line.toString.split(";")
fields(0).size
fields(1).size
})
The expected output should be:
List(10,10)
As for the check of the types I don't have any idea about how to implement it as we are using dataframes. Any idea about a function verifying the data format ?
Thanks a lot in advance for your replies.
ata
I see you are trying to use Dataframe, But if there are multiple double quotes then you can read as a textFile and remove them and convert to Dataframe as below
import org.apache.spark.sql.functions._
import spark.implicits._
val raw = spark.read.textFile("path to file ")
.map(_.replaceAll("\"", ""))
val header = raw.first
val data = raw.filter(row => row != header)
.map { r => val x = r.split(";"); (x(0), x(1)) }
.toDF(header.split(";"): _ *)
You get with data.show(false)
+----------+-------+
|job |marital|
+----------+-------+
|management|married|
|technician|single |
+----------+-------+
To calculate the size you can use withColumn and length function and play around as you need.
data.withColumn("jobSize", length($"job"))
.withColumn("martialSize", length($"marital"))
.show(false)
Output:
+----------+-------+-------+-----------+
|job |marital|jobSize|martialSize|
+----------+-------+-------+-----------+
|management|married|10 |7 |
|technician|single |10 |6 |
+----------+-------+-------+-----------+
All the column type are String.
Hope this helps!
You are using a dataframe. So when you use the map method, you are processing Row in your lambda.
so line is a Row.
Row.toString will return a string representing the Row, so in your case 2 structfields typed as String.
If you want to use map and process your Row, you have to get the vlaue inside the fields manually. with getAsString and getAsString.
Usually when you use Dataframes, you have to work in column's logic as in SQL using select, where... or directly the SQL syntax.

How to update multiple columns of Dataframe from given set of maps in Scala?

I have below dataframe
val df=Seq(("manuj","kumar","CEO","Info"),("Alice","Beb","Miniger","gogle"),("Ram","Kumar","Developer","Info Delhi")).toDF("fname","lname","designation","company")
or
+-----+-----+-----------+----------+
|fname|lname|designation| company|
+-----+-----+-----------+----------+
|manuj|kumar| CEO| Info|
|Alice| Beb| Miniger| gogle|
| Ram|Kumar| Developer|Info Delhi|
+-----+-----+-----------+----------+
Below is the given maps for individual column
val fnameMap=Map("manuj"->"Manoj")
val lnameMap=Map("Beb"->"Bob")
val designationMap=Map("Miniger"->"Manager")
val companyMap=Map("Info"->"Info Ltd","gogle"->"Google","Info Delhi"->"Info Ltd")
I also have list of columns which need to be updated so my requirement is that update all the columns of dataframe(df) which are in given list of columns using given maps.
val colList=Iterator("fname","lname","designation","company")
Output must be like
+-----+-----+-----------+--------+
|fname|lname|designation| company|
+-----+-----+-----------+--------+
|Manoj|kumar| CEO|Info Ltd|
|Alice| Bob| Manager| Google|
| Ram|Kumar| Developer|Info Ltd|
+-----+-----+-----------+--------+
Edit: Dataframe may have around 1200 columns and colList will have less than 1200 column names so I need to iterate over colList and update value of corresponding column from corresponding map.
Since DataFrames are immutable, in this example it can be processed progressively column by column, by creating a new DataFrame containing an intermediate column with replaced values, then renaming this column to initial name and finally overwriting the original DataFrame.
To achieve all this, several steps will be necessary.
First, we'll need a udf that returns a replacement value if it occurs in the provided map:
def replaceValueIfMapped(mappedValues: Map[String, String]) = udf((cellValue: String) =>
mappedValues.getOrElse(cellValue, cellValue)
)
Second, we'll need a generic function that expects a DataFrame, a column name and its replacements map. This function produces a dataframe with a temporary column, containing replaced values, drops the original column, renames the temporary one to the original name and finally returns the produced DataFrame:
def replaceColumnValues(toReplaceDf: DataFrame, column: String, mappedValues: Map[String, String]): DataFrame = {
val replacedColumn = column + "_replaced"
toReplaceDf.withColumn(replacedColumn, replaceValueIfMapped(mappedValues)(col(column)))
.drop(column)
.withColumnRenamed(replacedColumn, column)
}
Third, instead of having an Iterator on column names for replacements, we'll use a Map, where each column name is associated with a replacements map:
val colsToReplace = Map("fname" -> fnameMap,
"lname" -> lnameMap,
"designation" -> designationMap,
"company" -> companyMap)
Finally, we can call foldLeft on this map in order to execute all the replacements:
val replacedDf = colsToReplace.foldLeft(sourceDf){ case(alreadyReplaced, toReplace) =>
replaceColumnValues(alreadyReplaced, toReplace._1, toReplace._2)
}
replacedDf now contains the expected result.
To make the lookup dynamic at this level, you'll probably need to change the way you map your values to make then dynamically searchable. I would make maps of maps, with keys being the names of the columns, as expected to be passed in:
val fnameMap=Map("manuj"->"Manoj")
val lnameMap=Map("Beb"->"Bob")
val designationMap=Map("Miniger"->"Manager")
val companyMap=Map("Info"->"Info Ltd","gogle"->"Google","Info Delhi"->"Info Ltd")
val allMaps = Map("fname"->fnameMap,
"lname" -> lnameMap,
"designation" -> designationMap,
"company" -> companyMap)
This may make sense as the maps are relatively small, but you may need to consider using broadcast variables.
You can then dynamically look up based on field names.
* [ if you've seen that my scala code is bad, it's because it is. So here's a java version for you to translate] *
List<String> allColumns = Arrays.asList(dataFrame.columns());
df
.map(row ->
//this rewrites the row (that's a warning)
RowFactory.create(
allColumns.stream()
.map(dfColumn -> {
if(!colList.contains(dfColumn)) {
//column not requested for mapping, use old value
return row.get(allColumns.indexOf(dfColumn));
} else {
Object colValue =
row.get(allColumns.indexOf(dfColumn))
// in case of [2], you'd have to call:
//row.get(colListToDFIndex.get(dfColumn))
//Modified value
return allMaps.get(dfColumn)
//Assuming strings, you may need to cast
.getOrDefault(colValue, colValue);
}
})
.collect(Collectors.toList())
.toArray()
)
)
)

Compare 2 dataframes and filter results based on date column in spark

I have 2 dataframes in spark as mentioned below.
val test = hivecontext.sql("select max(test_dt) as test_dt from abc");
test: org.apache.spark.sql.DataFrame = [test_dt: string]
val test1 = hivecontext.table("testing");
where test1 has columns like id,name,age,audit_dt
I want to compare these 2 dataframes and filter rows from test1 where audit_dt > test_dt. Somehow I am not able to do that. I am able to compare audit_dt with literal date using lit function but i am not able to compare it with another dataframe column.
I am able to compare literal date using lit function as mentioned below
val output = test1.filter(to_date(test1("audit_date")).gt(lit("2017-03-23")))
Max Date in test dataframe is -> 2017-04-26
Data in test1 Dataframe ->
Id,Name,Age,Audit_Dt
1,Rahul,23,2017-04-26
2,Ankit,25,2017-04-26
3,Pradeep,28,2017-04-27
I just need the data for Id=3 since that only row qualifies the greater than criteria of max date.
I have already tried below mentioned option but it is not working.
val test = hivecontext.sql("select max(test_dt) as test_dt from abc")
val MAX_AUDIT_DT = test.first().toString()
val output = test.filter(to_date(test("audit_date")).gt((lit(MAX_AUDIT_DT))))
Can anyone suggest as way to compare it with column of dataframe test?
Thanks
You can use non-equi joins, if both columns "test_dt" and "audit_date" are of class date.
/// cast to correct type
import org.apache.spark.sql.functions.to_date
val new_test = test.withColumn("test_dt",to_date($"test_dt"))
val new_test1 = test1.withColumn("Audit_Dt", to_date($"Audit_Dt"))
/// join
new_test1.join(new_test, $"Audit_Dt" > $"test_dt")
.drop("test_dt").show()
+---+-------+---+----------+
| Id| Name|Age| Audit_Dt|
+---+-------+---+----------+
| 3|Pradeep| 28|2017-04-27|
+---+-------+---+----------+
Data
val test1 = sc.parallelize(Seq((1,"Rahul",23,"2017-04-26"),(2,"Ankit",25,"2017-04-26"),
(3,"Pradeep",28,"2017-04-27"))).toDF("Id","Name", "Age", "Audit_Dt")
val test = sc.parallelize(Seq(("2017-04-26"))).toDF("test_dt")
Try with this:
test1.filter(to_date(test1("audit_date")).gt(to_date(test("test_dt"))))
Store the value in a variable and use in filter.
val dtValue = test.select("test_dt")
OR
val dtValue = test.first().getString(0)
Now apply filter
val output = test1.filter(to_date(test1("audit_date")).gt(lit(dtValue)))

Dynamically select column content based on other column from the same row

I am using Spark 1.6.1. Lets say my data frame looks like:
+------------+-----+----+
|categoryName|catA |catB|
+------------+-----+----+
| catA |0.25 |0.75|
| catB |0.5 |0.5 |
+------------+-----+----+
Where categoryName has String type, and cat* are Double. I would like to add column that will contain value from column which name is in the categoryName column:
+------------+-----+----+-------+
|categoryName|catA |catB| score |
+------------+-----+----+-------+
| catA |0.25 |0.75| 0.25 | ('score' has value from column name 'catA')
| catB |0.5 |0.7 | 0.7 | ('score' value from column name 'catB')
+------------+-----+----+-------+
I need such extraction to some later calculations. Any ideas?
Important: I don't know names of category columns. Solution needs to be dynamic.
Spark 2.0:
You can do this (for any number of category columns) by creating a temporary column which holds a map of categroyName -> categoryValue, and then selecting from it:
// sequence of any number of category columns
val catCols = input.columns.filterNot(_ == "categoryName")
// create a map of category -> value, and then select from that map using categoryName:
input
.withColumn("asMap", map(catCols.flatMap(c => Seq(lit(c), col(c))): _*))
.withColumn("score", $"asMap".apply($"categoryName"))
.drop("asMap")
Spark 1.6: Similar idea, but using an array and a UDF to select from it:
// sequence of any number of category columns
val catCols = input.columns.filterNot(_ == "categoryName")
// UDF to select from array by index of colName in catCols
val getByColName = udf[Double, String, mutable.WrappedArray[Double]] {
case (colName, colValues) =>
val index = catCols.zipWithIndex.find(_._1 == colName).map(_._2)
index.map(colValues.apply).getOrElse(0.0)
}
// create an array of category values and select from it using UDF:
input
.withColumn("asArray", array(catCols.map(col): _*))
.withColumn("score", getByColName($"categoryName", $"asArray"))
.drop("asArray")
You have several options:
If you are using scala you can use the Dataset API in which case you would simply create a map which does the calculation.
You can move to RDD from dataframe and use a map
You can create a UDF which receives all relevant columns as input and do the calculation inside
you can use a bunch of when/otherwise clauses to do the search (e.g. when(col1 == CatA, col(CatA)).otherwise(col(CatB)))