I have the following DataFrame df:
id | type | count
-------------------
1 | A | 2
2 | B | 4
I want to pass each row of this df as the input of the function saveObj.
df.foreach( row => {
val list = List("id" -> row.get(0),"type" -> row.get(1))
saveObj(list)
})
Inside saveObj I want to access list values as follows: list("id"), list("type").
How can I avoid using indices of columns?: row.get(0)or row.get(1).
You can use getAs which expects a column name. By first creating a list of column names you're interested in - you can map them to the desired list of tuples:
// can also use df.columns.toList to get ALL columns
val columns = List("id", "type")
df.foreach(row => {
saveObj(columns.map(name => name -> row.getAs[Any](name)))
})
Alternatively, you can take advantage of Row.apply by using pattern matching - but in this case still require knowing the order of columns in the Row and repeating the column names:
df.foreach(_ match {
case Row(id: Any, typ: Any, _) => saveObj(List("id" -> id, "type" -> typ))
})
Related
Below is my dataframe:
val myDF= spark.sql("select company, comp_id from my_db.my_table")
myDF: org.apache.spark.sql.DataFrame = [company: string, comp_id: string]
And the data looks like
+----------+---------+
| company |comp_id |
+----------+---------+
|macys | 101 |
|jcpenny | 102 |
|kohls | 103 |
|star bucks| 104 |
|macy's | 105 |
+----------+---------+
I'm trying to create a Map collection object (like below) in Scala from the above dataframe.
Map("macys" -> "101", "jcpenny" -> "102" ..., "macy's" -> "105")
Questions:
1)Will the sequence of the dataframe records match with the sequence of the content in the original file sitting under the table?
2)If I do a collect() on the dataframe, will the sequence of the array being created match with the sequence of the content in the original file?
Explanation: When i do a df.collect().map(t => t(0) -> t(1)).toMap, looks like the map collection object doesn't preserve the insertion order, which is also the default behaviour of a scala map.res01: scala.collection.immutable.Map[Any,Any] = Map(kohls -> 103, jcpenny -> 102 ...)
3)So, How to convert the dataframe into one of the scala's collection map objects which actually preserves the insertion order/record sequence.
Explanation: As LinkedHashMap is one of the scala map collection object types to ensure insertion order. I'm trying to find a way to convert the dataframe into a LinkedHashMap object.
You can use LinkedHashMap, from Scaladoc page:
"This class implements mutable maps using a hashtable. The iterator and all traversal methods of this class visit elements in the order they were inserted."
But the Dataframes does not guarantee the order will always be the same.
import collection.mutable.LinkedHashMap
var myMap = LinkedHashMap[String, String]()
myDF.collect().map(t => myMap += (t(0).toString -> t(1).toString))
when you print myMap
res01: scala.collection.mutable.LinkedHashMap[String,String] = Map(macys -> 101, ..)
I have below dataframe
val df=Seq(("manuj","kumar","CEO","Info"),("Alice","Beb","Miniger","gogle"),("Ram","Kumar","Developer","Info Delhi")).toDF("fname","lname","designation","company")
or
+-----+-----+-----------+----------+
|fname|lname|designation| company|
+-----+-----+-----------+----------+
|manuj|kumar| CEO| Info|
|Alice| Beb| Miniger| gogle|
| Ram|Kumar| Developer|Info Delhi|
+-----+-----+-----------+----------+
Below is the given maps for individual column
val fnameMap=Map("manuj"->"Manoj")
val lnameMap=Map("Beb"->"Bob")
val designationMap=Map("Miniger"->"Manager")
val companyMap=Map("Info"->"Info Ltd","gogle"->"Google","Info Delhi"->"Info Ltd")
I also have list of columns which need to be updated so my requirement is that update all the columns of dataframe(df) which are in given list of columns using given maps.
val colList=Iterator("fname","lname","designation","company")
Output must be like
+-----+-----+-----------+--------+
|fname|lname|designation| company|
+-----+-----+-----------+--------+
|Manoj|kumar| CEO|Info Ltd|
|Alice| Bob| Manager| Google|
| Ram|Kumar| Developer|Info Ltd|
+-----+-----+-----------+--------+
Edit: Dataframe may have around 1200 columns and colList will have less than 1200 column names so I need to iterate over colList and update value of corresponding column from corresponding map.
Since DataFrames are immutable, in this example it can be processed progressively column by column, by creating a new DataFrame containing an intermediate column with replaced values, then renaming this column to initial name and finally overwriting the original DataFrame.
To achieve all this, several steps will be necessary.
First, we'll need a udf that returns a replacement value if it occurs in the provided map:
def replaceValueIfMapped(mappedValues: Map[String, String]) = udf((cellValue: String) =>
mappedValues.getOrElse(cellValue, cellValue)
)
Second, we'll need a generic function that expects a DataFrame, a column name and its replacements map. This function produces a dataframe with a temporary column, containing replaced values, drops the original column, renames the temporary one to the original name and finally returns the produced DataFrame:
def replaceColumnValues(toReplaceDf: DataFrame, column: String, mappedValues: Map[String, String]): DataFrame = {
val replacedColumn = column + "_replaced"
toReplaceDf.withColumn(replacedColumn, replaceValueIfMapped(mappedValues)(col(column)))
.drop(column)
.withColumnRenamed(replacedColumn, column)
}
Third, instead of having an Iterator on column names for replacements, we'll use a Map, where each column name is associated with a replacements map:
val colsToReplace = Map("fname" -> fnameMap,
"lname" -> lnameMap,
"designation" -> designationMap,
"company" -> companyMap)
Finally, we can call foldLeft on this map in order to execute all the replacements:
val replacedDf = colsToReplace.foldLeft(sourceDf){ case(alreadyReplaced, toReplace) =>
replaceColumnValues(alreadyReplaced, toReplace._1, toReplace._2)
}
replacedDf now contains the expected result.
To make the lookup dynamic at this level, you'll probably need to change the way you map your values to make then dynamically searchable. I would make maps of maps, with keys being the names of the columns, as expected to be passed in:
val fnameMap=Map("manuj"->"Manoj")
val lnameMap=Map("Beb"->"Bob")
val designationMap=Map("Miniger"->"Manager")
val companyMap=Map("Info"->"Info Ltd","gogle"->"Google","Info Delhi"->"Info Ltd")
val allMaps = Map("fname"->fnameMap,
"lname" -> lnameMap,
"designation" -> designationMap,
"company" -> companyMap)
This may make sense as the maps are relatively small, but you may need to consider using broadcast variables.
You can then dynamically look up based on field names.
* [ if you've seen that my scala code is bad, it's because it is. So here's a java version for you to translate] *
List<String> allColumns = Arrays.asList(dataFrame.columns());
df
.map(row ->
//this rewrites the row (that's a warning)
RowFactory.create(
allColumns.stream()
.map(dfColumn -> {
if(!colList.contains(dfColumn)) {
//column not requested for mapping, use old value
return row.get(allColumns.indexOf(dfColumn));
} else {
Object colValue =
row.get(allColumns.indexOf(dfColumn))
// in case of [2], you'd have to call:
//row.get(colListToDFIndex.get(dfColumn))
//Modified value
return allMaps.get(dfColumn)
//Assuming strings, you may need to cast
.getOrDefault(colValue, colValue);
}
})
.collect(Collectors.toList())
.toArray()
)
)
)
I am using Spark 1.6.1. Lets say my data frame looks like:
+------------+-----+----+
|categoryName|catA |catB|
+------------+-----+----+
| catA |0.25 |0.75|
| catB |0.5 |0.5 |
+------------+-----+----+
Where categoryName has String type, and cat* are Double. I would like to add column that will contain value from column which name is in the categoryName column:
+------------+-----+----+-------+
|categoryName|catA |catB| score |
+------------+-----+----+-------+
| catA |0.25 |0.75| 0.25 | ('score' has value from column name 'catA')
| catB |0.5 |0.7 | 0.7 | ('score' value from column name 'catB')
+------------+-----+----+-------+
I need such extraction to some later calculations. Any ideas?
Important: I don't know names of category columns. Solution needs to be dynamic.
Spark 2.0:
You can do this (for any number of category columns) by creating a temporary column which holds a map of categroyName -> categoryValue, and then selecting from it:
// sequence of any number of category columns
val catCols = input.columns.filterNot(_ == "categoryName")
// create a map of category -> value, and then select from that map using categoryName:
input
.withColumn("asMap", map(catCols.flatMap(c => Seq(lit(c), col(c))): _*))
.withColumn("score", $"asMap".apply($"categoryName"))
.drop("asMap")
Spark 1.6: Similar idea, but using an array and a UDF to select from it:
// sequence of any number of category columns
val catCols = input.columns.filterNot(_ == "categoryName")
// UDF to select from array by index of colName in catCols
val getByColName = udf[Double, String, mutable.WrappedArray[Double]] {
case (colName, colValues) =>
val index = catCols.zipWithIndex.find(_._1 == colName).map(_._2)
index.map(colValues.apply).getOrElse(0.0)
}
// create an array of category values and select from it using UDF:
input
.withColumn("asArray", array(catCols.map(col): _*))
.withColumn("score", getByColName($"categoryName", $"asArray"))
.drop("asArray")
You have several options:
If you are using scala you can use the Dataset API in which case you would simply create a map which does the calculation.
You can move to RDD from dataframe and use a map
You can create a UDF which receives all relevant columns as input and do the calculation inside
you can use a bunch of when/otherwise clauses to do the search (e.g. when(col1 == CatA, col(CatA)).otherwise(col(CatB)))
I have a Dataframe in which I want to remove duplicates based on a key, the catch is among the records sharing the same key, I need to select based on some columns and not just any record.
For example, my DF looks like:
+------+-------+-----+
|animal|country|color|
+------+-------+-----+
| Cat|america|white|
| dog|america|brown|
| dog| canada|white|
| dog| canada|black|
| Cat| canada|black|
| bear| canada|white|
+------+-------+-----+
Now I want to do remove duplicates based on column animal, and then have choose the ones which have country 'america'.
My desired output should be:
+------+-------+-----+
|animal|country|color|
+------+-------+-----+
| Cat|america|white|
| dog|america|brown|
| bear| canada|white|
+------+-------+-----+
Since there is no reduceBykey in Dataframe api, I convert this to a keyValue pair rdd and then do a reduceBykey I'm stuck in the function which will do this preference based selection amongst the duplicates.
I'll prefer the sample code in scala.
Provided that for all animal kind (dog, cat, ...) there is at least an entry for the country "america" and that you don't mind to loose duplicated matching animals within america, you can use reduceByKey:
val animals = sc.parallelize(("cat","america", "white")::("dog","america","brown")::("dog","canada","white")::("dog","canada","black")::("cat","canada","black")::("bear","canada","white")::Nil)
val animalsKV = animals.map { case (k, a, b) => k -> (a,b) }
animalsKV.reduceByKey {
case (a # ("america",_ ), _) => a
case (_, b) => b
}
In case you might have animals with no entries in "america", the code above will take one of the duplicates: the last one. You can improve it by providing a result maintaining the duplicates in those cases. e.g:
animalsKV.combineByKey(
Map(_), // First met entries keep wrapped within a map from their country to their color
(m: Map[String, String], entry: (String, String)) =>
if(entry._1 == "america") Map(entry) // If one animal in "america" is found, then it should be the answer value.
else m + entry, //Otherwise, we keep track of the duplicates
(a: Map[String, String], b: Map[String, String]) => //When combining maps...
if(a contains "america") a // If one of them contains "america"
else if(b contains "america") b //... then we keep that map
else a ++ b // Otherwise, we accumulate the duplicates
)
That code can be modified to keep track of duplicated "american" animals too.
I believe you can do what you with DataFrames in spark versions >= 1.4 using windows (at least I think that's what they're called).
However using RDDs
val input: RDD[(String, String, Row)] = ???
val keyedByAnimal: RDD[(String, (String, Row))] =
input.map{case (animal, country, other) => (animal, (country, other)) }
val result: RDD[(String, (String, Row))] = keyedByAnimal.reduceByKey{(x, y) =>
if(x._1 == "america") x else y
}
The above gives you a single distinct value for each animal value. The choice of which value is non-deterministic. All that can be said about it is if there exists a value for the animal with "america" one of them will be chosen.
Regarding your comment:
val df: DataFrame = ???
val animalCol:String = ???
val countryCol: String = ???
val otherCols = df.columns.filter(c => c != animalCol && c!= countryCol)
val rdd: RDD[(String, String, Row)] =
df.select(animalCol, countryCol, otherCols:_ *).rdd.map(r => (r.getString(0), r.getString(1), r)
The select reorders the columns so the getString methods pull out the expected values.
Honestly though, look into Window Aggregations. I don't know much about them as I don't use Dataframes or Spark beyond 1.3.
I am trying extract column value into a variable so that I can use the value somewhere else in the code. I am trying like the following
val name= test.filter(test("id").equalTo("200")).select("name").col("name")
It returns
name org.apache.spark.sql.Column = name
how to get the value?
The col("name") gives you a column expression. If you want to extract data from column "name" just do the same thing without col("name"):
val names = test.filter(test("id").equalTo("200"))
.select("name")
.collectAsList() // returns a List[Row]
Then for a row you could get name in String by:
val name = row.getString(0)
val maxDate = spark.sql("select max(export_time) as export_time from tier1_spend.cost_gcp_raw").first()
val rowValue = maxDate.get(0)
By this snippet, you can extract all the values in a column into a string.
Modify the snippet with where clauses to get your desired value.
val df = Seq((5, 2), (10, 1)).toDF("A", "B")
val col_val_df = df.select($"A").collect()
val col_val_str = col_val_df.map(x => x.get(0)).mkString(",")
/*
df: org.apache.spark.sql.DataFrame = [A: int, B: int]
col_val_row: Array[org.apache.spark.sql.Row] = Array([5], [10])
col_val_str: String = 5,10
*/
The value of entire column is stored in col_val_str
col_val_str: String = 5,10
Let us assume you need to pick the name from the below table for a particular Id and store that value in a variable.
+-----+-------+
| id | name |
+-----+-------+
| 100 | Alex |
| 200 | Bidan |
| 300 | Cary |
+-----+-------+
SCALA
-----------
Irrelevant data is filtered out first and then the name column is selected and finally stored into name variable
var name = df.filter($"id" === "100").select("name").collect().map(_.getString(0)).mkString("")
PYTHON (PYSPARK)
-----------------------------
For simpler usage, I have created a function that returns the value by passing the dataframe and the desired column name to this (this is spark Dataframe and not Pandas Dataframe). Before passing the dataframe to this function, filter is applied to filter out other records.
def GetValueFromDataframe(_df,columnName):
for row in _df.rdd.collect():
return row[columnName].strip()
name = GetValueFromDataframe(df.filter(df.id == "100"),"name")
There might be more simpler approach than this using 3x version of Python. The code which I showed above was tested for 2.7 version.
Note :
It is most likely to encounter out of memory error (Driver memory) since we use the collect function. Hence it is always recommended to apply transformations (like filter,where etc) before you call the collect function. If you
still encounter with driver out of memory issue, you could pass --conf spark.driver.maxResultSize=0 as command line argument to make use of unlimited driver memory.
For anyone interested below is an way to turn a column into an Array, for the below case we are just taking the first value.
val names= test.filter(test("id").equalTo("200")).selectExpr("name").rdd.map(x=>x.mkString).collect
val name = names(0)
s is the string of column values
.collect() converts columns/rows to an array of lists, in this case, all rows will be converted to a tuple, temp is basically an array of such tuples/row.
x(n-1) retrieves the n-th column value for x-th row, which is by default of type "Any", so needs to be converted to String so as to append to the existing strig.
s =""
// say the n-th column is the target column
val temp = test.collect() // converts Rows to array of list
temp.foreach{x =>
s += (x(n-1).asInstanceOf[String])
}
println(s)