I want to normalize Names of authors by removing the accents
Input: orčpžsíáýd
Output: orcpzsiayd
The code below will allow me the achieve this. How ever I am not sure how i can do this using spark functions where my input is dataframe col.
def stringNormalizer(c : Column) = (
import org.apache.commons.lang.StringUtils
return StringUtils.stripAccents(c.toString)
)
The way i should be able to call it
val normalizedAuthor = flat_author.withColumn("NormalizedAuthor",
stringNormalizer(df_article("authors")))
I have just started learning spark. So please let me know if there is a better way to achieve this without UDFs.
It requires an udf:
val stringNormalizer = udf((s: String) => StringUtils.stripAccents(s))
df_article.select(stringNormalizer(col("authors")))
Although it doesn't look as pretty, I found that it took half the amount of time to remove accents like this without a UDF:
def withColumnFormated(columnName: String)(df: DataFrame): DataFrame = {
val dfWithColumnUpper = df.withColumn(columnName, upper(col(columnName)))
val accents: Map[String, String] = Map("[ÃÁÀÂÄ]" -> "A", "[ÉÈÊË]" -> "E", "[ÍÌÎÏ]" -> "I",
"[Ñ]" -> "N", "[ÓÒÔÕÖ]" -> "O", "[ÚÙÛÜ]" -> "U",
"[Ç]" -> "C")
accents.foldLeft(dfWithColumnUpper){
(tempDf, replace_element) => tempDf.withColumn(columnName,
regexp_replace(col(columnName),
lit(replace_element._1),
lit(replace_element._2)))
}
}
And then you can apply it like this:
df_article.transform(withColumnFormated("authors"))
Related
I am having a scenario in which. I iterate over a list of DataFrames. Perform same type of operation on each using a FOR LOOP, and store the transformed data frame in a Map(String -> DataFrame).
for (df <- dfList)
{
//perform some transformation of dataframe
dfMap = dfMap + ("some_name", df)
}
This solution is working fine. But in a sequential manner. I want to make use of async to achieve parallelism and performance improvements. Such that the transformations on each df occur parallelly making using of distributed processing capabilities of Spark.
Check below code.
def logic(df: DataFrame):Map[String,DataFrame] = {
// Return Map[String,DataFrame]
}
val dfa = // DataFrame 1
val dfb = // DataFrame 2
val dfc = // DataFrame 3
Seq(dfa,dfb,dfc,dfd)
.par // Parallel
.map(logic) // invoking logic function for every dataframe.
.reduce( _ ++ _ ) // Final result in Map["aaa" -> dfa,"bbb" -> dfb,"ccc" -> dfc]
Update
def writeToMap(a: Int, i: Int) = Map(a -> i)
def doOperation(a: Int)=writeToMap(a,a+10)
val list = Seq.range(0, 33)
list.par.map(x => doOperation(x))
val dfList : List[DataFrame] = // Your Dataframe list
val dfMap : Map[String,DataFrame] = dfList.map("some_name" -> _).toMap
.map do the mapping of each element with the Pair
.toMap would aggregate the result to a Map.
Note : some_name should be unique for every dataframe
I am working on Spark Scala and there is a requirement to save Map[String, String] to the disk so that a different Spark application can read it.
(x,1),(y,2)...
To Save:
sc.parallelize(itemMap.toSeq).coalesce(1).saveAsTextFile(fileName)
I am doing a coalesce as the data is only 450 rows.
But to read it back, I am not able to convert it back to Map[String, String]
val myMap = sc.textFile(fileName).zipWithUniqueId().collect.toMap
the data comes as
((x,1),0),((y,2),1)...
What is the possible solution?
Thanks.
Loading a text file results in RDD[String], so you will have to deserialize your string representations of the tuples.
You can change your Save operation to add a delimiter between tuple value 1 and tuple value 2, or parse the string (:v1, :v2).
val d = spark.sparkContext.textFile(fileName)
val myMap = d.map(s => {
val parsedVals = s.substring(1, s.length-1).split(",")
(parsedVals(0), parsedVals(1))
}).collect.toMap
Alternatively, you can change your save operation to create a delimiter (like a comma) and parse the structure that way:
itemMap.toSeq.map(kv => kv._1 + "," + kv._2).saveAsTextFile(fileName)
val myMap = spark.sparkContext.textFile("trash3.txt")
.map(_.split(","))
.map(d => (d(0), d(1)))
.collect.toMap
Method "collectAsMap" exists in "PairRDDFunctions" class, means, applicable only for RDD with two values RDD[(K, V)].
If this function call is required, can be organized with code below. Dataframe is used for store in csv format ant avoid hand-made parsing
val originalMap = Map("x" -> 1, "y" -> 2)
// write
sparkContext.parallelize(originalMap.toSeq).coalesce(1).toDF("k", "v").write.csv(path)
// read
val restoredDF = spark.read.csv(path)
val restoredMap = restoredDF.rdd.map(r => (r.getString(0), r.getString(1))).collectAsMap()
println("restored map: " + restoredMap)
Output:
restored map: Map(y -> 2, x -> 1)
I'm trying to create a spark scala udf in order to transform MongoDB objects of the following shape:
Object:
"1": 50.3
"8": 2.4
"117": 1.0
Into Spark ml SparseVector.
The problem is that in order to create a SparseVector, I need one more input parameter - its size.
And in my app I keep the Vector sizes in a separate MongoDB collection.
So, I defined the following UDF function:
val mapToSparseVectorUdf = udf {
(myMap: Map[String, Double], size: Int) => {
val vb: VectorBuilder[Double] = new VectorBuilder(length = -1)
vb.use(myMap.keys.map(key => key.toInt).toArray, myMap.values.toArray, size)
vb.toSparseVector
}
}
And I was trying to call it like this:
df.withColumn("VecColumn", mapToSparseVectorUdf(col("MapColumn"), vecSize)).drop("MapColumn")
However, my IDE says "Not applicable" to that udf call.
Is there a way to make this kind of UDF that can take an extra parameter?
Udf functions would require columns to be passed as arguments and the columns passed would be parsed to primitive data types through serialization and desirialization. Thats why udf functions are expensive
If vecSize is an Integer constant then you can simply use lit inbuilt function as
df.withColumn("VecColumn", mapToSparseVectorUdf(col("MapColumn"), lit(vecSize))).drop("MapColumn")
This will do it:
def mapToSparseVectorUdf(vectorSize: Int) = udf[Vector, Map[String, Double]](
(myMap: Map[String, Double]) => {
val elements = myMap.toSeq.map {case (index, value) => (index.toInt, value)}
Vectors.sparse(vectorSize, elements)
}
)
Usage:
val data = spark.createDataFrame(Seq(
("1", Map("1" -> 50.3, "8" -> 2.4)),
("2", Map("2" -> 23.5, "3" -> 41.2))
)).toDF("id", "MapColumn")
data.withColumn("VecColumn", mapToSparseVectorUdf(10)($"MapColumn")).show(false)
NOTE:
Consider fixing your MongoDB schema! ;) The size is a member of a SparseVector, I wouldn't separate it from it's elements.
I have a dataFrame = [CUSTOMER_ID ,itemType, eventTimeStamp, valueType] which I convert to RDD[(String, (String, String, Map[String, Int]))] by doing the following:
val tempFile = result.map( {
r => {
val customerId = r.getAs[String]( "CUSTOMER_ID" )
val itemType = r.getAs[String]( "itemType" )
val eventTimeStamp = r.getAs[String]( "eventTimeStamp" )
val valueType = r.getAs[Map[String, Int]]( "valueType" )
(customerId, (itemType, eventTimeStamp, valueType))
}
} )
Since my my input is huge, this takes much time. Is there any efficient way to convert the df to RDD[(String, (String, String, Map[String, Int]))] ?
The operation you've described is as cheap as it's going to get. Doing a few getAs and allocating a few tuples is almost free. If it's going slow, that's probably unavoidable due to your large data size (7T). Also note that Catalyst optimizations cannot be performed on RDDs, so including this kind of .map downstream of DataFrame operations will often prevent other Spark shortcuts.
I'm trying to transform a dataframe via a function that takes an array as a parameter. My code looks something like this:
def getCategory(categories:Array[String], input:String): String = {
categories(input.toInt)
}
val myArray = Array("a", "b", "c")
val myCategories =udf(getCategory _ )
val df = sqlContext.parquetFile("myfile.parquet)
val df1 = df.withColumn("newCategory", myCategories(lit(myArray), col("myInput"))
However, lit doesn't like arrays and this script errors. I tried definining a new partially applied function and then the udf after that :
val newFunc = getCategory(myArray, _:String)
val myCategories = udf(newFunc)
val df1 = df.withColumn("newCategory", myCategories(col("myInput")))
This doesn't work either as I get a nullPointer exception and it appears myArray is not being recognized. Any ideas on how I pass an array as a parameter to a function with a dataframe?
On a separate note, any explanation as to why doing something simple like using a function on a dataframe is so complicated (define function, redefine it as UDF, etc, etc)?
Most likely not the prettiest solution but you can try something like this:
def getCategory(categories: Array[String]) = {
udf((input:String) => categories(input.toInt))
}
df.withColumn("newCategory", getCategory(myArray)(col("myInput")))
You could also try an array of literals:
val getCategory = udf(
(input:String, categories: Array[String]) => categories(input.toInt))
df.withColumn(
"newCategory", getCategory($"myInput", array(myArray.map(lit(_)): _*)))
On a side note using Map instead of Array is probably a better idea:
def mapCategory(categories: Map[String, String], default: String) = {
udf((input:String) => categories.getOrElse(input, default))
}
val myMap = Map[String, String]("1" -> "a", "2" -> "b", "3" -> "c")
df.withColumn("newCategory", mapCategory(myMap, "foo")(col("myInput")))
Since Spark 1.5.0 you can also use an array function:
import org.apache.spark.sql.functions.array
val colArray = array(myArray map(lit _): _*)
myCategories(lit(colArray), col("myInput"))
See also Spark UDF with varargs