How to replace specific columns multiple value in Spark Dataframe? - scala

I am trying to replace or update some specific column value in dataframe, as we know Dataframe is immutable, I am trying to transform in to new dataframe instead of Update or Replacement.
I tried dataframe.replace as explained in Spark doc, but it's giving me error as error: value replace is not a member of org.apache.spark.sql.DataFrame
I tried below option.For passing multiple value I am passing in array
val new_df= df.replace("Stringcolumn", Map(array("11","17","18","10"->"12")))
but I am getting error as
error: overloaded method value array with alternatives
Help is really appreciated!!

To access org.apache.spark.sql.DataFrameNaFunctions such as replace you have to call .na. So your code should look something like this,
import com.google.common.collect.ImmutableMap
df.na.replace("Stringcolumn", Map(10 -> 12, 11 -> 17))
see here to get all the list of DataFrameNaFunctions and how to use them

Related

Why does nested flatMap - map in Scala give an RDD of type Object instead of a list of tuples?

I have an rdd that I want to group according to some key, but it just doesn't work. I am a Scala and Spark beginner So I have the following RDD:
rdd: RDD[WikipediaArticle])
val meinVal = rdd.flatMap(article=>langs.map(lang=>{if (article.mentionsLanguage(lang){ Tuple2(lang,article)} else{None}})).filter(_!=None)
meinVal.collect.foreach(println) gives:
(Scala,WikipediaArticle(2,Scala and Java run on the JVM))
(Java,WikipediaArticle(2,Scala and Java run on the JVM))
(Scala,WikipediaArticle(3,Scala is not purely functional))
I have two questions:
Why can I not apply the groupByKey function? It is an rdd that contains a list of tuples, the first tuple-entry is the key.
I don't see how to apply groupby either. I thought I could do meinVal.groupby(x=> x._1), but that trows an error.
I notice, that when I use an IDE and hover over "meinVal" it shows that it is RDD[Object] whereas it should be RDD[(String,WikipediaArticle)]. I do not know how to get this information without the IDE. So it seems that the rdd contains just one big object. I only don't see why that is.
Anyone? Please?
Irene
Ok, so thanks to this post https://stackoverflow.com/a/29426336/909909 I figured it out. The problem was not the nested flatmap-map construct, but the condition in the map instruction. In my code I returned "None" if the condition was not met. Since None is not of type tuple I get an RDD[Object] and therefore I cannot use groupByKey.
To solve this I use Option and then flatten the rdd to get rid of the Option and its Nones again.
val meinVal = rdd.flatMap( article=> langs.map(lang=> { if(article.mentionsLanguage(lang)){Some(Tuple2(lang,article))}else{None}}).flatten)

How to convert a java.io list to a DataFrame in Scala?

I'm using this code to get the list of files in a directory, and want to call to toDF method that works when converting lists to dataframes. However, because this is a java.io List, it's saying it won't work.
val files = Option(new java.io.File("data").list).map(_.count(_.endsWith(".csv"))).getOrElse(0)
When I try to do
files.toDF.show()
I get this error:
How can I get this to work? Can someone help me with the code to convert this java.io List to a regular list?
Thanks
val files = Option(new java.io.File("data").list).map(_.count(_.endsWith(".csv"))).getOrElse(0)
Above Code returns - Int, And you are trying to convert Int Value to DataFrame, How is it possible. If I understand you wanted to convert list of .csv files as DataFrame. Please use this below code -
val files = Option(new java.io.File("data").list)).get.filter(x=>x.endsWith(".csv")).toList
import spark.implicits._
files.toDF().show()

How to convert a specific function to a udf function in apache spark with scala? [duplicate]

This question already has answers here:
How to find common elements among two array columns?
(3 answers)
Closed 4 years ago.
I have a data frame in apache spark, created using Scala. This data frame has two columns of type Array[String]. I have written a simple function which takes those two columns and returns the intersection of words (return number of common words: Int).
One example of my data frame is shown below.
data frame example with its columns
The function is the follow one:
def findNumberCommonWordsTitle(string1:Array[String], string2:Array[String]) ={
val intersection = string1.intersect(string2)
intersection.length }
I want to convert this function to udf function. I have tried this:
val fncwt=udf(findNumberCommonWordsTitle(_:Array[String],_:Array[String]))
finalDF.select(fncwt(finalDF("title_from_words"),finalDF("title_to_words"))).show(5)
but I am getting an error as bellow:
error
The error message says: Caused by: java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String;
What I am doing wrong? I think that the problem is a type mismatch but I am not sure.
After that, I want to create a new column on my data frame with the returned values of the function above.
How can I achieve that? What should I do to fix the error?
Thanks in advance
The function should be
def findNumberCommonWordsTitle(string1: Seq[String], string2: Seq[String]) ={
...
}
Reference: https://spark.apache.org/docs/2.3.0/sql-programming-guide.html#data-types

pyspark FPGrowth doesn't work with RDD

I am trying to use the FPGrowth function on some data in Spark. I tested the example here with no problems:
https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html
However, my dataset is coming from hive
data = hiveContext.sql('select transactionid, itemid from transactions')
model = FPGrowth.train(data, minSupport=0.1, numPartitions=100)
This failed with Method does not exist:
py4j.protocol.Py4JError: An error occurred while calling o764.trainFPGrowthModel. Trace:
py4j.Py4JException: Method trainFPGrowthModel([class org.apache.spark.sql.DataFrame, class java.lang.Double, class java.lang.Integer]) does not exist
So, I converted it to an RDD:
data=data.rdd
Now I start getting some strange pickle serializer errors.
net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)
Then I start looking at the types. In the example, the data is run through a flatmap. This returns a different type than the RDD.
RDD Type returned by flatmap: pyspark.rdd.PipelinedRDD
RDD Type returned by hiveContext: pyspark.rdd.RDD
FPGrowth only seems to work with the PipelinedRDD. Is there some way I can convert a regular RDD to a PipelinedRDD?
Thanks!
Well, my query was wrong, but changed that to use collect_set and then
I managed to get around the type error by doing:
data=data.map(lambda row: row[0])

Extract column values of Dataframe as List in Apache Spark

I want to convert a string column of a data frame to a list. What I can find from the Dataframe API is RDD, so I tried converting it back to RDD first, and then apply toArray function to the RDD. In this case, the length and SQL work just fine. However, the result I got from RDD has square brackets around every element like this [A00001]. I was wondering if there's an appropriate way to convert a column to a list or a way to remove the square brackets.
Any suggestions would be appreciated. Thank you!
This should return the collection containing single list:
dataFrame.select("YOUR_COLUMN_NAME").rdd.map(r => r(0)).collect()
Without the mapping, you just get a Row object, which contains every column from the database.
Keep in mind that this will probably get you a list of Any type. Ïf you want to specify the result type, you can use .asInstanceOf[YOUR_TYPE] in r => r(0).asInstanceOf[YOUR_TYPE] mapping
P.S. due to automatic conversion you can skip the .rdd part.
With Spark 2.x and Scala 2.11
I'd think of 3 possible ways to convert values of a specific column to a List.
Common code snippets for all the approaches
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.getOrCreate
import spark.implicits._ // for .toDF() method
val df = Seq(
("first", 2.0),
("test", 1.5),
("choose", 8.0)
).toDF("id", "val")
Approach 1
df.select("id").collect().map(_(0)).toList
// res9: List[Any] = List(one, two, three)
What happens now? We are collecting data to Driver with collect() and picking element zero from each record.
This could not be an excellent way of doing it, Let's improve it with the next approach.
Approach 2
df.select("id").rdd.map(r => r(0)).collect.toList
//res10: List[Any] = List(one, two, three)
How is it better? We have distributed map transformation load among the workers rather than a single Driver.
I know rdd.map(r => r(0)) does not seems elegant you. So, let's address it in the next approach.
Approach 3
df.select("id").map(r => r.getString(0)).collect.toList
//res11: List[String] = List(one, two, three)
Here we are not converting DataFrame to RDD. Look at map it won't accept r => r(0)(or _(0)) as the previous approach due to encoder issues in DataFrame. So end up using r => r.getString(0) and it would be addressed in the next versions of Spark.
Conclusion
All the options give the same output but 2 and 3 are effective, finally 3rd one is effective and elegant(I'd think).
Databricks notebook
I know the answer given and asked for is assumed for Scala, so I am just providing a little snippet of Python code in case a PySpark user is curious. The syntax is similar to the given answer, but to properly pop the list out I actually have to reference the column name a second time in the mapping function and I do not need the select statement.
i.e. A DataFrame, containing a column named "Raw"
To get each row value in "Raw" combined as a list where each entry is a row value from "Raw" I simply use:
MyDataFrame.rdd.map(lambda x: x.Raw).collect()
In Scala and Spark 2+, try this (assuming your column name is "s"):
df.select('s').as[String].collect
sqlContext.sql(" select filename from tempTable").rdd.map(r => r(0)).collect.toList.foreach(out_streamfn.println) //remove brackets
it works perfectly
List<String> whatever_list = df.toJavaRDD().map(new Function<Row, String>() {
public String call(Row row) {
return row.getAs("column_name").toString();
}
}).collect();
logger.info(String.format("list is %s",whatever_list)); //verification
Since no one has given any solution in java(Real Programming Language)
Can thank me later
from pyspark.sql.functions import col
df.select(col("column_name")).collect()
here collect is functions which in turn convert it to list.
Be ware of using the list on the huge data set. It will decrease performance.
It is good to check the data.
Below is for Python-
df.select("col_name").rdd.flatMap(lambda x: x).collect()
An updated solution that gets you a list:
dataFrame.select("YOUR_COLUMN_NAME").map(r => r.getString(0)).collect.toList
This is java answer.
df.select("id").collectAsList();