I have spark application and I implemented DataFrame extension -
def transform : Dataframe => Dataframe
,so app developer can pass custom transformations in my framework. Like
builder.load(path).transform(_.filter(col("sample") == lit(""))).
Now I want to track what was happened during spark execution:
Log:
val df = spark.read()
val df2 = df.filter(col("sample") == lit("")))
...
So, the idea is keep log of actions and pretty-print it at the end, but to do this I need somehow get the content of Dataframe => DataFrame function. Possibly, macros can help me, but I am not sure. I actually don't need the code(however will appreciate it), but just get the direction to go.
Related
I'm currently writing test cases for spark with mockito and I'm mocking a sparkContext which gets wholeTextFiles called on it. I have something like this
val rdd = sparkSession.sparkContext.makeRDD(Seq(("Java", 20000),
("Python", 100000), ("Scala", 3000)))
doReturn(rdd).when(mockContext).wholeTextFiles(testPath)
However, I keep getting an error saying wholeTextFiles is supposed to output an int
org.mockito.exceptions.misusing.WrongTypeOfReturnValue: ParallelCollectionRDD cannot be returned by wholeTextFiles$default$2()
wholeTextFiles$default$2() should return int
I know this isn't the case, the spark docs say that wholeTextFiles returns an RDD object, any tips on how I can fix this? I can't have my doReturn be of type int, because then the rest of my function fails, since I turn the wholeTextFiles output into a dataframe.
Resolved this with some alternative approaches. It works just fine if I do the Mockito.when() thenReturn pattern, but I decided to instead change the entire test case to start a local sparksession and load some sample files into an rdd instead, because that's a more in depth test of my function imo
I have huge problems creating a simple graph in Spark GraphX. I really don't understand anything so I try everything that I find but nothing works.
For example I try to reproduce the steps from here.
The following two were OK:
val flightsFromTo = df_1.select($"Origin",$"Dest")
val airportCodes = df_1.select($"Origin", $"Dest").flatMap(x => Iterable(x(0).toString, x(1).toString))
But after this I obtain an error:
val airportVertices: RDD[(VertexId, String)] = airportCodes.distinct().map(x => (MurmurHash.stringHash(x), x))
Error: missing Parameter type
Could You please tell me what is wrong?
And by the way, why MurmurHash? What is a purpose of it?
My guess is that you are working from a 3 year old tutorial with a recent Spark version.
The sqlContext read returns a Dataset instead of RDD.
If you want it like the tutorial use .rdd. instead
val airportVertices: RDD[(VertexId, String)] = airportCodes.rdd.distinct().map(x => (MurmurHash3.stringHash(x), x))
or change type of variable
val airportVertices: Dataset[(Int, String)] = airportCodes.distinct().map(x => (MurmurHash3.stringHash(x), x))
You could also checkout https://graphframes.github.io/ if you are interested in Graphs and Spark
Updated
To create a Graph you need vertices and edges
To make computation easier all vertices have to be identified by a VertexId (in essence a Long)
The MurmerHash is used to create very good unique hashes. More info here: MurmurHash - what is it?
Hashing is a best practise to distribute the data without skewing, but there is no technical reason why you couldn't use an incremental counter for each vertex
I've looked at the tutorial, but the only thing you have to change to make it work, is to add .rdd:
val flightsFromTo = df_1.select($"Origin",$"Dest").rdd
val airportCodes = df_1.select($"Origin", $"Dest").flatMap(x => Iterable(x(0).toString, x(1).toString)).rdd
I've been trying to learn/use Scala for machine learning and to do that I need to convert string variables to an index of dummies.
The way I've done it is with the StringIndexer in Scala. Before running I've used df.na.fill("missing") to replace missing values. Even after I run that I still get a NullPointerException.
Is there something else I should be doing or something else I should be checking? I used printSchema to filter only on the string columns to get the list of columns I needed to run StringIndexer on.
val newDf1 = reweight.na.fill("Missing")
val cat_cols = Array("highest_tier_nm", "day_of_week", "month",
"provided", "docsis", "dwelling_type_grp", "dwelling_type_cd", "market"
"bulk_flag")
val transformers: Array[org.apache.spark.ml.PipelineStage] = cat_cols
.map(cname => new StringIndexer()
.setInputCol(cname)
.setOutputCol(s"${cname}_index"))
val stages: Array[org.apache.spark.ml.PipelineStage] = transformers
val categorical = new Pipeline().setStages(stages)
val cat_reweight = categorical.fit(newDf)
Normally when using machine learning you would train the model with one part of the data and then test it with another part. Hence, there are two different methods to use to reflect this. You have only used fit() which is equivalent to training a model (or a pipeline).
This mean that your cat_reweight is not a dataframe, it is a PipelineModel. A PipelineModel have a function transform() that takes data with the same format as the one used for training and gives a dataframe as output. In other words, you should add .transform(newDf1) after fit(newDf1).
Another possible issue is that in your code you have used fit(newDf) instead of fit(newDf1). Make sure the correct dataframe is used for both the fit() and transform() methods, otherwise you will get a NullPointerException.
It works for me when running locally, however, if you still get an error you could try to cache() after replacing the nulls and then performing an action to make sure all transformations are done.
Hope it helps!
I am working with Spark 2.0 Scala. I am able to convert an RDD to a DataFrame using the toDF() method.
val rdd = sc.textFile("/pathtologfile/logfile.txt")
val df = rdd.toDF()
But for the life of me I cannot find where this is in the API docs. It is not under RDD. But it is under DataSet (link 1). However I have an RDD not a DataSet.
Also I can't see it under implicits (link 2).
So please help me understand why toDF() can be called for my RDD. Where is this method being inherited from?
It's coming from here:
Spark 2 API
Explanation: if you import sqlContext.implicits._, you have a implicit method to convert RDD to DataSetHolder (rddToDataSetHolder), then you call toDF on the DataSetHolder
Yes, you should import sqlContext implicits like that:
val sqlContext = //create sqlContext
import sqlContext.implicits._
val df = RDD.toDF()
Before you call to "toDF" in your RDDs
Yes I finally found piece of mind, this issue. It was troubling me like hell, this post is a life saver. I was trying to generically load data from log files to a case class object making it mutable List, this idea was to finally convert the list into DF. However as it was mutable and Spark 2.1.1 has changed the toDF implementation, what ever why the list want not getting converted. I finally thought of even covering save the data to file and the load it back using .read. However 5 min back this post had saved my day.
I did the exact same way as described.
after loading the data to mutable list I immediately used
import spark.sqlContext.implicits._
val df = <mutable list object>.toDF
df.show()
I have done just this with Spark 2.
it worked.
val orders = sc.textFile("/user/gd/orders")
val ordersDF = orders.toDF()
I want to convert a string column of a data frame to a list. What I can find from the Dataframe API is RDD, so I tried converting it back to RDD first, and then apply toArray function to the RDD. In this case, the length and SQL work just fine. However, the result I got from RDD has square brackets around every element like this [A00001]. I was wondering if there's an appropriate way to convert a column to a list or a way to remove the square brackets.
Any suggestions would be appreciated. Thank you!
This should return the collection containing single list:
dataFrame.select("YOUR_COLUMN_NAME").rdd.map(r => r(0)).collect()
Without the mapping, you just get a Row object, which contains every column from the database.
Keep in mind that this will probably get you a list of Any type. Ïf you want to specify the result type, you can use .asInstanceOf[YOUR_TYPE] in r => r(0).asInstanceOf[YOUR_TYPE] mapping
P.S. due to automatic conversion you can skip the .rdd part.
With Spark 2.x and Scala 2.11
I'd think of 3 possible ways to convert values of a specific column to a List.
Common code snippets for all the approaches
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.getOrCreate
import spark.implicits._ // for .toDF() method
val df = Seq(
("first", 2.0),
("test", 1.5),
("choose", 8.0)
).toDF("id", "val")
Approach 1
df.select("id").collect().map(_(0)).toList
// res9: List[Any] = List(one, two, three)
What happens now? We are collecting data to Driver with collect() and picking element zero from each record.
This could not be an excellent way of doing it, Let's improve it with the next approach.
Approach 2
df.select("id").rdd.map(r => r(0)).collect.toList
//res10: List[Any] = List(one, two, three)
How is it better? We have distributed map transformation load among the workers rather than a single Driver.
I know rdd.map(r => r(0)) does not seems elegant you. So, let's address it in the next approach.
Approach 3
df.select("id").map(r => r.getString(0)).collect.toList
//res11: List[String] = List(one, two, three)
Here we are not converting DataFrame to RDD. Look at map it won't accept r => r(0)(or _(0)) as the previous approach due to encoder issues in DataFrame. So end up using r => r.getString(0) and it would be addressed in the next versions of Spark.
Conclusion
All the options give the same output but 2 and 3 are effective, finally 3rd one is effective and elegant(I'd think).
Databricks notebook
I know the answer given and asked for is assumed for Scala, so I am just providing a little snippet of Python code in case a PySpark user is curious. The syntax is similar to the given answer, but to properly pop the list out I actually have to reference the column name a second time in the mapping function and I do not need the select statement.
i.e. A DataFrame, containing a column named "Raw"
To get each row value in "Raw" combined as a list where each entry is a row value from "Raw" I simply use:
MyDataFrame.rdd.map(lambda x: x.Raw).collect()
In Scala and Spark 2+, try this (assuming your column name is "s"):
df.select('s').as[String].collect
sqlContext.sql(" select filename from tempTable").rdd.map(r => r(0)).collect.toList.foreach(out_streamfn.println) //remove brackets
it works perfectly
List<String> whatever_list = df.toJavaRDD().map(new Function<Row, String>() {
public String call(Row row) {
return row.getAs("column_name").toString();
}
}).collect();
logger.info(String.format("list is %s",whatever_list)); //verification
Since no one has given any solution in java(Real Programming Language)
Can thank me later
from pyspark.sql.functions import col
df.select(col("column_name")).collect()
here collect is functions which in turn convert it to list.
Be ware of using the list on the huge data set. It will decrease performance.
It is good to check the data.
Below is for Python-
df.select("col_name").rdd.flatMap(lambda x: x).collect()
An updated solution that gets you a list:
dataFrame.select("YOUR_COLUMN_NAME").map(r => r.getString(0)).collect.toList
This is java answer.
df.select("id").collectAsList();