How to read a parquet file with lots of columns to a Dataset without a custom case class? - scala

I want to use datasets instead of dataframes.
I'm reading a parquet file and want to infer the types directly:
val df: Dataset[Row] = spark.read.parquet(path)
I don't want Dataset[Row] but a Dataset.
I know I can do something like:
val df= spark.read.parquet(path).as[myCaseClass]
but, my data has many columns! so, if I can avoid writing a case class it would be great!

Why do you want to work with a Dataset? I think it's because you will have not only the schema for free (which you have with the result DataFrame anyway) but because you will have a type-safe schema.
You need to have an Encoder for your dataset and to have it you need a type that would represent your dataset and hence the schema.
Either you select your columns to a reasonable number and use as[MyCaseClass] or you should accept what DataFrame offers.

Related

Is there any way to capture the input file name of multiple parquet files read in with a wildcard in Spark?

I am using Spark to read multiple parquet files into a single RDD, using standard wildcard path conventions. In other words, I'm doing something like this:
val myRdd = spark.read.parquet("s3://my-bucket/my-folder/**/*.parquet")
However, sometimes these Parquet files will have different schemas. When I'm doing my transforms on the RDD, I can try and differentiate between them in the map functions, by looking for the existence (or absence) of certain columns. However a surefire way to know which schema a given row in the RDD uses - and the way I'm asking about specifically here - is to know which file path I'm looking at.
Is there any way, on an RDD level, to tell which specific parquet file the current row came from? So imagine my code looks something like this, currently (this is a simplified example):
val mapFunction = new MapFunction[Row, (String, Row)] {
override def call(row: Row): (String, Row) = myJob.transform(row)
}
val pairRdd = myRdd.map(mapFunction, encoder=kryo[(String, Row)]
Within the myJob.transform( ) code, I'm decorating the result with other values, converting it to a pair RDD, and do some other transforms as well.
I make use of the row.getAs( ... ) method to look up particular column values, and that's a really useful method. I'm wondering if there are any similar methods (e.g. row.getInputFile( ) or something like that) to get the name of the specific file that I'm currently operating on?
Since I'm passing in wildcards to read multiple parquet files into a single RDD, I don't have any insight into which file I'm operating on. If nothing else, I'd love a way to decorate the RDD rows with the input file name. Is this possible?
You can add a new column for the file name as shown below
import org.apache.spark.sql.functions._
val myDF = spark.read.parquet("s3://my-bucket/my-folder/**/*.parquet").withColumn("inputFile", input_file_name())

Spark: transform dataframe

I work with Spark 1.6.1 in Scala.
I have one dataframe, and I want to create different dataframe and only want to read 1 time.
For example one dataframe have two columns ID and TYPE, and I want to create two dataframe one with the value of type = A and other with type value = B.
I've checked another posts on stackoverflow, but found only the option to read the dataframe 2 times.
However, I would like another solution with the best performance possible.
Kinds regards.
Spark will read from the data source multiple times if you perform multiple actions on the data. The way to aviod this is to use cache(). In this way, the data will be saved to memory after the first action, which will make subsequent actions using the data faster.
Your two dataframes can be created in this way, requiring only one read of the data source.
val df = spark.read.csv(path).cache()
val dfA = df.filter($"TYPE" === "A").drop("TYPE")
val dfB = df.filter($"TYPE" === "B").drop("TYPE")
The "TYPE" column is dropped as it should be unnecessary after the separation.

Using MLUtils.convertVectorColumnsToML() inside a UDF?

I have a Dataset/Dataframe with a mllib.linalg.Vector (of Doubles) as one of the columns. I would like to add another column to this dataset of type ml.linalg.Vector to this data set (so I will have both types of Vectors). The reason is I am evaluating few algorithms and some of those expect mllib vector and some expect ml vector. Also, I have to feed o/p of one algorithm to another and each use different types.
Can someone please help me convert mllib.linalg.Vector to ml.linalg.Vector and append a new column to the data set in hand. I tried using MLUtils.convertVectorColumnsToML() inside an UDF and regular functions but not able to get it to working. I am trying to avoid creating a new dataset and then doing inner join and dropping the columns as the data set will be huge eventually and joins are expensive.
You can use the method toML to convert from mllib to ml vector. An UDF and usage example can look like this:
val convertToML = udf((mllibVec: org.apache.spark.mllib.linalg.Vector) = > {
mllibVec.asML
})
val df2 = df.withColumn("mlVector", convertToML($"mllibVector"))
Assuming df to be the original dataframe and the column with the mllib vector to be named mllibVector.

Spark's toDS vs to DF

I understand that one can convert an RDD to a Dataset using rdd.toDS. However there also exists rdd.toDF. Is there really any benefit of one over the other?
After playing with the Dataset API for a day, I find out that almost any operation takes me out to a DataFrame (for instance withColumn). After converting an RDD with toDS, I often find out that another conversion to a DataSet is needed, because something brought me to a DataFrame again.
Am I using the API wrongly? Should I stick with .toDF and only convert to a DataSet in the end of a chain of operations? Or is there a benefit to using toDS earlier?
Here is a small concrete example
spark
.read
.schema (...)
.json (...)
.rdd
.zipWithUniqueId
.map[(Integer,String,Double)] { case (row,id) => ... }
.toDS // now with a Dataset API (should use toDF here?)
.withColumnRenamed ("_1", "id" ) // now back to a DataFrame, not type safe :(
.withColumnRenamed ("_2", "text")
.withColumnRenamed ("_2", "overall")
.as[ParsedReview] // back to a Dataset
Michael Armburst nicely explained that shift to dataset and dataframe and the difference between the two. Basically in spark 2.x they converged dataset and dataframe API into one with slight difference.
"DataFrame is just DataSet of generic row objects. When you don't know all the fields, DF is the answer".

How can I resolve table names to Parquet on the fly?

I need to run Spark SQL queries with my own custom correspondence from table names to Parquet data. Reading Parquet data to DataFrames with sqlContext.read.parquet and registering the DataFrames with df.registerTempTable isn't cutting it for my use case, because those calls have to be run before the SQL query, when I might not even know what tables are needed.
Rather than using registerTempTable, I'm trying to write an Analyzer that resolves table names using my own logic. However, I need to be able to resolve an UnresolvedRelation to a LogicalPlan representing Parquet data, but sqlContext.read.parquet gives a DataFrame, not a LogicalPlan.
A DataFrame seems to have a logicalPlan attribute, but that's marked protected[sql]. There's also a ParquetRelation class, but that's private[sql]. That's all I found for ways to get a LogicalPlan.
How can I resolve table names to Parquet with my own logic? Am I even on the right track with Analyzer?
You can actually retrieve the logicalPlan of your DataFrame with
val myLogicalPlan: LogicalPlan = myDF.queryExecution.logical