Is there anyway to get schema from the parquet files being queried? - scala

So, I have parquet files separated by folder with date in it, something like
root_folder
|_date=20210101
|_ file_A.parquet
|_date=20210102
|_ file_B.parquet
file_A has 2 column X,Y, file_B has 3 column X,Y,Z
but when i query using sparksession on the date 20210102, it's using schema from the topmost folder that is 20210101 and when i tried querying column Z it doesn't exist.
I've tried using mergeSchema=true option, but it doesn't fit my use case because I need to treat those with column Z differently, and i'm checking if there's column Z using DataFrame.columns.
Is there any workaround for this? I need to get schema from the one i query only.

If computational cost is not a concern, you can solve this problem by reading the entire dataset into spark, filter to the date you are looking for, and then drop the column if is entirely null.
This performs a pass over the data just to figure out if the column should be dropped, which is not great. Luckily .where and .count parallelize pretty well so you have enough compute it might be okay.
val base = spark.read
.option("mergeSchema", true)
.parquet("root_folder/")
.where(col("date") === "20210101")
val df = if (base.where(col("Z").isNotNull).count > 0) base.drop("Z") else base
df.schema // Should only have X, Y
If you want to generalize this into a function that drops all empty columns, you can compute the .isNotNull count for all columns in 1 pass.

Related

Removing duplicated columns in a dataframe

I know this seems like a really simple question and I have scoured google and stackoverflow for it, but could not find exactly what I need.
I have aggregated some data from one dataframe config into another config1 with the following code. The basis of the code was provided by another stackoverflow member Thank You #Sunny Shukla.
exprs=map(lambda c: max(c).alias(c), config.columns)
config1=config.groupBy(["seq_id","tool_id"])\
.agg(f.count(f.lit(1)).alias('count'),
*exprs).where('count = 1').drop('count')
The config dataframe has 20 columns and the config1 df has 22 columns because I have grouped it using 2 columns seq_id and tool_id but mapped the entire original columns to retain the original column names (im sure there is a more elegant way to do this)
The resulting dataframe config1 therefore has a duplicated columns of seq_id and tool_id. If I do
the config1.drop('seq_id','tool_id') then it drops 4 columns and i end up with 18 columns instead of 20.
Is there a more elegant way to do this without writing UDFs?
Thank You

Scala - How to append a column to a DataFrame preserving the original column name?

I have a basic DataFrame containing all the data and several derivative DataFrames that I've been subsequently creating from the basic DF making grouping, joins etc.
Every time I want to append a column to the last DataFrame containing the most relevant data I have to do something like this:
val theMostRelevantFinalDf = olderDF.withColumn("new_date_", to_utc_timestamp(unix_timestamp(col("new_date"))
.cast(TimestampType), "UTC").cast(StringType)).drop($"new_date")
As you may see I have to change the original column name to new_date_
But I want the column name to remain the same.
However if I don't change the name the column gets dropped. So renaming is just a not too pretty workaround.
How can I preserve the original column name when appending the column?
As far as I know you can not create two columns with the same name in a DataFrame transformation. I rename the new column to the olderĀ“s name like
val theMostRelevantFinalDf = olderDF.withColumn("new_date_", to_utc_timestamp(unix_timestamp(col("new_date"))
.cast(TimestampType), "UTC").cast(StringType)).drop($"new_date").withColumnRenamed("new_date_", "new_date")

Using MLUtils.convertVectorColumnsToML() inside a UDF?

I have a Dataset/Dataframe with a mllib.linalg.Vector (of Doubles) as one of the columns. I would like to add another column to this dataset of type ml.linalg.Vector to this data set (so I will have both types of Vectors). The reason is I am evaluating few algorithms and some of those expect mllib vector and some expect ml vector. Also, I have to feed o/p of one algorithm to another and each use different types.
Can someone please help me convert mllib.linalg.Vector to ml.linalg.Vector and append a new column to the data set in hand. I tried using MLUtils.convertVectorColumnsToML() inside an UDF and regular functions but not able to get it to working. I am trying to avoid creating a new dataset and then doing inner join and dropping the columns as the data set will be huge eventually and joins are expensive.
You can use the method toML to convert from mllib to ml vector. An UDF and usage example can look like this:
val convertToML = udf((mllibVec: org.apache.spark.mllib.linalg.Vector) = > {
mllibVec.asML
})
val df2 = df.withColumn("mlVector", convertToML($"mllibVector"))
Assuming df to be the original dataframe and the column with the mllib vector to be named mllibVector.

Is it inefficient to manually iterate Spark SQL data frames and create column values?

In order to run a few ML algorithms, I need to create extra columns of data. Each of these columns involves some fairly intense calculations that involves keeping moving averages and recording information as you go through each row (and updating it meanwhile). I've done a mock through with a simple Python script and it works, and I am currently looking to translate it to a Scala Spark script that could be run on a larger data set.
The issue is it seems that for these to be highly efficient, using Spark SQL, it is preferred to use the built in syntax and operations (which are SQL-like). Encoding the logic in a SQL expression seems to be a very thought-intensive process, so I'm wondering what the downsides will be if I just manually create the new column values by iterating through each row, keeping track of variables and inserting the column value at the end.
You can convert an rdd into dataframe. Then use map on the data frame and process each row as you wish. If you need to add new column, then you can use, withColumn. However this will only allow one column to be added and it happens for the entire dataframe. If you want more columns to be added, then inside map method,
a. you can gather new values based on the calculations
b. Add these new column values to main rdd as below
val newColumns: Seq[Any] = Seq(newcol1,newcol2)
Row.fromSeq(row.toSeq.init ++ newColumns)
Here row, is the reference of row in map method
c. Create new schema as below
val newColumnsStructType = StructType{Seq(new StructField("newcolName1",IntegerType),new StructField("newColName2", IntegerType))
d. Add to the old schema
val newSchema = StructType(mainDataFrame.schema.init ++ newColumnsStructType)
e. Create new dataframe with new columns
val newDataFrame = sqlContext.createDataFrame(newRDD, newSchema)

Append a column to Data Frame in Apache Spark 1.3

Is it possible and what would be the most efficient neat method to add a column to Data Frame?
More specifically, column may serve as Row IDs for the existing Data Frame.
In a simplified case, reading from file and not tokenizing it, I can think of something as below (in Scala), but it completes with errors (at line 3), and anyways doesn't look like the best route possible:
var dataDF = sc.textFile("path/file").toDF()
val rowDF = sc.parallelize(1 to DataDF.count().toInt).toDF("ID")
dataDF = dataDF.withColumn("ID", rowDF("ID"))
It's been a while since I posted the question and it seems that some other people would like to get an answer as well. Below is what I found.
So the original task was to append a column with row identificators (basically, a sequence 1 to numRows) to any given data frame, so the rows order/presence can be tracked (e.g. when you sample). This can be achieved by something along these lines:
sqlContext.textFile(file).
zipWithIndex().
map(case(d, i)=>i.toString + delimiter + d).
map(_.split(delimiter)).
map(s=>Row.fromSeq(s.toSeq))
Regarding the general case of appending any column to any data frame:
The "closest" to this functionality in Spark API are withColumn and withColumnRenamed. According to Scala docs, the former Returns a new DataFrame by adding a column. In my opinion, this is a bit confusing and incomplete definition. Both of these functions can operate on this data frame only, i.e. given two data frames df1 and df2 with column col:
val df = df1.withColumn("newCol", df1("col") + 1) // -- OK
val df = df1.withColumn("newCol", df2("col") + 1) // -- FAIL
So unless you can manage to transform a column in an existing dataframe to the shape you need, you can't use withColumn or withColumnRenamed for appending arbitrary columns (standalone or other data frames).
As it was commented above, the workaround solution may be to use a join - this would be pretty messy, although possible - attaching the unique keys like above with zipWithIndex to both data frames or columns might work. Although efficiency is ...
It's clear that appending a column to the data frame is not an easy functionality for distributed environment and there may not be very efficient, neat method for that at all. But I think that it's still very important to have this core functionality available, even with performance warnings.
not sure if it works in spark 1.3 but in spark 1.5 I use withColumn:
import sqlContext.implicits._
import org.apache.spark.sql.functions._
df.withColumn("newName",lit("newValue"))
I use this when I need to use a value that is not related to existing columns of the dataframe
This is similar to #NehaM's answer but simpler
I took help from above answer. However, I find it incomplete if we want to change a DataFrame and current APIs are little different in Spark 1.6.
zipWithIndex() returns a Tuple of (Row, Long) which contains each row and corresponding index. We can use it to create new Row according to our need.
val rdd = df.rdd.zipWithIndex()
.map(indexedRow => Row.fromSeq(indexedRow._2.toString +: indexedRow._1.toSeq))
val newstructure = StructType(Seq(StructField("Row number", StringType, true)).++(df.schema.fields))
sqlContext.createDataFrame(rdd, newstructure ).show
I hope this will be helpful.
You can use row_number with Window function as below to get the distinct id for each rows in a dataframe.
df.withColumn("ID", row_number() over Window.orderBy("any column name in the dataframe"))
You can also use monotonically_increasing_id for the same as
df.withColumn("ID", monotonically_increasing_id())
And there are some other ways too.