I have a schema created manually for creating a dataframe say myschema
Now my dataframe say df is created.
Now, I did some operations on df and some of the columns were dropped.
say original myschema consists of 500 columns
Now after dropping some columns, my df consist of 450 columns.
Now somewhere in my code I need the schema back but only the schema after dataframe has applied some operations(ie. having 450 columns).
Now ,
Q1. How optimum is calling df.schema and using it, Is it action or transformation?
Q2. Should I create another myschema2 by filtering out those columns from myschema which will be dropped and use that?
Quick answers:
to Q1: schema is neither an action neither a transformation, in the sense that it doesn't modify the data frame and doesn't trigger any computation.
to Q2:
if I understand well, I guess you have something like this
val myschema = StructType(someSchema)
val df = spark.createDataFrame(someData, myschema)
// do some transformation (drop, add columns etc)
val df2 = df.drop("column1", "column2").withColumn("new", $"c1" + $"c2"))
and you want get the schema of df2. if it is so you can just use
val myschema2 = df2.schema
Long Answer:
Informally speaking, DataFrame is an abstraction over distributed datasets, and as you already pointed out, there are transformations and actions defined on them.
When you makes some transformation on data frames, what happens under the hood is that spark just builds a Directed Acyclic Graph describing that transformations. When
That DAG is analyzed and used to build an execution plan to get the work done
Actions on the other hand trigger the execution of the plan, which is transforming the actual data.
Schema of a transformed data frame is derived from the schema of the initial data frame basically walking along the DAG. The impact of such derivations is _neglectable, it doesn't depend on the size of the data, it depends on how much big is the DAG, but in all practical cases, you can ignore the time required to get schema. Schema is just metadata attached to a dataframe.
So to respond to Q2: No you should not have schema2 taking track of the modification you. Just by calling df.schema Spark will do that for you
hope this clears your doubts
Related
I could not find any discussion on below topic in any forum I searched in internet. It may be because I am new to Spark and Scala and I am not asking a valid question. If there are any existing threads discussing the same or similar topic, the links will be very helpful. :)
I am working on a process which uses Spark and Scala and creates a file by reading a lot of tables and deriving a lot of fields by applying logic to the data fetched from tables. So, the structure of my code is like this:
val driver_sql = "SELECT ...";
var df_res = spark.sql(driver_sql)
var df_res = df_res.withColumn("Col1", <logic>)
var df_res = df_res.withColumn("Col2", <logic>)
var df_res = df_res.withColumn("Col3", <logic>)
.
.
.
var df_res = df_res.withColumn("Col20", <logic>)
Basically, there is a driver query which creates the "driver" dataframe. After that, separate logic (functions) is executed based on a key or keys in the driver dataframe to add new columns/fields. The "logic" part is not always a one-line code, sometimes, it is a separate function which runs another query and does some kind of join on df_res and adds a new column. Record count also changes since I use “inner” join with other tables/dataframes in some cases.
So, here are my questions:
Should I persist df_res at any point in time?
Can I persist df_res again and again after columns are added? I mean, does it add value?
If I persist df_res (disk only) every time a new column is added, is the data in the disk replaced? Or does it create a new copy/version of df_res in the disk?
Is there is a better technique to persist/cache data in a scenario like this (to avoid doing a lot of stuff in memory)?
The first thing is persisting a dataframe helps when you are going to apply iterative operations on dataframe.
What you are doing here is applying transformation operation on your dataframes. There is no need to persist these dataframes here.
For eg:- Persisting would be helpful if you are doing something like this.
val df = spark.sql("select * from ...").persist
df.count
val df1 = df.select("..").withColumn("xyz",udf(..))
df1.count
val df2 = df.select("..").withColumn("abc",udf2(..))
df2.count
Now, if you persist df here then it would be beneficial in calculating df1 and df2.
One more thing to notice here is, the reason why I did df.count is because dataframe is persisted only when an action is applied on it. From Spark docs:
"The first time it is computed in an action, it will be kept in memory on the nodes". And this answers your second question as well.
Every time you persist a new copy will be created but you should unpersist the prev one first.
I work with Spark 1.6.1 in Scala.
I have one dataframe, and I want to create different dataframe and only want to read 1 time.
For example one dataframe have two columns ID and TYPE, and I want to create two dataframe one with the value of type = A and other with type value = B.
I've checked another posts on stackoverflow, but found only the option to read the dataframe 2 times.
However, I would like another solution with the best performance possible.
Kinds regards.
Spark will read from the data source multiple times if you perform multiple actions on the data. The way to aviod this is to use cache(). In this way, the data will be saved to memory after the first action, which will make subsequent actions using the data faster.
Your two dataframes can be created in this way, requiring only one read of the data source.
val df = spark.read.csv(path).cache()
val dfA = df.filter($"TYPE" === "A").drop("TYPE")
val dfB = df.filter($"TYPE" === "B").drop("TYPE")
The "TYPE" column is dropped as it should be unnecessary after the separation.
I am new to spark. I have some json data that comes as an HttpResponse. I'll need to store this data in hive tables. Every HttpGet request returns a json which will be a single row in the table. Due to this, I am having to write single rows as files in the hive table directory.
But I feel having too many small files will reduce the speed and efficiency. So is there a way I can recursively add new rows to the Dataframe and write it to the hive table directory all at once. I feel this will also reduce the runtime of my spark code.
Example:
for(i <- 1 to 10){
newDF = hiveContext.read.json("path")
df = df.union(newDF)
}
df.write()
I understand that the dataframes are immutable. Is there a way to achieve this?
Any help would be appreciated. Thank you.
You are mostly on the right track, what you want to do is to obtain multiple single records as a Seq[DataFrame], and then reduce the Seq[DataFrame] to a single DataFrame by unioning them.
Going from the code you provided:
val BatchSize = 100
val HiveTableName = "table"
(0 until BatchSize).
map(_ => hiveContext.read.json("path")).
reduce(_ union _).
write.insertInto(HiveTableName)
Alternatively, if you want to perform the HTTP requests as you go, we can do that too. Let's assume you have a function that does the HTTP request and converts it into a DataFrame:
def obtainRecord(...): DataFrame = ???
You can do something along the lines of:
val HiveTableName = "table"
val OtherHiveTableName = "other_table"
val jsonArray = ???
val batched: DataFrame =
jsonArray.
map { parameter =>
obtainRecord(parameter)
}.
reduce(_ union _)
batched.write.insertInto(HiveTableName)
batched.select($"...").write.insertInto(OtherHiveTableName)
You are clearly misusing Spark. Apache Spark is analytical system, not a database API. There is no benefit of using Spark to modify Hive database like this. It will only bring a severe performance penalty without benefiting from any of the Spark features, including distributed processing.
Instead you should use Hive client directly to perform transactional operations.
If you can batch-download all of the data (for example with a script using curl or some other program) and store it in a file first (or many files, spark can load an entire directory at once) you can then load that file(or files) all at once into spark to do your processing. I would also check to see it the webapi as any endpoints to fetch all the data you need instead of just one record at a time.
In order to run a few ML algorithms, I need to create extra columns of data. Each of these columns involves some fairly intense calculations that involves keeping moving averages and recording information as you go through each row (and updating it meanwhile). I've done a mock through with a simple Python script and it works, and I am currently looking to translate it to a Scala Spark script that could be run on a larger data set.
The issue is it seems that for these to be highly efficient, using Spark SQL, it is preferred to use the built in syntax and operations (which are SQL-like). Encoding the logic in a SQL expression seems to be a very thought-intensive process, so I'm wondering what the downsides will be if I just manually create the new column values by iterating through each row, keeping track of variables and inserting the column value at the end.
You can convert an rdd into dataframe. Then use map on the data frame and process each row as you wish. If you need to add new column, then you can use, withColumn. However this will only allow one column to be added and it happens for the entire dataframe. If you want more columns to be added, then inside map method,
a. you can gather new values based on the calculations
b. Add these new column values to main rdd as below
val newColumns: Seq[Any] = Seq(newcol1,newcol2)
Row.fromSeq(row.toSeq.init ++ newColumns)
Here row, is the reference of row in map method
c. Create new schema as below
val newColumnsStructType = StructType{Seq(new StructField("newcolName1",IntegerType),new StructField("newColName2", IntegerType))
d. Add to the old schema
val newSchema = StructType(mainDataFrame.schema.init ++ newColumnsStructType)
e. Create new dataframe with new columns
val newDataFrame = sqlContext.createDataFrame(newRDD, newSchema)
Is it possible and what would be the most efficient neat method to add a column to Data Frame?
More specifically, column may serve as Row IDs for the existing Data Frame.
In a simplified case, reading from file and not tokenizing it, I can think of something as below (in Scala), but it completes with errors (at line 3), and anyways doesn't look like the best route possible:
var dataDF = sc.textFile("path/file").toDF()
val rowDF = sc.parallelize(1 to DataDF.count().toInt).toDF("ID")
dataDF = dataDF.withColumn("ID", rowDF("ID"))
It's been a while since I posted the question and it seems that some other people would like to get an answer as well. Below is what I found.
So the original task was to append a column with row identificators (basically, a sequence 1 to numRows) to any given data frame, so the rows order/presence can be tracked (e.g. when you sample). This can be achieved by something along these lines:
sqlContext.textFile(file).
zipWithIndex().
map(case(d, i)=>i.toString + delimiter + d).
map(_.split(delimiter)).
map(s=>Row.fromSeq(s.toSeq))
Regarding the general case of appending any column to any data frame:
The "closest" to this functionality in Spark API are withColumn and withColumnRenamed. According to Scala docs, the former Returns a new DataFrame by adding a column. In my opinion, this is a bit confusing and incomplete definition. Both of these functions can operate on this data frame only, i.e. given two data frames df1 and df2 with column col:
val df = df1.withColumn("newCol", df1("col") + 1) // -- OK
val df = df1.withColumn("newCol", df2("col") + 1) // -- FAIL
So unless you can manage to transform a column in an existing dataframe to the shape you need, you can't use withColumn or withColumnRenamed for appending arbitrary columns (standalone or other data frames).
As it was commented above, the workaround solution may be to use a join - this would be pretty messy, although possible - attaching the unique keys like above with zipWithIndex to both data frames or columns might work. Although efficiency is ...
It's clear that appending a column to the data frame is not an easy functionality for distributed environment and there may not be very efficient, neat method for that at all. But I think that it's still very important to have this core functionality available, even with performance warnings.
not sure if it works in spark 1.3 but in spark 1.5 I use withColumn:
import sqlContext.implicits._
import org.apache.spark.sql.functions._
df.withColumn("newName",lit("newValue"))
I use this when I need to use a value that is not related to existing columns of the dataframe
This is similar to #NehaM's answer but simpler
I took help from above answer. However, I find it incomplete if we want to change a DataFrame and current APIs are little different in Spark 1.6.
zipWithIndex() returns a Tuple of (Row, Long) which contains each row and corresponding index. We can use it to create new Row according to our need.
val rdd = df.rdd.zipWithIndex()
.map(indexedRow => Row.fromSeq(indexedRow._2.toString +: indexedRow._1.toSeq))
val newstructure = StructType(Seq(StructField("Row number", StringType, true)).++(df.schema.fields))
sqlContext.createDataFrame(rdd, newstructure ).show
I hope this will be helpful.
You can use row_number with Window function as below to get the distinct id for each rows in a dataframe.
df.withColumn("ID", row_number() over Window.orderBy("any column name in the dataframe"))
You can also use monotonically_increasing_id for the same as
df.withColumn("ID", monotonically_increasing_id())
And there are some other ways too.