I came across the following express, I know what does it mean - department("name"). I am curious to know, what it is resolved to. please share your inputs .
department("name") - it is used to refer the column with the name "name". Hope I am correct ? But , what it is resolved to, it seems like auxiliary constructor
From https://spark.apache.org/docs/2.4.5/api/java/index.html?org/apache/spark/sql/DataFrameWriter.html,
// To create Dataset[Row] using SparkSession
val people = spark.read.parquet("...")
val department = spark.read.parquet("...")
people.filter("age > 30")
.join(department, people("deptId") === department("id"))
.groupBy(department("name"), people("gender"))
.agg(avg(people("salary")), max(people("age")))
department("name") is just syntactic sugar for calling apply function:
department.apply("name") which returns Column
from Spark API, Dataset object:
/**
* Selects column based on the column name and returns it as a [[Column]].
*
* #note The column name can also reference to a nested column like `a.b`.
*
* #group untypedrel
* #since 2.0.0
*/
def apply(colName: String): Column = col(colName)
Related
I have a fixed date "2000/01/01" and a dataframe:
data1 = [{'index':1,'offset':50}]
data_p = sc.parallelize(data1)
df = spark.createDataFrame(data_p)
I want to create a new column by adding the offset column to this fixed date
I tried different method but cannot pass the column iterator and expr error as:
function is neither a registered temporary function nor a permanent function registered in the database 'default'
The only solution I can think of is
df = df.withColumn("zero",lit(datetime.strptime('2000/01/01', '%Y/%m/%d')))
df.withColumn("date_offset",expr("date_add(zero,offset)")).drop("zero")
Since I cannot use lit and datetime.strptime in the expr, I have to use this approach which creates a redundant column and redundant operations.
Any better way to do it?
As you have marked it as pyspark question so in python you can do below
df_a3.withColumn("date_offset",F.lit("2000-01-01").cast("date") + F.col("offset").cast("int")).show()
Edit- As per comment below lets assume there was an extra column of type then based on it below code can be used
df_a3.withColumn("date_offset",F.expr("case when type ='month' then add_months(cast('2000-01-01' as date),offset) else date_add(cast('2000-01-01' as date),cast(offset as int)) end ")).show()
In a Spark (2.3.0) project using Scala, I would like to drop multiple columns using a regex. I tried using colRegex, but without success:
val df = Seq(("id","a_in","a_out","b_in","b_out"))
.toDF("id","a_in","a_out","b_in","b_out")
val df_in = df
.withColumnRenamed("a_in","a")
.withColumnRenamed("b_in","b")
.drop(df.colRegex("`.*_(in|out)`"))
// Hoping to get columns Array(id, a, b)
df_in.columns
// Getting Array(id, a, a_out, b, b_out)
On the other hand, the mechanism seems to work with select:
df.select(df.colRegex("`.*_(in|out)`")).columns
// Getting Array(a_in, a_out, b_in, b_out)
Several things are not clear to me:
what is this backquote syntax in the regex?
colRegex returns a Column: how can it actually represent several columns in the 2nd example?
can I combine drop and colRegex or do I need some workaround?
If you check spark code of colRefex method ... it expects regexs to be passed in the below format
/** the column name pattern in quoted regex without qualifier */
val escapedIdentifier = "`(.+)`".r
/** the column name pattern in quoted regex with qualifier */
val qualifiedEscapedIdentifier = ("(.+)" + """.""" + "`(.+)`").r
backticks(`) are necessary to enclose your regex, otherwise the above patterns will not identify your input pattern.
you can try selecting specific colums which are valid as mentioned below
val df = Seq(("id","a_in","a_out","b_in","b_out"))
.toDF("id","a_in","a_out","b_in","b_out")
val df_in = df
.withColumnRenamed("a_in","a")
.withColumnRenamed("b_in","b")
.drop(df.colRegex("`.*_(in|out)`"))
val validColumns = df_in.columns.filter(p => p.matches(".*_(in|out)$")).toSeq //select all junk columns
val final_df_in = df_in.drop(validColumns:_*) // this will drop all columns which are not valid as per your criteria.
In addition to the workaround proposed by Waqar Ahmed and kavetiraviteja (accepted answer), here is another possibility based on select with some negative regex magic. More concise, but harder to read for non-regex-gurus...
val df_in = df
.withColumnRenamed("a_in","a")
.withColumnRenamed("b_in","b")
.select(df.colRegex("`^(?!.*_(in|out)_).*$`")) // regex with negative lookahead
I see a code from book "Spark The Definitive Guide",it invoke a drop on a dataframe with no parameter,when I use show(),I found nothing changed,but what is the meaning of it?
I execute it,nothing changed,dfNoNull.show() is the same as dfWithDate.show()
dfWithDate.createOrReplaceTempView("dfWithDate")
// in Scala
val dfNoNull = dfWithDate.drop()
dfNoNull.createOrReplaceTempView("dfNoNull")
does it mean, it create a new datarframe?
I know when a dataframe join itself when I using Hive sql,if I just
val df1=spark.sql("select id,date from date")
val df2=spark.sql("select id,date from date")
val joinedDf = spark.sql("select dateid1,dateid2 from sales")
.join(df1,df1["id"]===dateid1).join(df2,df2["id"]===dateid2)
Then an error occur:Cartesian join!
because the lazy evalution will consider df1 and df1 as the same one
so here,if I
val df2=df1.drop()
will I prevent that error?
If not,what does the drop method with no parameter mean?
Or it just mean remove the temp view name and create a new one?
but I try the code below,no exception throwed:
val df= Seq((1,"a")).toDF("id","name")
df.createOrReplaceTempView("df1")
val df2=df.drop()
df2.createOrReplaceTempView("df2")
spark.sql("select * from df1").show()
Or does the book mean below?
val dfNoNull = dfWithDate.na.drop()
because it wrote somewhere below the code:
Grouping sets depend on null values for aggregation levels. If you do
not filter-out null values, you will get incorrect results.This
applies to cubes, rollups, and grouping sets.
drop function with no parameter behave the same as drop with column name that doesn't exist in the Dataframe.
You can follow the code in the source of spark.
Even in the function documentation you can see a hint to this behavior.
/**
* Returns a new Dataset with a column dropped. This is a no-op if schema doesn't contain
* column name.
*
* This method can only be used to drop top level columns. the colName string is treated
* literally without further interpretation.
*
* #group untypedrel
* #since 2.0.0
*/
So when calling the function with no parameter no-op occur and nothing changes in the returning DataFrame.
I'm trying to add a column to a dataframe, which will contain hash of another column.
I've found this piece of documentation:
https://spark.apache.org/docs/2.3.0/api/sql/index.html#hash
And tried this:
import org.apache.spark.sql.functions._
val df = spark.read.parquet(...)
val withHashedColumn = df.withColumn("hashed", hash($"my_column"))
But what is the hash function used by that hash()? Is that murmur, sha, md5, something else?
The value I get in this column is integer, thus range of values here is probably [-2^(31) ... +2^(31-1)].
Can I get a long value here? Can I get a string hash instead?
How can I specify a concrete hashing algorithm for that?
Can I use a custom hash function?
It is Murmur based on the source code:
/**
* Calculates the hash code of given columns, and returns the result as an int column.
*
* #group misc_funcs
* #since 2.0.0
*/
#scala.annotation.varargs
def hash(cols: Column*): Column = withExpr {
new Murmur3Hash(cols.map(_.expr))
}
If you want a Long hash, in spark 3 there is the xxhash64 function: https://spark.apache.org/docs/3.0.0-preview/api/sql/index.html#xxhash64.
You may want only positive numbers. In this case you may use hash and sum Int.MaxValue as
df.withColumn("hashID", hash($"value").cast(LongType)+Int.MaxValue).show()
i have been searching for a while but i haven't found how to do it.
i have a dataframe that contains a reference to a table and one of the columns contains a string
dataframe schema: name string,lastname string, interests string
i have a list of interests like so:
val sports:List [String] = List("football","basketball","soccer")
i want to filter all the people from my dataframe that contain one of the sports above in their interests
val peopledata = sqlContext.sql("select * from learning.people")
i have tried to do this like this :
for (sport <- sports)peopledata.filter(peopledata("interests").contains(sport))
but i have asked a pro in the company i work in, and he told me there he a better and prettier way to do it
Execute collect() function to get Array[Row] of results and filter elements of this array with sports.contains():
peopledata.collect().filter(row => sports contains row.getString(2))
2 here is number of interests field in your schema.
Usage of string interpolation will solve your problem:
val interest = sports.mkString("('","','","')")
val peopledata = sqlContext.sql(s"select * from learning.people where interest in $interest")