I'm new to Databricks & Spark/Scala.
I'm currently working on a machine learning to do sales forecasting.
I used the function dayofyear to create features.
The only problem is that returns me null value.
I tried with this csv because i was using an another one and i thought this could come from this.
But apparently, i was wrong.
I read the docs about this function but the description is really short.
I tried with dayofmonth or weekofyear, same result.
Can you explain me how I can fix this ? What am I doing wrong ?
val path = "dbfs:/databricks-datasets/asa/planes/plane-data.csv"
val df = sqlContext.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load(path)
display(df)
import org.apache.spark.sql.functions._
val df2 = df.withColumn("dateofyear", dayofyear(df("issue_date")))
display(df2)
Here's the result : Result
You can cast the issue_date to timestamp before using dayofyear function as
data.withColumn("issue_date", unix_timestamp($"issue_date", "MM/dd/yyyy").cast(TimestampType))
.withColumn("dayofyear", dayofyear($"issue_date"))
Hope this helps!
Related
I keep on getting error, that I pass too many arguments and not sure why, as I am following the exact examples from:
command-629173529675356:9: error: too many arguments for method apply: (colName: String)org.apache.spark.sql.Column in class Dataset
val df_2 = date_format.withColumn("day_of_week", date_format(col("date"), "EEEE"))
My code:
val date_format = df_filter.withColumn("date", to_date(col("pickup_datetime")))
val df_2 = date_format.withColumn("day_of_week", date_format(col("date"), "EEEE"))
Thank you for the help!
dayofweek is the function you're looking for, so something like this
import org.apache.spark.sql.functions.dayofweek
date_format.withColumn("day_of_week", dayofweek(col("date")))
You get your error because you named your first dataframe date_format, which is the same name as the Spark's built-in function you want to use. So when you call date_format, you're retrieving your dataframe instead of date_format built-in function.
To solve this, you should either rename your first dataframe:
val df_1 = df_filter.withColumn("date", to_date(col("pickup_datetime")))
val df_2 = df_1.withColumn("day_of_week", date_format(col("date"), "EEEE"))
Or ensure that you're calling right date_format by importing functions and then call functions.date_format when extracting day of week:
import org.apache.spark.sql.functions
val date_format = df_filter.withColumn("date", to_date(col("pickup_datetime")))
val df_2 = date_format.withColumn("day_of_week", functions.date_format(col("date"), "EEEE"))
I have Datetime stored in the following format - YYYYMMDDHHMMSS. (Data Type -Long Int)
Sample Data -
This Temp View - ingestionView comes from a DataFrame.
Now I want to introduce a new column newingestiontime in the dataframe which is of the format YYYY-MM-DD HH:MM:SS.
One of the ways I have tried this is, but it didnt work either -
val res = ingestiondatetimeDf.select(col("ingestiontime"), unix_timestamp(col("newingestiontime"), "yyyyMMddHHmmss").cast(TimestampType).as("timestamp"))
Output -
Please help me here , and If there is a better way to establish this, I will be delighted to learn new thing.
Thanks in advance.
Use from_unixtime & unix_timestamp.
Check below code.
scala> df
.withColumn(
"newingestiontime",
from_unixtime(
unix_timestamp($"ingestiontime".cast("string"),
"yyyyMMddHHmmss")
)
)
.show(false)
+--------------+-------------------+
|ingestiontime |newingestiontime |
+--------------+-------------------+
|20200501230000|2020-05-01 23:00:00|
+--------------+-------------------+
How can I add a file_name column to a dataframe, as data is loading into the frame? So, I want the file_name to show for every record in the dataframe.
I did some research on this, and found something that seems like it should work, but it actually doesn't load any file names, only the data in the files themselves.
import org.apache.spark.sql.functions._
val df = spark.read.format("csv")
.option("sep","|")
.option("inferSchema","true")
.option("header","false")
.load("mnt/rawdata/2019/01/01/corp/ABC*.gz")
df.withColumn("file_name", input_file_name)
What is wrong with my code here? Thanks.
The input_file_name function creates a string column for the file name of the current Spark task.
import org.apache.spark.sql.functions.input_file_name
val df= spark.read
.option("delimiter", "|")
.option("header", "false")
.csv("mnt/rawdata/2019/01/01/corp/")
.withColumn("file_name", input_file_name())
I am facing an issue when i am trying to find the number of months between two dates using 'months_between'function. when my input date format is 'dd/mm/yyyy' or any other date format then the function is returning the correct output. however when i am passing the input date format as yyyymmdd then i am getting the below error.
Code:
val df = spark.read.option("header", "true").option("dateFormat", "yyyyMMdd").option("inferSchema", "true").csv("MyFile.csv")
val filteredMemberDF = df.withColumn("monthsBetween", functions.months_between(col("toDate"), col("fromDT")))
error:
cannot resolve 'months_between(toDate, fromDT)' due to data type mismatch: argument 1 requires timestamp type,
however, 'toDate' is of int type. argument 2 requires timestamp type, however, 'fromDT' is of int type.;
When my input is as below,
id fromDT toDate
11 16/06/2008 16/08/2008
12 13/07/2008 13/10/2008
getting expected output,
id fromDT toDate monthsBetween
11 16/6/2008 16/8/2008 2
12 13/7/2008 13/10/2008 3
when i am passing the below data, facing the above said error.
id fromDT toDate
11 20150930 20150930
12 20150930 20150930
You first need to use to_date function to convert those numbers to DateTimes.
import org.apache.spark.sql.functions._
val df = spark.read
.option("header", "true")
.option("dateFormat", "yyyyMMdd")
.option("inferSchema", "true")
.csv("MyFile.csv")
val dfWithDates = df
.withColumn("toDateReal", to_date(concat(col("toDate")), "yyyyMMdd"))
.withColumn("fromDateReal", to_date(concat(col("fromDT")), "yyyyMMdd"))
val filteredMemberDF = dfWithDates
.withColumn("monthsBetween", months_between(col("toDateReal"), col("fromDateReal")))
I saw a solution here but when I tried it doesn't work for me.
First I import a cars.csv file :
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("/usr/local/spark/cars.csv")
Which looks like the following :
+----+-----+-----+--------------------+-----+
|year| make|model| comment|blank|
+----+-----+-----+--------------------+-----+
|2012|Tesla| S| No comment| |
|1997| Ford| E350|Go get one now th...| |
|2015|Chevy| Volt| null| null|
Then I do this :
df.na.fill("e",Seq("blank"))
But the null values didn't change.
Can anyone help me ?
This is basically very simple. You'll need to create a new DataFrame. I'm using the DataFrame df that you have defined earlier.
val newDf = df.na.fill("e",Seq("blank"))
DataFrames are immutable structures.
Each time you perform a transformation which you need to store, you'll need to affect the transformed DataFrame to a new value.
you can achieve same in java this way
Dataset<Row> filteredData = dataset.na().fill(0);
If the column was string type,
val newdf= df.na.fill("e",Seq("blank"))
would work.
Since it's float type (as the image tells) you need to use
val newdf= df.na.fill(0.0, Seq("blank"))