I am trying to reproduce in SparkSQL the same behavior than the group by in sql.
Here is an example of what I am able to do in SQL and not in SparkSQL with SparkSQL functions:
Input dataset:
val input = Seq(
("Warsaw", 2016, 2),
("Toronto", 2016, 4),
("Toronto", 2017, 1),
("Toronto", 2017, 1)).toDF("city", "year", "count")
Which results in:
+-------+----+-----+
|city |year|count|
+-------+----+-----+
|Warsaw |2016|2 |
|Toronto|2016|4 |
|Toronto|2017|1 |
|Toronto|2017|1 |
+-------+----+-----+
Then I register the table as a temporary one using:
input.createOrReplaceTempView("input")
Then by using sql
select city, year, count
from result
group by 1,2,3
Which gives
+-------+----+-----+
|city |year|count|
+-------+----+-----+
|Warsaw |2016|2 |
|Toronto|2016|4 |
|Toronto|2017|1 |
+-------+----+-----+
I would like to the same with SparkSQL native functions, and if possible NOT USE dropDuplicates.
Thanks in advance for your help
You can use Window function - row_number().
val columns = input.columns.map(col(_))
input.withColumn("rn", row_number().over(Window.partitionBy(columns: _*).orderBy(columns: _*)))
.where("rn = 1")
.drop("rn")
.show()
Related
I found some tips about converting a pyspark dataframe to R, but I need to perform the opposite task: convert a R dataframe to pyspark
Anyone knows how to do it?
You can use the same approach as for other languages - use createOrReplaceTempView function to register your dataframe, and then use spark.sql from another language to access its content.
For example. If R side looks as following:
%r
library(SparkR)
id <- c(rep(1, 3), rep(2, 3), 3)
desc <- c('New', 'New', 'Good', 'New', 'Good', 'Good', 'New')
df <- data.frame(id, desc)
df <- createDataFrame(df)
createOrReplaceTempView(df, "test_df")
head(df)
id desc
1 1 New
2 1 New
3 1 Good
4 2 New
5 2 Good
6 2 Good
then you can access these data from Python:
df = spark.sql("select * from test_df")
df.show()
+---+----+
| id|desc|
+---+----+
|1.0| New|
|1.0| New|
|1.0|Good|
|2.0| New|
|2.0|Good|
|2.0|Good|
|3.0| New|
+---+----+
I'm having difficulty joining these two dataframe views because of not being able to modify specific column values in spark scala. I think I have to do a transpose/join somehow, but am not able to figure it out.
Here is the first dataframe:
var sample_df = Seq(("john","morning","7am"),("john","night","10pm"),("bob","morning","8am"),("bob","night","11pm"),("phil","morning","9am"),("phil","night","10pm")).toDF("person","time_of_day","wake/sleep hour")
here is the second dataframe:
var sample_df2 = Seq(("john","6am","11pm"),("bob","7am","2am"),("phil","8am","1am")).toDF("person","morning_earliest","night_latest")
and here is the resulting dataframe I'm looking to produce:
var resulting_df = Seq(("john","morning","7am","6am"),("john","night","10pm","11pm"),("bob","morning","8am","7am"),("bob","night","11pm","2am"),("phil","morning","9am","8am"),("phil","night","10pm","1am")).toDF("person","time_of_day","wake/sleep hour","earliest/latest")
Any help would be greatly appreciated! Thanks and have a great day!
sample_df.createOrReplaceTempView("df1")
sample_df2.createOrReplaceTempView("df2")
spark.sql("""
select person, time_of_day, `wake/sleep hour`, `earliest/latest`
from (
select person, stack(2, 'morning', morning_earliest, 'night', night_latest) as (time_of_day, `earliest/latest`)
from df2
) df
join df1
using (time_of_day, person)
""").show()
+------+-----------+---------------+---------------+
|person|time_of_day|wake/sleep hour|earliest/latest|
+------+-----------+---------------+---------------+
| john| morning| 7am| 6am|
| john| night| 10pm| 11pm|
| bob| morning| 8am| 7am|
| bob| night| 11pm| 2am|
| phil| morning| 9am| 8am|
| phil| night| 10pm| 1am|
+------+-----------+---------------+---------------+
val df = sample_df
.join(sample_df2,"person")
val resulting_df = df.withColumn("earliest/latest",
when(col("time_of_day")=== "morning", $"morning_earliest")
.otherwise($"night_latest"))
.drop($"morning_earliest")
.drop($"night_latest")
resulting_df.show()
I am reading the data from HDFS into DataFrame using Spark 2.2.0 and Scala 2.11.8:
val df = spark.read.text(outputdir)
df.show()
I see this result:
+--------------------+
| value|
+--------------------+
|(4056,{community:...|
|(56,{community:56...|
|(2056,{community:...|
+--------------------+
If I run df.head(), I see more details about the structure of each row:
[(4056,{community:1,communitySigmaTot:1020457,internalWeight:0,nodeWeight:1020457})]
I want to get the following output:
+---------+----------+
| id | value|
+---------+----------+
|4056 |1 |
|56 |56 |
|2056 |20 |
+---------+----------+
How can I do it? I tried using .map(row => row.mkString(",")),
but I don't know how to extract the data as I showed.
The problem is that you are getting the data as a single column of strings. The data format is not really specified in the question (ideally it would be something like JSON), but given what we know, we can use a regular expression to extract the number on the left (id) and the community field:
val r = """\((\d+),\{.*community:(\d+).*\}\)"""
df.select(
F.regexp_extract($"value", r, 1).as("id"),
F.regexp_extract($"value", r, 2).as("community")
).show()
A bunch of regular expressions should give you required result.
df.select(
regexp_extract($"value", "^\\(([0-9]+),.*$", 1) as "id",
explode(split(regexp_extract($"value", "^\\(([0-9]+),\\{(.*)\\}\\)$", 2), ",")) as "value"
).withColumn("value", split($"value", ":")(1))
If your data is always of the following format
(4056,{community:1,communitySigmaTot:1020457,internalWeight:0,nodeWeight:1020457})
Then you can simply use split and regex_replace inbuilt functions to get your desired output dataframe as
import org.apache.spark.sql.functions._
df.select(regexp_replace((split(col("value"), ",")(0)), "\\(", "").as("id"), regexp_replace((split(col("value"), ",")(1)), "\\{community:", "").as("value") ).show()
I hope the answer is helpful
I need to create an equivalent of bussiness current view in pyspark , I have an history file and a delta file(containing id and date) .I need to create final dataframe which will have the single record for each id and that record should be of latest date .
df1=sql_context.createDataFrame([("3000", "2017-04-19"), ("5000", "2017-04-19"), ("9012", "2017-04-19")], ["id", "date"])
df2=sql_context.createDataFrame([("3000", "2017-04-18"), ("5120", "2017-04-18"), ("1012", "2017-04-18")], ["id", "date"])
df3=df2.union(df1).distinct()
+----+----------+
| id| date|
+----+----------+
|3000|2017-04-19|
|3000|2017-04-18|
|5120|2017-04-18|
|5000|2017-04-19|
|1012|2017-04-18|
|9012|2017-04-19|
+----+----------+
I tried doing a union and do a distinct , it gives me id=3000 for both the dates where as I need only record for id=300 for date=2017-04-19
Even subtract doesnt work since it returns all the rows of either of the df's .
Desired output:-
+----+----------+
| id| date|
+----+----------+
|3000|2017-04-19|
|
|5120|2017-04-18|
|5000|2017-04-19|
|1012|2017-04-18|
|9012|2017-04-19|
+----+----------+
Hope this helps!
from pyspark.sql.functions import unix_timestamp, col, to_date, max
#sample data
df1=sqlContext.createDataFrame([("3000", "2017-04-19"),
("5000", "2017-04-19"),
("9012", "2017-04-19")],
["id", "date"])
df2=sqlContext.createDataFrame([("3000", "2017-04-18"),
("5120", "2017-04-18"),
("1012", "2017-04-18")],
["id", "date"])
df=df2.union(df1)
df.show()
#convert 'date' column to date type so that latest date can be fetched for an ID
df = df.\
withColumn('date_inDateFormat',to_date(unix_timestamp(col('date'),"yyyy-MM-dd").cast("timestamp"))).\
drop('date')
#get latest date for an ID
df = df.groupBy('id').agg(max('date_inDateFormat').alias('date'))
df.show()
Output is:
+----+----------+
| id| date|
+----+----------+
|5000|2017-04-19|
|1012|2017-04-18|
|5120|2017-04-18|
|9012|2017-04-19|
|3000|2017-04-19|
+----+----------+
Note: Please don't forget to let SO know if the answer helps you solve your problem.
I have two Spark dataframe's, df1 and df2:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
|shankar|12121| 28|
| ramesh| 1212| 29|
| suresh| 1111| 30|
| aarush| 0707| 15|
+-------+-----+---+
+------+-----+---+-----+
| eName| eNo|age| city|
+------+-----+---+-----+
|aarush|12121| 15|malmo|
|ramesh| 1212| 29|malmo|
+------+-----+---+-----+
I need to get the non matching records from df1, based on a number of columns which is specified in another file.
For example, the column look up file is something like below:
df1col,df2col
name,eName
empNo, eNo
Expected output is:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
|shankar|12121| 28|
| suresh| 1111| 30|
| aarush| 0707| 15|
+-------+-----+---+
The idea is how to build a where condition dynamically for the above scenario, because the lookup file is configurable, so it might have 1 to n fields.
You can use the except dataframe method. I'm assuming that the columns to use are in two lists for simplicity. It's necessary that the order of both lists are correct, the columns on the same location in the list will be compared (regardless of column name). After except, use join to get the missing columns from the first dataframe.
val df1 = Seq(("shankar","12121",28),("ramesh","1212",29),("suresh","1111",30),("aarush","0707",15))
.toDF("name", "empNo", "age")
val df2 = Seq(("aarush", "12121", 15, "malmo"),("ramesh", "1212", 29, "malmo"))
.toDF("eName", "eNo", "age", "city")
val df1Cols = List("name", "empNo")
val df2Cols = List("eName", "eNo")
val tempDf = df1.select(df1Cols.head, df1Cols.tail: _*)
.except(df2.select(df2Cols.head, df2Cols.tail: _*))
val df = df1.join(broadcast(tempDf), df1Cols)
The resulting dataframe will look as wanted:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
| aarush| 0707| 15|
| suresh| 1111| 30|
|shankar|12121| 28|
+-------+-----+---+
If you're doing this from a SQL query I would remap the column names in the SQL query itself with something like Changing a SQL column title via query. You could do a simple text replace in the query to normalize them to the df1 or df2 column names.
Once you have that you can diff using something like
How to obtain the difference between two DataFrames?
If you need more columns that wouldn't be used in the diff (e.g. age) you can reselect the data again based on your diff results. This may not be the optimal way of doing it but it would probably work.