I'm having a problem trying to update my table. I receive a new date and a new state, but I also want to update the history of dates and states, removing the last case(date2, state2) if it would be filled. I need some kind of loop or function to do it properly for not hard coding every single column.
My main though failed because I misunderstand the concept. I was trying to create a ordered list of dates for locating in the right column but I got stucked when I realise about the state.
I have run out of ideas but still didn't reach a significant code. I just cannot get to any basic code. I would appreciate any help.
Input
val historic = Seq(("Alice", "2022-01-02", "2", "2021-04-06", "3", "2020-01-01", "1")).toDF("name", "currentDate", "currentState", "date1", "state1", "date2", "state2").show()
+-----+-----------+------------+----------+------+----------+------+
| name|currentDate|currentState| date1|state1| date2|state2|
+-----+-----------+------------+----------+------+----------+------+
|Alice| 2022-01-02| 2|2021-04-06| 3|2020-01-01| 1|
+-----+-----------+------------+----------+------+----------+------+
val newData = Seq(("Alice", "2022-02-02", "s1")).toDF("name", "date", "state").show()
+-----+----------+-----+
| name| date|state|
+-----+----------+-----+
|Alice|2022-02-02| s1|
+-----+----------+-----+
Desired output
val expected = Seq(("Alice", "2022-02-02", "s1", "2022-01-02", "2", "2021-04-06", "3")).toDF("name", "currentDate", "currentState", "date1", "state1", "date2", "state2").show()
+-----+-----------+------------+----------+------+----------+------+
| name|currentDate|currentState| date1|state1| date2|state2|
+-----+-----------+------------+----------+------+----------+------+
|Alice| 2022-02-02| s1|2022-01-02| 2|2021-04-06| 3|
+-----+-----------+------------+----------+------+----------+------+
Thanks
This should do what you need:
val expected = historic.join(newData, Seq("name")).select(
'name,
'currentDate as "oldDate",
'currentState as "oldState",
'date as "newDate",
'state,
'state1 as "oldS1",
'date1 as "oldD1",
).select(
'name,
'newDate as "currentDate",
'state as "currentState",
'oldState as "state1",
'oldDate as "date1",
'oldS1 as "state2",
'oldD1 as "date2"
)
It adds the new data to the old based on the name column (making the assumption that that is unique), and then renames the columns to give you the desired structure.
Related
I need to loop through a json file, flatten the results and add a column to a dataframe in each loop with respective values. But the end result will have around ~2000 columns. So, using withColumn to add the columns is extremely slow. Is their any other alternative to add columns to a dataframe?
Sample Input json:
[
{
"ID": "12345",
"Timestamp": "20140101",
"Usefulness": "Yes",
"Code": [
{
"event1": "A",
"result": "1"
}
]
},
{
"ID": "1A35B",
"Timestamp": "20140102",
"Usefulness": "No",
"Code": [
{
"event1": "B",
"result": "1"
}
]
}
]
My output should be:
ID Timestamp Usefulness Code_event1 Code_result
12345 20140101 Yes A 1
1A35B 20140102 No B 1
The json file I am working on is huge and consists of many columns. So, withColumn is not feasible in my case.
EDIT:
Sample code:
# Data file
df_data = spark.read.json(file_path)
# Schema file
with open(schemapath) as fh:
jsonschema = json.load(fh,object_pairs_hook=OrderedDict)
I am looping through the schema file and in the loop I am accessing the data for a particular key from the data DF (df_data). I am doing this because my data file has multiple records so I cant loop through the data json file or it will loop through every record.
def func_structs(json_file):
for index,(k,v) in enumerate(json_file.items()):
if isinstance(v, dict):
srccol = k
func_structs(v)
elif isinstance(v, list):
srccol = k
func_lists(v) # Separate function to loop through list elements to find nested elements
else:
try:
df_data = df_data.withColumn(srcColName,df_Data[srcCol])
except:
df_data = df_data.withColumn(srcColName,lit(None).cast(StringType()))
func_structs(jsonschema)
I am adding columns to the data DF (df_data) itself.
One way is to use Spark's built-in json parser to read the json into a DF:
df = (sqlContext
.read
.option("multiLine", True)
.option("mode", "PERMISSIVE")
.json('file:///mypath/file.json')) # change as necessary
The result is as follows:
+--------+-----+---------+----------+
| Code| ID|Timestamp|Usefulness|
+--------+-----+---------+----------+
|[[A, 1]]|12345| 20140101| Yes|
|[[B, 1]]|1A35B| 20140102| No|
+--------+-----+---------+----------+
The second step is then to break out the struct inside the Code column:
df = df.withColumn('Code_event1', f.col('Code').getItem(0).getItem('event1'))
df = df.withColumn('Code_result', f.col('Code').getItem(0).getItem('result'))
df.show()
which gives
+--------+-----+---------+----------+-----------+-----------+
| Code| ID|Timestamp|Usefulness|Code_event1|Code_result|
+--------+-----+---------+----------+-----------+-----------+
|[[A, 1]]|12345| 20140101| Yes| A| 1|
|[[B, 1]]|1A35B| 20140102| No| B| 1|
+--------+-----+---------+----------+-----------+-----------+
EDIT:
Based on comment below from #pault, here is a neater way to capture the required values (run this code after load statement):
df = df.withColumn('Code', f.explode('Code'))
df.select("*", "Code.*")
I have a list with a few entires:
val list = Seq("Car", "House", "Beach")
The data looks like this:
val df = spark.sparkContext.parallelize(Seq(
("Pete", "He has a Car"),
("Mike", "The Beach is beautiful"),
("Steve", "Look at this House")
)).toDF("Name", "message"
What I want to accomplish is an additional column where the value is the element of the list if the element is present within the message column.
|----------------------------------|
|Name |Message |NewCol|
|----------------------------------|
|Pete |He has a Car |Car |
|Mike |The Beach is beatiful|Beach |
|Steve|Look at this House |House |
|----------------------------------|
I tried a few things but without any success like
a) when($"message".isin(list:_*)
b) A udf with -> list.exists(message.contains(_))
I also thought about comparing the string with a regular expression of *<listelement>* but could not get that to work either.
A join would also be a possibility(even more wanted) since the list is created by a column of a dataframe. The new column would only be used to join afterwards with the origin df of the list.
val new_df = df.join(df_listorigin, Seq("NewCol"))
I think right now I am thinking too complicated. Any help or ideas would be appreciated.
UDF-approach:
val contains = udf((m:String) => list.filter(m.contains(_)).mkString(","))
df
.withColumn("NewCol",contains($"message"))
.show()
+-----+--------------------+------+
| Name| message|NewCol|
+-----+--------------------+------+
| Pete| He has a Car| Car|
| Mike|The Beach is beau...| Beach|
|Steve| Look at this House| House|
+-----+--------------------+------+
Or with a join:
df
.join(list.toDF("NewCol"),$"message".contains($"NewCol"),"left")
.show()
+-----+--------------------+------+
| Name| message|NewCol|
+-----+--------------------+------+
| Pete| He has a Car| Car|
| Mike|The Beach is beau...| Beach|
|Steve| Look at this House| House|
+-----+--------------------+------+
I have two Spark dataframe's, df1 and df2:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
|shankar|12121| 28|
| ramesh| 1212| 29|
| suresh| 1111| 30|
| aarush| 0707| 15|
+-------+-----+---+
+------+-----+---+-----+
| eName| eNo|age| city|
+------+-----+---+-----+
|aarush|12121| 15|malmo|
|ramesh| 1212| 29|malmo|
+------+-----+---+-----+
I need to get the non matching records from df1, based on a number of columns which is specified in another file.
For example, the column look up file is something like below:
df1col,df2col
name,eName
empNo, eNo
Expected output is:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
|shankar|12121| 28|
| suresh| 1111| 30|
| aarush| 0707| 15|
+-------+-----+---+
The idea is how to build a where condition dynamically for the above scenario, because the lookup file is configurable, so it might have 1 to n fields.
You can use the except dataframe method. I'm assuming that the columns to use are in two lists for simplicity. It's necessary that the order of both lists are correct, the columns on the same location in the list will be compared (regardless of column name). After except, use join to get the missing columns from the first dataframe.
val df1 = Seq(("shankar","12121",28),("ramesh","1212",29),("suresh","1111",30),("aarush","0707",15))
.toDF("name", "empNo", "age")
val df2 = Seq(("aarush", "12121", 15, "malmo"),("ramesh", "1212", 29, "malmo"))
.toDF("eName", "eNo", "age", "city")
val df1Cols = List("name", "empNo")
val df2Cols = List("eName", "eNo")
val tempDf = df1.select(df1Cols.head, df1Cols.tail: _*)
.except(df2.select(df2Cols.head, df2Cols.tail: _*))
val df = df1.join(broadcast(tempDf), df1Cols)
The resulting dataframe will look as wanted:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
| aarush| 0707| 15|
| suresh| 1111| 30|
|shankar|12121| 28|
+-------+-----+---+
If you're doing this from a SQL query I would remap the column names in the SQL query itself with something like Changing a SQL column title via query. You could do a simple text replace in the query to normalize them to the df1 or df2 column names.
Once you have that you can diff using something like
How to obtain the difference between two DataFrames?
If you need more columns that wouldn't be used in the diff (e.g. age) you can reselect the data again based on your diff results. This may not be the optimal way of doing it but it would probably work.
Scala 2.10 here using Spark 1.6.2. I have a similar (but not the same) question as this one, however, the accepted answer is not an SSCCE and assumes a certain amount of "upfront knowledge" about Spark; and therefore I can't reproduce it or make sense of it. More importantly, that question is also just limited to adding a new column to an existing dataframe, whereas I need to add a column as well as a value for all existing rows in the dataframe.
So I want to add a column to an existing Spark DataFrame, and then apply an initial ('default') value for that new column to all rows.
val json : String = """{ "x": true, "y": "not true" }"""
val rdd = sparkContext.parallelize(Seq(json))
val jsonDF = sqlContext.read.json(rdd)
jsonDF.show()
When I run that I get this following as output (via .show()):
+----+--------+
| x| y|
+----+--------+
|true|not true|
+----+--------+
Now I want to add a new field to jsonDF, after it's created and without modifying the json string, such that the resultant DF would look like this:
+----+--------+----+
| x| y| z|
+----+--------+----+
|true|not true| red|
+----+--------+----+
Meaning, I want to add a new "z" column to the DF, of type StringType, and then default all rows to contain a z-value of "red".
From that other question I have pieced the following pseudo-code together:
val json : String = """{ "x": true, "y": "not true" }"""
val rdd = sparkContext.parallelize(Seq(json))
val jsonDF = sqlContext.read.json(rdd)
//jsonDF.show()
val newDF = jsonDF.withColumn("z", jsonDF("col") + 1)
newDF.show()
But when I run this, I get a compiler error on that .withColumn(...) method:
org.apache.spark.sql.AnalysisException: Cannot resolve column name "col" among (x, y);
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151)
at org.apache.spark.sql.DataFrame.col(DataFrame.scala:664)
at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:652)
I also don't see any API methods that would allow me to set "red" as the default value. Any ideas as to where I'm going awry?
You can use lit function. First you have to import it
import org.apache.spark.sql.functions.lit
and use it as shown below
jsonDF.withColumn("z", lit("red"))
Type of the column will be inferred automatically.
For example I want to replace all numbers equal to 0.2 in a column to 0. How can I do that in Scala? Thanks
Edit:
|year| make|model| comment |blank|
|2012|Tesla| S | No comment | |
|1997| Ford| E350|Go get one now th...| |
|2015|Chevy| Volt| null | null|
This is my Dataframe I'm trying to change Tesla in make column to S
Spark 1.6.2, Java code (sorry), this will change every instance of Tesla to S for the entire dataframe without passing through an RDD:
dataframe.withColumn("make", when(col("make").equalTo("Tesla"), "S")
.otherwise(col("make")
);
Edited to add #marshall245 "otherwise" to ensure non-Tesla columns aren't converted to NULL.
Building off of the solution from #Azeroth2b. If you want to replace only a couple of items and leave the rest unchanged. Do the following. Without using the otherwise(...) method, the remainder of the column becomes null.
import org.apache.spark.sql.functions._
val newsdf =
sdf.withColumn(
"make",
when(col("make") === "Tesla", "S").otherwise(col("make"))
);
Old DataFrame
+-----+-----+
| make|model|
+-----+-----+
|Tesla| S|
| Ford| E350|
|Chevy| Volt|
+-----+-----+
New Datarame
+-----+-----+
| make|model|
+-----+-----+
| S| S|
| Ford| E350|
|Chevy| Volt|
+-----+-----+
This can be achieved in dataframes with user defined functions (udf).
import org.apache.spark.sql.functions._
val sqlcont = new org.apache.spark.sql.SQLContext(sc)
val df1 = sqlcont.jsonRDD(sc.parallelize(Array(
"""{"year":2012, "make": "Tesla", "model": "S", "comment": "No Comment", "blank": ""}""",
"""{"year":1997, "make": "Ford", "model": "E350", "comment": "Get one", "blank": ""}""",
"""{"year":2015, "make": "Chevy", "model": "Volt", "comment": "", "blank": ""}"""
)))
val makeSIfTesla = udf {(make: String) =>
if(make == "Tesla") "S" else make
}
df1.withColumn("make", makeSIfTesla(df1("make"))).show
Note:
As mentionned by Olivier Girardot, this answer is not optimized and the withColumn solution is the one to use (Azeroth2b answer)
Can not delete this answer as it has been accepted
Here is my take on this one:
val rdd = sc.parallelize(
List( (2012,"Tesla","S"), (1997,"Ford","E350"), (2015,"Chevy","Volt"))
)
val sqlContext = new SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._
val dataframe = rdd.toDF()
dataframe.foreach(println)
dataframe.map(row => {
val row1 = row.getAs[String](1)
val make = if (row1.toLowerCase == "tesla") "S" else row1
Row(row(0),make,row(2))
}).collect().foreach(println)
//[2012,S,S]
//[1997,Ford,E350]
//[2015,Chevy,Volt]
You can actually use directly map on the DataFrame.
So you basically check the column 1 for the String tesla.
If it's tesla, use the value S for make else you the current value of column 1
Then build a tuple with all data from the row using the indexes (zero based) (Row(row(0),make,row(2))) in my example)
There is probably a better way to do it. I am not that familiar yet with the Spark umbrella
df2.na.replace("Name",Map("John" -> "Akshay","Cindy" -> "Jayita")).show()
replace in class DataFrameNaFunctions of type [T](col: String, replacement: Map[T,T])org.apache.spark.sql.DataFrame
For running this function you must have active spark object and dataframe with headers ON.
import org.apache.spark.sql.functions._
val base_optin_email = spark.read.option("header","true").option("delimiter",",").schema(schema_base_optin).csv(file_optin_email).where("CPF IS NOT NULL").
withColumn("CARD_KEY", lit(translate( translate(col("cpf"), ".", ""),"-","")))