I need to loop through a json file, flatten the results and add a column to a dataframe in each loop with respective values. But the end result will have around ~2000 columns. So, using withColumn to add the columns is extremely slow. Is their any other alternative to add columns to a dataframe?
Sample Input json:
[
{
"ID": "12345",
"Timestamp": "20140101",
"Usefulness": "Yes",
"Code": [
{
"event1": "A",
"result": "1"
}
]
},
{
"ID": "1A35B",
"Timestamp": "20140102",
"Usefulness": "No",
"Code": [
{
"event1": "B",
"result": "1"
}
]
}
]
My output should be:
ID Timestamp Usefulness Code_event1 Code_result
12345 20140101 Yes A 1
1A35B 20140102 No B 1
The json file I am working on is huge and consists of many columns. So, withColumn is not feasible in my case.
EDIT:
Sample code:
# Data file
df_data = spark.read.json(file_path)
# Schema file
with open(schemapath) as fh:
jsonschema = json.load(fh,object_pairs_hook=OrderedDict)
I am looping through the schema file and in the loop I am accessing the data for a particular key from the data DF (df_data). I am doing this because my data file has multiple records so I cant loop through the data json file or it will loop through every record.
def func_structs(json_file):
for index,(k,v) in enumerate(json_file.items()):
if isinstance(v, dict):
srccol = k
func_structs(v)
elif isinstance(v, list):
srccol = k
func_lists(v) # Separate function to loop through list elements to find nested elements
else:
try:
df_data = df_data.withColumn(srcColName,df_Data[srcCol])
except:
df_data = df_data.withColumn(srcColName,lit(None).cast(StringType()))
func_structs(jsonschema)
I am adding columns to the data DF (df_data) itself.
One way is to use Spark's built-in json parser to read the json into a DF:
df = (sqlContext
.read
.option("multiLine", True)
.option("mode", "PERMISSIVE")
.json('file:///mypath/file.json')) # change as necessary
The result is as follows:
+--------+-----+---------+----------+
| Code| ID|Timestamp|Usefulness|
+--------+-----+---------+----------+
|[[A, 1]]|12345| 20140101| Yes|
|[[B, 1]]|1A35B| 20140102| No|
+--------+-----+---------+----------+
The second step is then to break out the struct inside the Code column:
df = df.withColumn('Code_event1', f.col('Code').getItem(0).getItem('event1'))
df = df.withColumn('Code_result', f.col('Code').getItem(0).getItem('result'))
df.show()
which gives
+--------+-----+---------+----------+-----------+-----------+
| Code| ID|Timestamp|Usefulness|Code_event1|Code_result|
+--------+-----+---------+----------+-----------+-----------+
|[[A, 1]]|12345| 20140101| Yes| A| 1|
|[[B, 1]]|1A35B| 20140102| No| B| 1|
+--------+-----+---------+----------+-----------+-----------+
EDIT:
Based on comment below from #pault, here is a neater way to capture the required values (run this code after load statement):
df = df.withColumn('Code', f.explode('Code'))
df.select("*", "Code.*")
Related
I'm currently working in databricks and have a delta table with 20+ columns. I basically need to take a value from 1 column in each row, send it to an api which returns two values/columns, and then create the other 26 to merge the values back to the original delta table. So input is 28 columns and output is 28 columns. Currently my code looks like:
from pyspark.sql.types import *
from pyspark.sql import functions as F
import requests, uuid, json
from pyspark.sql import SparkSession
from pyspark.sql import DataFrame
from pyspark.sql.functions import col,lit
from functools import reduce
spark.conf.set("spark.sql.adaptive.enabled","true")
spark.conf.set("spark.databricks.adaptive.autoOptimizeShuffle.enabled", "true")
spark.sql('set spark.sql.execution.arrow.pyspark.enabled = true')
spark.conf.set("spark.databricks.optimizer.dynamicPartitionPruning","true")
spark.conf.set("spark.sql.parquet.compression.codec","gzip")
spark.conf.set("spark.sql.inMemorycolumnarStorage.compressed","true")
spark.conf.set("spark.databricks.optimizer.dynamicFilePruning","true");
output=spark.sql("select * from delta.`table`").cache()
SeriesAppend=[]
for i in output.collect():
#small mapping fix
if i['col1']=='val1':
var0='a'
elif i['col1']=='val2':
var0='b'
elif i['col1']=='val3':
var0='c'
elif i['col1']=='val4':
var0='d'
var0=set([var0])
req_var = set(['a','b','c','d'])
var_list=list(req_var-var0)
#subscription info
headers = {header}
body = [{
'text': i['col2']
}]
if len(i['col2'])<500:
request = requests.post(constructed_url, params=params, headers=headers, json=body)
response = request.json()
dumps=json.dumps(response[0])
loads = json.loads(dumps)
json_rdd = sc.parallelize(loads)
json_df = spark.read.json(json_rdd)
json_df = json_df.withColumn('col1',lit(i['col1']))
json_df = json_df.withColumn('col2',lit(i['col2']))
json_df = json_df.withColumn('col3',lit(i['col3']))
...
SeriesAppend.append(json_df)
else:
pass
Series_output=reduce(DataFrame.unionAll, SeriesAppend)
SAMPLE DF with only 3 columns:
df = spark.createDataFrame(
[
("a", "cat","owner1"), # create your data here, be consistent in the types.
("b", "dog","owner2"),
("c", "fish","owner3"),
("d", "fox","owner4"),
("e", "rat","owner5"),
],
["col1", "col2", "col3"]) # add your column names here
I really just need to write the response + other column values to a delta table, so dataframes are not necessarily required, but haven't found a faster way than the above. Right now, I can run 5 inputs, which returns 15 in 25.3 seconds without the unionAll. With the inclusion of the union, it turns into 3 minutes.
The final output would look like:
df = spark.createDataFrame(
[
("a", "cat","owner1","MI", 48003), # create your data here, be consistent in the types.
("b", "dog","owner2", "MI", 48003),
("c", "fish","owner3","MI", 48003),
("d", "fox","owner4","MI", 48003),
("e", "rat","owner5","MI", 48003),
],
["col1", "col2", "col3", "col4", "col5"]) # add your column names here
How can I make this faster in spark?
As mentioned in my comments, you should use UDF to distribute more workload to workers instead of collect and let a single machine (driver) to run it all. It's simply wrong approach and not scalable.
# This is your main function, pure Python and you can unittest it in any way you want.
# The most important about this function is:
# - everything must be encapsulated inside the function, no global variable works here
def req(col1, col2):
if col1 == 'val1':
var0 = 'a'
elif col1 == 'val2':
var0 = 'b'
elif col1 == 'val3':
var0 = 'c'
elif col1 == 'val4':
var0 = 'd'
var0=set([var0])
req_var = set(['a','b','c','d'])
var_list = list(req_var - var0)
#subscription info
headers = {header} # !!! `header` must available **inside** this function, global won't work
body = [{
'text': col2
}]
if len(col2) < 500:
# !!! same as `header`, `constructed_url` must available **inside** this function, global won't work
request = requests.post(constructed_url, params=params, headers=headers, json=body)
response = request.json()
return (response.col4, response.col5)
else:
return None
# Now you wrap the function above into a Spark UDF.
# I'm using only 2 columns here as input, but you can use as many columns as you wish.
# Same as output, I'm using only a tuple with 2 elements, you can make it as many items as you wish.
df.withColumn('temp', F.udf(req, T.ArrayType(T.StringType()))('col1', 'col2')).show()
# Output
# +----+----+------+------------------+
# |col1|col2| col3| temp|
# +----+----+------+------------------+
# | a| cat|owner1|[foo_cat, bar_cat]|
# | b| dog|owner2|[foo_dog, bar_dog]|
# | c|fish|owner3| null|
# | d| fox|owner4| null|
# | e| rat|owner5| null|
# +----+----+------+------------------+
# Now all you have to do is extract the tuple and assign to separate columns
# (and delete temp column to cleanup)
(df
.withColumn('col4', F.col('temp')[0])
.withColumn('col5', F.col('temp')[1])
.drop('temp')
.show()
)
# Output
# +----+----+------+-------+-------+
# |col1|col2| col3| col4| col5|
# +----+----+------+-------+-------+
# | a| cat|owner1|foo_cat|bar_cat|
# | b| dog|owner2|foo_dog|bar_dog|
# | c|fish|owner3| null| null|
# | d| fox|owner4| null| null|
# | e| rat|owner5| null| null|
# +----+----+------+-------+-------+
I have .log file in ADLS which contain multiple nested Json objects as follows
{"EventType":3735091736,"Timestamp":"2019-03-19","Data":{"Id":"event-c2","Level":2,"MessageTemplate":"Test1","Properties":{"CorrId":"d69b7489","ActionId":"d0e2c3fd"}},"Id":"event-c20b9c7eac0808d6321106d901000000"}
{"EventType":3735091737,"Timestamp":"2019-03-18","Data":{"Id":"event-d2","Level":2,"MessageTemplate":"Test1","Properties":{"CorrId":"f69b7489","ActionId":"d0f2c3fd"}},"Id":"event-d20b9c7eac0808d6321106d901000000"}
{"EventType":3735091738,"Timestamp":"2019-03-17","Data":{"Id":"event-e2","Level":1,"MessageTemplate":"Test1","Properties":{"CorrId":"g69b7489","ActionId":"d0d2c3fd"}},"Id":"event-e20b9c7eac0808d6321106d901000000"}
Need to read the above multiple nested Json objects in pyspark and convert to dataframe as follows
EventType Timestamp Data.[Id] ..... [Data.Properties.CorrId] [Data.Properties. ActionId]
3735091736 2019-03-19 event-c2 ..... d69b7489 d0e2c3fd
3735091737 2019-03-18 event-d2 ..... f69b7489 d0f2c3fd
3735091738 2019-03-17 event-e2 ..... f69b7489 d0d2c3fd
For above I am using ADLS,Pyspark in Azure DataBricks.
Does anyone know a general way to deal with above problem? Thanks!
You can read it into an RDD first. It will be read as a list of strings
You need to convert the json string into a native python datatype using
json.loads()
Then you can convert the RDD into a dataframe, and it can infer the schema directly using toDF()
Using the answer from Flatten Spark Dataframe column of map/dictionary into multiple columns, you can explode the Data column into multiple columns. Given your Id column is going to be unique. Note that, explode would return key, value columns for each entry in the map type.
You can repeat the 4th point to explode the properties column.
Solution:
import json
rdd = sc.textFile("demo_files/Test20191023.log")
df = rdd.map(lambda x: json.loads(x)).toDF()
df.show()
# +--------------------+----------+--------------------+----------+
# | Data| EventType| Id| Timestamp|
# +--------------------+----------+--------------------+----------+
# |[MessageTemplate ...|3735091736|event-c20b9c7eac0...|2019-03-19|
# |[MessageTemplate ...|3735091737|event-d20b9c7eac0...|2019-03-18|
# |[MessageTemplate ...|3735091738|event-e20b9c7eac0...|2019-03-17|
# +--------------------+----------+--------------------+----------+
data_exploded = df.select('Id', 'EventType', "Timestamp", F.explode('Data'))\
.groupBy('Id', 'EventType', "Timestamp").pivot('key').agg(F.first('value'))
# There is a duplicate Id column and might cause ambiguity problems
data_exploded.show()
# +--------------------+----------+----------+--------+-----+---------------+--------------------+
# | Id| EventType| Timestamp| Id|Level|MessageTemplate| Properties|
# +--------------------+----------+----------+--------+-----+---------------+--------------------+
# |event-c20b9c7eac0...|3735091736|2019-03-19|event-c2| 2| Test1|{CorrId=d69b7489,...|
# |event-d20b9c7eac0...|3735091737|2019-03-18|event-d2| 2| Test1|{CorrId=f69b7489,...|
# |event-e20b9c7eac0...|3735091738|2019-03-17|event-e2| 1| Test1|{CorrId=g69b7489,...|
# +--------------------+----------+----------+--------+-----+---------------+--------------------+
I was able to read the data by following code.
from pyspark.sql.functions import *
DF = spark.read.json("demo_files/Test20191023.log")
DF.select(col('Id'),col('EventType'),col('Timestamp'),col('Data.Id'),col('Data.Level'),col('Data.MessageTemplate'),
col('Data.Properties.CorrId'),col('Data.Properties.ActionId'))\
.show()```
***Result***
+--------------------+----------+----------+--------+-----+---------------+--------+--------+
| Id| EventType| Timestamp| Id|Level|MessageTemplate| CorrId|ActionId|
+--------------------+----------+----------+--------+-----+---------------+--------+--------+
|event-c20b9c7eac0...|3735091736|2019-03-19|event-c2| 2| Test1|d69b7489|d0e2c3fd|
|event-d20b9c7eac0...|3735091737|2019-03-18|event-d2| 2| Test1|f69b7489|d0f2c3fd|
|event-e20b9c7eac0...|3735091738|2019-03-17|event-e2| 1| Test1|g69b7489|d0d2c3fd|
+--------------------+----------+----------+--------+-----+---------------+--------+--------+
I get multiple incoming files and i have to compare each incoming file with the source file then merge and replace the old rows with the new rows and append the extra rows if any present in the source file. Afterwords I have to use the updated sourcefile and compare with another incoming file, update it and so the process goes on.
I have so far created the dataframe for each file and compared and merged using join. i want to save all the updates done in the source file and use the updated source file again to compare and update incomming files.
val merge = df.union(dfSource.join(df, Seq( "EmployeeID" ),joinType= "left_anti").orderBy("EmployeeID") )
merge.write.mode ("append").format("text").insertInto("dfSource")
merge.show()
I tried this way but it dosent update my dfSource dataframe. could somebody help please.
Thanks
Not possible this way. Need to use tables and then save to a file as final part of process.
Suggest you align your approach as follows - which allows parallel loading but really I suspect not really of benefit.
Load all files in order of delivery with each record loaded being tagged with a timestamp or some ordering sequence from your sequence number of files along with type of record. E.g. File X with, say, position 2 in sequence gets records loaded with seqnum = 2. You can use the DF approach on the file being processed and appending to a Impala / Hive KUDU table if performing all within SPARK domain.
For records in the same file apply monotonically_increasing_id() to get ordering within the file if same key can exist in same file. See DataFrame-ified zipWithIndex. Or zipWithIndex via RDD via conversion and back to DF.
Then issue a select statement to take the key values with maximum value timestamp, seq_num per key. E.g. if in current run 3 recs, say, for key=1, only one needs to be processed - presumably the one with highest value.
Save as a new file.
Process this new file accordingly.
OR:
Bypass step 3 and read in asc order and process data accordingly.
Comment to make:
Typically I load such data with LOAD to HIVE / IMPALA with partitioning key being set via extracting timestamp from the file name. Requires some LINUX scripting / processing. That's a question of style and should not be a real Big Data bottleneck.
Here is a snippet with simulated input of how some aspects can be done to allow a MAX select against a key for UPSerts. The Operation, DEL,ALT whatever you need to add. Although I think you can do this yourself actually from what I have seen:
import org.apache.spark.sql.functions._
import spark.implicits._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
def dfSchema(columnNames: List[String]): StructType =
StructType(
Seq(
StructField(name = "key", dataType = StringType, nullable = false),
StructField(name = "file", dataType = StringType, nullable = false),
StructField(name = "ts", dataType = StringType, nullable = false),
StructField(name = "val", dataType = StringType, nullable = false),
StructField(name = "seq_val", dataType = LongType, nullable = false)
)
)
val newSchema = dfSchema(List("key", "file", "ts", "val", "seq_val"))
val df1 = Seq(
("A","F1", "ts1","1"),
("B","F1", "ts1","10"),
("A","F1", "ts2","2"),
("C","F2", "ts3","8"),
("A","F2", "ts3","3"),
("A","F0", "ts0","0")
).toDF("key", "file", "ts","val")
val rddWithId = df1.sort($"key", $"ts".asc).rdd.zipWithIndex
val dfZippedWithId = spark.createDataFrame(rddWithId.map{ case (row, index) => Row.fromSeq(row.toSeq ++ Array(index))}, newSchema)
dfZippedWithId.show
returns:
+---+----+---+---+-------+
|key|file| ts|val|seq_val|
+---+----+---+---+-------+
| A| F0|ts0| 0| 0|
| A| F1|ts1| 1| 1|
| A| F1|ts2| 2| 2|
| A| F2|ts3| 3| 3|
| B| F1|ts1| 10| 4|
| C| F2|ts3| 8| 5|
+---+----+---+---+-------+
ready for subsequent processing.
Scala 2.10 here using Spark 1.6.2. I have a similar (but not the same) question as this one, however, the accepted answer is not an SSCCE and assumes a certain amount of "upfront knowledge" about Spark; and therefore I can't reproduce it or make sense of it. More importantly, that question is also just limited to adding a new column to an existing dataframe, whereas I need to add a column as well as a value for all existing rows in the dataframe.
So I want to add a column to an existing Spark DataFrame, and then apply an initial ('default') value for that new column to all rows.
val json : String = """{ "x": true, "y": "not true" }"""
val rdd = sparkContext.parallelize(Seq(json))
val jsonDF = sqlContext.read.json(rdd)
jsonDF.show()
When I run that I get this following as output (via .show()):
+----+--------+
| x| y|
+----+--------+
|true|not true|
+----+--------+
Now I want to add a new field to jsonDF, after it's created and without modifying the json string, such that the resultant DF would look like this:
+----+--------+----+
| x| y| z|
+----+--------+----+
|true|not true| red|
+----+--------+----+
Meaning, I want to add a new "z" column to the DF, of type StringType, and then default all rows to contain a z-value of "red".
From that other question I have pieced the following pseudo-code together:
val json : String = """{ "x": true, "y": "not true" }"""
val rdd = sparkContext.parallelize(Seq(json))
val jsonDF = sqlContext.read.json(rdd)
//jsonDF.show()
val newDF = jsonDF.withColumn("z", jsonDF("col") + 1)
newDF.show()
But when I run this, I get a compiler error on that .withColumn(...) method:
org.apache.spark.sql.AnalysisException: Cannot resolve column name "col" among (x, y);
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151)
at org.apache.spark.sql.DataFrame.col(DataFrame.scala:664)
at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:652)
I also don't see any API methods that would allow me to set "red" as the default value. Any ideas as to where I'm going awry?
You can use lit function. First you have to import it
import org.apache.spark.sql.functions.lit
and use it as shown below
jsonDF.withColumn("z", lit("red"))
Type of the column will be inferred automatically.
For example I want to replace all numbers equal to 0.2 in a column to 0. How can I do that in Scala? Thanks
Edit:
|year| make|model| comment |blank|
|2012|Tesla| S | No comment | |
|1997| Ford| E350|Go get one now th...| |
|2015|Chevy| Volt| null | null|
This is my Dataframe I'm trying to change Tesla in make column to S
Spark 1.6.2, Java code (sorry), this will change every instance of Tesla to S for the entire dataframe without passing through an RDD:
dataframe.withColumn("make", when(col("make").equalTo("Tesla"), "S")
.otherwise(col("make")
);
Edited to add #marshall245 "otherwise" to ensure non-Tesla columns aren't converted to NULL.
Building off of the solution from #Azeroth2b. If you want to replace only a couple of items and leave the rest unchanged. Do the following. Without using the otherwise(...) method, the remainder of the column becomes null.
import org.apache.spark.sql.functions._
val newsdf =
sdf.withColumn(
"make",
when(col("make") === "Tesla", "S").otherwise(col("make"))
);
Old DataFrame
+-----+-----+
| make|model|
+-----+-----+
|Tesla| S|
| Ford| E350|
|Chevy| Volt|
+-----+-----+
New Datarame
+-----+-----+
| make|model|
+-----+-----+
| S| S|
| Ford| E350|
|Chevy| Volt|
+-----+-----+
This can be achieved in dataframes with user defined functions (udf).
import org.apache.spark.sql.functions._
val sqlcont = new org.apache.spark.sql.SQLContext(sc)
val df1 = sqlcont.jsonRDD(sc.parallelize(Array(
"""{"year":2012, "make": "Tesla", "model": "S", "comment": "No Comment", "blank": ""}""",
"""{"year":1997, "make": "Ford", "model": "E350", "comment": "Get one", "blank": ""}""",
"""{"year":2015, "make": "Chevy", "model": "Volt", "comment": "", "blank": ""}"""
)))
val makeSIfTesla = udf {(make: String) =>
if(make == "Tesla") "S" else make
}
df1.withColumn("make", makeSIfTesla(df1("make"))).show
Note:
As mentionned by Olivier Girardot, this answer is not optimized and the withColumn solution is the one to use (Azeroth2b answer)
Can not delete this answer as it has been accepted
Here is my take on this one:
val rdd = sc.parallelize(
List( (2012,"Tesla","S"), (1997,"Ford","E350"), (2015,"Chevy","Volt"))
)
val sqlContext = new SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._
val dataframe = rdd.toDF()
dataframe.foreach(println)
dataframe.map(row => {
val row1 = row.getAs[String](1)
val make = if (row1.toLowerCase == "tesla") "S" else row1
Row(row(0),make,row(2))
}).collect().foreach(println)
//[2012,S,S]
//[1997,Ford,E350]
//[2015,Chevy,Volt]
You can actually use directly map on the DataFrame.
So you basically check the column 1 for the String tesla.
If it's tesla, use the value S for make else you the current value of column 1
Then build a tuple with all data from the row using the indexes (zero based) (Row(row(0),make,row(2))) in my example)
There is probably a better way to do it. I am not that familiar yet with the Spark umbrella
df2.na.replace("Name",Map("John" -> "Akshay","Cindy" -> "Jayita")).show()
replace in class DataFrameNaFunctions of type [T](col: String, replacement: Map[T,T])org.apache.spark.sql.DataFrame
For running this function you must have active spark object and dataframe with headers ON.
import org.apache.spark.sql.functions._
val base_optin_email = spark.read.option("header","true").option("delimiter",",").schema(schema_base_optin).csv(file_optin_email).where("CPF IS NOT NULL").
withColumn("CARD_KEY", lit(translate( translate(col("cpf"), ".", ""),"-","")))