Struggling with the following in PySpark. I have a dictionary in Python that looks like this:
COUNTRY_MAP = {
"AND": "AD", "ARE": "AE", "AFG": "AF", "ATG": "AG", "AIA": "AI", ... };
I now want to build up a value consisting of 3 columns, say value1, value2 and value3. The problem is that value3 needs to use the above lookup to convert the 3-letter code to a 2-letter code and if it does not exist, then "NONE" should be used, i.e.
from pyspark.sql import functions as sf
combined = sf.trim(sf.concat(sf.col("value1"), sf.lit(":"), sf.col("value2"), sf.lit(":"),
sf.coalesce(sf.col("value3"), "NONE")))
tmp = (df.withColumn('COMBINED_FIELD', combined)
...<other stuff>
)
This gives me values like "abc:4545:AND", "def:7789:ARE" and "ghi:1122:NONE". I now need: "abc:4545:AD", "def:7789:AE" and "ghi:1122:NONE".
As a newbie in PySpark, I am really struggling to get this working. Do you know?
You can convert the dictionary into a map type column and get the values using value3 as the key:
import pyspark.sql.functions as F
COUNTRY_MAP = {"AND": "AD", "ARE": "AE", "AFG": "AF", "ATG": "AG", "AIA": "AI"}
result = df.withColumn(
'combined_field',
F.trim(
F.concat_ws(':',
'value1', 'value2',
F.coalesce(
F.create_map(*sum([[F.lit(k), F.lit(v)] for (k,v) in COUNTRY_MAP.items()], []))[F.col('value3')],
F.lit('NONE')
)
)
)
)
result.show()
+------+------+------+--------------+
|value1|value2|value3|combined_field|
+------+------+------+--------------+
| abc| 4545| AND| abc:4545:AD|
| def| 7789| ARE| def:7789:AE|
| ghi| 1122| NONE| ghi:1122:NONE|
+------+------+------+--------------+
Related
I am trying to aggregate a table that I have around one kay value (id here) so that I can have one row per id and perform some verifications on the rows that belong to each id in order to identify the 'result' (type of transaction of sorts). Lets say that after aggregating, I have something like this:
sc = SparkContext()
cols = ['id', 'list1', 'list2']
data = [('zero', ['cd1', 'cd7', 'cd5', 'cd2'], ['', '', '', 'debit']),('one', ['cd2', 'cd3', 'cd9', 'cd6'], ['credit', '', '', '']),('two', ['cd4', 'cd3', 'cd5', 'cd1'],['', '', '', ''])]
rdd = sc.parallelize(data)
df = rdd.toDF(cols)
>>> df.show()
+----+--------------------+--------------+
| id| list1| list2|
+----+--------------------+--------------+
|zero|[cd1, cd7, cd5, cd2]| [, , , debit]|
| one|[cd2, cd3, cd9, cd6]|[credit, , , ]|
| two|[cd4, cd3, cd5, cd1]| [, , , ]|
+----+--------------------+--------------+
The question I have to answer here is: does list1 have cd9 in it? If so, what is the corresponding value in list2 of list1's cd2?
What I have done to solve it was defining a couple of UDFs, since array functions in PySpark 1.6 are limited:
enum = F.udf(lambda x,y: [i for i, e in enumerate(x) if e==y], T.ArrayType(T.IntegerType()))
elat = F.udf(lambda x,y: [e for i, e in enumerate(x) if (i in y)], T.ArrayType(T.StringType()))
nulls = F.udf(lambda: [], T.ArrayType(T.IntegerType()))
Then creating a new 'lookup' column with the indexes of the elements I want to grab from the other column of lists:
df = df.withColumn('lookup',
F.when((F.array_contains(F.col('list1'), 'cd7')) | (F.array_contains(F.col('list1'), 'cd9')), enum(F.col('list1'), F.lit('cd2')))
.otherwise(nulls()))
And finally using this column to reach my endgoal:
df = df.withColumn('result',
F.when(F.array_contains(F.col('list1'), 'cd7') & (F.array_contains(elat(F.col('list2'), F.col('lookup')),'debit')), 'CD 7 - DEBIT')
.otherwise(F.when(F.array_contains(F.col('list1'), 'cd7') & (F.array_contains(elat(F.col('list2'), F.col('lookup')),'credit')), 'CD 7 - CREDIT')
.otherwise(F.when(F.array_contains(F.col('list1'), 'cd9') & (F.array_contains(elat(F.col('list2'), F.col('lookup')),'debit')), 'CD 9 - DEBIT')
.otherwise(F.when(F.array_contains(F.col('list1'), 'cd9') & (F.array_contains(elat(F.col('list2'), F.col('lookup')),'credit')), 'CD 9 - CREDIT')
.otherwise('etc')
)))
)
>>> df.show()
+----+--------------------+--------------+------+-------------+
| id| list1| list2|lookup| result|
+----+--------------------+--------------+------+-------------+
|zero|[cd1, cd7, cd5, cd2]| [, , , debit]| [3]| CD 7 - DEBIT|
| one|[cd2, cd3, cd9, cd6]|[credit, , , ]| [0]|CD 9 - CREDIT|
| two|[cd4, cd3, cd5, cd1]| [, , , ]| []| etc|
+----+--------------------+--------------+------+-------------+
But I would very much prefer if there was a way to achieve the same without creating one extra column, because the actual dataframe has more columns and the lookup list may need to access different columns depending on the rule that I need to check for. When I tried to combine both elat and enum UDFs on one go it was unable to compute one or the other.
I need to loop through a json file, flatten the results and add a column to a dataframe in each loop with respective values. But the end result will have around ~2000 columns. So, using withColumn to add the columns is extremely slow. Is their any other alternative to add columns to a dataframe?
Sample Input json:
[
{
"ID": "12345",
"Timestamp": "20140101",
"Usefulness": "Yes",
"Code": [
{
"event1": "A",
"result": "1"
}
]
},
{
"ID": "1A35B",
"Timestamp": "20140102",
"Usefulness": "No",
"Code": [
{
"event1": "B",
"result": "1"
}
]
}
]
My output should be:
ID Timestamp Usefulness Code_event1 Code_result
12345 20140101 Yes A 1
1A35B 20140102 No B 1
The json file I am working on is huge and consists of many columns. So, withColumn is not feasible in my case.
EDIT:
Sample code:
# Data file
df_data = spark.read.json(file_path)
# Schema file
with open(schemapath) as fh:
jsonschema = json.load(fh,object_pairs_hook=OrderedDict)
I am looping through the schema file and in the loop I am accessing the data for a particular key from the data DF (df_data). I am doing this because my data file has multiple records so I cant loop through the data json file or it will loop through every record.
def func_structs(json_file):
for index,(k,v) in enumerate(json_file.items()):
if isinstance(v, dict):
srccol = k
func_structs(v)
elif isinstance(v, list):
srccol = k
func_lists(v) # Separate function to loop through list elements to find nested elements
else:
try:
df_data = df_data.withColumn(srcColName,df_Data[srcCol])
except:
df_data = df_data.withColumn(srcColName,lit(None).cast(StringType()))
func_structs(jsonschema)
I am adding columns to the data DF (df_data) itself.
One way is to use Spark's built-in json parser to read the json into a DF:
df = (sqlContext
.read
.option("multiLine", True)
.option("mode", "PERMISSIVE")
.json('file:///mypath/file.json')) # change as necessary
The result is as follows:
+--------+-----+---------+----------+
| Code| ID|Timestamp|Usefulness|
+--------+-----+---------+----------+
|[[A, 1]]|12345| 20140101| Yes|
|[[B, 1]]|1A35B| 20140102| No|
+--------+-----+---------+----------+
The second step is then to break out the struct inside the Code column:
df = df.withColumn('Code_event1', f.col('Code').getItem(0).getItem('event1'))
df = df.withColumn('Code_result', f.col('Code').getItem(0).getItem('result'))
df.show()
which gives
+--------+-----+---------+----------+-----------+-----------+
| Code| ID|Timestamp|Usefulness|Code_event1|Code_result|
+--------+-----+---------+----------+-----------+-----------+
|[[A, 1]]|12345| 20140101| Yes| A| 1|
|[[B, 1]]|1A35B| 20140102| No| B| 1|
+--------+-----+---------+----------+-----------+-----------+
EDIT:
Based on comment below from #pault, here is a neater way to capture the required values (run this code after load statement):
df = df.withColumn('Code', f.explode('Code'))
df.select("*", "Code.*")
I have dataframe like below.
+---+------+------+
| ID|Field1|Field2|
+---+------+------+
| 1| x| n|
| 2| a| b|
+---+------+------+
And I need the output like below
+---+-------------+------+
| ID| Fields|values|
+---+-------------+------+
| 1|Field1,Field2| x,n|
| 2|Field1,Field2| a,b|
+---+-------------+------+
I am pretty new to scala.. I just need an approach to do this. I already researched on internet regarding transpose, but couldn't get the solution.
Since Fields column is going to be the same in every row, you can add it later.
In this example class Thing has 3 fields: id, Field1, Field2.
val sqlContext = new org.apache.spark.sql.SQLContext( sc )
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val df =
sc
.parallelize( List( Thing( 1, "a", "b" ), Thing( 2, "x", "y" ) ) )
.toDF( "id", "Field1", "Field2" )
Column names are returned in the same order so we can just take last two for field names
val fieldNames =
df
.columns
.takeRight( 2 )
The org.apache.spark.sql.functions does all the job combining data from given columns.
val res =
df
.select( $"id", array( $"Field1", $"Field2" ) as "values" )
.withColumn( "Fields", lit( fieldNames ) )
res.show()
Result:
+---+------+----------------+
| id|values| Fields|
+---+------+----------------+
| 1|[a, b]|[Field1, Field2]|
| 2|[x, y]|[Field1, Field2]|
+---+------+----------------+
I got this following RDD after kafka streaming. I want to convert it into dataframe without defining Schema.
[
{u'Enrolment_Date': u'2008-01-01', u'Freq': 78},
{u'Enrolment_Date': u'2008-02-01', u'Group': u'Recorded Data'},
{u'Freq': 70, u'Group': u'Recorded Data'},
{u'Enrolment_Date': u'2008-04-01', u'Freq': 96}
]
You can use an OrderedDict to convert a RDD containing key value pairs to a dataframe. However, in your case not all keys are present in each row, and therefore you need to populate these first with None values. See the solution below
#Define test data
data = [{u'Enrolment_Date': u'2008-01-01', u'Freq': 78}, {u'Enrolment_Date': u'2008-02-01', u'Group': u'Recorded Data'}, {u'Freq': 70, u'Group': u'Recorded Data'}, {u'Enrolment_Date': u'2008-04-01', u'Freq': 96}]
rdd = sc.parallelize(data)
from pyspark.sql import Row
from collections import OrderedDict
#Determine all the keys in the input data
schema = rdd.flatMap(lambda x: x.keys()).distinct().collect()
#Add missing keys with a None value
rdd_complete= rdd.map(lambda r:{x:r.get(x) for x in schema})
#Use an OrderedDict to convert your data to a dataframe
#This ensures data ends up in the right column
df = rdd_complete.map(lambda r: Row(**OrderedDict(sorted(r.items())))).toDF()
df.show()
This gives as output:
+--------------+----+-------------+
|Enrolment_Date|Freq| Group|
+--------------+----+-------------+
| 2008-01-01| 78| null|
| 2008-02-01|null|Recorded Data|
| null| 70|Recorded Data|
| 2008-04-01| 96| null|
+--------------+----+-------------+
For example I want to replace all numbers equal to 0.2 in a column to 0. How can I do that in Scala? Thanks
Edit:
|year| make|model| comment |blank|
|2012|Tesla| S | No comment | |
|1997| Ford| E350|Go get one now th...| |
|2015|Chevy| Volt| null | null|
This is my Dataframe I'm trying to change Tesla in make column to S
Spark 1.6.2, Java code (sorry), this will change every instance of Tesla to S for the entire dataframe without passing through an RDD:
dataframe.withColumn("make", when(col("make").equalTo("Tesla"), "S")
.otherwise(col("make")
);
Edited to add #marshall245 "otherwise" to ensure non-Tesla columns aren't converted to NULL.
Building off of the solution from #Azeroth2b. If you want to replace only a couple of items and leave the rest unchanged. Do the following. Without using the otherwise(...) method, the remainder of the column becomes null.
import org.apache.spark.sql.functions._
val newsdf =
sdf.withColumn(
"make",
when(col("make") === "Tesla", "S").otherwise(col("make"))
);
Old DataFrame
+-----+-----+
| make|model|
+-----+-----+
|Tesla| S|
| Ford| E350|
|Chevy| Volt|
+-----+-----+
New Datarame
+-----+-----+
| make|model|
+-----+-----+
| S| S|
| Ford| E350|
|Chevy| Volt|
+-----+-----+
This can be achieved in dataframes with user defined functions (udf).
import org.apache.spark.sql.functions._
val sqlcont = new org.apache.spark.sql.SQLContext(sc)
val df1 = sqlcont.jsonRDD(sc.parallelize(Array(
"""{"year":2012, "make": "Tesla", "model": "S", "comment": "No Comment", "blank": ""}""",
"""{"year":1997, "make": "Ford", "model": "E350", "comment": "Get one", "blank": ""}""",
"""{"year":2015, "make": "Chevy", "model": "Volt", "comment": "", "blank": ""}"""
)))
val makeSIfTesla = udf {(make: String) =>
if(make == "Tesla") "S" else make
}
df1.withColumn("make", makeSIfTesla(df1("make"))).show
Note:
As mentionned by Olivier Girardot, this answer is not optimized and the withColumn solution is the one to use (Azeroth2b answer)
Can not delete this answer as it has been accepted
Here is my take on this one:
val rdd = sc.parallelize(
List( (2012,"Tesla","S"), (1997,"Ford","E350"), (2015,"Chevy","Volt"))
)
val sqlContext = new SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._
val dataframe = rdd.toDF()
dataframe.foreach(println)
dataframe.map(row => {
val row1 = row.getAs[String](1)
val make = if (row1.toLowerCase == "tesla") "S" else row1
Row(row(0),make,row(2))
}).collect().foreach(println)
//[2012,S,S]
//[1997,Ford,E350]
//[2015,Chevy,Volt]
You can actually use directly map on the DataFrame.
So you basically check the column 1 for the String tesla.
If it's tesla, use the value S for make else you the current value of column 1
Then build a tuple with all data from the row using the indexes (zero based) (Row(row(0),make,row(2))) in my example)
There is probably a better way to do it. I am not that familiar yet with the Spark umbrella
df2.na.replace("Name",Map("John" -> "Akshay","Cindy" -> "Jayita")).show()
replace in class DataFrameNaFunctions of type [T](col: String, replacement: Map[T,T])org.apache.spark.sql.DataFrame
For running this function you must have active spark object and dataframe with headers ON.
import org.apache.spark.sql.functions._
val base_optin_email = spark.read.option("header","true").option("delimiter",",").schema(schema_base_optin).csv(file_optin_email).where("CPF IS NOT NULL").
withColumn("CARD_KEY", lit(translate( translate(col("cpf"), ".", ""),"-","")))