Decoding a Column and Extracting Into Several Columns using PySpark - pyspark

Given: I have the following pySpark Dataframe:
s_df.show(10)
+-------------+--------------------+-------+--------------------+
| timestamp| value|quality| attributes|
+-------------+--------------------+-------+--------------------+
|1506846688201|eyJGbG9vcl9OdW1iZ...| 3|[WrappedArray(0.3...|
|1506846714421|eyJGbG9vcl9OdW1iZ...| 3|[WrappedArray(0.3...|
|1506853046041|eyJGbG9vcl9OdW1iZ...| 3|[WrappedArray(0.3...|
|1506853069411|eyJGbG9vcl9OdW1iZ...| 3|[WrappedArray(0.3...|
|1506853175701|eyJGbG9vcl9OdW1iZ...| 3|[WrappedArray(0.3...|
|1506853278721|eyJWYWx1ZSI6ICJOQ...| 3|[WrappedArray(0.3...|
|1506853285741|eyJWYWx1ZSI6ICJOQ...| 3|[WrappedArray(0.3...|
|1506853313701|eyJWYWx1ZSI6ICJOQ...| 3|[WrappedArray(0.3...|
|1506856544461|eyJJbnNlcnRUaW1lI...| 3|[WrappedArray(0.3...|
|1506856563751|eyJJbnNlcnRUaW1lI...| 3|[WrappedArray(0.3...|
+-------------+--------------------+-------+--------------------+
only showing top 10 rows
Goal: I want to decode the value column and extract the data into a data frame which looks like this:
Counter Duration EventEndTime ... Floor_Number InsertTime Value
0 1.0 2790 1506846690991 ... NA 1507645527691 0
0 1.0 2760 1506846717181 ... NA 1507645530751 0
0 1.0 5790 1506853051831 ... NA 1509003670478 NA
0 1.0 6060 1506853075471 ... NA 1509003671231 NA
0 1.0 3480 1506853179181 ... NA 1509003671935 NA
0 1.0 2760 1506853281481 ... NA 1509004002809 NA
0 1.0 3030 1506853288771 ... NA 1509004003249 NA
0 1.0 2790 1506853316491 ... NA 1509004004038 NA
0 1.0 3510 1506856547971 ... NA 1509003922437 NA
0 1.0 3810 1506856567561 ... NA 1509003923116 NA
Difficulty: I can decode but I am unable to extract the dict in pySpark. I end up doing it in pandas. I would like to do all the operations in pySpark.
My Attempt
I use the following udf to decode the value column in the following way:
def decode_values(x):
return base64.b64decode(x)
udf_myFunction = udf(decode_values, StringType())
result = ts_df.withColumn('value', udf_myFunction('value'))
to get the following:
result.show(10)
+-------------+--------------------+-------+--------------------+
| timestamp| value|quality| attributes|
+-------------+--------------------+-------+--------------------+
|1506846688201|{"Floor_Number": ...| 3|[WrappedArray(0.3...|
|1506846714421|{"Floor_Number": ...| 3|[WrappedArray(0.3...|
|1506853046041|{"Floor_Number": ...| 3|[WrappedArray(0.3...|
|1506853069411|{"Floor_Number": ...| 3|[WrappedArray(0.3...|
|1506853175701|{"Floor_Number": ...| 3|[WrappedArray(0.3...|
|1506853278721|{"Value": "NA", "...| 3|[WrappedArray(0.3...|
|1506853285741|{"Value": "NA", "...| 3|[WrappedArray(0.3...|
|1506853313701|{"Value": "NA", "...| 3|[WrappedArray(0.3...|
|1506856544461|{"InsertTime": 15...| 3|[WrappedArray(0.3...|
|1506856563751|{"InsertTime": 15...| 3|[WrappedArray(0.3...|
+-------------+--------------------+-------+--------------------+
The value column appears as a dict so I do the following in pandas:
import pandas as pd
from pandas.io.json import json_normalize
result2 = result.toPandas()
final_df = result2.value.apply(json.loads).apply(json_normalize).pipe(lambda x: pd.concat(x.values))
final_df.head()
Counter Duration EventEndTime ... Floor_Number InsertTime Value
0 1.0 2790 1506846690991 ... NA 1507645527691 0
0 1.0 2760 1506846717181 ... NA 1507645530751 0
0 1.0 5790 1506853051831 ... NA 1509003670478 NA
0 1.0 6060 1506853075471 ... NA 1509003671231 NA
0 1.0 3480 1506853179181 ... NA 1509003671935 NA
0 1.0 2760 1506853281481 ... NA 1509004002809 NA
0 1.0 3030 1506853288771 ... NA 1509004003249 NA
0 1.0 2790 1506853316491 ... NA 1509004004038 NA
0 1.0 3510 1506856547971 ... NA 1509003922437 NA
0 1.0 3810 1506856567561 ... NA 1509003923116 NA
I tried to create another udf function but it did not work:
def convert_dict_to_columns(x):
df_out = x.apply(json.loads).apply(json_normalize).pipe(lambda y: pd.concat(y.values))
return df_out
# Convert the dict to columns
udf_convert_d2c = udf(convert_dict_to_columns, StringType())
final_result = result.withColumn('value', udf_convert_d2c('value'))
Any idea how I can do this in a pySpark-y way.
Here's my minimal viable code:
import pandas as pd
df = pd.DataFrame(columns=['Timestamp', 'value'])
df.loc[:, 'Timestamp'] = [1501891200, 1501891200, 1501891200, 1501891200, 1501891200, 1501891200]
df.loc[:, 'value'] = [{"Floor_Number": "NA", "Value": 0.0, "InsertTime": 1507645527691, "EventStartTime": 1506846688201},
{"Floor_Number": "NA", "Value": 0.0, "InsertTime": 1507645527691, "EventStartTime": 1506846688201},
{"Floor_Number": "NA", "Value": 0.0, "InsertTime": 1507645527691, "EventStartTime": 1506846688201},
{"Floor_Number": "NA", "Value": 0.0, "InsertTime": 1507645527691, "EventStartTime": 1506846688201},
{"Floor_Number": "NA", "Value": 0.0, "InsertTime": 1507645527691, "EventStartTime": 1506846688201},
{"Floor_Number": "NA", "Value": 0.0, "InsertTime": 1507645527691, "EventStartTime": 1506846688201}]
s_df = spark.createDataFrame(df)

Related

Dividing a set of columns by its average in Pyspark

I have to divide a set of columns in a pyspark.sql.dataframe by their respective column average but I am not able to find an correct way to do it. Below is a sample data and my present code.
Input Data
columns = ["Col1", "Col2", "Col3","Name"]
data = [("1","40", "56" , "john jones"),
("2", "45", "55", "tracey smith"),
("3", "33", "23", "amy sanders")]
df = spark.createDataFrame(data=data,schema=columns)
Col1 Col2 Col3 Name
1 40 56 john jones
2 45 55 tracey smith
3 33 23 amy sanders
Expected Output
Col1 Col2 Col3 Name
0.5 1.02 1.25 john jones
1 1.14 1.23 tracey smith
1.5 0.84 0.51 amy sanders
Function as of now. Not working:
#function to divide few columns by the column average and overwrite the column
def avg_scaling(df):
#List of columns which have to be scaled by their average
col_list = ['col1', 'col2', 'col3']
for i in col_list:
df = df.withcolumn(i, col(i)/df.select(f.avg(df[i])))
return df
new_df = avg_scaling(df)
You can make use of a Window here partitioned on a pusedo column and run average on that window.
The code goes like this,
columns = ["Col1", "Col2", "Col3","Name"]
data = [("1","40", "56" , "john jones"),
("2", "45", "55", "tracey smith"),
("3", "33", "23", "amy sanders")]
df = spark.createDataFrame(data=data,schema=columns)
df.show()
+----+----+----+------------+
|Col1|Col2|Col3| Name|
+----+----+----+------------+
| 1| 40| 56| john jones|
| 2| 45| 55|tracey smith|
| 3| 33| 23| amy sanders|
+----+----+----+------------+
from pyspark.sql import Window
def avg_scaling(df, cols_to_scale):
w = Window.partitionBy(F.lit(1))
for col in cols_to_scale:
df = df.withColumn(f"{col}", F.round(F.col(col) / F.avg(col).over(w), 2))
return df
new_df = avg_scaling(df, ["Col1", 'Col2', 'Col3'])
new_df.show()
+----+----+----+------------+
|Col1|Col2|Col3| Name|
+----+----+----+------------+
| 0.5|1.02|1.25| john jones|
| 1.5|0.84|0.51| amy sanders|
| 1.0|1.14|1.23|tracey smith|
+----+----+----+------------+

How to parse dynamic Json with dynamic keys inside it in Scala

I am trying to parse Json structure which is dynamic in nature and load into database. But facing difficulty where json has dynamic keys inside it. Below is my sample json: Have tried using explode function but didn't help.
moslty similar thing is described here How to parse a dynamic JSON key in a Nested JSON result?
{
"_id": {
"planId": "5f34dab0c661d8337097afb9",
"version": {
"$numberLong": "1"
},
"period": {
"name"
: "3Q20",
"startDate": 20200629,
"endDate": 20200927
},
"line": "b443e9c0-fafc-4791-87c9-
8e32339c7f3c",
"channelId": "G7k5_-HWRIuF0-afe7q-rQ"
},
"unitRates": {
"units": {
"$numberLong":
"0"
},
"rate": 0.0,
"rcRate": 0.0
},
"demoValues": {
"66": {
"cpm": 0.0,
"cpp": 0,
"vpvh": 0.0,
"imps"
:
0.0,
"rcImps": 0.0,
"ue": 0.0,
"grps": 0.0,
"demoId": "66"
},
"63": {
"cpm": 0.0,
"cpp": 0,
"vpvh":
0.0,
"imps": 0.0,
"rcImps": 0.0,
"ue": 0.0,
"grps": 0.0,
"demoId": "63"
},
"21": {
"cpm": 0.0,
"cpp"
:
0,
"vpvh": 0.0,
"imps": 0.0,
"rcImps": 0.0,
"ue": 0.0,
"grps": 0.0,
"demoId": "21"
}
},
"hh-imps":
0.0
}
Below is my scala code:
import org.apache.spark.sql.streaming.OutputMode
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
import com.google.gson.JsonObject
import org.apache.spark.sql.types.{ArrayType, MapType, StringType,
StructField, StructType}
import org.codehaus.jettison.json.JSONObject
object ParseDynamic_v2 {
def main(args: Array[String]): Unit = {
System.setProperty("hadoop.home.dir", "C:\\hadoop")
val spark = SparkSession
.builder
.appName("ConfluentConsumer")
.master("local[4]")
.getOrCreate()
import spark.implicits._
val jsonStringDs = spark.createDataset[String](
Seq(
("""{"_id" : {"planId" : "5f34dab0c661d8337097afb9","version" : {"$numberLong" : "1"},"period" : {"name" : "3Q20","startDate" : 20200629,"endDate" : 20200927},"line" : "b443e9c0-fafc-4791-87c9-8e32339c7f3c","channelId" : "G7k5_-HWRIuF0-afe7q-rQ"},"unitRates" : {"units" : {"$numberLong" : "0"},"rate" : 0.0,"rcRate" : 0.0},"demoValues" : {"66" : {"cpm" : 0.0,"cpp" : 0,"vpvh" : 0.0,"imps" : 0.0,"rcImps" : 0.0,"ue" : 0.0,"grps" : 0.0,"demoId" : "66"},"63" : {"cpm" : 0.0,"cpp" : 0,"vpvh" : 0.0,"imps" : 0.0,"rcImps" : 0.0,"ue" : 0.0,"grps" : 0.0,"demoId" : "63"},"21" : {"cpm" : 0.0,"cpp" : 0,"vpvh" : 0.0,"imps" : 0.0,"rcImps" : 0.0,"ue" : 0.0,"grps" : 0.0,"demoId" : "21"}},"hh-imps" : 0.0}""")
))
jsonStringDs.show
import spark.implicits._
val df = spark.read.json(jsonStringDs)
df.show(false)
val app = df.select("demoValues.*")
app.createOrReplaceTempView("app")
app.printSchema
app.show(false)
val verticaProperties: Map[String, String] = Map(
"db" -> "dbname", // Database name
"user" -> "user", // Database username
"password" -> "****", // Password
"table" -> "tablename", // vertica table name
"dbschema" -> "public", // schema of vertica where the table will be
residing
"host" -> "localhost", // Host on which vertica is currently running
"hdfs_url" -> "hdfs://localhost:8020/user/hadoop/planheader/", // HDFS directory url in which intermediate orc file will persist before sending it to vertica
"web_hdfs_url" -> "webhdfs://localhost:50070/user/hadoop/planheader/"
)
val verticaDataSource = "com.vertica.spark.datasource.DefaultSource"
//read mode
val loadStream = df.write.format(verticaDataSource).options(verticaProperties).mode("overwrite").save()
//read stream mode
val saveToVertica: DataFrame => Unit =
dataFrame =>
dataFrame.write.format(verticaDataSource).options(verticaProperties).mode("append").save()
val checkpointLocation = "/user/hadoop/planheader/checkpoint"
val streamingQuery = df.writeStream
.outputMode(OutputMode.Append)
.option("checkpointLocation", checkpointLocation)
//.trigger(ProcessingTime("25 seconds"))
.foreachBatch((ds, _) => saveToVertica(ds)).start()
streamingQuery.awaitTermination()
}
}
expected output:
Here you see what I did using Vertica:
I created a flex table, loaded it, and used Vertica's flex table function COMPUTE_FLEXTABLE_KEYS_AND_CREATE_VIEW() to get a view.
Turned out to be a single-row table:
-- CREATE the Flex Table
CREATE FLEX TABLE demovals();
-- copy it using the built-in JSON Parser (it creates a map container,
-- with all key-value pairs
COPY demovals FROM '/home/gessnerm/1/Vertica/supp/l.json' PARSER fjsonparser();
-- out vsql:/home/gessnerm/._vfv.sql:1: ROLLBACK 4213: Object "demovals" already exists
-- out Rows Loaded
-- out -------------
-- out 1
-- out (1 row)
-- out
-- out Time: First fetch (1 row): 112.540 ms. All rows formatted: 112.623 ms
-- the function on the next line guesses the data types in the values
-- matching the keys, stores the guessed data types in a second table,
-- and builds a view from all found keys
SELECT COMPUTE_FLEXTABLE_KEYS_AND_BUILD_VIEW('demovals');
-- out COMPUTE_FLEXTABLE_KEYS_AND_BUILD_VIEW
-- out --------------------------------------------------------------------------------------------------------
-- out Please see dbadmin.demovals_keys for updated keys
-- out The view dbadmin.demovals_view is ready for querying
-- out (1 row)
-- out
-- out Time: First fetch (1 row): 467.551 ms. All rows formatted: 467.583 ms
-- now, select from the single-row view on the flex table,
-- one row per column in the report (extended view: "\x" )
\x
SELECT * FROM dbadmin.demovals_view;
-- out -[ RECORD 1 ]---------------+-------------------------------------
-- out _id.channelid | G7k5_-HWRIuF0-afe7q-rQ
-- out _id.line | b443e9c0-fafc-4791-87c9-8e32339c7f3c
-- out _id.period.enddate | 20200927
-- out _id.period.name | 3Q20
-- out _id.period.startdate | 20200629
-- out _id.planid | 5f34dab0c661d8337097afb9
-- out _id.version.$numberlong | 1
-- out demovalues.21.cpm | 0.00
-- out demovalues.21.cpp | 0
-- out demovalues.21.demoid | 21
-- out demovalues.21.grps | 0.00
-- out demovalues.21.imps | 0.00
-- out demovalues.21.rcimps | 0.00
-- out demovalues.21.ue | 0.00
-- out demovalues.21.vpvh | 0.00
-- out demovalues.63.cpm | 0.00
-- out demovalues.63.cpp | 0
-- out demovalues.63.demoid | 63
-- out demovalues.63.grps | 0.00
-- out demovalues.63.imps | 0.00
-- out demovalues.63.rcimps | 0.00
-- out demovalues.63.ue | 0.00
-- out demovalues.63.vpvh | 0.00
-- out demovalues.66.cpm | 0.00
-- out demovalues.66.cpp | 0
-- out demovalues.66.demoid | 66
-- out demovalues.66.grps | 0.00
-- out demovalues.66.imps | 0.00
-- out demovalues.66.rcimps | 0.00
-- out demovalues.66.ue | 0.00
-- out demovalues.66.vpvh | 0.00
-- out hh-imps | 0.00
-- out unitrates.rate | 0.00
-- out unitrates.rcrate | 0.00
-- out unitrates.units.$numberlong | 0
For the children, for example:
CREATE FLEX TABLE children();
TRUNCATE TABLE children;
COPY children FROM '/home/gessnerm/1/Vertica/supp/l.json' PARSER fjsonparser(start_point='demoValues');
SELECT COMPUTE_FLEXTABLE_KEYS_AND_BUILD_VIEW('children');
\x
SELECT * FROM dbadmin.children_view;
-- out Time: First fetch (0 rows): 7.303 ms. All rows formatted: 7.308 ms
-- out Rows Loaded
-- out -------------
-- out 1
-- out (1 row)
-- out
-- out Time: First fetch (1 row): 13.848 ms. All rows formatted: 13.876 ms
-- out COMPUTE_FLEXTABLE_KEYS_AND_BUILD_VIEW
-- out --------------------------------------------------------------------------------------------------------
-- out Please see dbadmin.children_keys for updated keys
-- out The view dbadmin.children_view is ready for querying
-- out (1 row)
-- out
-- out Time: First fetch (1 row): 140.381 ms. All rows formatted: 140.404 ms
-- out -[ RECORD 1 ]---
-- out 21.cpm | 0.00
-- out 21.cpp | 0
-- out 21.demoid | 21
-- out 21.grps | 0.00
-- out 21.imps | 0.00
-- out 21.rcimps | 0.00
-- out 21.ue | 0.00
-- out 21.vpvh | 0.00
-- out 63.cpm | 0.00
-- out 63.cpp | 0
-- out 63.demoid | 63
-- out 63.grps | 0.00
-- out 63.imps | 0.00
-- out 63.rcimps | 0.00
-- out 63.ue | 0.00
-- out 63.vpvh | 0.00
-- out 66.cpm | 0.00
-- out 66.cpp | 0
-- out 66.demoid | 66
-- out 66.grps | 0.00
-- out 66.imps | 0.00
-- out 66.rcimps | 0.00
-- out 66.ue | 0.00
-- out 66.vpvh | 0.00
Not sure how much efficient my code is but it does the job.
//reading data from json file
val df1 = spark.read.json("src/main/resources/data.json")
// defining schema here.
val schema = StructType(
StructField("planid", StringType, true) ::
StructField("periodname", IntegerType, false) ::
StructField("cpm", StringType, false)::
StructField("vpvh", StringType, false)::
StructField("imps", StringType, false)::
StructField("demoids", StringType, false)
:: Nil)
var someDF = spark.createDataFrame(spark.sparkContext
.emptyRDD[Row], schema)
val app = df1.select("demoValues.*","_id.planId","_id.period.name")
//this will have all the dynamic keys as column
val arr=app.columns
for(i <- 0 to arr.length-3) {
println("columnname: "+arr(i))
// traversing through keys to get the values .ex: demoValues.63.cpm
val cpm = "demoValues."+arr(i)+".cpm"
val vpvh = "demoValues."+arr(i)+".vpvh"
val imps="demoValues."+arr(i)+".imps"
val df2 = df1.select(col("_id.planId"),col("_id.period.name"),col(cpm),
col(vpvh),col(imps),lit(arr(i)).as("demoids"))
df2.show(false)
someDF=someDF.union(df2)
}
someDF.show()

I got an error in pypsark, that states: TypeError: 'Column' object is not callable

What I did is I tried to groupby and collect_list:
Data:
id dates quantity
-- ----- -----
12 2012-03-02 1
32 2012-02-21 4
43 2012-03-02 4
5 2012-12-02 5
42 2012-12-02 7
21 2012-31-02 9
3 2012-01-02 5
2 2012-01-02 5
3 2012-01-02 7
2 2012-01-02 1
3 2012-01-02 3
21 2012-01-02 6
21 2012-03-23 5
21 2012-03-24 3
21 2012-04-25 1
21 2012-07-23 6
21 2012-01-02 8
Code:
new_df = df.groupby('id').agg(F.collect_list("dayid"),F.collect_list("quantity"))
The code seems to work fine for me, only question I have is you have used dayid as column in the collect_list rest all looks fine.
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
sc = spark.sparkContext
dataset1 = [{'id' : 12,'dates' : '2012-03-02','quantity' : 1},
{'id' : 32,'dates' : '2012-02-21','quantity' : 4},
{'id' : 12,'dates' : '2012-03-02','quantity' : 1},
{'id' : 32,'dates' : '2012-02-21','quantity' : 4}]
rdd1 = sc.parallelize(dataset1)
df1 = spark.createDataFrame(rdd1)
df1.show()
+----------+---+--------+
| dates| id|quantity|
+----------+---+--------+
|2012-03-02| 12| 1|
|2012-02-21| 32| 4|
|2012-03-02| 12| 1|
|2012-02-21| 32| 4|
+----------+---+--------+
new_df = df1.groupby('id').agg(F.collect_list("dayid"),F.collect_list("quantity"))
+---+----------------+----------------------+
| id|collect_list(id)|collect_list(quantity)|
+---+----------------+----------------------+
| 32| [32, 32]| [4, 4]|
| 12| [12, 12]| [1, 1]|
+---+----------------+----------------------+

create data frame in spark from unparsed text string

I am using Scala and Spark to analyze some data.
Sorry I am absolute novice in this area.
I have data in the following format (below)
I want to create RDD to filter data, group and transform data.
Currently I have rdd with list of unparsed strings
I have created it from rawData: list of strings
val rawData ( this is ListBuffer[String] )
val rdd = sc.parallelize(rawData)
How can I create data set to manipulate data?
I want to have objects in Rdd with named fields line ob.name, obj.year and so on
What is the right approach?
Should I create data frame for this?
Raw data strings looks like this : this is list of strings, with space separated values
Column meaning: "name", year" , "month", "tmax", "tmin", "afdays", "rainmm", "sunhours"
aberporth 1941 10 --- --- --- 106.2 ---
aberporth 1941 11 --- --- --- 92.3 ---
aberporth 1941 12 --- --- --- 86.5 ---
aberporth 1942 1 5.8 2.1 --- 114.0 58.0
aberporth 1942 2 4.2 -0.6 --- 13.8 80.3
aberporth 1942 3 9.7 3.7 --- 58.0 117.9
aberporth 1942 4 13.1 5.3 --- 42.5 200.1
aberporth 1942 5 14.0 6.9 --- 101.1 215.1
aberporth 1942 6 16.2 9.9 --- 2.3 269.3
aberporth 1942 7 17.4 11.3 12 70.2* 185.0
aberporth 1942 8 18.7 12.3 5- 78.5 141.9
aberporth 1942 9 16.4 10.7 123 146.8 129.1#
aberporth 1942 10 13.1 8.2 125 131.1 82.1l
--- - means no data, i think I can put 0 to this colument.
70.2* , 129.1# , 82.l - * , # and l here should be filtred
Please point me to right direction.
I have found one of the possible solution here:
https://medium.com/#mrpowers/manually-creating-spark-dataframes-b14dae906393
This example looks good:
val someData = Seq(
Row(8, "bat"),
Row(64, "mouse"),
Row(-27, "horse")
)
val someSchema = List(
StructField("number", IntegerType, true),
StructField("word", StringType, true)
)
val someDF = spark.createDataFrame(
spark.sparkContext.parallelize(someData),
StructType(someSchema)
)
How can I transform list of strings to Seq of Row ?
You can read the data as a text file and replace --- with 0 and remove special characters or filter out. ( I have replaced in below example)
Create a case class to represent the data
case class Data(
name: String, year: String, month: Int, tmax: Double,
tmin: Double, afdays: Int, rainmm: Double, sunhours: Double
)
Read a file
val data = spark.read.textFile("file path") //read as a text file
.map(_.replace("---", "0").replaceAll("-|#|\\*", "")) //replace special charactes
.map(_.split("\\s+"))
.map(x => // create Data object for each record
Data(x(0), x(1), x(2).toInt, x(3).toDouble, x(4).toDouble, x(5).toInt, x(6).toDouble, x(7).replace("l", "").toDouble)
)
Now you get a Dataset[Data] which is a dataset parsed from the text.
Output:
+---------+----+-----+----+----+------+------+--------+
|name |year|month|tmax|tmin|afdays|rainmm|sunhours|
+---------+----+-----+----+----+------+------+--------+
|aberporth|1941|10 |0.0 |0.0 |0 |106.2 |0.0 |
|aberporth|1941|11 |0.0 |0.0 |0 |92.3 |0.0 |
|aberporth|1941|12 |0.0 |0.0 |0 |86.5 |0.0 |
|aberporth|1942|1 |5.8 |2.1 |0 |114.0 |58.0 |
|aberporth|1942|2 |4.2 |0.6 |0 |13.8 |80.3 |
|aberporth|1942|3 |9.7 |3.7 |0 |58.0 |117.9 |
|aberporth|1942|4 |13.1|5.3 |0 |42.5 |200.1 |
|aberporth|1942|5 |14.0|6.9 |0 |101.1 |215.1 |
|aberporth|1942|6 |16.2|9.9 |0 |2.3 |269.3 |
|aberporth|1942|7 |17.4|11.3|12 |70.2 |185.0 |
|aberporth|1942|8 |18.7|12.3|5 |78.5 |141.9 |
|aberporth|1942|9 |16.4|10.7|123 |146.8 |129.1 |
|aberporth|1942|10 |13.1|8.2 |125 |131.1 |82.1 |
+---------+----+-----+----+----+------+------+--------+
I hope this helps!

need help to compare two columns in spark scala

I have spark dataframe like this
id1 id2 attrname attr_value attr_valuelist
1 2 test Yes Yes, No
2 1 test1 No Yes, No
3 2 test2 value1 val1, Value1,value2
4 1 test3 3 0, 1, 2
5 3 test4 0 0, 1, 2
11 2 test Yes Yes, No
22 1 test1 No1 Yes, No
33 2 test2 value0 val1, Value1,value2
44 1 test3 11 0, 1, 2
55 3 test4 0 0, 1, 2
val df = sqlContext.sql("select id1, id2, attrname, attr_value, attr_valuelist from dftable)
i want to check attr_value in attr_valuelist if it does not exists then take only those rows
id1 id2 attrname attr_value attr_valuelist
4 1 test3 3 0, 1, 2
22 1 test1 No1 Yes, No
33 2 test2 value0 val1, Value1,value2
44 1 test3 11 0, 1, 2
you can simply do the following with contains in your dataframe
import org.apache.spark.sql.functions._
df.filter(!(col("attr_valuelist").contains(col("attr_value")))).show(false)
you should have following output
+---+---+--------+----------+-------------------+
|id1|id2|attrname|attr_value|attr_valuelist |
+---+---+--------+----------+-------------------+
|3 |2 |test2 |value1 |val1, Value1,value2|
|4 |1 |test3 |3 |0, 1, 2 |
|22 |1 |test1 |No1 |Yes, No |
|33 |2 |test2 |value0 |val1, Value1,value2|
|44 |1 |test3 |11 |0, 1, 2 |
+---+---+--------+----------+-------------------+
If you want to ignore the case letters then you can simply user lower function as
df.filter(!(lower(col("attr_valuelist")).contains(lower(col("attr_value"))))).show(false)
you should have
+---+---+--------+----------+-------------------+
|id1|id2|attrname|attr_value|attr_valuelist |
+---+---+--------+----------+-------------------+
|4 |1 |test3 |3 |0, 1, 2 |
|22 |1 |test1 |No1 |Yes, No |
|33 |2 |test2 |value0 |val1, Value1,value2|
|44 |1 |test3 |11 |0, 1, 2 |
+---+---+--------+----------+-------------------+
You can define a custom function, user defined function in Spark, where you can test if a value from a column is contained in the value of the other column, like this:
def contains = udf((attr: String, attrList: String) => attrList.contains(attr))
def notContains = udf((attr: String, attrList: String) => !attrList.contains(attr))
you can tweak contains function how you want, and then you can select from your dataframe like this
df.where(contains(df("attr_value", df("attr_valuelist")))
df.where(notContains(df("attr_value", df("attr_valuelist")))