Pyspark create temp view from dataframe - pyspark

I am trying to read thorugh spark.sql a huge csv.
I created a dataframe from a CSV, the dataframe seems created correctly.
I read the schema and I can perform select and filter.
I would like to create a temp view to execute same research using sql, I am more comfortable with it but the temp view seems created on the csv header only.
Where am I making the mistake?
Thanks
>>> df = spark.read.options(header=True,inferSchema=True,delimiter=";").csv("./elenco_dm_tutti_csv_formato_opendata_UltimaVersione.csv")
>>> df.printSchema()
root
|-- TIPO: integer (nullable = true)
|-- PROGRESSIVO_DM_ASS: integer (nullable = true)
|-- DATA_PRIMA_PUBBLICAZIONE: string (nullable = true)
|-- DM_RIFERIMENTO: integer (nullable = true)
|-- GRUPPO_DM_SIMILI: integer (nullable = true)
|-- ISCRIZIONE_REPERTORIO: string (nullable = true)
|-- INIZIO_VALIDITA: string (nullable = true)
|-- FINE_VALIDITA: string (nullable = true)
|-- FABBRICANTE_ASSEMBLATORE: string (nullable = true)
|-- CODICE_FISCALE: string (nullable = true)
|-- PARTITA_IVA_VATNUMBER: string (nullable = true)
|-- CODICE_CATALOGO_FABBR_ASS: string (nullable = true)
|-- DENOMINAZIONE_COMMERCIALE: string (nullable = true)
|-- CLASSIFICAZIONE_CND: string (nullable = true)
|-- DESCRIZIONE_CND: string (nullable = true)
|-- DATAFINE_COMMERCIO: string (nullable = true)
>>> df.count()
1653697
>>> df.createOrReplaceTempView("mask")
>>> spark.sql("select count(*) from mask")
DataFrame[count(1): bigint]

Spark operations like sql() do not process anything by default. You need to add .show() or .collect() to get results.

Related

PySpark - When Otherwise - Condition should be a Column

I have a dataframe as below
root
|-- tasin: string (nullable = true)
|-- advertiser_id: decimal(38,10) (nullable = true)
|-- predicted_sp_sold_units: decimal(38,10) (nullable = true)
|-- predicted_sp_impressions: decimal(38,10) (nullable = true)
|-- predicted_sp_clicks: decimal(38,10) (nullable = true)
|-- predicted_sdc_sold_units: decimal(38,10) (nullable = true)
|-- predicted_sdc_impressions: decimal(38,10) (nullable = true)
|-- predicted_sdc_clicks: decimal(38,10) (nullable = true)
|-- predicted_sda_sold_units: decimal(38,10) (nullable = true)
|-- predicted_sda_impressions: decimal(38,10) (nullable = true)
|-- predicted_sda_clicks: decimal(38,10) (nullable = true)
|-- region_id: integer (nullable = true)
|-- marketplace_id: integer (nullable = true)
|-- dataset_date: date (nullable = true)
Now I am using the below select statement. I am looking for presence of a column name and if present select the value or else fill with Null. The dataframe is stored in df variable.
scores_df1 = df.select(
col('marketplace_id'),
col('region_id'),
col('tasin'),
col('advertiser_id'),
col('predicted_sp_sold_units'),
col('predicted_sp_impressions'),
col('predicted_sp_clicks'),
col('predicted_sdc_sold_units'),
col('predicted_sdc_impressions'),
col('predicted_sdc_clicks'),
col('predicted_sda_sold_units'),
col('predicted_sda_impressions'),
col('predicted_sda_clicks'),
when('sdcr_score' in df.columns is True, col('sdcr_score')).otherwise(lit(None)).alias('sdcr_score'),
when('sdar_score' in df.columns is True, col('sdar_score')).otherwise(lit(None)).alias('sdar_score')
)
I am receiving error <class 'TypeError'>: condition should be a Column
Please advice what is wrong
the phrase 'sdcr_score' in df.columns is True is evaluated in Python before moving to spark and return True/False.
So what you are passing to spark is: when(True, ..., ...).
When is expecting the first argument to be a Column that is evaluated to a True/False statement and not a Pythonic Bool.
You can wrap the argument with lit() function which will basically pass a True/False column to all arguments of the when clause.

Not able to add data through with column using case statement pyspark

Code like as below:
#To get deal keys
schema of lt_online:
root
|-- FT/RT: string (nullable = true)
|-- Country: string (nullable = true)
|-- Charge_Type: string (nullable = true)
|-- Tariff_Loc: string (nullable = true)
|-- Charge_No: string (nullable = true)
|-- Status: string (nullable = true)
|-- Validity_from: string (nullable = true)
|-- Validity_to: string (nullable = true)
|-- Range_Basis: string (nullable = true)
|-- Limited_Parties: string (nullable = true)
|-- Charge_Detail: string (nullable = true)
|-- Freetime_Unit: string (nullable = true)
|-- Freetime: string (nullable = true)
|-- Count_Holidays: string (nullable = true)
|-- Majeure: string (nullable = true)
|-- Start_Event: string (nullable = true)
|-- Same/Next_Day: string (nullable = true)
|-- Next_Day_if_AFTER: string (nullable = true)
|-- Availability_Date: string (nullable = true)
|-- Route_Group: string (nullable = true)
|-- Route_Code: string (nullable = true)
|-- Origin: string (nullable = true)
|-- LoadZone: string (nullable = true)
|-- FDischZone: string (nullable = true)
|-- PODZone: string (nullable = true)
|-- FDestZone: string (nullable = true)
|-- Equipment_Group: string (nullable = true)
|-- Equipment_Type: string (nullable = true)
|-- Range_From: string (nullable = true)
|-- Range_To: void (nullable = true)
|-- Cargo_Type: string (nullable = true)
|-- Commodity: string (nullable = true)
|-- SC_Group: string (nullable = true)
|-- SC_Number: string (nullable = true)
|-- IMO: string (nullable = true)
|-- Shipper_Group: string (nullable = true)
|-- Cnee_Group: string (nullable = true)
|-- Direction: string (nullable = true)
|-- Service: string (nullable = true)
|-- Haulage: string (nullable = true)
|-- Transport_Type: string (nullable = true)
|-- Option1: string (nullable = true)
|-- Option2: string (nullable = true)
|-- 1st_of_Route_Group: string (nullable = true)
|-- 1st_of_LoadZone: string (nullable = true)
|-- 1st_of_FDischZone: string (nullable = true)
|-- 1st_of_PODZone: string (nullable = true)
|-- 1st_of_FDestZone: string (nullable = true)
|-- 1st_of_Equipment_Group: string (nullable = true)
|-- 1st_of_SC_Group: string (nullable = true)
|-- 1st_of_Shipper_Group: string (nullable = true)
|-- 1st_of_Cnee_Group: string (nullable = true)
pyspark code as below df=lt_online.withColumn("dealkeys",lit('')).withColumn("dealAttributes",lit(''))
start=[]
start_dict={}
dealatt=["Charge_No","Status","Validity_from","Validity_to"]
dealkeys=["Charge_Type","Direction"]
for index,row in lt_online.toPandas().iterrows():
start=[]
start_dict={}
key = row['Charge_No']
for i in dealatt:
#final = row[i]
start_dict[i]=row[i]
df_deal_att = df.withColumn('dealkeys', when(col('Charge_No') == key , str(start_dict)).otherwise(col('dealkeys')))
for i in dealkeys:
#key = row['Charge_No']
final = {"keyname" : i,"value" : row[i],"description":".."}
start.append(final)
#final_val= {"value" : row['Charge_Type']}
#start.append(final_val)
#df3=lt_online.withColumn("new_column",str(start))
print(key,start_dict)
df3 = df_deal_att.withColumn('dealAttributes', when(col('Charge_No') == key , str(start)).otherwise(col('dealAttributes')))
when i run DF3 dataframe dealAttributes and dealkeys old data got blank and latest record only inserted.
Please see the screenshot
Since the lt_online dataframe is large, I have selected only the required columns from it. The following is the schema of the lt_online dataframe that I have selected.
The problem arrises because you are not changing df in place, but assigning it to df_deal_att. This will update df_deal_att (also df3) only for the current row in loop (because df is not changing in the entire process). Using df_deal_att.show() inside the loop will help in understanding this.
Use the following code instead to get the desired output:
for index,row in lt_online.toPandas().iterrows():
start=[]
start_dict={}
key = row['Charge_No']
for i in dealatt:
start_dict[i]=row[i]
#ASSIGN TO df INSTEAD OF df_deal_att
df = df.withColumn('dealkeys', when(col('Charge_No') == key , str(start_dict)).otherwise(col('dealkeys')))
for i in dealkeys:
final = {"keyname" : i,"value" : row[i],"description":".."}
start.append(final)
#USE df and ASSIGN TO df INSTEAD OF USING df_deal_att AND ASSIGNING TO df3
df = df.withColumn('dealAttributes', when(col('Charge_No') == key , str(start)).otherwise(col('dealAttributes')))
Assigning the df dataframe after adding the column value based on condition to df itself (instead of using df_deal_att or df3) helps in solving the issue. The following image reflects the output achieved after using the above code.

SparkException - Chi-square test expect categorical values

I am new to pyspark and I am trying to select the best features using chisqselector.
I have a dataset of 78 features. The steps I did are the following
1. Dropped nan, applied imputer
2. Converted the string label column to int using stringindexer.
3. Applied Vector Assembler
4. Vector inderxer
5. Standard Scaler
6. Applied Chisqselector, produced error.
As mentioned in the post (SparkException: Chi-square test expect factors) I applied vector indexer, still its not working. What are the data preparation steps I should do for chisqselector. Thanks in Advance.
I am using a security dataset CICIDS2017 with 78 features and label is a string.
CODE
````
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("exp").getOrCreate()
raw_data = spark.read.csv("SCX.csv", inferSchema = True, header =
True)
raw_data.na.drop().show()
cols=raw_data.columns
cols.remove("Label")
from pyspark.ml.feature import Imputer
imputer=Imputer(inputCols=['Destination Port',
'FlowDuration',
'TotalFwdPackets',
'TotalBackwardPackets',
'TotalLengthofFwdPackets',
'TotalLengthofBwdPackets'
],outputCols=['Destination Port',
'FlowDuration',
'TotalFwdPackets',
'TotalBackwardPackets',
'TotalLengthofFwdPackets',
])
model=imputer.fit(raw_data)
raw_data1=model.transform(raw_data)
raw_data1.show(5)
#RAW DATA2 => After doing String indexer on label column
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol='Label', outputCol='_LabelIndexed')
raw_data2 = indexer.fit(raw_data1).transform(raw_data1)
#RAW DATA3 => After applying vector assembler
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=cols,outputCol="features")
# Now let us use the transform method to transform our dataset
raw_data3=assembler.transform(raw_data2)
raw_data3.select("features").show(truncate=False)
#RAW DATA 4 => After applying Vector Indexer
from pyspark.ml.feature import VectorIndexer
vindexer = VectorIndexer(inputCol="features", outputCol="vindexed",
maxCategories=9999)
vindexerModel = vindexer.fit(raw_data3)
categoricalFeatures = vindexerModel.categoryMaps
print("Chose %d categorical features: %s" %
(len(categoricalFeatures), ", ".join(str(k) for k in
categoricalFeatures.keys())))
# Create new column "indexed" with categorical values transformed to
indices
raw_data4 = vindexerModel.transform(raw_data3)
raw_data4.show()
#RAW DATA 5 => After applying Standard Scaler
from pyspark.ml.feature import StandardScaler
standardscaler=StandardScaler().setInputCol("vindexed").setOutputCol
("Scaled_features")
raw_data5=standardscaler.fit(raw_data4).transform(raw_data4)
train, test = raw_data5.randomSplit([0.8, 0.2], seed=456)
# Feature selection using chisquareSelector
from pyspark.ml.feature import ChiSqSelector
chi = ChiSqSelector(featuresCol='Scaled_features',
outputCol='Selected_f',labelCol='_LabelIndexed',fpr=0.05)
train=chi.fit(train).transform(train)
#test=chi.fit(test).transform(test)
#test.select("Aspect").show(5,truncate=False)
````
But this code returns error message while fiting
Py4JJavaError: An error occurred while calling o568.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 32.0 failed 1 times, most recent failure: Lost task 2.0 in stage 32.0 (TID 69, 192.168.1.15, executor driver): org.apache.spark.SparkException: *****Chi-square test expect factors (categorical values) but found more than 10000 distinct values in column 14.*****
````
raw_data.printSchema()
````
|-- Destination Port: integer (nullable = true)
|-- FlowDuration: integer (nullable = true)
|-- TotalFwdPackets: integer (nullable = true)
|-- TotalBackwardPackets: integer (nullable = true)
|-- TotalLengthofFwdPackets: integer (nullable = true)
|-- TotalLengthofBwdPackets: integer (nullable = true)
|-- FwdPacketLengthMax: integer (nullable = true)
|-- FwdPacketLengthMin: integer (nullable = true)
|-- FwdPacketLengthMean: double (nullable = true)
|-- FwdPacketLengthStd: double (nullable = true)
|-- BwdPacketLengthMax: integer (nullable = true)
|-- BwdPacketLengthMin: integer (nullable = true)
|-- BwdPacketLengthMean: double (nullable = true)
|-- BwdPacketLengthStd: double (nullable = true)
|-- FlowBytesPersec: double (nullable = true)
|-- FlowPacketsPersec: double (nullable = true)
|-- FlowIATMean: double (nullable = true)
|-- FlowIATStd: double (nullable = true)
|-- FlowIATMax: integer (nullable = true)
|-- FlowIATMin: integer (nullable = true)
|-- FwdIATTotal: integer (nullable = true)
|-- FwdIATMean: double (nullable = true)
|-- FwdIATStd: double (nullable = true)
|-- FwdIATMax: integer (nullable = true)
|-- FwdIATMin: integer (nullable = true)
|-- BwdIATTotal: integer (nullable = true)
|-- BwdIATMean: double (nullable = true)
|-- BwdIATStd: double (nullable = true)
|-- BwdIATMax: integer (nullable = true)
|-- BwdIATMin: integer (nullable = true)
|-- FwdPSHFlags: integer (nullable = true)
|-- BwdPSHFlags: integer (nullable = true)
|-- FwdURGFlags: integer (nullable = true)
|-- BwdURGFlags: integer (nullable = true)
|-- FwdHeaderLength_1: integer (nullable = true)
|-- BwdHeaderLength: integer (nullable = true)
|-- FwdPackets/s: double (nullable = true)
|-- BwdPackets/s: double (nullable = true)
|-- MinPacketLength: integer (nullable = true)
|-- MaxPacketLength: integer (nullable = true)
|-- PacketLengthMean: double (nullable = true)
|-- PacketLengthStd: double (nullable = true)
|-- PacketLengthVariance: double (nullable = true)
|-- FINFlagCount: integer (nullable = true)
|-- SYNFlagCount: integer (nullable = true)
|-- RSTFlagCount: integer (nullable = true)
|-- PSHFlagCount: integer (nullable = true)
|-- ACKFlagCount: integer (nullable = true)
|-- URGFlagCount: integer (nullable = true)
|-- CWEFlagCount: integer (nullable = true)
|-- ECEFlagCount: integer (nullable = true)
|-- Down/UpRatio: integer (nullable = true)
|-- AveragePacketSize: double (nullable = true)
|-- AvgFwdSegmentSize: double (nullable = true)
|-- AvgBwdSegmentSize: double (nullable = true)
|-- FwdHeaderLength_2: integer (nullable = true)
|-- FwdAvgBytes/Bulk: integer (nullable = true)
|-- FwdAvgPackets/Bulk: integer (nullable = true)
|-- FwdAvgBulkRate: integer (nullable = true)
|-- BwdAvgBytes/Bulk: integer (nullable = true)
|-- BwdAvgPackets/Bulk: integer (nullable = true)
|-- BwdAvgBulkRate: integer (nullable = true)
|-- SubflowFwdPackets: integer (nullable = true)
|-- SubflowFwdBytes: integer (nullable = true)
|-- SubflowBwdPackets: integer (nullable = true)
|-- SubflowBwdBytes: integer (nullable = true)
|-- Init_Win_bytes_forward: integer (nullable = true)
|-- Init_Win_bytes_backward: integer (nullable = true)
|-- act_data_pkt_fwd: integer (nullable = true)
|-- min_seg_size_forward: integer (nullable = true)
|-- ActiveMean: double (nullable = true)
|-- ActiveStd: double (nullable = true)
|-- ActiveMax: integer (nullable = true)
|-- ActiveMin: integer (nullable = true)
|-- IdleMean: double (nullable = true)
|-- IdleStd: double (nullable = true)
|-- IdleMax: integer (nullable = true)
|-- IdleMin: integer (nullable = true)
|-- Label: string (nullable = true)
Dataset Reference - Iman Sharafaldin, Arash Habibi Lashkari, and Ali A. Ghorbani, “Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization”, 4th International Conference on Information Systems Security and Privacy (ICISSP), Portugal, January 2018
For your information, I was using all the features, and inorder to reduce the code content here I am showing only few 6 columns while imputing.

How to update the schema of a Spark DataFrame (methods like Dataset.withColumn and Datset.select don't work in my case)

My question is if there are any approaches to update the schema of a DataFrame without explicitly calling SparkSession.createDataFrame(dataframe.rdd, newSchema).
Details are as follows.
I have an original Spark DataFrame with schema below:
root
|-- column11: string (nullable = true)
|-- column12: string (nullable = true)
|-- column13: string (nullable = true)
|-- column14: string (nullable = true)
|-- column15: string (nullable = true)
|-- column16: string (nullable = true)
|-- column17: string (nullable = true)
|-- column18: string (nullable = true)
|-- column19: string (nullable = true)
I applied Dataset.mapPartitions on the original DataFrame and got a new DataFrame (returned by Dataset.mapPartitions).
The reason for using Dataset.mapPartitions but not Dataset.map is better transformation speed.
In this new DataFrame, every row should have a schema like below:
root
|-- column21: string (nullable = true)
|-- column22: long (nullable = true)
|-- column23: string (nullable = true)
|-- column24: long (nullable = true)
|-- column25: struct (nullable = true)
| |-- column251: string (nullable = true)
| |-- column252: string (nullable = true)
| |-- column253: string (nullable = true)
| |-- column254: string (nullable = true)
| |-- column255: string (nullable = true)
| |-- column256: string (nullable = true)
So the schema of the new DataFrame should be the same as the above.
However, the schema of the new DataFrame won't be updated automatically. The output of applying Dataset.printSchema method on the new DataFrame is still original:
root
|-- column11: string (nullable = true)
|-- column12: string (nullable = true)
|-- column13: string (nullable = true)
|-- column14: string (nullable = true)
|-- column15: string (nullable = true)
|-- column16: string (nullable = true)
|-- column17: string (nullable = true)
|-- column18: string (nullable = true)
|-- column19: string (nullable = true)
So, in order to get the correct (updated) schema, what I'm doing is using SparkSession.createDataFrame(newDataFrame.rdd, newSchema).
My concern here is that falling back to RDD (newDataFrame.rdd) will hurt the transformation speed because Spark Catalyst doesn't handle RDD as well as Dataset/DataFrame.
My question is if there are any approaches to update the schema of the new DataFrame without explicitly calling SparkSession.createDataFrame(newDataFrame.rdd, newSchema).
Thanks a lot.
You can use RowEncoder to define schema for newDataFrame.
See following example.
val originalDF = spark.sparkContext.parallelize(List(("Tonny", "city1"), ("Rogger", "city2"), ("Michal", "city3"))).toDF("name", "city")
val r = scala.util.Random
val encoderForNewDF = RowEncoder(StructType(Array(
StructField("name", StringType),
StructField("num", IntegerType),
StructField("city", StringType)
)))
val newDF = originalDF.mapPartitions { partition =>
partition.map{ row =>
val name = row.getAs[String]("name")
val city = row.getAs[String]("city")
val num = r.nextInt
Row.fromSeq(Array[Any](name, num, city))
}
} (encoderForNewDF)
newDF.printSchema()
|-- name: string (nullable = true)
|-- num: integer (nullable = true)
|-- city: string (nullable = true)
Row Encoder for spark: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-RowEncoder.html

Spark 1.6: drop column in DataFrame with escaped column names

Trying to drop a column in a DataFrame, but i have column names with dots in them, which I escaped.
Before I escape, my schema looks like this:
root
|-- user_id: long (nullable = true)
|-- hourOfWeek: string (nullable = true)
|-- observed: string (nullable = true)
|-- raw.hourOfDay: long (nullable = true)
|-- raw.minOfDay: long (nullable = true)
|-- raw.dayOfWeek: long (nullable = true)
|-- raw.sensor2: long (nullable = true)
If I try to drop a column, I get:
df = df.drop("hourOfWeek")
org.apache.spark.sql.AnalysisException: cannot resolve 'raw.hourOfDay' given input columns raw.dayOfWeek, raw.sensor2, observed, raw.hourOfDay, hourOfWeek, raw.minOfDay, user_id;
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
Note that I'm not even trying to drop on the columns with dots in name.
Since I couldn't seem to do much without escaping the column names, I converted the schema to:
root
|-- user_id: long (nullable = true)
|-- hourOfWeek: string (nullable = true)
|-- observed: string (nullable = true)
|-- `raw.hourOfDay`: long (nullable = true)
|-- `raw.minOfDay`: long (nullable = true)
|-- `raw.dayOfWeek`: long (nullable = true)
|-- `raw.sensor2`: long (nullable = true)
but that doesn't seem to help. I still get the same error.
I tried escaping all column names, and drop using the escaped name, but that doesn't work either.
root
|-- `user_id`: long (nullable = true)
|-- `hourOfWeek`: string (nullable = true)
|-- `observed`: string (nullable = true)
|-- `raw.hourOfDay`: long (nullable = true)
|-- `raw.minOfDay`: long (nullable = true)
|-- `raw.dayOfWeek`: long (nullable = true)
|-- `raw.sensor2`: long (nullable = true)
df.drop("`hourOfWeek`")
org.apache.spark.sql.AnalysisException: cannot resolve 'user_id' given input columns `user_id`, `raw.dayOfWeek`, `observed`, `raw.minOfDay`, `raw.hourOfDay`, `raw.sensor2`, `hourOfWeek`;
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
Is there another way to drop a column that would not fail on this type of data?
Alright, I seem to have found the solution after all:
df.drop(df.col("raw.hourOfWeek")) seems to work
val data = df.drop("Customers");
will work fine for normal columns
val new = df.drop(df.col("old.column"));