SparkException - Chi-square test expect categorical values - pyspark

I am new to pyspark and I am trying to select the best features using chisqselector.
I have a dataset of 78 features. The steps I did are the following
1. Dropped nan, applied imputer
2. Converted the string label column to int using stringindexer.
3. Applied Vector Assembler
4. Vector inderxer
5. Standard Scaler
6. Applied Chisqselector, produced error.
As mentioned in the post (SparkException: Chi-square test expect factors) I applied vector indexer, still its not working. What are the data preparation steps I should do for chisqselector. Thanks in Advance.
I am using a security dataset CICIDS2017 with 78 features and label is a string.
CODE
````
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("exp").getOrCreate()
raw_data = spark.read.csv("SCX.csv", inferSchema = True, header =
True)
raw_data.na.drop().show()
cols=raw_data.columns
cols.remove("Label")
from pyspark.ml.feature import Imputer
imputer=Imputer(inputCols=['Destination Port',
'FlowDuration',
'TotalFwdPackets',
'TotalBackwardPackets',
'TotalLengthofFwdPackets',
'TotalLengthofBwdPackets'
],outputCols=['Destination Port',
'FlowDuration',
'TotalFwdPackets',
'TotalBackwardPackets',
'TotalLengthofFwdPackets',
])
model=imputer.fit(raw_data)
raw_data1=model.transform(raw_data)
raw_data1.show(5)
#RAW DATA2 => After doing String indexer on label column
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol='Label', outputCol='_LabelIndexed')
raw_data2 = indexer.fit(raw_data1).transform(raw_data1)
#RAW DATA3 => After applying vector assembler
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=cols,outputCol="features")
# Now let us use the transform method to transform our dataset
raw_data3=assembler.transform(raw_data2)
raw_data3.select("features").show(truncate=False)
#RAW DATA 4 => After applying Vector Indexer
from pyspark.ml.feature import VectorIndexer
vindexer = VectorIndexer(inputCol="features", outputCol="vindexed",
maxCategories=9999)
vindexerModel = vindexer.fit(raw_data3)
categoricalFeatures = vindexerModel.categoryMaps
print("Chose %d categorical features: %s" %
(len(categoricalFeatures), ", ".join(str(k) for k in
categoricalFeatures.keys())))
# Create new column "indexed" with categorical values transformed to
indices
raw_data4 = vindexerModel.transform(raw_data3)
raw_data4.show()
#RAW DATA 5 => After applying Standard Scaler
from pyspark.ml.feature import StandardScaler
standardscaler=StandardScaler().setInputCol("vindexed").setOutputCol
("Scaled_features")
raw_data5=standardscaler.fit(raw_data4).transform(raw_data4)
train, test = raw_data5.randomSplit([0.8, 0.2], seed=456)
# Feature selection using chisquareSelector
from pyspark.ml.feature import ChiSqSelector
chi = ChiSqSelector(featuresCol='Scaled_features',
outputCol='Selected_f',labelCol='_LabelIndexed',fpr=0.05)
train=chi.fit(train).transform(train)
#test=chi.fit(test).transform(test)
#test.select("Aspect").show(5,truncate=False)
````
But this code returns error message while fiting
Py4JJavaError: An error occurred while calling o568.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 32.0 failed 1 times, most recent failure: Lost task 2.0 in stage 32.0 (TID 69, 192.168.1.15, executor driver): org.apache.spark.SparkException: *****Chi-square test expect factors (categorical values) but found more than 10000 distinct values in column 14.*****
````
raw_data.printSchema()
````
|-- Destination Port: integer (nullable = true)
|-- FlowDuration: integer (nullable = true)
|-- TotalFwdPackets: integer (nullable = true)
|-- TotalBackwardPackets: integer (nullable = true)
|-- TotalLengthofFwdPackets: integer (nullable = true)
|-- TotalLengthofBwdPackets: integer (nullable = true)
|-- FwdPacketLengthMax: integer (nullable = true)
|-- FwdPacketLengthMin: integer (nullable = true)
|-- FwdPacketLengthMean: double (nullable = true)
|-- FwdPacketLengthStd: double (nullable = true)
|-- BwdPacketLengthMax: integer (nullable = true)
|-- BwdPacketLengthMin: integer (nullable = true)
|-- BwdPacketLengthMean: double (nullable = true)
|-- BwdPacketLengthStd: double (nullable = true)
|-- FlowBytesPersec: double (nullable = true)
|-- FlowPacketsPersec: double (nullable = true)
|-- FlowIATMean: double (nullable = true)
|-- FlowIATStd: double (nullable = true)
|-- FlowIATMax: integer (nullable = true)
|-- FlowIATMin: integer (nullable = true)
|-- FwdIATTotal: integer (nullable = true)
|-- FwdIATMean: double (nullable = true)
|-- FwdIATStd: double (nullable = true)
|-- FwdIATMax: integer (nullable = true)
|-- FwdIATMin: integer (nullable = true)
|-- BwdIATTotal: integer (nullable = true)
|-- BwdIATMean: double (nullable = true)
|-- BwdIATStd: double (nullable = true)
|-- BwdIATMax: integer (nullable = true)
|-- BwdIATMin: integer (nullable = true)
|-- FwdPSHFlags: integer (nullable = true)
|-- BwdPSHFlags: integer (nullable = true)
|-- FwdURGFlags: integer (nullable = true)
|-- BwdURGFlags: integer (nullable = true)
|-- FwdHeaderLength_1: integer (nullable = true)
|-- BwdHeaderLength: integer (nullable = true)
|-- FwdPackets/s: double (nullable = true)
|-- BwdPackets/s: double (nullable = true)
|-- MinPacketLength: integer (nullable = true)
|-- MaxPacketLength: integer (nullable = true)
|-- PacketLengthMean: double (nullable = true)
|-- PacketLengthStd: double (nullable = true)
|-- PacketLengthVariance: double (nullable = true)
|-- FINFlagCount: integer (nullable = true)
|-- SYNFlagCount: integer (nullable = true)
|-- RSTFlagCount: integer (nullable = true)
|-- PSHFlagCount: integer (nullable = true)
|-- ACKFlagCount: integer (nullable = true)
|-- URGFlagCount: integer (nullable = true)
|-- CWEFlagCount: integer (nullable = true)
|-- ECEFlagCount: integer (nullable = true)
|-- Down/UpRatio: integer (nullable = true)
|-- AveragePacketSize: double (nullable = true)
|-- AvgFwdSegmentSize: double (nullable = true)
|-- AvgBwdSegmentSize: double (nullable = true)
|-- FwdHeaderLength_2: integer (nullable = true)
|-- FwdAvgBytes/Bulk: integer (nullable = true)
|-- FwdAvgPackets/Bulk: integer (nullable = true)
|-- FwdAvgBulkRate: integer (nullable = true)
|-- BwdAvgBytes/Bulk: integer (nullable = true)
|-- BwdAvgPackets/Bulk: integer (nullable = true)
|-- BwdAvgBulkRate: integer (nullable = true)
|-- SubflowFwdPackets: integer (nullable = true)
|-- SubflowFwdBytes: integer (nullable = true)
|-- SubflowBwdPackets: integer (nullable = true)
|-- SubflowBwdBytes: integer (nullable = true)
|-- Init_Win_bytes_forward: integer (nullable = true)
|-- Init_Win_bytes_backward: integer (nullable = true)
|-- act_data_pkt_fwd: integer (nullable = true)
|-- min_seg_size_forward: integer (nullable = true)
|-- ActiveMean: double (nullable = true)
|-- ActiveStd: double (nullable = true)
|-- ActiveMax: integer (nullable = true)
|-- ActiveMin: integer (nullable = true)
|-- IdleMean: double (nullable = true)
|-- IdleStd: double (nullable = true)
|-- IdleMax: integer (nullable = true)
|-- IdleMin: integer (nullable = true)
|-- Label: string (nullable = true)
Dataset Reference - Iman Sharafaldin, Arash Habibi Lashkari, and Ali A. Ghorbani, “Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization”, 4th International Conference on Information Systems Security and Privacy (ICISSP), Portugal, January 2018
For your information, I was using all the features, and inorder to reduce the code content here I am showing only few 6 columns while imputing.

Related

Not able to add data through with column using case statement pyspark

Code like as below:
#To get deal keys
schema of lt_online:
root
|-- FT/RT: string (nullable = true)
|-- Country: string (nullable = true)
|-- Charge_Type: string (nullable = true)
|-- Tariff_Loc: string (nullable = true)
|-- Charge_No: string (nullable = true)
|-- Status: string (nullable = true)
|-- Validity_from: string (nullable = true)
|-- Validity_to: string (nullable = true)
|-- Range_Basis: string (nullable = true)
|-- Limited_Parties: string (nullable = true)
|-- Charge_Detail: string (nullable = true)
|-- Freetime_Unit: string (nullable = true)
|-- Freetime: string (nullable = true)
|-- Count_Holidays: string (nullable = true)
|-- Majeure: string (nullable = true)
|-- Start_Event: string (nullable = true)
|-- Same/Next_Day: string (nullable = true)
|-- Next_Day_if_AFTER: string (nullable = true)
|-- Availability_Date: string (nullable = true)
|-- Route_Group: string (nullable = true)
|-- Route_Code: string (nullable = true)
|-- Origin: string (nullable = true)
|-- LoadZone: string (nullable = true)
|-- FDischZone: string (nullable = true)
|-- PODZone: string (nullable = true)
|-- FDestZone: string (nullable = true)
|-- Equipment_Group: string (nullable = true)
|-- Equipment_Type: string (nullable = true)
|-- Range_From: string (nullable = true)
|-- Range_To: void (nullable = true)
|-- Cargo_Type: string (nullable = true)
|-- Commodity: string (nullable = true)
|-- SC_Group: string (nullable = true)
|-- SC_Number: string (nullable = true)
|-- IMO: string (nullable = true)
|-- Shipper_Group: string (nullable = true)
|-- Cnee_Group: string (nullable = true)
|-- Direction: string (nullable = true)
|-- Service: string (nullable = true)
|-- Haulage: string (nullable = true)
|-- Transport_Type: string (nullable = true)
|-- Option1: string (nullable = true)
|-- Option2: string (nullable = true)
|-- 1st_of_Route_Group: string (nullable = true)
|-- 1st_of_LoadZone: string (nullable = true)
|-- 1st_of_FDischZone: string (nullable = true)
|-- 1st_of_PODZone: string (nullable = true)
|-- 1st_of_FDestZone: string (nullable = true)
|-- 1st_of_Equipment_Group: string (nullable = true)
|-- 1st_of_SC_Group: string (nullable = true)
|-- 1st_of_Shipper_Group: string (nullable = true)
|-- 1st_of_Cnee_Group: string (nullable = true)
pyspark code as below df=lt_online.withColumn("dealkeys",lit('')).withColumn("dealAttributes",lit(''))
start=[]
start_dict={}
dealatt=["Charge_No","Status","Validity_from","Validity_to"]
dealkeys=["Charge_Type","Direction"]
for index,row in lt_online.toPandas().iterrows():
start=[]
start_dict={}
key = row['Charge_No']
for i in dealatt:
#final = row[i]
start_dict[i]=row[i]
df_deal_att = df.withColumn('dealkeys', when(col('Charge_No') == key , str(start_dict)).otherwise(col('dealkeys')))
for i in dealkeys:
#key = row['Charge_No']
final = {"keyname" : i,"value" : row[i],"description":".."}
start.append(final)
#final_val= {"value" : row['Charge_Type']}
#start.append(final_val)
#df3=lt_online.withColumn("new_column",str(start))
print(key,start_dict)
df3 = df_deal_att.withColumn('dealAttributes', when(col('Charge_No') == key , str(start)).otherwise(col('dealAttributes')))
when i run DF3 dataframe dealAttributes and dealkeys old data got blank and latest record only inserted.
Please see the screenshot
Since the lt_online dataframe is large, I have selected only the required columns from it. The following is the schema of the lt_online dataframe that I have selected.
The problem arrises because you are not changing df in place, but assigning it to df_deal_att. This will update df_deal_att (also df3) only for the current row in loop (because df is not changing in the entire process). Using df_deal_att.show() inside the loop will help in understanding this.
Use the following code instead to get the desired output:
for index,row in lt_online.toPandas().iterrows():
start=[]
start_dict={}
key = row['Charge_No']
for i in dealatt:
start_dict[i]=row[i]
#ASSIGN TO df INSTEAD OF df_deal_att
df = df.withColumn('dealkeys', when(col('Charge_No') == key , str(start_dict)).otherwise(col('dealkeys')))
for i in dealkeys:
final = {"keyname" : i,"value" : row[i],"description":".."}
start.append(final)
#USE df and ASSIGN TO df INSTEAD OF USING df_deal_att AND ASSIGNING TO df3
df = df.withColumn('dealAttributes', when(col('Charge_No') == key , str(start)).otherwise(col('dealAttributes')))
Assigning the df dataframe after adding the column value based on condition to df itself (instead of using df_deal_att or df3) helps in solving the issue. The following image reflects the output achieved after using the above code.

Pyspark create temp view from dataframe

I am trying to read thorugh spark.sql a huge csv.
I created a dataframe from a CSV, the dataframe seems created correctly.
I read the schema and I can perform select and filter.
I would like to create a temp view to execute same research using sql, I am more comfortable with it but the temp view seems created on the csv header only.
Where am I making the mistake?
Thanks
>>> df = spark.read.options(header=True,inferSchema=True,delimiter=";").csv("./elenco_dm_tutti_csv_formato_opendata_UltimaVersione.csv")
>>> df.printSchema()
root
|-- TIPO: integer (nullable = true)
|-- PROGRESSIVO_DM_ASS: integer (nullable = true)
|-- DATA_PRIMA_PUBBLICAZIONE: string (nullable = true)
|-- DM_RIFERIMENTO: integer (nullable = true)
|-- GRUPPO_DM_SIMILI: integer (nullable = true)
|-- ISCRIZIONE_REPERTORIO: string (nullable = true)
|-- INIZIO_VALIDITA: string (nullable = true)
|-- FINE_VALIDITA: string (nullable = true)
|-- FABBRICANTE_ASSEMBLATORE: string (nullable = true)
|-- CODICE_FISCALE: string (nullable = true)
|-- PARTITA_IVA_VATNUMBER: string (nullable = true)
|-- CODICE_CATALOGO_FABBR_ASS: string (nullable = true)
|-- DENOMINAZIONE_COMMERCIALE: string (nullable = true)
|-- CLASSIFICAZIONE_CND: string (nullable = true)
|-- DESCRIZIONE_CND: string (nullable = true)
|-- DATAFINE_COMMERCIO: string (nullable = true)
>>> df.count()
1653697
>>> df.createOrReplaceTempView("mask")
>>> spark.sql("select count(*) from mask")
DataFrame[count(1): bigint]
Spark operations like sql() do not process anything by default. You need to add .show() or .collect() to get results.

How to get correlation matrix for Scala dataframe

I have Scala dataframe with numeric data:
df2_num.printSchema
root
|-- ot2_total_sum: decimal(38,18) (nullable = true)
|-- s42_3: decimal(38,0) (nullable = true)
|-- s109_5: decimal(38,0) (nullable = true)
|-- is_individual: decimal(38,0) (nullable = true)
|-- s118_5: decimal(38,0) (nullable = true)
|-- s46_3: decimal(38,0) (nullable = true)
|-- ot1_nds_10: decimal(38,18) (nullable = true)
|-- s45_3: decimal(38,0) (nullable = true)
|-- s10_3: decimal(38,0) (nullable = true)
|-- nb: decimal(38,0) (nullable = true)
|-- s80_5: decimal(38,0) (nullable = true)
|-- ot2_nds_10: decimal(38,18) (nullable = true)
|-- pr: decimal(38,0) (nullable = true)
|-- IP: integer (nullable = true)
|-- s70_5: decimal(38,0) (nullable = true)
|-- ot1_sum_without_nds: decimal(38,18) (nullable = true)
|-- s109_3: decimal(38,0) (nullable = true)
|-- s60_3: decimal(38,0) (nullable = true)
|-- s190_3: decimal(38,0) (nullable = true)
|-- ot3_total_sum: decimal(38,18) (nullable = true)
|-- s130_3: decimal(38,0) (nullable = true)
|-- region: integer (nullable = true)
|-- s170_3: decimal(38,0) (nullable = true)
|-- s20_3: decimal(38,0) (nullable = true)
|-- s90_5: decimal(38,0) (nullable = true)
|-- ot2_nds_20: decimal(38,18) (nullable = true)
|-- s70_3: decimal(38,0) (nullable = true)
|-- ot1_nds_0: decimal(38,18) (nullable = true)
|-- s200_3: decimal(38,0) (nullable = true)
|-- ot2_sum_without_nds: decimal(38,18) (nullable = true)
|-- ot1_nds_20: decimal(38,18) (nullable = true)
|-- s120_3: decimal(38,0) (nullable = true)
|-- s150_3: decimal(38,0) (nullable = true)
|-- s40_3: decimal(38,0) (nullable = true)
|-- s10_5: decimal(38,0) (nullable = true)
|-- nalog: decimal(38,0) (nullable = true)
|-- ot1_total_sum: decimal(38,18) (nullable = true)
I need to get correlation matrix for all columns of this dataframe.
I've tried to use org.apache.spark.mllib.stat.Statistics.corr . It reqiues RDD data , so I've converted my dataframe to RDD
val df2_num_rdd = df2_num.rdd
Then I try to use Statistics.cor , and get error:
val correlMatrix = Statistics.corr(df2_num_rdd , "pearson")
<console>:82: error: overloaded method value corr with alternatives:
(x: org.apache.spark.api.java.JavaRDD[java.lang.Double],y: org.apache.spark.api.java.JavaRDD[java.lang.Double])scala.Double <and>
(x: org.apache.spark.rdd.RDD[scala.Double],y: org.apache.spark.rdd.RDD[scala.Double])scala.Double <and>
(X: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector],method: String)org.apache.spark.mllib.linalg.Matrix
cannot be applied to (org.apache.spark.rdd.RDD[org.apache.spark.sql.Row], String)
val correlMatrix = Statistics.corr(df2_num_rdd , "pearson")
So how I need to handle my data for Statistics.corr ?
Assuming you're running a relatively recent version of Spark, I suggest using org.apache.spark.ml.stat.Correlation.corr instead.
First, you have to assemble the columns for which you want to compute correlation, and then you can get correlations as a dataframe. From here, you can fetch the first row and transform it to whatever suits your needs.
Here is an example :
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.stat.Correlation
val assembled: DataFrame = new VectorAssembler()
.setInputCols(df2_num.columns)
.setOutputCol("correlations")
.transform(df2_num)
val correlations: DataFrame =
Correlation.corr(assembled, column = "correlations", method = "pearson")
Here are some useful links for guides related to this approach :
Spark MLlib Guide : Correlation
Spark MLlib Guide : VectorAssembler
.getAs[DenseMatrix] in correlations.first.getAs[DenseMatrix] throwing an error.
#H.Leger - How would you convert the final result to a proper matrix of this format
Column
c1
c2
c3
c1
1
0.97
0.92
c2
0.97
1
0.94
c3
0.92
0.94
1

Cannot save spark dataframe as CSV

I am attempting to save a Spark DataFrame as a CSV. I have looked up numerous different posts and guides and am, for some reason, still getting an issue. The code I am using to do this is
endframe.coalesce(1).
write.
mode("append").
csv("file:///home/X/Code/output/output.csv")
I have also tried this by including .format("com.databricks.spark.csv") as well as by changing the .csv() to a .save() and strangely none of these work. The most unusual part is that running this code creates an empty folder called "output.csv" in the output folder.
The error message that spark gives is
Job aborted due to stage failure:
Task 0 in stage 281.0 failed 4 times, most recent failure:
Lost task 0.3 in stage 281.0 (TID 22683, X.x.local, executor 4): org.apache.spark.SparkException:
Task failed while writing rows.
I have verified that the dataframe schema is properly initialized. However, when I use the .format, I do not import com.databricks.spark.csv, but I do not think that is the problem. Any advice on this would be appreciated.
The schema is as follows:
|-- kwh: double (nullable = true)
|-- qh_end: double (nullable = true)
|-- cdh70: double (nullable = true)
|-- norm_hbu: double (nullable = true)
|-- precool_counterprecoolevent_id6: double (nullable = true)
|-- precool_counterprecoolevent_id7: double (nullable = true)
|-- precool_counterprecoolevent_id8: double (nullable = true)
|-- precool_counterprecoolevent_id9: double (nullable = true)
|-- event_id10event_counterevent: double (nullable = true)
|-- event_id2event_counterevent: double (nullable = true)
|-- event_id3event_counterevent: double (nullable = true)
|-- event_id4event_counterevent: double (nullable = true)
|-- event_id5event_counterevent: double (nullable = true)
|-- event_id6event_counterevent: double (nullable = true)
|-- event_id7event_counterevent: double (nullable = true)
|-- event_id8event_counterevent: double (nullable = true)
|-- event_id9event_counterevent: double (nullable = true)
|-- event_idTestevent_counterevent: double (nullable = true)
|-- event_id10snapback_countersnapback: double (nullable = true)
|-- event_id2snapback_countersnapback: double (nullable = true)
|-- event_id3snapback_countersnapback: double (nullable = true)
|-- event_id4snapback_countersnapback: double (nullable = true)
|-- event_id5snapback_countersnapback: double (nullable = true)
|-- event_id6snapback_countersnapback: double (nullable = true)
|-- event_id7snapback_countersnapback: double (nullable = true)
|-- event_id8snapback_countersnapback: double (nullable = true)
|-- event_id9snapback_countersnapback: double (nullable = true)
|-- event_idTestsnapback_countersnapback: double (nullable = true)

How to update the schema of a Spark DataFrame (methods like Dataset.withColumn and Datset.select don't work in my case)

My question is if there are any approaches to update the schema of a DataFrame without explicitly calling SparkSession.createDataFrame(dataframe.rdd, newSchema).
Details are as follows.
I have an original Spark DataFrame with schema below:
root
|-- column11: string (nullable = true)
|-- column12: string (nullable = true)
|-- column13: string (nullable = true)
|-- column14: string (nullable = true)
|-- column15: string (nullable = true)
|-- column16: string (nullable = true)
|-- column17: string (nullable = true)
|-- column18: string (nullable = true)
|-- column19: string (nullable = true)
I applied Dataset.mapPartitions on the original DataFrame and got a new DataFrame (returned by Dataset.mapPartitions).
The reason for using Dataset.mapPartitions but not Dataset.map is better transformation speed.
In this new DataFrame, every row should have a schema like below:
root
|-- column21: string (nullable = true)
|-- column22: long (nullable = true)
|-- column23: string (nullable = true)
|-- column24: long (nullable = true)
|-- column25: struct (nullable = true)
| |-- column251: string (nullable = true)
| |-- column252: string (nullable = true)
| |-- column253: string (nullable = true)
| |-- column254: string (nullable = true)
| |-- column255: string (nullable = true)
| |-- column256: string (nullable = true)
So the schema of the new DataFrame should be the same as the above.
However, the schema of the new DataFrame won't be updated automatically. The output of applying Dataset.printSchema method on the new DataFrame is still original:
root
|-- column11: string (nullable = true)
|-- column12: string (nullable = true)
|-- column13: string (nullable = true)
|-- column14: string (nullable = true)
|-- column15: string (nullable = true)
|-- column16: string (nullable = true)
|-- column17: string (nullable = true)
|-- column18: string (nullable = true)
|-- column19: string (nullable = true)
So, in order to get the correct (updated) schema, what I'm doing is using SparkSession.createDataFrame(newDataFrame.rdd, newSchema).
My concern here is that falling back to RDD (newDataFrame.rdd) will hurt the transformation speed because Spark Catalyst doesn't handle RDD as well as Dataset/DataFrame.
My question is if there are any approaches to update the schema of the new DataFrame without explicitly calling SparkSession.createDataFrame(newDataFrame.rdd, newSchema).
Thanks a lot.
You can use RowEncoder to define schema for newDataFrame.
See following example.
val originalDF = spark.sparkContext.parallelize(List(("Tonny", "city1"), ("Rogger", "city2"), ("Michal", "city3"))).toDF("name", "city")
val r = scala.util.Random
val encoderForNewDF = RowEncoder(StructType(Array(
StructField("name", StringType),
StructField("num", IntegerType),
StructField("city", StringType)
)))
val newDF = originalDF.mapPartitions { partition =>
partition.map{ row =>
val name = row.getAs[String]("name")
val city = row.getAs[String]("city")
val num = r.nextInt
Row.fromSeq(Array[Any](name, num, city))
}
} (encoderForNewDF)
newDF.printSchema()
|-- name: string (nullable = true)
|-- num: integer (nullable = true)
|-- city: string (nullable = true)
Row Encoder for spark: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-RowEncoder.html