Cannot save spark dataframe as CSV - scala

I am attempting to save a Spark DataFrame as a CSV. I have looked up numerous different posts and guides and am, for some reason, still getting an issue. The code I am using to do this is
endframe.coalesce(1).
write.
mode("append").
csv("file:///home/X/Code/output/output.csv")
I have also tried this by including .format("com.databricks.spark.csv") as well as by changing the .csv() to a .save() and strangely none of these work. The most unusual part is that running this code creates an empty folder called "output.csv" in the output folder.
The error message that spark gives is
Job aborted due to stage failure:
Task 0 in stage 281.0 failed 4 times, most recent failure:
Lost task 0.3 in stage 281.0 (TID 22683, X.x.local, executor 4): org.apache.spark.SparkException:
Task failed while writing rows.
I have verified that the dataframe schema is properly initialized. However, when I use the .format, I do not import com.databricks.spark.csv, but I do not think that is the problem. Any advice on this would be appreciated.
The schema is as follows:
|-- kwh: double (nullable = true)
|-- qh_end: double (nullable = true)
|-- cdh70: double (nullable = true)
|-- norm_hbu: double (nullable = true)
|-- precool_counterprecoolevent_id6: double (nullable = true)
|-- precool_counterprecoolevent_id7: double (nullable = true)
|-- precool_counterprecoolevent_id8: double (nullable = true)
|-- precool_counterprecoolevent_id9: double (nullable = true)
|-- event_id10event_counterevent: double (nullable = true)
|-- event_id2event_counterevent: double (nullable = true)
|-- event_id3event_counterevent: double (nullable = true)
|-- event_id4event_counterevent: double (nullable = true)
|-- event_id5event_counterevent: double (nullable = true)
|-- event_id6event_counterevent: double (nullable = true)
|-- event_id7event_counterevent: double (nullable = true)
|-- event_id8event_counterevent: double (nullable = true)
|-- event_id9event_counterevent: double (nullable = true)
|-- event_idTestevent_counterevent: double (nullable = true)
|-- event_id10snapback_countersnapback: double (nullable = true)
|-- event_id2snapback_countersnapback: double (nullable = true)
|-- event_id3snapback_countersnapback: double (nullable = true)
|-- event_id4snapback_countersnapback: double (nullable = true)
|-- event_id5snapback_countersnapback: double (nullable = true)
|-- event_id6snapback_countersnapback: double (nullable = true)
|-- event_id7snapback_countersnapback: double (nullable = true)
|-- event_id8snapback_countersnapback: double (nullable = true)
|-- event_id9snapback_countersnapback: double (nullable = true)
|-- event_idTestsnapback_countersnapback: double (nullable = true)

Related

Not able to add data through with column using case statement pyspark

Code like as below:
#To get deal keys
schema of lt_online:
root
|-- FT/RT: string (nullable = true)
|-- Country: string (nullable = true)
|-- Charge_Type: string (nullable = true)
|-- Tariff_Loc: string (nullable = true)
|-- Charge_No: string (nullable = true)
|-- Status: string (nullable = true)
|-- Validity_from: string (nullable = true)
|-- Validity_to: string (nullable = true)
|-- Range_Basis: string (nullable = true)
|-- Limited_Parties: string (nullable = true)
|-- Charge_Detail: string (nullable = true)
|-- Freetime_Unit: string (nullable = true)
|-- Freetime: string (nullable = true)
|-- Count_Holidays: string (nullable = true)
|-- Majeure: string (nullable = true)
|-- Start_Event: string (nullable = true)
|-- Same/Next_Day: string (nullable = true)
|-- Next_Day_if_AFTER: string (nullable = true)
|-- Availability_Date: string (nullable = true)
|-- Route_Group: string (nullable = true)
|-- Route_Code: string (nullable = true)
|-- Origin: string (nullable = true)
|-- LoadZone: string (nullable = true)
|-- FDischZone: string (nullable = true)
|-- PODZone: string (nullable = true)
|-- FDestZone: string (nullable = true)
|-- Equipment_Group: string (nullable = true)
|-- Equipment_Type: string (nullable = true)
|-- Range_From: string (nullable = true)
|-- Range_To: void (nullable = true)
|-- Cargo_Type: string (nullable = true)
|-- Commodity: string (nullable = true)
|-- SC_Group: string (nullable = true)
|-- SC_Number: string (nullable = true)
|-- IMO: string (nullable = true)
|-- Shipper_Group: string (nullable = true)
|-- Cnee_Group: string (nullable = true)
|-- Direction: string (nullable = true)
|-- Service: string (nullable = true)
|-- Haulage: string (nullable = true)
|-- Transport_Type: string (nullable = true)
|-- Option1: string (nullable = true)
|-- Option2: string (nullable = true)
|-- 1st_of_Route_Group: string (nullable = true)
|-- 1st_of_LoadZone: string (nullable = true)
|-- 1st_of_FDischZone: string (nullable = true)
|-- 1st_of_PODZone: string (nullable = true)
|-- 1st_of_FDestZone: string (nullable = true)
|-- 1st_of_Equipment_Group: string (nullable = true)
|-- 1st_of_SC_Group: string (nullable = true)
|-- 1st_of_Shipper_Group: string (nullable = true)
|-- 1st_of_Cnee_Group: string (nullable = true)
pyspark code as below df=lt_online.withColumn("dealkeys",lit('')).withColumn("dealAttributes",lit(''))
start=[]
start_dict={}
dealatt=["Charge_No","Status","Validity_from","Validity_to"]
dealkeys=["Charge_Type","Direction"]
for index,row in lt_online.toPandas().iterrows():
start=[]
start_dict={}
key = row['Charge_No']
for i in dealatt:
#final = row[i]
start_dict[i]=row[i]
df_deal_att = df.withColumn('dealkeys', when(col('Charge_No') == key , str(start_dict)).otherwise(col('dealkeys')))
for i in dealkeys:
#key = row['Charge_No']
final = {"keyname" : i,"value" : row[i],"description":".."}
start.append(final)
#final_val= {"value" : row['Charge_Type']}
#start.append(final_val)
#df3=lt_online.withColumn("new_column",str(start))
print(key,start_dict)
df3 = df_deal_att.withColumn('dealAttributes', when(col('Charge_No') == key , str(start)).otherwise(col('dealAttributes')))
when i run DF3 dataframe dealAttributes and dealkeys old data got blank and latest record only inserted.
Please see the screenshot
Since the lt_online dataframe is large, I have selected only the required columns from it. The following is the schema of the lt_online dataframe that I have selected.
The problem arrises because you are not changing df in place, but assigning it to df_deal_att. This will update df_deal_att (also df3) only for the current row in loop (because df is not changing in the entire process). Using df_deal_att.show() inside the loop will help in understanding this.
Use the following code instead to get the desired output:
for index,row in lt_online.toPandas().iterrows():
start=[]
start_dict={}
key = row['Charge_No']
for i in dealatt:
start_dict[i]=row[i]
#ASSIGN TO df INSTEAD OF df_deal_att
df = df.withColumn('dealkeys', when(col('Charge_No') == key , str(start_dict)).otherwise(col('dealkeys')))
for i in dealkeys:
final = {"keyname" : i,"value" : row[i],"description":".."}
start.append(final)
#USE df and ASSIGN TO df INSTEAD OF USING df_deal_att AND ASSIGNING TO df3
df = df.withColumn('dealAttributes', when(col('Charge_No') == key , str(start)).otherwise(col('dealAttributes')))
Assigning the df dataframe after adding the column value based on condition to df itself (instead of using df_deal_att or df3) helps in solving the issue. The following image reflects the output achieved after using the above code.

Pyspark create temp view from dataframe

I am trying to read thorugh spark.sql a huge csv.
I created a dataframe from a CSV, the dataframe seems created correctly.
I read the schema and I can perform select and filter.
I would like to create a temp view to execute same research using sql, I am more comfortable with it but the temp view seems created on the csv header only.
Where am I making the mistake?
Thanks
>>> df = spark.read.options(header=True,inferSchema=True,delimiter=";").csv("./elenco_dm_tutti_csv_formato_opendata_UltimaVersione.csv")
>>> df.printSchema()
root
|-- TIPO: integer (nullable = true)
|-- PROGRESSIVO_DM_ASS: integer (nullable = true)
|-- DATA_PRIMA_PUBBLICAZIONE: string (nullable = true)
|-- DM_RIFERIMENTO: integer (nullable = true)
|-- GRUPPO_DM_SIMILI: integer (nullable = true)
|-- ISCRIZIONE_REPERTORIO: string (nullable = true)
|-- INIZIO_VALIDITA: string (nullable = true)
|-- FINE_VALIDITA: string (nullable = true)
|-- FABBRICANTE_ASSEMBLATORE: string (nullable = true)
|-- CODICE_FISCALE: string (nullable = true)
|-- PARTITA_IVA_VATNUMBER: string (nullable = true)
|-- CODICE_CATALOGO_FABBR_ASS: string (nullable = true)
|-- DENOMINAZIONE_COMMERCIALE: string (nullable = true)
|-- CLASSIFICAZIONE_CND: string (nullable = true)
|-- DESCRIZIONE_CND: string (nullable = true)
|-- DATAFINE_COMMERCIO: string (nullable = true)
>>> df.count()
1653697
>>> df.createOrReplaceTempView("mask")
>>> spark.sql("select count(*) from mask")
DataFrame[count(1): bigint]
Spark operations like sql() do not process anything by default. You need to add .show() or .collect() to get results.

How to get correlation matrix for Scala dataframe

I have Scala dataframe with numeric data:
df2_num.printSchema
root
|-- ot2_total_sum: decimal(38,18) (nullable = true)
|-- s42_3: decimal(38,0) (nullable = true)
|-- s109_5: decimal(38,0) (nullable = true)
|-- is_individual: decimal(38,0) (nullable = true)
|-- s118_5: decimal(38,0) (nullable = true)
|-- s46_3: decimal(38,0) (nullable = true)
|-- ot1_nds_10: decimal(38,18) (nullable = true)
|-- s45_3: decimal(38,0) (nullable = true)
|-- s10_3: decimal(38,0) (nullable = true)
|-- nb: decimal(38,0) (nullable = true)
|-- s80_5: decimal(38,0) (nullable = true)
|-- ot2_nds_10: decimal(38,18) (nullable = true)
|-- pr: decimal(38,0) (nullable = true)
|-- IP: integer (nullable = true)
|-- s70_5: decimal(38,0) (nullable = true)
|-- ot1_sum_without_nds: decimal(38,18) (nullable = true)
|-- s109_3: decimal(38,0) (nullable = true)
|-- s60_3: decimal(38,0) (nullable = true)
|-- s190_3: decimal(38,0) (nullable = true)
|-- ot3_total_sum: decimal(38,18) (nullable = true)
|-- s130_3: decimal(38,0) (nullable = true)
|-- region: integer (nullable = true)
|-- s170_3: decimal(38,0) (nullable = true)
|-- s20_3: decimal(38,0) (nullable = true)
|-- s90_5: decimal(38,0) (nullable = true)
|-- ot2_nds_20: decimal(38,18) (nullable = true)
|-- s70_3: decimal(38,0) (nullable = true)
|-- ot1_nds_0: decimal(38,18) (nullable = true)
|-- s200_3: decimal(38,0) (nullable = true)
|-- ot2_sum_without_nds: decimal(38,18) (nullable = true)
|-- ot1_nds_20: decimal(38,18) (nullable = true)
|-- s120_3: decimal(38,0) (nullable = true)
|-- s150_3: decimal(38,0) (nullable = true)
|-- s40_3: decimal(38,0) (nullable = true)
|-- s10_5: decimal(38,0) (nullable = true)
|-- nalog: decimal(38,0) (nullable = true)
|-- ot1_total_sum: decimal(38,18) (nullable = true)
I need to get correlation matrix for all columns of this dataframe.
I've tried to use org.apache.spark.mllib.stat.Statistics.corr . It reqiues RDD data , so I've converted my dataframe to RDD
val df2_num_rdd = df2_num.rdd
Then I try to use Statistics.cor , and get error:
val correlMatrix = Statistics.corr(df2_num_rdd , "pearson")
<console>:82: error: overloaded method value corr with alternatives:
(x: org.apache.spark.api.java.JavaRDD[java.lang.Double],y: org.apache.spark.api.java.JavaRDD[java.lang.Double])scala.Double <and>
(x: org.apache.spark.rdd.RDD[scala.Double],y: org.apache.spark.rdd.RDD[scala.Double])scala.Double <and>
(X: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector],method: String)org.apache.spark.mllib.linalg.Matrix
cannot be applied to (org.apache.spark.rdd.RDD[org.apache.spark.sql.Row], String)
val correlMatrix = Statistics.corr(df2_num_rdd , "pearson")
So how I need to handle my data for Statistics.corr ?
Assuming you're running a relatively recent version of Spark, I suggest using org.apache.spark.ml.stat.Correlation.corr instead.
First, you have to assemble the columns for which you want to compute correlation, and then you can get correlations as a dataframe. From here, you can fetch the first row and transform it to whatever suits your needs.
Here is an example :
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.stat.Correlation
val assembled: DataFrame = new VectorAssembler()
.setInputCols(df2_num.columns)
.setOutputCol("correlations")
.transform(df2_num)
val correlations: DataFrame =
Correlation.corr(assembled, column = "correlations", method = "pearson")
Here are some useful links for guides related to this approach :
Spark MLlib Guide : Correlation
Spark MLlib Guide : VectorAssembler
.getAs[DenseMatrix] in correlations.first.getAs[DenseMatrix] throwing an error.
#H.Leger - How would you convert the final result to a proper matrix of this format
Column
c1
c2
c3
c1
1
0.97
0.92
c2
0.97
1
0.94
c3
0.92
0.94
1

SparkException - Chi-square test expect categorical values

I am new to pyspark and I am trying to select the best features using chisqselector.
I have a dataset of 78 features. The steps I did are the following
1. Dropped nan, applied imputer
2. Converted the string label column to int using stringindexer.
3. Applied Vector Assembler
4. Vector inderxer
5. Standard Scaler
6. Applied Chisqselector, produced error.
As mentioned in the post (SparkException: Chi-square test expect factors) I applied vector indexer, still its not working. What are the data preparation steps I should do for chisqselector. Thanks in Advance.
I am using a security dataset CICIDS2017 with 78 features and label is a string.
CODE
````
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("exp").getOrCreate()
raw_data = spark.read.csv("SCX.csv", inferSchema = True, header =
True)
raw_data.na.drop().show()
cols=raw_data.columns
cols.remove("Label")
from pyspark.ml.feature import Imputer
imputer=Imputer(inputCols=['Destination Port',
'FlowDuration',
'TotalFwdPackets',
'TotalBackwardPackets',
'TotalLengthofFwdPackets',
'TotalLengthofBwdPackets'
],outputCols=['Destination Port',
'FlowDuration',
'TotalFwdPackets',
'TotalBackwardPackets',
'TotalLengthofFwdPackets',
])
model=imputer.fit(raw_data)
raw_data1=model.transform(raw_data)
raw_data1.show(5)
#RAW DATA2 => After doing String indexer on label column
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol='Label', outputCol='_LabelIndexed')
raw_data2 = indexer.fit(raw_data1).transform(raw_data1)
#RAW DATA3 => After applying vector assembler
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=cols,outputCol="features")
# Now let us use the transform method to transform our dataset
raw_data3=assembler.transform(raw_data2)
raw_data3.select("features").show(truncate=False)
#RAW DATA 4 => After applying Vector Indexer
from pyspark.ml.feature import VectorIndexer
vindexer = VectorIndexer(inputCol="features", outputCol="vindexed",
maxCategories=9999)
vindexerModel = vindexer.fit(raw_data3)
categoricalFeatures = vindexerModel.categoryMaps
print("Chose %d categorical features: %s" %
(len(categoricalFeatures), ", ".join(str(k) for k in
categoricalFeatures.keys())))
# Create new column "indexed" with categorical values transformed to
indices
raw_data4 = vindexerModel.transform(raw_data3)
raw_data4.show()
#RAW DATA 5 => After applying Standard Scaler
from pyspark.ml.feature import StandardScaler
standardscaler=StandardScaler().setInputCol("vindexed").setOutputCol
("Scaled_features")
raw_data5=standardscaler.fit(raw_data4).transform(raw_data4)
train, test = raw_data5.randomSplit([0.8, 0.2], seed=456)
# Feature selection using chisquareSelector
from pyspark.ml.feature import ChiSqSelector
chi = ChiSqSelector(featuresCol='Scaled_features',
outputCol='Selected_f',labelCol='_LabelIndexed',fpr=0.05)
train=chi.fit(train).transform(train)
#test=chi.fit(test).transform(test)
#test.select("Aspect").show(5,truncate=False)
````
But this code returns error message while fiting
Py4JJavaError: An error occurred while calling o568.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 32.0 failed 1 times, most recent failure: Lost task 2.0 in stage 32.0 (TID 69, 192.168.1.15, executor driver): org.apache.spark.SparkException: *****Chi-square test expect factors (categorical values) but found more than 10000 distinct values in column 14.*****
````
raw_data.printSchema()
````
|-- Destination Port: integer (nullable = true)
|-- FlowDuration: integer (nullable = true)
|-- TotalFwdPackets: integer (nullable = true)
|-- TotalBackwardPackets: integer (nullable = true)
|-- TotalLengthofFwdPackets: integer (nullable = true)
|-- TotalLengthofBwdPackets: integer (nullable = true)
|-- FwdPacketLengthMax: integer (nullable = true)
|-- FwdPacketLengthMin: integer (nullable = true)
|-- FwdPacketLengthMean: double (nullable = true)
|-- FwdPacketLengthStd: double (nullable = true)
|-- BwdPacketLengthMax: integer (nullable = true)
|-- BwdPacketLengthMin: integer (nullable = true)
|-- BwdPacketLengthMean: double (nullable = true)
|-- BwdPacketLengthStd: double (nullable = true)
|-- FlowBytesPersec: double (nullable = true)
|-- FlowPacketsPersec: double (nullable = true)
|-- FlowIATMean: double (nullable = true)
|-- FlowIATStd: double (nullable = true)
|-- FlowIATMax: integer (nullable = true)
|-- FlowIATMin: integer (nullable = true)
|-- FwdIATTotal: integer (nullable = true)
|-- FwdIATMean: double (nullable = true)
|-- FwdIATStd: double (nullable = true)
|-- FwdIATMax: integer (nullable = true)
|-- FwdIATMin: integer (nullable = true)
|-- BwdIATTotal: integer (nullable = true)
|-- BwdIATMean: double (nullable = true)
|-- BwdIATStd: double (nullable = true)
|-- BwdIATMax: integer (nullable = true)
|-- BwdIATMin: integer (nullable = true)
|-- FwdPSHFlags: integer (nullable = true)
|-- BwdPSHFlags: integer (nullable = true)
|-- FwdURGFlags: integer (nullable = true)
|-- BwdURGFlags: integer (nullable = true)
|-- FwdHeaderLength_1: integer (nullable = true)
|-- BwdHeaderLength: integer (nullable = true)
|-- FwdPackets/s: double (nullable = true)
|-- BwdPackets/s: double (nullable = true)
|-- MinPacketLength: integer (nullable = true)
|-- MaxPacketLength: integer (nullable = true)
|-- PacketLengthMean: double (nullable = true)
|-- PacketLengthStd: double (nullable = true)
|-- PacketLengthVariance: double (nullable = true)
|-- FINFlagCount: integer (nullable = true)
|-- SYNFlagCount: integer (nullable = true)
|-- RSTFlagCount: integer (nullable = true)
|-- PSHFlagCount: integer (nullable = true)
|-- ACKFlagCount: integer (nullable = true)
|-- URGFlagCount: integer (nullable = true)
|-- CWEFlagCount: integer (nullable = true)
|-- ECEFlagCount: integer (nullable = true)
|-- Down/UpRatio: integer (nullable = true)
|-- AveragePacketSize: double (nullable = true)
|-- AvgFwdSegmentSize: double (nullable = true)
|-- AvgBwdSegmentSize: double (nullable = true)
|-- FwdHeaderLength_2: integer (nullable = true)
|-- FwdAvgBytes/Bulk: integer (nullable = true)
|-- FwdAvgPackets/Bulk: integer (nullable = true)
|-- FwdAvgBulkRate: integer (nullable = true)
|-- BwdAvgBytes/Bulk: integer (nullable = true)
|-- BwdAvgPackets/Bulk: integer (nullable = true)
|-- BwdAvgBulkRate: integer (nullable = true)
|-- SubflowFwdPackets: integer (nullable = true)
|-- SubflowFwdBytes: integer (nullable = true)
|-- SubflowBwdPackets: integer (nullable = true)
|-- SubflowBwdBytes: integer (nullable = true)
|-- Init_Win_bytes_forward: integer (nullable = true)
|-- Init_Win_bytes_backward: integer (nullable = true)
|-- act_data_pkt_fwd: integer (nullable = true)
|-- min_seg_size_forward: integer (nullable = true)
|-- ActiveMean: double (nullable = true)
|-- ActiveStd: double (nullable = true)
|-- ActiveMax: integer (nullable = true)
|-- ActiveMin: integer (nullable = true)
|-- IdleMean: double (nullable = true)
|-- IdleStd: double (nullable = true)
|-- IdleMax: integer (nullable = true)
|-- IdleMin: integer (nullable = true)
|-- Label: string (nullable = true)
Dataset Reference - Iman Sharafaldin, Arash Habibi Lashkari, and Ali A. Ghorbani, “Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization”, 4th International Conference on Information Systems Security and Privacy (ICISSP), Portugal, January 2018
For your information, I was using all the features, and inorder to reduce the code content here I am showing only few 6 columns while imputing.

Spark 1.6: drop column in DataFrame with escaped column names

Trying to drop a column in a DataFrame, but i have column names with dots in them, which I escaped.
Before I escape, my schema looks like this:
root
|-- user_id: long (nullable = true)
|-- hourOfWeek: string (nullable = true)
|-- observed: string (nullable = true)
|-- raw.hourOfDay: long (nullable = true)
|-- raw.minOfDay: long (nullable = true)
|-- raw.dayOfWeek: long (nullable = true)
|-- raw.sensor2: long (nullable = true)
If I try to drop a column, I get:
df = df.drop("hourOfWeek")
org.apache.spark.sql.AnalysisException: cannot resolve 'raw.hourOfDay' given input columns raw.dayOfWeek, raw.sensor2, observed, raw.hourOfDay, hourOfWeek, raw.minOfDay, user_id;
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
Note that I'm not even trying to drop on the columns with dots in name.
Since I couldn't seem to do much without escaping the column names, I converted the schema to:
root
|-- user_id: long (nullable = true)
|-- hourOfWeek: string (nullable = true)
|-- observed: string (nullable = true)
|-- `raw.hourOfDay`: long (nullable = true)
|-- `raw.minOfDay`: long (nullable = true)
|-- `raw.dayOfWeek`: long (nullable = true)
|-- `raw.sensor2`: long (nullable = true)
but that doesn't seem to help. I still get the same error.
I tried escaping all column names, and drop using the escaped name, but that doesn't work either.
root
|-- `user_id`: long (nullable = true)
|-- `hourOfWeek`: string (nullable = true)
|-- `observed`: string (nullable = true)
|-- `raw.hourOfDay`: long (nullable = true)
|-- `raw.minOfDay`: long (nullable = true)
|-- `raw.dayOfWeek`: long (nullable = true)
|-- `raw.sensor2`: long (nullable = true)
df.drop("`hourOfWeek`")
org.apache.spark.sql.AnalysisException: cannot resolve 'user_id' given input columns `user_id`, `raw.dayOfWeek`, `observed`, `raw.minOfDay`, `raw.hourOfDay`, `raw.sensor2`, `hourOfWeek`;
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
Is there another way to drop a column that would not fail on this type of data?
Alright, I seem to have found the solution after all:
df.drop(df.col("raw.hourOfWeek")) seems to work
val data = df.drop("Customers");
will work fine for normal columns
val new = df.drop(df.col("old.column"));