Pyspark Py4jjava error when trying to show dataframe - pyspark

enter image description here
Hello, I am getting this error when I try to display my dataframe after adding a column. I was not getting an error when I showed the original tokenized dataframe and do not understand where this error is coming from and what it means.
The image is in the link, thank you

Related

Dataframe display function in pyspark on databricks platform

I am new to databricks, i was studing topic dataframe in pyspark
df = spark.read.parquet(salesPath)
display(df)
Above is my code , i m not getting ,what actually the up arrows do?
and why this beautiful df.display not included in Apache pyspark documentation?
Arrows are used to sort the displayed portion of the dataframe. But please note that the display function shows at max 1000 records, and won't load the whole dataset.
The display function isn't included into PySpark documentation because it's specific to Databricks. Similar function also exist in Jupyter that you can use with PySpark, but it's not part of the PySpark. (you can use df.show() function to display as text table - it's a part of the PySpark's DataFrame API)

Pyspark Cannot resolve column name when Column does exist

I had some Pyspark code that was working with a sample csv BLOB and then I decided to point it to a bigger dataset. This line:
df= df.withColumn("TransactionDate", df["TransactionDate"].cast(TimestampType()))
In now throwing this error:
AnalysisException: u'Cannot resolve column name "TransactionDate" among ("TransactionDate","Country ...
Clearly TransactionDate exists as a column in the dataset so why is it suddenly not working?
Ah ok I figured it out. If you get this issue check your delimiter. In my new dataset it was "," where as in my smaller sample is was "|"
df = spark.read.format(file_type).options(header='true', quote='"', delimiter=",",ignoreLeadingWhiteSpace='true',inferSchema='true').load(file_location)

Pyspark dataframe spliting columns gives empty result

I am trying to split a column and allocate a new column name to split result.
But it gives empty column. Please find below the Expression.
df.selectExpr("variable_name","split(variable_name, '.')[2] as r").show(100,False)
I am suppose to get ZZZZ as result column values in a
It gives this
I tried using '\\\\.' to escape the special character and it is working now.
Here is the code:
df.selectExpr("variable_name","split(variable_name, '\\\\.')[2] as r").show(100,False)
Thanks!!

How to find widths of a Flat File using read_fwf() in Pandas?

I have downloaded some data from the Mainframe (.DATA format) and I need to parse it to create a PySpark DataFrame and perform some operations on it. Before doing that, I created a sample file and read it using read_fwf() feature of Pandas.
I was able to read and create the DataFrame but I encountered some problems like
Padding of "0" in the first column of some of the Rows
Repeating Headers while reading the Data
These were some of the issues I can handle, however the key challenge I am facing is in identifying the widths of the columns. I currently have 65 columns but in order to create a PySpark DataFrame, I would require to know the widths of these columns. Can read_fwf() tell what is the widths it is using for each column ?
And is there a read_fwf() like function in PySpark ? Or we would have to write a MapRed code for it ?

'Error in Parameters' for Tableau?

Just force-closed my Tableau workbook full of metrics, and when I opened again, I received the error message:
'Error in Parameters for command 'get-quantative-color' bad value: fn
After this, I received error messages that all of my calculated fields and even some basic date fields 'do not exist,' and I am basically left with a blank workbook.
Happy to provide more detail if needed -- anyone have any suggestions? Have been working on the workbook for a while and would rather not have to recreate from scratch :)