Pyspark Cannot resolve column name when Column does exist - pyspark

I had some Pyspark code that was working with a sample csv BLOB and then I decided to point it to a bigger dataset. This line:
df= df.withColumn("TransactionDate", df["TransactionDate"].cast(TimestampType()))
In now throwing this error:
AnalysisException: u'Cannot resolve column name "TransactionDate" among ("TransactionDate","Country ...
Clearly TransactionDate exists as a column in the dataset so why is it suddenly not working?

Ah ok I figured it out. If you get this issue check your delimiter. In my new dataset it was "," where as in my smaller sample is was "|"
df = spark.read.format(file_type).options(header='true', quote='"', delimiter=",",ignoreLeadingWhiteSpace='true',inferSchema='true').load(file_location)

Related

ADF Unpivot Dynamically With New Column

There is an Excel worksheet that I wanted to unpivot all the columns after "Currency Code" into rows, the number of columns need to be unpivot might vary, new columns might be added after "NetIUSD". Is there a way to dynamically unpivot this worksheet with unknown columns?
It worked when I projected all the fields and define the datatype for all the numerical fields as "double" and set the unpivot column data type as "double" as well. However, the issue is there might be additional columns added to the source file, which I won't be able to define the datatype ahead, in this case, if the new column has different data type other than "double", it will throw an error that the new column is not of the same unpivot datatype.
I tried to repro this in Dataflow with sample input details.
Take the unpivot transformation and in unpivot settings do the following.
Ungroup by: Code, Currency_code
Unpivot column: Currency
Unpivoted Columns: Column arrangement: Normal
Column name: Amount
Type: string
Data Preview
All columns other than mentioned in ungroup by can be dynamically unpivoted even if you add additional fields.
I confirm an Aswin answer. Got the same issue: failed dataflow with dynamically new columns. The reason was in datatype of unpivoted columns. Changed that to string and all goes smoothly.
Imported projection does not affect this case i`ve tried both with imported and manually coded, both works with "string" datatype.

How to get the numeric value of missing values in a PySpark column?

I am working with the OpenFoodFacts dataset using PySpark. There's quite a lot of columns which are entirely made up of missing values and I want to drop said columns. I have been looking up ways to retrieve the number of missing values on each column, but they are displayed in a table format instead of actually giving me the numeric value of the total null values.
The following code shows the number of missing values in a column but displays it in a table format:
from pyspark.sql.functions import col, isnan, when, count
data.select([count(when(isnan("column") | col("column").isNull(), "column")]).show()
I have tried the following codes:
This one does not work as intended as it doesn't drop any columns (as expected)
for c in data.columns:
if(data.select([count(when(isnan(c) | col(c).isNull(), c)]) == data.count()):
data = data.drop(c)
data.show()
This one I am currently trying but takes ages to execute
for c in data.columns:
if(data.filter(data[c].isNull()).count() == data.count()):
data = data.drop(c)
data.show()
Is there a way to get ONLY the number? Thanks
If you need the number instead of showing in the table format, you need to use the .collect(), which is:
list_of_values = data.select([count(when(isnan("column") | col("column").isNull(), "column")]).collect()
What you get is a list of Row, which contain all the information in the table.

CSV to ORC conversation in Azure Error : found more columns than expected column count

Am pushing csv files pipe(|) delimited from one storage account in azure to another storage account using the ORC file format but it throws an error:
Error found when processing 'Csv/Tsv Format Text' source 'time.csv' with row number 122277 found more columns than expected column count
how do I solve this error ?
'Csv/Tsv Format Text' source 'time.csv' with row number 122277 found
more columns than expected column count
Based on the error, it indicates that your columns violate the below 3rd rule which is mentioned in this link.
Source data store query result does not have a column name that is
specified in the input dataset "structure" section.
Sink data store (if with pre-defined schema) does not have a column
name that is specified in the output dataset "structure" section.
Either fewer columns or more columns in the "structure" of sink
dataset than specified in the mapping.
Duplicate mapping.
You need to check whether the row number 122277 source columns are divided into different constants with | delimitation so that it can't map to the sink columns.

Splitting a column data as per delimiter

I have a Spark (1.4) dataframe where the data in a column is like "1-2-3-4-5-6-7-8-9-10-11-12". I want to split the data into multiple columns. Please note that the number of fields can vary from 1 to 12, its not fixed.
P.S. we are using Scala API.
Edit:
Editing over the original question. I have the delimited string as below:
"ABC-DEF-PQR-XYZ"
From this string I need to create delimited strings in separate columns as below. Please note that this string is in a column in DF.
Original column: ABC-DEF-PQR-XYZ
New col1 : ABC
New col2 : ABC-DEF
New col3 : ABC-DEF-PQR
New col4 : ABC-DEF-PQR-XYZ
Please note that there can be 12 such new columns which needs to get derived from original field. Also, the string in original column might vary i.e. some times 1 column, some time 2 but max can be 12.
Hope I have articulated the problem statement clearly.
Thanks!
You can use explode and pivot. Here is some sample data:
df=sc.parallelize([["1-2-3-4-5-6-7-8-9-10-11-12"], ["1-2-3-4"], ["1-2-3-4-5-6-7-8-9-10"]]).toDF(schema=["col"])
Now add a unique id to rows so that we can keep track of which row the data belongs to:
df=df.withColumn("id", f.monotonically_increasing_id())
Then split the columns by delimiter - and then explode to get a long-form dataset:
df=df.withColumn("col_split", f.explode(f.split("col", "\-")))
Finally pivot on id to get back to wide form:
df.groupby("id")
.pivot("col_split")
.agg(f.max("col_split"))
.drop("id").show()

Pyspark dataframe spliting columns gives empty result

I am trying to split a column and allocate a new column name to split result.
But it gives empty column. Please find below the Expression.
df.selectExpr("variable_name","split(variable_name, '.')[2] as r").show(100,False)
I am suppose to get ZZZZ as result column values in a
It gives this
I tried using '\\\\.' to escape the special character and it is working now.
Here is the code:
df.selectExpr("variable_name","split(variable_name, '\\\\.')[2] as r").show(100,False)
Thanks!!