Pyspark dataframe spliting columns gives empty result - pyspark

I am trying to split a column and allocate a new column name to split result.
But it gives empty column. Please find below the Expression.
df.selectExpr("variable_name","split(variable_name, '.')[2] as r").show(100,False)
I am suppose to get ZZZZ as result column values in a
It gives this

I tried using '\\\\.' to escape the special character and it is working now.
Here is the code:
df.selectExpr("variable_name","split(variable_name, '\\\\.')[2] as r").show(100,False)
Thanks!!

Related

How to get the numeric value of missing values in a PySpark column?

I am working with the OpenFoodFacts dataset using PySpark. There's quite a lot of columns which are entirely made up of missing values and I want to drop said columns. I have been looking up ways to retrieve the number of missing values on each column, but they are displayed in a table format instead of actually giving me the numeric value of the total null values.
The following code shows the number of missing values in a column but displays it in a table format:
from pyspark.sql.functions import col, isnan, when, count
data.select([count(when(isnan("column") | col("column").isNull(), "column")]).show()
I have tried the following codes:
This one does not work as intended as it doesn't drop any columns (as expected)
for c in data.columns:
if(data.select([count(when(isnan(c) | col(c).isNull(), c)]) == data.count()):
data = data.drop(c)
data.show()
This one I am currently trying but takes ages to execute
for c in data.columns:
if(data.filter(data[c].isNull()).count() == data.count()):
data = data.drop(c)
data.show()
Is there a way to get ONLY the number? Thanks
If you need the number instead of showing in the table format, you need to use the .collect(), which is:
list_of_values = data.select([count(when(isnan("column") | col("column").isNull(), "column")]).collect()
What you get is a list of Row, which contain all the information in the table.

datenum and matrix column string conversion

I want to convert the second column of table T using datenum.
The elements of this column are '09:30:31.848', '15:35:31.325', etc. When I use datenum('09:30:31.848','HH:MM:SS.FFF') everything works, but when I want to apply datenum to the whole column it doesn't work. I tried this command datenum(T(:,2),'HH:MM:SS.FFF') and I receive this error message:
"The input to DATENUM was not an array of character vectors"
Here a snapshot of T
Thank you
You are not calling the data from the table, but rather a slice of the table (so its stays a table). Refer to the data in the table using T.colName:
times_string = ['09:30:31.848'; '15:35:31.325'];
T = table(times_string)
times_num = datenum(T.times_string, 'HH:MM:SS.FFF')
Alternatively, you can slice the table using curly braces to extract the data (if you want to use the column number instead of name):
times_num = datenum(T{:,2}, 'HH:MM:SS.FFF')

Pyspark Cannot resolve column name when Column does exist

I had some Pyspark code that was working with a sample csv BLOB and then I decided to point it to a bigger dataset. This line:
df= df.withColumn("TransactionDate", df["TransactionDate"].cast(TimestampType()))
In now throwing this error:
AnalysisException: u'Cannot resolve column name "TransactionDate" among ("TransactionDate","Country ...
Clearly TransactionDate exists as a column in the dataset so why is it suddenly not working?
Ah ok I figured it out. If you get this issue check your delimiter. In my new dataset it was "," where as in my smaller sample is was "|"
df = spark.read.format(file_type).options(header='true', quote='"', delimiter=",",ignoreLeadingWhiteSpace='true',inferSchema='true').load(file_location)

Transform CSV Column Values into Single Row

My data in CSV is like this(Expected Image):
Actual Data
And I want to convert this Data into:
Expected Data
(hivetablename.hivecolumnname = dbtablename.dbtablecolumn)
By joining the multiple Row values into a Single row value like above.
Please note that 'AND' is a Literal between the condition to be built, which would appear until the second last record.
Once the Last Record is reached, Only the condition would appear(xx=yy)
I wish the result to be in SCALA SPARK.
Many thanks in advance!

Map a text file to key/value pair in order to group them in pyspark

I would like to create a spark dataframe in pyspark from a text file, that has different number of rows and columns and map it to key/value pair, the key is the first 4 characters from the first column of the text file. I want to do that in order to remove the redundant rows and to be able group them later by the key value. I know how to do that on pandas but still confused where to start doing that in pyspark.
My input is a text file that has the following:
1234567,micheal,male,usa
891011,sara,femal,germany
I want to be able to group every row by the first six characters in the first column
Create a new column that contains only the first six characters of the first column, and then group by that:
from pyspark.sql.functions import col
df2 = df.withColumn("key", col("first_col")[:6])
df2.groupBy("key").agg(...)