change the format to a date - scala

I am trying to convert a field of type string to date. Also, I am trying to change the date format. I have not been successful, because everything is showing me null.
the field:
+-------------------------+
|financial_statements_date|
+-------------------------+
| 06-sep-12|
| 26-jul-12|
| 02-sep-11|
| 02-dic-09|
| 24-jun-15|
| 19-oct-15|
| 02-sep-13|
| 17-feb-09|
| 24-ago-10|
| 10-ago-16|
| 12-jul-16|
| 27-jul-20|
| 31-dic-02|
| 02-abr-08|
| 17-sep-19|
+-------------------------+
result:
+--------------------+
|gf_company_size_date|
+--------------------+
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
+--------------------+
my code :
df.select(
to_date(col("financial_statements_date"),"YYYY-MM-DD").as("gf_company_size_date")
)

You're date format is incorrect and should have 3 M in there. Also, I think the format is day, month, year (instead of year, month, day (looking at the sample data)). So, I think the format should be:
dd-MMM-yy
Re-running with the new format and first 3 records, they are now parsed as:
+-------------------------+
|financial_statements_date|
+-------------------------+
| 06-sep-12|
| 26-jul-12|
| 02-sep-11|
+-------------------------+
+--------------------+
|gf_company_size_date|
+--------------------+
| 2012-09-06|
| 2012-07-26|
| 2011-09-02|
+--------------------+
Related:
https://stackoverflow.com/a/8907693/864369

Related

Pyspark - advanced aggregation of monthly data

I have a table of the following format.
|---------------------|------------------|------------------|
| Customer | Month | Sales |
|---------------------|------------------|------------------|
| A | 3 | 40 |
|---------------------|------------------|------------------|
| A | 2 | 50 |
|---------------------|------------------|------------------|
| B | 1 | 20 |
|---------------------|------------------|------------------|
I need it in the format as below
|---------------------|------------------|------------------|------------------|
| Customer | Month 1 | Month 2 | Month 3 |
|---------------------|------------------|------------------|------------------|
| A | 0 | 50 | 40 |
|---------------------|------------------|------------------|------------------|
| B | 20 | 0 | 0 |
|---------------------|------------------|------------------|------------------|
Can you please help me out to solve this problem in PySpark?
This should help , i am assumming you are using SUM to aggregate vales from the originical DF
>>> df.show()
+--------+-----+-----+
|Customer|Month|Sales|
+--------+-----+-----+
| A| 3| 40|
| A| 2| 50|
| B| 1| 20|
+--------+-----+-----+
>>> import pyspark.sql.functions as F
>>> df2=(df.withColumn('COLUMN_LABELS',F.concat(F.lit('Month '),F.col('Month')))
.groupby('Customer')
.pivot('COLUMN_LABELS')
.agg(F.sum('Sales'))
.fillna(0))
>>> df2.show()
+--------+-------+-------+-------+
|Customer|Month 1|Month 2|Month 3|
+--------+-------+-------+-------+
| A| 0| 50| 40|
| B| 20| 0| 0|
+--------+-------+-------+-------+

Reading a tsv file in pyspark

I want to read a tsv file but it has no header I am creating my own schema nad then trying to read TSV file but after applyting schema it is showing all columns values as null.Below is my code and result.
from pyspark.sql.types import StructType,StructField,StringType,IntegerType
schema = StructType([StructField("id_code", IntegerType()),StructField("description", StringType())])
df=spark.read.csv("C:/Users/HP/Downloads/`connection_type`.tsv",schema=schema)
df.show();
+-------+-----------+
|id_code|description|
+-------+-----------+
| null| null|
| null| null|
| null| null|
| null| null|
| null| null|
+-------+-----------+
If i read it simply without applying any schema.
df=spark.read.csv("C:/Users/HP/Downloads/connection_type.tsv",sep="/t")
df.show()
+-----------------+
| _c0|
+-----------------+
| 0 Not Specified |
| 1 Modem |
| 2 LAN/Wifi |
| 3 Unknown |
| 4 Mobile Carrier|
+-----------------+
It is not coming in a proper way. Can anyone please help me with this. My sample file is .tsv file and it has below records.
0 Specified
1 Modemwifi
2 LAN/Wifi
3 Unknown
4 Mobile user
Add the sep option and if it is really tab-separated, this will work.
df = spark.read.option("inferSchema","true").option("sep","\t").csv("test.tsv").show()
+---+-----------+
|_c0| _c1|
+---+-----------+
| 0| Specified|
| 1| Modemwifi|
| 2| LAN/Wifi|
| 3| Unknown|
| 4|Mobile user|
+---+-----------+

scala filtering out rows in a joined df based on 2 columns with same values - best way

Im comparing 2 dataframes.
I choose to compare them column by column
I created 2 smaller dataframes from the parent dataframes.
based on join columns and the comparison columns:
Created 1st dataframe:
val df1_subset = df1.select(subset_cols.head, subset_cols.tail: _*)
+----------+---------+-------------+
|first_name|last_name|loyalty_score|
+----------+---------+-------------+
| tom | cruise| 66|
| blake | lively| 66|
| eva| green| 44|
| brad| pitt| 99|
| jason| momoa| 34|
| george | clooney| 67|
| ed| sheeran| 88|
| lionel| messi| 88|
| ryan| reynolds| 45|
| will | smith| 67|
| null| null| |
+----------+---------+-------------+
Created 2nd Dataframe:
val df1_1_subset = df1_1.select(subset_cols.head, subset_cols.tail: _*)
+----------+---------+-------------+
|first_name|last_name|loyalty_score|
+----------+---------+-------------+
| tom | cruise| 34|
| brad| pitt| 78|
| eva| green| 56|
| tom | cruise| 99|
| jason| momoa| 34|
| george | clooney| 67|
| george | clooney| 88|
| lionel| messi| 88|
| ryan| reynolds| 45|
| will | smith| 67|
| kyle| jenner| 56|
| celena| gomez| 2|
+----------+---------+-------------+
Then I joined the 2 subsets
I joined these as a full outer join to get the following:
val df_subset_joined = df1_subset.join(df1_1_subset, joinColsArray, "full_outer")
Joined Subset
+----------+---------+-------------+-------------+
|first_name|last_name|loyalty_score|loyalty_score|
+----------+---------+-------------+-------------+
| will | smith| 67| 67|
| george | clooney| 67| 67|
| george | clooney| 67| 88|
| blake | lively| 66| null|
| celena| gomez| null| 2|
| eva| green| 44| 56|
| null| null| | null|
| jason| momoa| 34| 34|
| ed| sheeran| 88| null|
| lionel| messi| 88| 88|
| kyle| jenner| null| 56|
| tom | cruise| 66| 34|
| tom | cruise| 66| 99|
| brad| pitt| 99| 78|
| ryan| reynolds| 45| 45|
+----------+---------+-------------+-------------+
Then I tried to filter out the elements that are same in both comparison columns (loyalty_scores in this example) by using column positions
df_subset_joined.filter(_c2 != _c3).show
But that didnt work. Im getting the following error:
Error:(174, 33) not found: value _c2
df_subset_joined.filter(_c2 != _c3).show
What is the most efficient way for me to get a joined dataframe, where I only see the rows that do not match in the comparison columns.
I would like to keep this dynamic so hard coding column names is not an option.
Thank you for helping me understand this.
you need wo work with aliases and make us of the null-safe comparison operator (https://spark.apache.org/docs/latest/api/sql/index.html#_9), see also https://stackoverflow.com/a/54067477/1138523
val df_subset_joined = df1_subset.as("a").join(df1_1_subset.as("b"), joinColsArray, "full_outer")
df_subset_joined.filter(!($"a.loyality_score" <=> $"b.loyality_score")).show
EDIT: for dynamic column names, you can use string interpolation
import org.apache.spark.sql.functions.col
val xxx : String = ???
df_subset_joined.filter(!(col(s"a.$xxx") <=> col(s"b.$xxx"))).show

Forward-fill missing data in PySpark not working

I have a simple dataset as shown under.
| id| name| country| languages|
|1 | Bob| USA| Spanish|
|2 | Angelina| France| null|
|3 | Carl| Brazil| null|
|4 | John| Australia| English|
|5 | Anne| Nepal| null|
I am trying to impute the null values in languages with the last non-null value using pyspark.sql.window to create a window over certain rows but nothing is happening. The column which is supposed to be have null values filled, temp_filled_spark, remains unchanged i.e a copy of original languages column.
from pyspark.sql import Window
from pyspark.sql.functions import last
window = Window.partitionBy('name').orderBy('country').rowsBetween(-sys.maxsize, 0)
filled_column = last(df['languages'], ignorenulls=True).over(window)
df = df.withColumn('temp_filled_spark', filled_column)
df.orderBy('name', 'country').show(100)
I expect the output column to be:
|temp_filled_spark|
| Spanish|
| Spanish|
| Spanish|
| English|
| English|
Could anybody help pointing out the mistake?
we can create window considering entire dataframe as one partition as,
from pyspark.sql import functions as F
>>> df1.show()
+---+--------+---------+---------+
| id| name| country|languages|
+---+--------+---------+---------+
| 1| Bob| USA| Spanish|
| 2|Angelina| France| null|
| 3| Carl| Brazil| null|
| 4| John|Australia| English|
| 5| Anne| Nepal| null|
+---+--------+---------+---------+
>>> w = Window.partitionBy(F.lit(1)).orderBy(F.lit(1)).rowsBetween(-sys.maxsize, 0)
>>> df1.select("*",F.last('languages',True).over(w).alias('newcol')).show()
+---+--------+---------+---------+-------+
| id| name| country|languages| newcol|
+---+--------+---------+---------+-------+
| 1| Bob| USA| Spanish|Spanish|
| 2|Angelina| France| null|Spanish|
| 3| Carl| Brazil| null|Spanish|
| 4| John|Australia| English|English|
| 5| Anne| Nepal| null|English|
+---+--------+---------+---------+-------+
Hope this helps.!

How to fill missing values in SataFrame?

After querying a mysql db and building the corresponding data frame, I am left with this:
mydata.show
+--+------+------+------+------+------+------+
|id| sport| var1| var2| var3| var4| var5|
+--+------+------+------+------+------+------+
| 1|soccer|330234| | | | |
| 2|soccer| null| null| null| null| null|
| 3|soccer|330101| | | | |
| 4|soccer| null| null| null| null| null|
| 5|soccer| null| null| null| null| null|
| 6|soccer| null| null| null| null| null|
| 7|soccer| null| null| null| null| null|
| 8|soccer|330024|330401| | | |
| 9|soccer|330055|330106| | | |
|10|soccer| null| null| null| null| null|
|11|soccer|390027| | | | |
|12|soccer| null| null| null| null| null|
|13|soccer|330101| | | | |
|14|soccer|330059| | | | |
|15|soccer| null| null| null| null| null|
|16|soccer|140242|140281| | | |
|17|soccer|330214| | | | |
|18|soccer| | | | | |
|19|soccer|330055|330196| | | |
|20|soccer|210022| | | | |
+--+------+------+------+------+------+------+
Every var column is a:
string (nullable = true)
So I'd like to change all the empty rows to a "null", so to be able to treat empty cells and cell with "null" as equal, possibly without leaving the data frame for an RDD...
My approach would be to create a list of expressions. In Scala this can be done using a map. On the other hand in Python you'd to use a comprehension list.
After that, you should unpack that list inside a df.select instruction like in the examples bellow.
Inside the expression, empty strings are replaced with a null value
Scala:
val exprs = df.columns.map(x => when(col(x) === '', null).otherwise(col(x)).as(x))
df.select(exprs:_*).show()
Python:
# Creation of a dummy dataframe:
df = sc.parallelize([("", "19911201", 1, 1, 20.0),
("", "19911201", 2, 1, 20.0),
("hola", "19911201", 2, 1, 20.0),
(None, "20111201", 3, 1, 20.0)]).toDF()
df.show()
exprs = [when(col(x) == '', None).otherwise(col(x)).alias(x)
for x in df.columns]
df.select(*exprs).show()
E.g:
+----+--------+---+---+----+
| _1| _2| _3| _4| _5|
+----+--------+---+---+----+
| |19911201| 1| 1|20.0|
| |19911201| 2| 1|20.0|
|hola|19911201| 2| 1|20.0|
|null|20111201| 3| 1|20.0|
+----+--------+---+---+----+
+----+--------+---+---+----+
| _1| _2| _3| _4| _5|
+----+--------+---+---+----+
|null|19911201| 1| 1|20.0|
|null|19911201| 2| 1|20.0|
|hola|19911201| 2| 1|20.0|
|null|20111201| 3| 1|20.0|
+----+--------+---+---+----+
One option would be to do the opposite - replace nulls with empty values (I personally hate nulls...), for which you can use the coalesce function:
import org.apache.spark.sql.functions._
val result = input.withColumn("myCol", coalesce(input("myCol"), lit("")))
To do that for multiple columns:
val cols = Seq("var1", "var2", "var3", "var4", "var5")
val result = cols.foldLeft(input) { case (df, colName) => df.withColumn(colName, coalesce(df(colName), lit(""))) }