Filtering rows which are causing datatype parsing issue in spark - scala

I have a spark dataFrame with column Salary as shown below:
|Salary|
|"100"|
|"200"|
|"abc"|
The dafault datatype is string. I want to convert that to Integer with removing those rows which are causing parsing issue.
Desired Output
|Salary|
|100|
|200|
Can someone please let me know the code for filtering the rows which will be causing datatype parsing issue.
Thanks in advance.

You can filter the desired Field with a regex and then casting the column:
import org.apache.spark.sql.types._
df.filter(row => row.getAs[String]("Salary").matches("""\d+"""))
.withColumn("Salary", $"Salary".cast(IntegerType))
You can do it also with Try if you don't like regex:
import scala.util._
df.filter(row => Try(row.getAs[String]("Salary").toInt).isSuccess)
.withColumn("Salary", $"Salary".cast(IntegerType))

Related

how to find length of string of array of json object in pyspark scala?

I have one column in DataFrame with format =
'[{jsonobject},{jsonobject}]'. here length will be 2 .
I have to find length of this array and store it in another column.
I've only worked with pySpark, but the Scala solution would be similar. Assuming the column name is input:
from pyspark.sql import functions as f, types as t
json_schema = t.ArrayType(t.MapType(t.StringType(), t.StringType()))
df.select(f.size(f.from_json(df.input, json_schema)).alias("num_objects"))

cast big number in human readable format

I'm working with databricks on a notebook.
I have a column with numbers like this 103503119090884718216391506040
They are in string format. I can print them and read them easily.
For debugging purpose I need to be able to read them. However I also need to be able to apply them .sort() method. Casting them as IntegerType() return null value, casting them as double make them unreadable.
How can I convert them in a human readable format but at the same time where .sort() would work? Do I need to create two separate columns?
To make the column sortable, you could convert your column to DecimalType(precision, scale) (https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.types.DecimalType.html#pyspark.sql.types.DecimalType). For this data type you can choose the possible value range the two arguments
from pyspark.sql import SparkSession, Row, types as T, functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
Row(string_column='103503119090884718216391506040'),
Row(string_column='103503119090884718216391506039'),
Row(string_column='90'),
])
(
df
.withColumn('decimal_column', F.col('string_column').cast(T.DecimalType(30,0)))
.sort('decimal_column')
.show(truncate=False)
)
# Output
+------------------------------+------------------------------+
|string_column |decimal_column |
+------------------------------+------------------------------+
|90 |90 |
|103503119090884718216391506039|103503119090884718216391506039|
|103503119090884718216391506040|103503119090884718216391506040|
|103503119090884718216391506041|103503119090884718216391506041|
+------------------------------+------------------------------+
Concerning "human readability" I'm not sure whether that helps, though.

Combining two columns, casting two timestamp and selecting from df causes no error, but casting one column to timestamp and selecting causes error

Description
When I try to select a column that is cast to unix_timestamp and then timestamp from a dataframe there is a sparkanalysisexception error. See link below.
However, when I combine two columns, and then cast the combo to a unix_timestamp and then timestamp type and then select from a df there is no error.
Disparate Cases
Error:
How to extract year from a date string?
No Error
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val spark: SparkSession = SparkSession.builder().
appName("myapp").master("local").getOrCreate()
case class Person(id: Int, date: String, time:String)
import spark.implicits._
val mydf: DataFrame = Seq(Person(1,"9/16/13", "11:11:11")).toDF()
//solution.show()
//column modificaton
val datecol: Column = mydf("date")
val timecol: Column = mydf("time")
val newcol: Column = unix_timestamp(concat(datecol,lit(" "),timecol),"MM/dd/yy").cast(TimestampType)
mydf.select(newcol).show()
Results
Expected:
Error-sparkanalysis, can't find unix_timestamp(concat(....)) in mydf
Actual:
+------------------------------------------------------------------+
|CAST(unix_timestamp(concat(date, , time), MM/dd/yy) AS TIMESTAMP)|
+------------------------------------------------------------------+
| 2013-09-16 00:00:...|
These do not seem disparate cases. In the erroneous case, you had a new dataframe with changed column names. See below :-
val select_df: DataFrame = mydf.select(unix_timestamp(mydf("date"),"MM/dd/yy").cast(TimestampType))
select_df.select(year($"date")).show()
Here, select_df dataframe has changed column names from date to something like cast(unix_timestamp(mydf("date"),"MM/dd/yy")) as Timestamp
While in the case mentioned above, you are just defining a new column when you say :-
val newcol: Column = unix_timestamp(concat(datecol,lit(" "),timecol),"MM/dd/yy").cast(TimestampType)
And then you use this to select from your dataframe and thus it gives out expected results.
Hope this makes things clearer.

Reverse contents of a field within a dataframe using scala

I'm using scala.
I have a dataframe with millions of rows and multiple fields. One of the fields is a string field containing thing like this:
"Snow_KC Bingfamilies Conference_610507"
How do I reverse the contents of just this field for all the rows in the dataframe?
Thanks.
Doing a quick search on the Scaladoc, I found this reverse function which does exactly that.
import org.apache.spark.sql.{functions => sqlfun}
val df1 = ...
val df2 = df1.withColumn("columnName", sqlfun.reverse($"columnName"))

How can I pretty print a data frame in Hue/Notebook/Scala/Spark?

I am using Spark 2.1 and Scala 2.11 in a HUE 3.12 notebook. I have a dataframe that I can print like this:
df.select("account_id", "auto_pilot").show(2, false)
And the output looks like this:
+--------------------+----------+
|account_id |auto_pilot|
+--------------------+----------+
|00000000000000000000|null |
|00000000000000000002|null |
+--------------------+----------+
only showing top 2 rows
Is there a way of getting the data frame to show as pretty tables (like when I query from Impala or pyspark)?
Impala example of same query:
you can use a magic function %table , however this function only works for datasets not dataframe. One option is to convert dataframe to datasets before printing.
import spark.implicits._
case class Account(account_id: String, auto_pilot: String)
val accountDF = df.select("account_id", "auto_pilot").collect()
val accountDS: Dataset[Account] = accountDF.as[Account]
%table accountDS
Right now this is the solution that I can think of. Other better solutions are always welcome. I will modify this as soon I find any other elegant solution.
From http://gethue.com/bay-area-bike-share-data-analysis-with-spark-notebook-part-2/
This is what I did
df = sqlContext.sql("select * from my_table")
result = df.limit(5).collect()
%table result