fetch more than 20 rows and display full value of column in spark-shell - scala

I am using CassandraSQLContext from spark-shell to query data from Cassandra. So, I want to know two things one how to fetch more than 20 rows using CassandraSQLContext and second how do Id display the full value of column. As you can see below by default it append dots in the string values.
Code :
val csc = new CassandraSQLContext(sc)
csc.setKeyspace("KeySpace")
val maxDF = csc.sql("SQL_QUERY" )
maxDF.show
Output:
+--------------------+--------------------+-----------------+--------------------+
| id| Col2| Col3| Col4|
+--------------------+--------------------+-----------------+--------------------+
|8wzloRMrGpf8Q3bbk...| Value1| X| K1|
|AxRfoHDjV1Fk18OqS...| Value2| Y| K2|
|FpMVRlaHsEOcHyDgy...| Value3| Z| K3|
|HERt8eFLRtKkiZndy...| Value4| U| K4|
|nWOcbbbm8ZOjUSNfY...| Value5| V| K5|

If you want to print the whole value of a column, in scala, you just need to set the argument truncate from the show method to false :
maxDf.show(false)
and if you wish to show more than 20 rows :
// example showing 30 columns of
// maxDf untruncated
maxDf.show(30, false)
For pyspark, you'll need to specify the argument name :
maxDF.show(truncate = False)

You won't get in nice tabular form instead it will be converted to scala object.
maxDF.take(50)

Related

Flatmap on Spark Dataframe in Scala

I have a Dataframe. I need to create one or more rows from each row in dataframe. I am hoping FlapMap could help me in solving the problem. One or More rows would be created by applying logic on 2 columns of the row.
Example Input dataframe
+--------------------+
| Name|Float1|Float2|
+--------------------+
| Java| 2.3| 0.2|
|Python| 3.2| 0.5|
| Scala| 4.3| 0.8|
+--------------------+
Logic:
If *|Float1 + Float2| = |Float1)|* Then one row is created.
eg : 2.3 +0.2 = |2.5| = 2
|2.3| = 2
if *|Float1 +Float2| > |Float1|* Then two rows are created
eg: 4.3+0.8 = |5.1| = 5
|4.3| = 4
Can we solve this problem using flatmap or any other transformation in spark?
Create a UDF that takes in two columns and returns back a list.
Once you have a list, then use the explode function on the column which will give you what you desire

Filtering on a dataframe based on columns defined in a list

I have a dataframe -
df
+----------+----+----+-------+-------+
| WEEK|DIM1|DIM2|T1_diff|T2_diff|
+----------+----+----+-------+-------+
|2016-04-02| 14|NULL| -5| 60|
|2016-04-30| 14| FR| 90| 4|
+----------+----+----+-------+-------+
I have defined a list as targetList
List(T1_diff, T2_diff)
I want to filter out all rows in dataframe where T1_diff and T2_diff is greater than 3. In this scenario the output should only contain the second row as first row contains -5 as T1_Diff. targetList can contain more columns, currently it has T1_diff, T2_diff, if there is another column called T3_diff, so that should be automatically handled.
What is the best way to achieve this ?
Suppose you have following List of columns which you want to filter out for a value greater than 3.
val lst = List("T1_diff", "T2_diff")
Then you can create a String using these column names and then pass that String to where function.
val condition = lst.map(c => s"$c>3").mkString(" AND ")
df.where(condition).show(false)
For the above Dataframe it will output only second row.
+----------+----+----+-------+-------+
|Week |Dim1|Dim2|T1_diff|T2_diff|
+----------+----+----+-------+-------+
|2016-04-30|14 |FR |90 |4 |
+----------+----+----+-------+-------+
If you have another column say T3_diff you can add it to the List and it will get added to the filter condition.

Spark SQL Dataframe API -build filter condition dynamically

I have two Spark dataframe's, df1 and df2:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
|shankar|12121| 28|
| ramesh| 1212| 29|
| suresh| 1111| 30|
| aarush| 0707| 15|
+-------+-----+---+
+------+-----+---+-----+
| eName| eNo|age| city|
+------+-----+---+-----+
|aarush|12121| 15|malmo|
|ramesh| 1212| 29|malmo|
+------+-----+---+-----+
I need to get the non matching records from df1, based on a number of columns which is specified in another file.
For example, the column look up file is something like below:
df1col,df2col
name,eName
empNo, eNo
Expected output is:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
|shankar|12121| 28|
| suresh| 1111| 30|
| aarush| 0707| 15|
+-------+-----+---+
The idea is how to build a where condition dynamically for the above scenario, because the lookup file is configurable, so it might have 1 to n fields.
You can use the except dataframe method. I'm assuming that the columns to use are in two lists for simplicity. It's necessary that the order of both lists are correct, the columns on the same location in the list will be compared (regardless of column name). After except, use join to get the missing columns from the first dataframe.
val df1 = Seq(("shankar","12121",28),("ramesh","1212",29),("suresh","1111",30),("aarush","0707",15))
.toDF("name", "empNo", "age")
val df2 = Seq(("aarush", "12121", 15, "malmo"),("ramesh", "1212", 29, "malmo"))
.toDF("eName", "eNo", "age", "city")
val df1Cols = List("name", "empNo")
val df2Cols = List("eName", "eNo")
val tempDf = df1.select(df1Cols.head, df1Cols.tail: _*)
.except(df2.select(df2Cols.head, df2Cols.tail: _*))
val df = df1.join(broadcast(tempDf), df1Cols)
The resulting dataframe will look as wanted:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
| aarush| 0707| 15|
| suresh| 1111| 30|
|shankar|12121| 28|
+-------+-----+---+
If you're doing this from a SQL query I would remap the column names in the SQL query itself with something like Changing a SQL column title via query. You could do a simple text replace in the query to normalize them to the df1 or df2 column names.
Once you have that you can diff using something like
How to obtain the difference between two DataFrames?
If you need more columns that wouldn't be used in the diff (e.g. age) you can reselect the data again based on your diff results. This may not be the optimal way of doing it but it would probably work.

Change Behaviour of Built-in Spark Sql Functions

Is there any way to prevent spark sql functions from nulling values?
For example I have the following dataframe
df.show
+--------------------+--------------+------+------------+
| Title|Year Published|Rating|Length (Min)|
+--------------------+--------------+------+------------+
| 101 Dalmatians| 01/1996| G| 103|
|101 Dalmatians (A...| 1961| G| 79|
|101 Dalmations II...| 2003| G| 70|
I want to apply spark sqls date_format function to Year Published column.
val sql = """date_format(`Year Published`, 'MM/yyyy')"""
val df2 = df.withColumn("Year Published", expr(sql))
df2.show
+--------------------+--------------+------+------------+
| Title|Year Published|Rating|Length (Min)|
+--------------------+--------------+------+------------+
| 101 Dalmatians| null| G| 103|
|101 Dalmatians (A...| 01/1961| G| 79|
|101 Dalmations II...| 01/2003| G| 70|
The first row of the Year Published column has been nulled as the original value was in a different date format than the other dates.
This behaviour is not unique to date_format for example format_number will null non-numeric types.
With my dataset I expect different date formats and dirty data with unparseable values. I have a use case where if the value of a cell cannot be formatted then I want to return the current value as opposed to null.
Is there a way to make spark use the original value in df instead of null if the function for df2 cannot be applied correctly?
What I've tried
I've looked at wrapping Expressions in org.apache.spark.sql.catalyst.expressions but could not see a way to replace the existing functions.
The only working solution I could find is creating my own date_format and registering it as a udf but this isn't practical for all functions. I'm looking for a solution that will never return null if the input to a function is non-null or an automated way to wrap all existing spark functions.
You could probably use the coalesce function for your purposes:
coalesce(date_format(`Year Published`, 'MM/yyyy'), `Year Published`)

How to randomly selecting rows from one dataframeusing information from another dataframe

The following I am attempting in Scala-Spark.
I'm hoping someone can give me some guidance on how to tackle this problem or provide me with some resources to figure out what I can do.
I have a dateCountDF with a count corresponding to a date. I would like to randomly select a certain number of entries for each dateCountDF.month from another Dataframe entitiesDF where dateCountDF.FirstDate<entitiesDF.Date && entitiesDF.Date <= dateCountDF.LastDate and then place all the results into a new Dataframe. See Bellow for Data Example
I'm not at all sure how to approach this problem from a Spark-SQl or Spark-MapReduce perspective. The furthest I got was the naive approach, where I use a foreach on a dataFrame and then refer to the other dataframe within the function. But this doesn't work because of the distributed nature of Spark.
val randomEntites = dateCountDF.foreach(x => {
val count:Int = x(1).toString().toInt
val result = entitiesDF.take(count)
return result
})
DataFrames
**dateCountDF**
| Date | Count |
+----------+----------------+
|2016-08-31| 4|
|2015-12-31| 1|
|2016-09-30| 5|
|2016-04-30| 5|
|2015-11-30| 3|
|2016-05-31| 7|
|2016-11-30| 2|
|2016-07-31| 5|
|2016-12-31| 9|
|2014-06-30| 4|
+----------+----------------+
only showing top 10 rows
**entitiesDF**
| ID | FirstDate | LastDate |
+----------+-----------------+----------+
| 296| 2014-09-01|2015-07-31|
| 125| 2015-10-01|2016-12-31|
| 124| 2014-08-01|2015-03-31|
| 447| 2017-02-01|2017-01-01|
| 307| 2015-01-01|2015-04-30|
| 574| 2016-01-01|2017-01-31|
| 613| 2016-04-01|2017-02-01|
| 169| 2009-08-23|2016-11-30|
| 205| 2017-02-01|2017-02-01|
| 433| 2015-03-01|2015-10-31|
+----------+-----------------+----------+
only showing top 10 rows
Edit:
For clarification.
My inputs are entitiesDF and dateCountDF. I want to loop through dateCountDF and for each row I want to select a random number of entities in entitiesDF where dateCountDF.FirstDate<entitiesDF.Date && entitiesDF.Date <= dateCountDF.LastDate
To select random you do like this in scala
import random
def sampler(df, col, records):
# Calculate number of rows
colmax = df.count()
# Create random sample from range
vals = random.sample(range(1, colmax), records)
# Use 'vals' to filter DataFrame using 'isin'
return df.filter(df[col].isin(vals))
select random number of rows you want store in dataframe and the add this data in the another dataframe for this you can use unionAll.
also you can refer this answer