Get column names, distinct values and its occurrences into a text file with Spark Scala - scala

I am new to Spark Scala and would like to execute the following tasks:
Get all column names, the values and its occurrences from a table
Write the result into a text file, i.e. in the following format:
Column Name |Value | Occurrences
Col1 |Test | 12
Col2 |123 | 15
I am using Spark 1.6, not Spark 2.0.
Thanks a lot in advance for any help.
Cheers,
Matthias

Hope this will help you.
Let me explain with an example.
I have a file, users.txt with content as following:
1 abc#test.com EN US
2 xyz#test2.com EN GB
3 srt#test3.com FR FR
Code:
var fileRDD=sc.textFile("users.txt")
case class User(ID:Int,email:String,lang:String,country:String)
var rawRDD=fileRDD.flatMap(_.split("\t")).map(_.split(" "))
var userRDD=rawRDD.map(u=>User(u(0).toInt,u(1).toString,u(2).toString,u(3).toString))
userDF.registerTempTable("user_table")
sqlContext.sql("select * from user_table").show()
+---+-------------+----+-------+
| ID| email|lang|country|
+---+-------------+----+-------+
| 1| abc#test.com| EN| US|
| 2|xyz#test2.com| EN| GB|
| 3|srt#test3.com| FR| FR|
+---+-------------+----+-------+
var emailCount=sqlContext.sql("select 'email' as col,email as value, count(email) as occur from user_table group by email")
var langCount=sqlContext.sql("select 'lang' as col,lang as value, count(lang) as occur from user_table group by lang")
emailCount.unionAll(langCount).show()
+-----+-------------+-----+
| col| value|occur|
+-----+-------------+-----+
|email|srt#test3.com| 1|
|email|xyz#test2.com| 1|
|email| abc#test.com| 1|
| lang| EN| 2|
| lang| FR| 1|
+-----+-------------+-----+

Related

Counting distinct values for a given column partitioned by a window function, without using approx_count_distinct()

I have the following dataframe:
val df1 = Seq(("Roger","Rabbit", "ABC123"), ("Roger","Rabit", "ABC123"),("Roger","Rabbit", "ABC123"), ("Trevor","Philips","XYZ987"), ("Trevor","Philips","XYZ987")).toDF("first_name", "last_name", "record")
+----------+---------+------+
|first_name|last_name|record|
+----------+---------+------+
|Roger |Rabbit |ABC123|
|Roger |Rabit |ABC123|
|Roger |Rabbit |ABC123|
|Trevor |Philips |XYZ987|
|Trevor |Philips |XYZ987|
+----------+---------+------+
I want to group records in this dataframe by the column record. And then I want to look for anomalies in the fields first_name and last_name, which should remain constant for all records with same record value.
The best approach I found so far is using approx_count_distinct:
val wind_person = Window.partitionBy("record")
df1.withColumn("unique_fields",cconcat($"first_name",$"last_name"))
.withColumn("anomaly",capprox_count_distinct($"unique_fields") over wind_person)
.show(false)
+----------+---------+------+-------------+-------+
|first_name|last_name|record|unique_fields|anomaly|
+----------+---------+------+-------------+-------+
|Roger |Rabbit |ABC123|RogerRabbit |2 |
|Roger |Rabbit |ABC123|RogerRabbit |2 |
|Roger |Rabit |ABC123|RogerRabit |2 |
|Trevor |Philips |XYZ987|TrevorPhilips|1 |
|Trevor |Philips |XYZ987|TrevorPhilips|1 |
+----------+---------+------+-------------+-------+
Where an anomaly is detected is anomaly column is greater than 1.
The problem is with approx_count_distinct we get just an approximation, and I am not sure how much confident we can be that it will always return an accurate count.
Some extra information:
The Dataframe may contain over 500M records
The Dataframe is previously repartitioned based on record column
For each different value of record, no more than 15 rows will be there
Is is safe to use approx_count_distinct in this scenario with a 100% accuracy or are there better window functions in spark to achieve this?
You can use collect_set of unique_fields over the window wind_person and get it's size which is equivalent to the count distinct of that field :
df1.withColumn("unique_fields", concat($"first_name", $"last_name"))
.withColumn("anomaly", size(collect_set($"unique_fields").over(wind_person)))
.show
//+----------+---------+------+-------------+-------+
//|first_name|last_name|record|unique_fields|anomaly|
//+----------+---------+------+-------------+-------+
//|Roger |Rabbit |ABC123|RogerRabbit |2 |
//|Roger |Rabit |ABC123|RogerRabit |2 |
//|Roger |Rabbit |ABC123|RogerRabbit |2 |
//|Trevor |Philips |XYZ987|TrevorPhilips|1 |
//|Trevor |Philips |XYZ987|TrevorPhilips|1 |
//+----------+---------+------+-------------+-------+
You can get the exact countDistinct over a Window using some dense_rank operations:
val df2 = df1.withColumn(
"unique_fields",
concat($"first_name",$"last_name")
).withColumn(
"anomaly",
dense_rank().over(Window.partitionBy("record").orderBy("unique_fields")) +
dense_rank().over(Window.partitionBy("record").orderBy(desc("unique_fields")))
- 1
)
df2.show
+----------+---------+------+-------------+-------+
|first_name|last_name|record|unique_fields|anomaly|
+----------+---------+------+-------------+-------+
| Roger| Rabit|ABC123| RogerRabit| 2|
| Roger| Rabbit|ABC123| RogerRabbit| 2|
| Roger| Rabbit|ABC123| RogerRabbit| 2|
| Trevor| Philips|XYZ987|TrevorPhilips| 1|
| Trevor| Philips|XYZ987|TrevorPhilips| 1|
+----------+---------+------+-------------+-------+

How to check whether a the whole column in a pyspark contains a value using Expr

In pyspark how can i use expr to check whether a whole column contains the value in columnA of that row.
pseudo code below
df=df.withColumn("Result", expr(if any the rows in column1 contains the value colA (for this row) then 1 else 0))
Take an arbitrary example:
valuesCol = [('rose','rose is red'),('jasmine','I never saw Jasmine'),('lily','Lili dont be silly'),('daffodil','what a flower')]
df = sqlContext.createDataFrame(valuesCol,['columnA','columnB'])
df.show()
+--------+-------------------+
| columnA| columnB|
+--------+-------------------+
| rose| rose is red|
| jasmine|I never saw Jasmine|
| lily| Lili dont be silly|
|daffodil| what a flower|
+--------+-------------------+
Application of expr() here. How one can use expr(), just look for the corresponding SQL syntax and it should work with expr() mostly.
df = df.withColumn('columnA_exists',expr("(case when instr(lower(columnB), lower(columnA))>=1 then 1 else 0 end)"))
df.show()
+--------+-------------------+--------------+
| columnA| columnB|columnA_exists|
+--------+-------------------+--------------+
| rose| rose is red| 1|
| jasmine|I never saw Jasmine| 1|
| lily| Lili dont be silly| 0|
|daffodil| what a flower| 0|
+--------+-------------------+--------------+

Use PySpark Dataframe column in another spark sql query

I have a situation where I'm trying to query a table and use the result (dataframe) from that query as IN clause of another query.
From the first query I have the dataframe below:
+-----------------+
|key |
+-----------------+
| 10000000000004|
| 10000000000003|
| 10000000000008|
| 10000000000009|
| 10000000000007|
| 10000000000006|
| 10000000000010|
| 10000000000002|
+-----------------+
And now I want to run a query like the one below using the values of that dataframe dynamically instead of hard coding the values:
spark.sql("""select country from table1 where key in (10000000000004, 10000000000003, 10000000000008, 10000000000009, 10000000000007, 10000000000006, 10000000000010, 10000000000002)""").show()
I tried the following, however it didn't work:
df = spark.sql("""select key from table0 """)
a = df.select("key").collect()
spark.sql("""select country from table1 where key in ({0})""".format(a)).show()
Can somebody help me?
You should use an (inner) join between two data frames to get the countries you would like. See my example:
# Create a list of countries with Id's
countries = [('Netherlands', 1), ('France', 2), ('Germany', 3), ('Belgium', 4)]
# Create a list of Ids
numbers = [(1,), (2,)]
# Create two data frames
df_countries = spark.createDataFrame(countries, ['CountryName', 'Id'])
df_numbers = spark.createDataFrame(numbers, ['Id'])
The data frames look like the following:
df_countries:
+-----------+---+
|CountryName| Id|
+-----------+---+
|Netherlands| 1|
| France| 2|
| Germany| 3|
| Belgium| 4|
+-----------+---+
df_numbers:
+---+
| Id|
+---+
| 1|
| 2|
+---+
You can join them as follows:
countries.join(numbers, on='Id', how='inner')
Resulting in:
+---+-----------+
| Id|CountryName|
+---+-----------+
| 1|Netherlands|
| 2| France|
+---+-----------+
Hope that clears things up!

Randomly join two dataframes

I have two tables, one called Reasons that has 9 records and another containing IDs with 40k records.
IDs:
+------+------+
|pc_pid|pc_aid|
+------+------+
| 4569| 1101|
| 63961| 1101|
|140677| 4364|
|127113| 7|
| 96097| 480|
| 8309| 3129|
| 45218| 89|
|147036| 3289|
| 88493| 3669|
| 29973| 3129|
|127444| 3129|
| 36095| 89|
|131001| 1634|
|104731| 781|
| 79219| 244|
+-------------+
Reasons:
+-----------------+
| reasons|
+-----------------+
| follow up|
| skin chk|
| annual meet|
|review lab result|
| REF BY DR|
| sick visit|
| body pain|
| test|
| other|
+-----------------+
I want output like this
|pc_pid|pc_aid| reason
+------+------+-------------------
| 4569| 1101| body pain
| 63961| 1101| review lab result
|140677| 4364| body pain
|127113| 7| sick visit
| 96097| 480| test
| 8309| 3129| other
| 45218| 89| follow up
|147036| 3289| annual meet
| 88493| 3669| review lab result
| 29973| 3129| REF BY DR
|127444| 3129| skin chk
| 36095| 89| other
In the reasons I have only 9 records and in the ID dataframe I have 40k records, I want to assign reason randomly to each and every id.
The following solution tries to be more robust to the number of reasons (ie. you can have as many reasons as you can reasonably fit in your cluster). If you just have few reasons (like the OP asks), you can probably broadcast them or embed them in a udf and easily solve this problem.
The general idea is to create an index (sequential) for the reasons and then random values from 0 to N (where N is the number of reasons) on the IDs dataset and then join the two tables using these two new columns. Here is how you can do this:
case class Reasons(s: String)
defined class Reasons
case class Data(id: Long)
defined class Data
Data will hold the IDs (simplified version of the OP) and Reasons will hold some simplified reasons.
val d1 = spark.createDataFrame( Data(1) :: Data(2) :: Data(10) :: Nil)
d1: org.apache.spark.sql.DataFrame = [id: bigint]
d1.show()
+---+
| id|
+---+
| 1|
| 2|
| 10|
+---+
val d2 = spark.createDataFrame( Reasons("a") :: Reasons("b") :: Reasons("c") :: Nil)
+---+
| s|
+---+
| a|
| b|
| c|
+---+
We will later need the number of reasons so we calculate that first.
val numerOfReasons = d2.count()
val d2Indexed = spark.createDataFrame(d2.rdd.map(_.getString(0)).zipWithIndex)
d2Indexed.show()
+---+---+
| _1| _2|
+---+---+
| a| 0|
| b| 1|
| c| 2|
+---+---+
val d1WithRand = d1.select($"id", (rand * numerOfReasons).cast("int").as("rnd"))
The last step is to join on the new columns and the remove them.
val res = d1WithRand.join(d2Indexed, d1WithRand("rnd") === d2Indexed("_2")).drop("_2").drop("rnd")
res.show()
+---+---+
| id| _1|
+---+---+
| 2| a|
| 10| b|
| 1| c|
+---+---+
pyspark random join itself
data_neg = data_pos.sortBy(lambda x: uniform(1, 10000))
data_neg = data_neg.coalesce(1, False).zip(data_pos.coalesce(1, True))
The fastest way to randomly join dataA (huge dataframe) and dataB (smaller dataframe, sorted by any column):
dfB = dataB.withColumn(
"index", F.row_number().over(Window.orderBy("col")) - 1
)
dfA = dataA.withColumn("index", (F.rand() * dfB.count()).cast("bigint"))
df = dfA.join(dfB, on="index", how="left").drop("index")
Since dataB is already sorted, row numbers can be assigned over sorted window with high degree of parallelism. F.rand() is another highly parallel function, so adding index to dataA will be very fast as well.
If dataB is small enough, you may benefit from broadcasting it.
This method is better than using:
zipWithIndex: Can be very expensive to convert dataframe to rdd, zipWithIndex, and then to df.
monotonically_increasing_id: Need to be used with row_number which will collect all the partitions into a single executor.
Reference: https://towardsdatascience.com/adding-sequential-ids-to-a-spark-dataframe-fa0df5566ff6

How to merge duplicate rows using expressions in Spark Dataframes

How can I merge 2 data frames by removing duplicates by comparing columns.
I have two dataframes with same column names
a.show()
+-----+----------+--------+
| name| date|duration|
+-----+----------+--------+
| bob|2015-01-13| 4|
|alice|2015-04-23| 10|
+-----+----------+--------+
b.show()
+------+----------+--------+
| name| date|duration|
+------+----------+--------+
| bob|2015-01-12| 3|
|alice2|2015-04-13| 10|
+------+----------+--------+
What I am trying to do is merging of 2 dataframes to display only unique rows by applying two conditions
1.For same name duration will be sum of durations.
2.For same name,the final date will be latest date.
Final output will be
final.show()
+-------+----------+--------+
| name | date|duration|
+----- +----------+--------+
| bob |2015-01-13| 7|
|alice |2015-04-23| 10|
|alice2 |2015-04-13| 10|
+-------+----------+--------+
I followed the following method.
//Take union of 2 dataframe
val df =a.unionAll(b)
//group and take sum
val grouped =df.groupBy("name").agg($"name",sum("duration"))
//join
val j=df.join(grouped,"name").drop("duration").withColumnRenamed("sum(duration)", "duration")
and I got
+------+----------+--------+
| name| date|duration|
+------+----------+--------+
| bob|2015-01-13| 7|
| alice|2015-04-23| 10|
| bob|2015-01-12| 7|
|alice2|2015-04-23| 10|
+------+----------+--------+
How can I now remove duplicates by comparing dates.
Will it be possible by running sql queries after registering it as table.
I am a beginner in SparkSQL and I feel like my way of approaching this problem is weird. Is there any better way to do this kind of data processing.
you can do max(date) in groupBy(). No need to do join the grouped with df.
// In 1.3.x, in order for the grouping column "name" to show up,
val grouped = df.groupBy("name").agg($"name",sum("duration"), max("date"))
// In 1.4+, grouping column "name" is included automatically.
val grouped = df.groupBy("name").agg(sum("duration"), max("date"))