Execute spark sql query on dataframe column values - pyspark

I'm trying to get the size of each table in my database.
I listed first all my tables in a dataframe using this command :
df = spark.sql("show tables in db")
And this is my current dataframe :
+---------+
| tabs |
+---------+
|db.tab1 |
|db.tab2 |
|db.tab3 |
|db.tab4 |
|db.tab5 |
+---------+
Then, for each table I want to get some informations such as count and last modification date.
To explain more, what I want to do is something like this (it's not working) :
df1 = df.withColumn("count", spark.sql('select count(*) from {0}'.format(df.tabs)))
This is the desired result :
+---------+------+
| tabs | count|
+---------+------+
|db.tab1 | 122 |
|db.tab2 | 156 |
|db.tab3 | 235 |
|db.tab4 | 11 |
|db.tab5 | 98 |
+---------+------+

You can try like below.
get count for each table and union them.
tables = df.collect()
count_dfs = [
spark.sql(f'select {table} as tabs, count(*) as count from {table}')
for table in tables
]
result_df = reduce(lambda union_dfs, count_df: union_dfs.union(count_df))
result_df.show()

Related

how to get the minor of three column values in postgresql

The common function to get the minor value of a column is min(column), but what I want to do is to get the minor value of a row, based on the values of 3 columns. For example, using the following base table:
+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| 2 | 1 | 3 |
| 10 | 0 | 1 |
| 13 | 12 | 2 |
+------+------+------+
I want to query it as:
+-----------+
| min_value |
+-----------+
| 1 |
| 0 |
| 2 |
+-----------+
I found a solution as follows, but for SQL, not Postgresql. So I am not getting it to work in postgresql:
select
(
select min(minCol)
from (values (t.col1), (t.col2), (t.col3)) as minCol(minCol)
) as minCol
from t
I could write something using case statement but I would like to write a query like the above for postgresql. Is this possible?
You can use least() (and greatest() for the maximum)
select least(col1, col2, col3) as min_value
from the_table

Fast split Spark dataframe by keys in some column and save as different dataframes

I have Spark 2.3 very big dataframe like this:
-------------------------
| col_key | col1 | col2 |
-------------------------
| AA | 1 | 2 |
| AB | 2 | 1 |
| AA | 2 | 3 |
| AC | 1 | 2 |
| AA | 3 | 2 |
| AC | 5 | 3 |
-------------------------
I need to "split" this dataframe by values in col_key column and save each splitted part in separate csv file, so I have to get smaller dataframes like
-------------------------
| col_key | col1 | col2 |
-------------------------
| AA | 1 | 2 |
| AA | 2 | 3 |
| AA | 3 | 2 |
-------------------------
and
-------------------------
| col_key | col1 | col2 |
-------------------------
| AC | 1 | 2 |
| AC | 5 | 3 |
-------------------------
and so far.
Every result dataframe I need to save as different csv file.
Count of keys is not big (20-30) but total count of data is (~200 millions records).
I have the solution where in the loop is selected every part of data and then saved to file:
val keysList = df.select("col_key").distinct().map(r => r.getString(0)).collect.toList
keysList.foreach(k => {
val dfi = df.where($"col_key" === lit(k))
SaveDataByKey(dfi, path_to_save)
})
It works correct, but bad issue of this solution is that every selection of data by every key couse full passing through whole dataframe, and it get too many time.
I think must be faster solution, where we pass through dataframe only once and during this put every record to "rigth" result dataframe (or directly to separate file). But I don't know how can to do it :)
May be, someone have ideas about it?
Also I prefer to use Spark's DataFrame API because it provides fastest way of data processing (so using RDD's is not desirable, if possible).
You need to partition by column and save as csv file. Each partition save as one file.
yourDF
.write
.partitionBy("col_key")
.csv("/path/to/save")
Why don't you try this ?

Identifying recurring values a column over a Window (Scala)

I have a data frame with two columns: "ID" and "Amount", each row representing a transaction of a particular ID and the transacted amount. My example uses the following DF:
val df = sc.parallelize(Seq((1, 120),(1, 120),(2, 40),
(2, 50),(1, 30),(2, 120))).toDF("ID","Amount")
I want to create a new column identifying whether said amount is a recurring value, i.e. occurs in any other transaction for the same ID, or not.
I have found a way to do this more generally, i.e. across the entire column "Amount", not taking into account the ID, using the following function:
def recurring_amounts(df: DataFrame, col: String) : DataFrame = {
var df_to_arr = df.select(col).rdd.map(r => r(0).asInstanceOf[Double]).collect()
var arr_to_map = df_to_arr.groupBy(identity).mapValues(_.size)
var map_to_df = arr_to_map.toSeq.toDF(col, "Count")
var df_reformat = map_to_df.withColumn("Amount", $"Amount".cast(DoubleType))
var df_out = df.join(df_reformat, Seq("Amount"))
return df_new
}
val df_output = recurring_amounts(df, "Amount")
This returns:
+---+------+-----+
|ID |Amount|Count|
+---+------+-----+
| 1 | 120 | 3 |
| 1 | 120 | 3 |
| 2 | 40 | 1 |
| 2 | 50 | 1 |
| 1 | 30 | 1 |
| 2 | 120 | 3 |
+---+------+-----+
which I can then use to create my desired binary variable to indicate whether the amount is recurring or not (yes if > 1, no otherwise).
However, my problem is illustrated in this example by the value 120, which is recurring for ID 1 but not for ID 2. My desired output therefore is:
+---+------+-----+
|ID |Amount|Count|
+---+------+-----+
| 1 | 120 | 2 |
| 1 | 120 | 2 |
| 2 | 40 | 1 |
| 2 | 50 | 1 |
| 1 | 30 | 1 |
| 2 | 120 | 1 |
+---+------+-----+
I've been trying to think of a way to apply a function using
.over(Window.partitionBy("ID") but not sure how to go about it. Any hints would be much appreciated.
If you are good in sql, you can write sql query for your Dataframe. The first thing that you need to do is to register your Dataframeas a table in the spark's memory. After that you can write the sql on top of the table. Note that spark is the spark session variable.
val df = sc.parallelize(Seq((1, 120),(1, 120),(2, 40),(2, 50),(1, 30),(2, 120))).toDF("ID","Amount")
df.registerTempTable("transactions")
spark.sql("select *,count(*) over(partition by ID,Amount) as Count from transactions").show()
Please let me know if you have any questions.

Spark: Iterating Through Dataframe with Operation

I have a dataframe and I want to iterate through every row of the dataframe. There are some columns in the dataframe that have leading characters of three quotations that indicate that they are accidentally chopped off, and need to all be part of one column. Therefore, I need to loop through all the rows in the dataframe, and if the column has the leading characters then it needs to join it's proper column.
The following works for a single line and gives the correct result:
val t = df.first.toSeq.toArray.toBuffer
while(t(5).toString.startsWith("\"\"\"")){
t(4) = t(4).toString.concat(t(5).toString)
t.remove(5)
}
However, when I try to go through the whole dataframe it errors out:
df.foreach(z =>
val t = z.toSeq.toArray.toBuffer
while(t(5).toString.startsWith("\"\"\"")){
t(4) = t(4).toString.concat(t(5).toString)
t.remove(5)
}
)
This errors out with this error message: <console>:2: error: illegal start of simple expression.
How do I correct this to make it work correctly? Why is this not correct?
Thanks!
Edit - Example Data (there are other columns in front):
+---+--------+----------+----------+---------+
|id | col4 | col5 | col6 | col7 |
+---+--------+----------+----------+---------+
| 1 | {blah} | service | null | null |
| 2 | { blah | """ blah | """blah} | service |
| 3 | { blah | """blah} | service | null |
+---+--------+----------+----------+---------+

PySpark: How can I join one more column to a dataFrame?

I'm work on a dataframe with two inicial columns, id and colA.
+---+-----+
|id |colA |
+---+-----+
| 1 | 5 |
| 2 | 9 |
| 3 | 3 |
| 4 | 1 |
+---+-----+
I need to merge that dataFrame to another column more, colB. I know that colB fits perfectly at the end of the dataFrame, I just need some way to join it all together.
+-----+
|colB |
+-----+
| 8 |
| 7 |
| 0 |
| 6 |
+-----+
In result of these, I need to obtain a new dataFrame like that below:
+---+-----+-----+
|id |colA |colB |
+---+-----+-----+
| 1 | 5 | 8 |
| 2 | 9 | 7 |
| 3 | 3 | 0 |
| 4 | 1 | 6 |
+---+-----+-----+
This is the pyspark code to obtain the first DataFrame:
l=[(1,5),(2,9), (3,3), (4,1)]
names=["id","colA"]
db=sqlContext.createDataFrame(l,names)
db.show()
How can I do it? Could anyone help me, please? Thanks
I've done! I've solved it by adding a temporary column with the indices of the rows and then I delete it.
code:
from pyspark.sql import Row
from pyspark.sql.window import Window
from pyspark.sql.functions import rowNumber
w = Window().orderBy()
l=[(1,5),(2,9), (3,3), (4,1)]
names=["id","colA"]
db=sqlContext.createDataFrame(l,names)
db.show()
l=[5,9,3,1]
rdd = sc.parallelize(l).map(lambda x: Row(x))
test_df = rdd.toDF()
test_df2 = test_df.selectExpr("_1 as colB")
dbB = test_df2.select("colB")
db= db.withColum("columnindex", rowNumber().over(w))
dbB = dbB.withColum("columnindex", rowNumber().over(w))
testdf_out = db.join(dbB, db.columnindex == dbB.columnindex. 'inner').drop(db.columnindex).drop(dbB.columnindex)
testdf_out.show()