Count of rows over a window as a total - pyspark

In Pyspark I am trying to execute a count of all rows within a dataframe.
On Hive, I am able to execute it with:
count(1) OVER () as biggest_id
However on pyspark, I am unsure how to execute it. Here is what I tried:
df_new = (
df.withColumn('biggest_id', F.count(F.lit(1)).over())
)
Usually the over argument needs a windowing statement, but I haven't been successful on how to do it.

Try this. None is not allowed but can make the window of all.
.over(Window.partitionBy())

Related

Pyspark: correlated column is not allowed in predicate

I have a table with three columns EVENT, TIME, and `PRICE. For all events I would like to aggregate on previous events, for simplicity we'll assume it is mean.
What I would like to do is the following,
SELECT (
SELECT COUNT(*), MEAN(ti.PRICE)
    FROM table_1 ti
WHERE ti.EVENT = to.EVENT AND ti.TIME < to.TIME
), EVENT
FROM table_1
though if I run this in a pyspark environment or pyspark.sql(query) I get the error correlated column is not allowed in predicate.
Now, I wonder how I can change either the query to run without errors, or, how I can use native pyspark functions (F.filter....) to achieve the same result.
read other stackoverflow, that did not help

In pyspark, how to select n rows of DataFrame without scan the whole table

I'm using pyspark, and want to show user a preview of a (very large, 10 million for example) table, for example, user can see 5000 rows in the table, (first/last/random, any 5000 rows are ok), so what is the fastest way to get n rows from the table? I have tried limit, sample, but these function will still scan the whole table, the time complexity are O(N*), which takes a lot of time.
spark.sql('select * from some_table').limit(N)
Can some help me.
spark.sql('select * from some_table limit 10')
Since you are making a sql call from python, this is by far the easiest solution. And it's fast. I don't think it scans the whole table when you use a sql call. Assuming your table is already cached- are you sure the delay is caused by scanning the table, or is it caused by materializing the table?
As an alternative, assuming you had a python dataframe handle, df_some_table, it gets trickier because the .head() and .show() functions return something other than a dataframe, but they can work for peeking at the dataframe.
df_some_table.head(N)
df_some_table.show(N)

Pyspark window functions (lag and row_number) generate inconsistent results

I have been fighting an issue with window functions in pyspark for a few weeks now.
I have the following query to detect changes for a given key field:
rowcount = sqlContext.sql(f"""
with temp as (
select key, timestamp, originalcol, lag(originalcol,1) over (partition by key order by timestamp) as lg
from snapshots
where originalcol is not null
)
select count(1) from (
select *
from temp
where lg is not null
and lg != originalcol
)
""")
Data types are as follows:
key: string (not null)
timestamp: timestamp (unique, not null)
originalcol: timestamp
The snapshots table contains over a million records. This query is producing different row counts after each execution: 27952, 27930, etc. while the expected count is 27942. I can say it is only approximately correct, with a deviation of around 10 records, however this is not acceptable as running the same function twice with the same inputs should produce the same results.
I have a similar problem with row_number() over the same window, then filtering for row_number = 1, but I guess the issue should be related.
I tried the query in an AWS Glue job as both pyspark and athena SQL, and the inconsistencies are similar.
Any clue about what I am doing wrong here?
Spark is pretty picky about some silly things...
and lg != originalcol doesn't detect Null values and thus the first value from the window partition will always be filtered out (since the first value from LAG will always be Null).
The same thing happens when you try using Null using In statment
Another example where Null will filter-out:
where test in (Null, 1)
After a bit of research, I discovered that column timestamp is not unique. Even though SQL Server manages to produce the same execution plan and results, pyspark and presto get confused with the order by clause in the window function and produce different results after each execution. If anything can be learned from this experience, it would be to double-check the partition and order by keys in a window function.

Pyspark delete row from PostgreSQL

How can PySpark remove rows in PostgreSQL by executing a query such as DELETE FROM my_table WHERE day = 3 ?
SparkSQL provides API only for inserting/overriding records. So using a library like psycopg2 could do the job, but it needs to be explicitly compiled on the remote machine, that is not doable for me. Any other suggestions?
Dataframes in Apache Spark are immutable. You can filter out the rows you don't want.
See the documentation.
A simple example could be:
df = spark.jdbc("conn-url", "mytable")
df.createOrReplaceTempView("mytable")
df2 = spark.sql("SELECT * FROM mytable WHERE day != 3")
df2.collect()
The only solution that works so far is to install psycopg2 to spark master node and call queries like a regular python would do. Adding that library as py-files didn't work out for me

IF/Case statement when using SparkSQL with Cassandra

I am trying to transform data when selecting from Cassandra to Spark using Scala.
When selecting data I would like to transform the data to place the counts into a specific count_* column based on the value.
I am unable to find an IF/CASE statement to use with Spark SQL. Any ideas?
val results = csc.sql("
SELECT trip_sell_key, trip_veh_key, idle_stop_date, COUNT(*),
SUM (case when idle_stop_duration >= 0
and idle_stop_duration < 5 then 1 else 0 end)
from veh_trip"
)
I'm not even sure your SQL is valid for SparkSQL. Don't remember if SparkSQL does support case else statements.
Another point is that COUNT(*) and SUM(...) are aggregation functions and they can only work in conjunction with a GROUP BY clause, which is missing in your statement