How to set pivotMaxValues in pyspark? - pyspark

I am trying to pivot a column which has more than 10000 distinct values. The default limit in Spark for maximum number of distinct values is 10000 and I am receiving this error
The pivot column COLUMN_NUM_2 has more than 10000 distinct values, this could indicate an error. If this was intended, set spark.sql.pivotMaxValues to at least the number of distinct values of the pivot column
How do I set this in PySpark?

You have to add / set this parameter in the Spark interpreter.
I am working with Zeppelin notebooks on an EMR (AWS) cluster, had the same error message as you and it worked after I added the parameter in the interpreter.
Hope this helps...

Related

Using CQL syntax query with DataFrame in PySpark

We're trying to use PySpark with Cassandra connector to migrate Cassandra data between clusters.
We need to be able to use CQL specific syntax in our queries, to limit the data we migrate. Having full partition key (consisting of two fields) is still selecting too much data, so we need more restrictive filtering condition.
Selection criteria works ok when done with CQLSH tool, where we can use CQL statement like:
select * from ns.table where pk1 = 'val1' and pk2 = 'val2' and timestamp > maxTimeuuid("abc") and timestamp < minTimeuuid("xyz")
// pk1, pk2 - partition key fields, timestamp - next field in primary key
but the problem is that minTimeuuid() and maxTimeuuid() are CQL - specific functions, and seems that PySpark DataFrame does not allow filtering based on Cassandra specific functions. Only when it resembles regular SQL syntax ( field = value kind of comparisons ) does it work.
Tried filtering DataFrame with .filter() condition, or SparkContext sql() function - either one gives errors about unknown functions ( minTimeuuid() or maxTimeuuid() ).
Similarly tried using token('..') function, and 'token()' function is flagged by Spark as unknown function as well.
Are there any examples how CQL specific functions can be used as filtering conditions with PySpark/DataFrame ( or maybe RDD if not with DataFrame)?

Pyspark window functions (lag and row_number) generate inconsistent results

I have been fighting an issue with window functions in pyspark for a few weeks now.
I have the following query to detect changes for a given key field:
rowcount = sqlContext.sql(f"""
with temp as (
select key, timestamp, originalcol, lag(originalcol,1) over (partition by key order by timestamp) as lg
from snapshots
where originalcol is not null
)
select count(1) from (
select *
from temp
where lg is not null
and lg != originalcol
)
""")
Data types are as follows:
key: string (not null)
timestamp: timestamp (unique, not null)
originalcol: timestamp
The snapshots table contains over a million records. This query is producing different row counts after each execution: 27952, 27930, etc. while the expected count is 27942. I can say it is only approximately correct, with a deviation of around 10 records, however this is not acceptable as running the same function twice with the same inputs should produce the same results.
I have a similar problem with row_number() over the same window, then filtering for row_number = 1, but I guess the issue should be related.
I tried the query in an AWS Glue job as both pyspark and athena SQL, and the inconsistencies are similar.
Any clue about what I am doing wrong here?
Spark is pretty picky about some silly things...
and lg != originalcol doesn't detect Null values and thus the first value from the window partition will always be filtered out (since the first value from LAG will always be Null).
The same thing happens when you try using Null using In statment
Another example where Null will filter-out:
where test in (Null, 1)
After a bit of research, I discovered that column timestamp is not unique. Even though SQL Server manages to produce the same execution plan and results, pyspark and presto get confused with the order by clause in the window function and produce different results after each execution. If anything can be learned from this experience, it would be to double-check the partition and order by keys in a window function.

Spark SQL: Generate a row id column with auto increments in CONSECUTIVE integer

I have a databricks notebook written in Scala. And I have this dataframe generated like this:
val df = spark.sql("SELECT ColumnName FROM TableName")
I want to add another column RowID that will automatically populate the rows with integers. I don't want to use the row_number() function. I need CONSECUTIVE integers starting from 1. Is there any other way?
I checked this answer but it does not help me to generate consecutive integers. And monotonically_increasing_id is not working for me. Is this function valid for databricks? Do we need to import some modules?
Thanks!

How to rank rows in DataFrame? [duplicate]

random number generator SparkSQL ?
For example:
Netezza: sequence number
mysql: sequence number
Thanks.
Sequence in spark sql is in spark 1.6 its select monotonically_increasing_id() from table , spark 1.6 is due to get released
Spark Sql already have random functions there is one blog.
Or for number of rows spark sql also have row_number() function.

Convert T-SQL Cross Apply to Redshift

I am converting the following T-SQL statement to Redshift. The purpose of the query is to convert a column in the table with a value containing a comma delimited string with up to 60 values into multiple rows with 1 value per row.
SELECT
id_1
, id_2
, value
into dbo.myResultsTable
FROM myTable
CROSS APPLY STRING_SPLIT([comma_delimited_string], ',')
WHERE [comma_delimited_string] is not null;
In SQL this processes 10 million records in just under 1 hour which is fine for my purposes. Obviously a direct conversation to Redshift isn't possible due to Redshift not having a Cross Apply or String Split functionality so I built a solution using the process detailed here (Redshift. Convert comma delimited values into rows) which utilizes split_part() to split the comma delimited string into multiple columns. Then another query that unions everything to get the final output into a single column. But the typical run takes over 6 hours to process the same amount of data.
I wasn't expecting to run into this issue just knowing the power difference between the machines. The SQL Server I was using for the comparison test was a simple server with 12 processors and 32 GB of RAM while the Redshift server is based on the dc1.8xlarge nodes (I don't know the total count). The instance is shared with other teams but when I look at the performance information there are plenty of available resources.
I'm relatively new to Redshift so I'm still assuming I'm not understanding something. But I have no idea what am I missing. Are there things I need to check to make sure the data is loaded in an optimal way (I'm not an adim so my ability to check this is limited)? Are there other Redshift query options that are better than the example I found? I've searched for other methods and optimizations but unless I start looking into Cross Joins, something I'd like to avoid (Plus when I tried to talk to the DBA's running the Redshift cluster about this option their response was a flat "No, can't do that.") I'm not even sure where to go at this point so any help would be much appreciated!
Thanks!
I've found a solution that works for me.
You need to do a JOIN on a number table, for which you can take any table as long as it has more rows that the numbers you need. You need to make sure the numbers are int by forcing the type. Using the funcion regexp_count on the column to be split for the ON statement to count the number of fields (delimiter +1), will generate a table with a row per repetition.
Then you use the split_part function on the column, and use the number.num column to extract for each of the rows a different part of the text.
SELECT comma_delimited_string, numbers.num, REGEXP_COUNT(comma_delimited_string , ',')+1 AS nfields, SPLIT_PART(comma_delimited_string, ',', numbers.num) AS field
FROM mytable
JOIN
(
select
(row_number() over (order by 1))::int as num
from
mytable
limit 15 --max num of fields
) as numbers
ON numbers.num <= regexp_count(comma_delimited_string , ',') + 1