What does this error message mean in Pyspark? - pyspark

I got the following error message when I want to count a pyspark dataframe with ~12 million rows.
Also I got a error messages when I want to convert the dataframe into a pandas. So every inspect function is a kind of this error like in the image. What can I do?

When you work with Spark, you have to types of functions :
Transformations : They return usually a dataframe and are evaluated lazily. For exemple select, where, groupBy etc ... they are almost instant to execute and just validate that the structure of the dataframe fits the operation
Action : They are the final step of the transformation chain. An action will trigger the execution of all the previous transformation. For exemple : show, count, collect, toPandas ...
Because of that system, if something is wrong during a transformation, the action cannot be resolved. That is the message you see. You do not know which transformation is failling, you just know that somewhere, in your cluster, the data and the transformation you want to apply creates an error.

Related

Pentaho transformation swithc/sace doesnt work as suposed to be

Pentaho , i have transformatipn step
set_useful_life have select from the table and assign the result to the variable.
them switch/case, check result, and supposed to run SQL dwh_adhoc_useful_life step only if the variable is 2
I 100% sure that variable is 1 but it still execute SQL dwh_adhoc_useful_life step. and at this moment I don't understand how to work around it?
Why am I doing this check here and not in the job, because this transformation receives variables from the previous step, and if I am doing
in the job and execute transformation step, I am losing all variables and this step doesn't execute for each available passed from the previous transformation step
I think your problem is a concept of how PDI works.
In a transformation every step is initialized at the beginning, waiting to process the rows it receives from the stream, so the SQL dwh_adhoc_useful_life step is going to be initialized at the beginning, even if it doesn't receives rows.
Usually that's not a problem because it usually expects something from the stream of rows it receives, but in your case it really doesn't need anything from the rows it receives as input, so it generates its own stream of rows and follows up on it.
One way to fix it would be to get the client_ssr_config column to work as an argument to your SQL script, so it only produces results if the correct value is passed (something in the lines of adding an AND 1=client_ssr_config to your filters in the SQL clause)

In Palantir Foundry, how do I debug pyspark (or pandas) UDFs since I can't use print statements?

In Code Workbooks, I can use print statements, which appear in the Output section of the Code Workbook (where errors normally appear). This does not work for UDFs, and it also doesn't work in Code Authoring/Repositories.
What are ways I can debug my pyspark code, especially if I'm using UDFs?
I will explain 3 debugging tools for pyspark (and usable in Foundry):
Raising Exceptions
Running locally as a pandas series
Logging and specifically logging in UDFs
Raising Exceptions
The easiest, quickest way to view a variable, especially for pandas UDFs is to Raise an Exception.
def my_compute_function(my_input):
interesting_variable = some_function(my_input) # Want to see result of this
raise ValueError(interesting_variable)
This is often easier than reading/writing DataFrames because:
Can easily insert a raise statement without messing with the transform's return value or other logic
Don't need to mess around with defining a valid schema for your debug statement
The downside is that it stops the execution of the code.
Running locally as a pandas series
If you are more experienced with Pandas, you use a small sample of the data, and run your algoritm on the driver as a pandas series where you can do debugging.
Some techniques I previously used is not just downsampling the data by a number of rows, rather I filtered the data to be representative of my work. For example if I was writing an algorithm to determine flight delays, I would filter to all flights to a specific airport on a specific day. This way I'm testing holistically on the sample.
Logging
Code Repositories uses Python's built in logging library. This is widely documented online and allows you to control logging level (ERROR, WARNING, INFO) for easier filtering.
Logging output appears in both your output dataset's log files, and in your build's driver logs (Dataset -> Details -> Files -> Log Files, and Builds -> Build -> Job status logs; select "Driver logs", respectively).
This would allow you to view the logged information in the logs (after the build completes), but doesn't work for UDFs.
Logging in UDFs
The work done by the UDF is done by the executor not the driver, and Spark captures the logging output from the top-level driver process. If you are using a UDF within your PySpark query and need to log data, create and call a second UDF that returns the data you wish to capture and store it in a column to view once the build is finished:
#transform_df(
...
)
def some_transformation(some_input):
logger.info("log output related to the overall query")
#F.udf("integer")
def custom_function(integer_input):
return integer_input + 5
#F.udf("string")
def custom_log(integer_input):
return "Original integer was %d before adding 5" % integer_input
df = (
some_input
.withColumn("new_integer", custom_function(F.col("example_integer_col"))
.withColumn("debugging", custom_log(F.col("example_integer_col"))
)

Is there a way to add literals as columns to a spark dataframe when reading the multiple files at once if the column values depend on the filepath?

I'm trying to read a lot of avro files into a spark dataframe. They all share the same s3 filepath prefix, so initially I was running something like:
path = "s3a://bucketname/data-files"
df = spark.read.format("avro").load(path)
which was successfully identifying all the files.
The individual files are something like:
"s3a://bucketname/data-files/timestamp=20201007123000/id=update_account/0324345431234.avro"
Upon attempting to manipulate the data, the code kept errorring out, with a message that one of the files was not an Avro data file. The actual error message received is: org.apache.spark.SparkException: Job aborted due to stage failure: Task 62476 in stage 44102.0 failed 4 times, most recent failure: Lost task 62476.3 in stage 44102.0 (TID 267428, 10.96.134.227, executor 9): java.io.IOException: Not an Avro data file.
To circumvent the problem, I was able to get the explicit filepaths of the avro files I'm interested in. After putting them in a list (file_list), I was successfully able to run spark.read.format("avro").load(file_list).
The issue now is this - I'm interested in adding a number of fields to the dataframe that are part of the filepath (ie. the timestamp and the id from the example above).
While using just the bucket and prefix filepath to find the files (approach #1), these fields were automatically appended to the resulting dataframe. With the explicit filepaths, I don't get that advantage.
I'm wondering if there's a way to include these columns while using spark to read the files.
Sequentially processing the files would look something like:
for file in file_list:
df = spark.read.format("avro").load(file)
id, timestamp = parse_filename(file)
df = df.withColumn("id", lit(id))\
.withColumn("timestamp", lit(timestamp))
but there are over 500k files and this would take an eternity.
I'm new to Spark, so any help would be much appreciated, thanks!
Two separate things to tackle here:
Specifying Files
Spark has built in handling for reading all files of a particular type in a given path. As #Sri_Karthik suggested, try supplying a path like "s3a://bucketname/data-files/*.avro" (if that doesn't work, maybe try "s3a://bucketname/data-files/**/*.avro"... i can't remember the exact pattern matching syntax spark uses), which should grab all avro files only and get rid of that error where you are seeing non-avro files in those paths. In my opinion this is more elegant than manually fetching the file paths and explicitly specifying them.
As an aside, the reason you are seeing this is likely because folders typically get marked with metadata files like .SUCCESS or .COMPLETED to indicate they are are ready for consumption.
Extracting metadata from filepaths
If you check out this stackoverflow question, it shows how you can add the filename as a new column (both for scala and pyspark). You could then use the regexp_extract function to parse out the desired elements from that filename string. I've never used scala in spark so can't help you there, but it should be similar to the pyspark version.
Why dont you try to read the files first by using wholetextfiles method and add the path name into the data itself at the beginning. Then you can filter out the file names from the data and add it as a column while creating the dataframe. I agree it's a two step process. But it should work. To get a timestamp of file you will need filesystem object which js not serializable , i.e. it cant be used in sparks parallelized operation , So you will have to create a local collection with file and timestamp and join it somehow with the RDD you created with wholetextfiles.

Is it possible use composite triggers in conjunction with micro-batching with dataflow?

We have an unbounded PCollection PCollection<TableRow> source that we are inserting to BigQuery.
An easy "by the book" way to fire windows every 500 thousand messages or five minutes would be:
source.apply("GlobalWindow", Window.<TableRow>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterFirst.of(
AfterPane.elementCountAtLeast(500000),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(5))))
).withAllowedLateness(Duration.standardMinutes(1440)).discardingFiredPanes())
You would think that applying the following to the fired window/pane would allow you to write contents of the fired pane to BigQuery:
.apply("BatchWriteToBigQuery", BigQueryIO.writeTableRows()
.to(destination)
.withMethod(BigQueryIO.Write.Method.FILE_LOADS)
.withNumFileShards(NUM_FILE_SHARDS)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
But this would yield a compile error An exception occured while executing the Java class. When writing an unbounded PCollection via FILE_LOADS, triggering frequency must be specified
Relatively easy fix would be to add .withTriggeringFrequency(Duration.standardMinutes(5)) to the above, which would essentially render the idea of inserting either every five minutes or every N messages completely void, and you might as well get rid of the windowing in that case anyway.
Is there a way to actually accomplish this?
FILE_LOADS requires triggering frequency.
If you want more realtime results then you can use STREAMING_INSERTS
Reference https://beam.apache.org/releases/javadoc/2.19.0/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.Method.html#FILE_LOADS

Is it necessary to use windows in Flink?

I'm attempting to transform a stream of data, without using any window provided by Flink. My code looks something like this :
val stream1 = executionEnvironment.getStream
val stream2 = stream1.flatMap(someFunction)
stream2.addSink(s3_Sink)
executionEnvironment.execute()
However, upon submitting and running my job, I'm not getting any output on S3. The web UI shows 0 bytes received, 0 records received, 0 bytes sent, 0 records sent.
Another running Flink job is already using the same data source, so the data source is fine. There are no errors anywhere but still no output. Could this issue be, because I'm not using any window or key operation? I attempted to get the output after assigning ascending timestamps but didn't get any output. Any idea of what could not be working?
I guess that has nothing to do with a missing window. Rule of thumb: Use windows when you want any kind of aggregation (folds, reduces, etc.).
Regarding you initial problem: From what you have shown so far I can only imagine that the flatMap operator doesn't produces any output (in contrast to a map which always have to emit a value flatMap might filter out everything). Maybe you can add more code so that we can have a closer look.