I am doing some experiments with Beam SQL.
I get a PCollection<Row> from the transform SampleSource and pass its output to a SqlTransform.
String sql1 = "select c1, c2, c3 from PCOLLECTION where c1 > 1";
The code below runs without any error.
POutput it = p.apply(new SampleSource()).apply(SqlTransform.query(sql1));
p.run().waitUntilFinish();
However, when I try the following lines of code, I get a runtime error.
POutput it = p.apply(new SampleSource());
it.getPipeline().apply(SqlTransform.query(sql1));
p.run().waitUntilFinish();
The error details are
Caused by: org.apache.beam.repackaged.beam_sdks_java_extensions_sql.org.apache.calcite.sql.validate.SqlValidatorException: Object 'PCOLLECTION' not found
Please provide some pointers.
It doesn't work because you're applying a SqlTransform to a pipeline, not a PCollection.
You probably want to change it along these lines:
// source probably returns a PCollection,
// would make sense to change 'it' to PCollection:
PCollection<...> it = p.apply(new SampleSource());
// then apply SqlTransform to the PCollection from the previous step,
// that is apply it directly to 'it':
it.apply(SqlTransform.query(sql1));
...
How Beam pipeline works, from high level perspective:
create a pipeline;
apply an IO PTransform that reads from some source and produces a PColelction of some elements that it reads from the source;
chain-apply more PTransforms to the PCollection from the previous step to process the data (conceptually, different PCollections will be produced at each step);
repeat;
SqlTransform is a normal PTransform, it is expected to be applied to a PCollection of elements and output another PCollection as a result. The query that you specify in SqlTransform.create() is applied to a PCollection. It expects the data to come from a magical PCOLLECTION table that represents the PCollection that you apply the SqlTransform to.
What you are doing in your example is different:
create a pipeline;
apply a source PTransform that produces a POutput not necessarily a PCollection;
then you ignore the output if your source, but instead take the original pipeline and apply a SqlTransform directly to it;
So what happens is that SqlTransform in this case is applied to the 'root' of the pipeline, not to the PCollection that comes out of the source. Instead of chain of PTransforms applied one after another you now have two PTransforms applied to the root independently from each other.
One more caveat is that SqlTransform expects input elements to be Rows, because SQL as a language works only on data that is represented as rows. There are two ways to achieve this:
manually convert the elements that are produced by the source to Rows by applying another ParDo between the source and SqlTransform;
use Beam's Schema framework (e.g. check out PCollection.setSchema() method) that allows Beam SQL to automatically convert input elements into Rows;
Related
Suppose I create a polars Lazyframe from a list of csv files using pl.concat():
df = pl.concat([pl.scan_csv(file) for file in ['file1.csv', 'file2.csv']])
Is the data in the resulting dataframe guaranteed to have the exact order of the input files, or could there be a scenario where the query optimizer would mix things up?
The order is maintained. The engine may execute them in a different order, but the final result will always have the same order as the lazy computations provided by the caller.
I am using spark with scala in which I am getting streaming datas from eventhubs and then storing them in delta table. In order to apply drools rule on them ,i need to pass them through variables...i am stuck where i have to get the data from delta table to variable.
It really depends what data you need to pass to that drools rules, and what you need to return. You can either use:
User defined function - you define a function that will receive one or more parameters (column values of specific rows). (more examples)
Use map function of Dataset / Dataframe class to process the whole Row (doc, and examples)
Delta Tables can be read into DataFrames. A variable can be assigned to point to the DataFrame.
df = spark.read.format("delta").load("some/delta/path")
Once the Delta Table is read, you can apply your custom transformations:
transformed_df = df.transform(first_transform).transform(second_transform)
Hope this helps point you in the right direction.
Context
I am trying to use Spark/Scala in order to "edit" multiple parquet files (potentially 50k+) efficiently. The only edit that needs to be done is deletion (i.e. deleting records/rows) based on a given set of row IDs.
The parquet files are stored in s3 as a partitioned DataFrame where an example partition looks like this:
s3://mybucket/transformed/year=2021/month=11/day=02/*.snappy.parquet
Each partition can have upwards of 100 parquet files that each are between 50mb and 500mb in size.
Inputs
We are given a spark Dataset[MyClass] called filesToModify which has 2 columns:
s3path: String = the complete s3 path to a parquet file in s3 that needs to be edited
ids: Set[String] = a set of IDs (rows) that need to be deleted in the parquet file located at s3path
Example input dataset filesToModify:
s3path
ids
s3://mybucket/transformed/year=2021/month=11/day=02/part-1.snappy.parquet
Set("a", "b")
s3://mybucket/transformed/year=2021/month=11/day=02/part-2.snappy.parquet
Set("b")
Expected Behaviour
Given filesToModify I want to take advantage of parallelism in Spark do the following for each row:
Load the parquet file located at row.s3path
Filter so that we exclude any row whose id is in the set row.ids
Count the number of deleted/excluded rows per id in row.ids (optional)
Save the filtered data back to the same row.s3path to overwrite the file
Return the number of deleted rows (optional)
What I have tried
I have tried using filesToModify.map(row => deleteIDs(row.s3path, row.ids)) where deleteIDs is looks like this:
def deleteIDs(s3path: String, ids: Set[String]): Int = {
import spark.implicits._
val data = spark
.read
.parquet(s3path)
.as[DataModel]
val clean = data
.filter(not(col("id").isInCollection(ids)))
// write to a temp directory and then upload to s3 with same
// prefix as original file to overwrite it
writeToSingleFile(clean, s3path)
1 // dummy output for simplicity (otherwise it should correspond to the number of deleted rows)
}
However this leads to NullPointerException when executed within the map operation. If I execute it alone outside of the map block then it works but I can't understand why it doesn't inside it (something to do with lazy evaluation?).
You get a NullPointerException because you try to retrieve your spark session from an executor.
It is not explicit, but to perform spark action, your DeleteIDs function needs to retrieve active spark session. To do so, it calls method getActiveSession from SparkSession object. But when called from an executor, this getActiveSession method returns None as stated in SparkSession's source code:
Returns the default SparkSession that is returned by the builder.
Note: Return None, when calling this function on executors
And thus NullPointerException is thrown when your code starts using this None spark session.
More generally, you can't recreate a dataset and use spark transformations/actions in transformations of another dataset.
So I see two solutions for your problem:
either to rewrite DeleteIDs function's code without using spark, and modify your parquet files by using parquet4s for instance.
or transform filesToModify to a Scala collection and use Scala's map instead of Spark's one.
s3path and ids parameters that are passed to deleteIDs are not actually strings and sets respectively. They are instead columns.
In order to operate over these values you can instead create a UDF that accepts columns instead of intrinsic types, or you can collect your dataset if it is small enough so that you can use the values in the deleteIDs function directly. The former is likely your best bet if you seek to take advantage of Spark's parallelism.
You can read about UDFs here
I have an AlterRow transformation that marks each row with the appropriate CRUD operation in an ADFv2 data flow. I don't see any output variables on this activity that will give me the total inserts, updates, etc. I do, however, see methods in the expression syntax to tell me if a particular row is an IsInsert(), IsUpdate(), etc.
Would the correct way to get counts be to
Add another output from the AlterRow transformation
Add derived column that uses the expression syntax IsInsert(), IsUpdate() to set operation type (I, U, D)
Add an aggregate to group by this column to get total counts for each operation
When creating the aggregate, I don't see any metadata that would allow me to group by the CRUD operation type so I assume I would have to create this myself, but it seems like it should already be there since that's the purpose of the AlterRow transformation. Am I working too hard to get these counts?
Add an aggregate after your AlterRow with no group-by and use these formulas:
I am working on an apache beam pipeline to run a SQL aggregation function.Reference: https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/BeamSqlDslJoinTest.java#L159.
The example here works fine.However, when I replace the source with an actual unbounded source and do an aggregation, I see no results.
Steps in my pipeline:
Read bounded data from a source and convert to collection of rows.
Read unbounded json data from a websocket source.
Assign timestamp to the every source stream via a DoFn.
Convert the unbounded json to unbounded row collection
Apply a window on the row collection
Apply a SQL statement.
Output the result of the sql.
A normal SQL statement executes and outputs the results. However, when I use a group by in the SQL, there is no output.
SELECT
o1.detectedCount,
o1.sensor se,
o2.sensor sa
FROM SENSOR o1
LEFT JOIN AREA o2
on o1.sensor = o2.sensor
The results are continous and like shown below.
2019-07-19 20:43:11 INFO ConsoleSink:27 - {
"detectedCount":0,
"se":"3a002f000647363432323230",
"sa":"3a002f000647363432323230"
}
2019-07-19 20:43:11 INFO ConsoleSink:27 - {
"detectedCount":1,
"se":"3a002f000647363432323230",
"sa":"3a002f000647363432323230"
}
2019-07-19 20:43:11 INFO ConsoleSink:27 - {
"detectedCount":0,
"se":"3a002f000647363432323230",
"sa":"3a002f000647363432323230"
}
The results don't show up at all when I change the sql to
SELECT
COUNT(o1.detectedCount) o2.sensor sa
FROM SENSOR o1
LEFT JOIN AREA o2
on o1.sensor = o2.sensor
GROUP BY o2.sensor
Is there anything I am doing wrong in this implementation.Any pointers would be really helpful.
Some suggestions come up when reading your code:
Extend the window, to allow lateness, and to emit early arrived data.
.apply("windowing", Window.<Row>into(FixedWindows.of(Duration.standardSeconds(2)))
.triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(1)))
.withLateFirings(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(2))))
.withAllowedLateness(Duration.standardMinutes(10))
.discardingFiredPanes());
Try to remove the join and check if without it you have output to the window,
Try to add more time to the window. because sometimes it is too short to shuffle the data between the workers. and the joined streams aren't emitted at the same time.
outputWithTimestamp will output the rows in a different timestamp, and then they can be dropped when you don't allow lateness.
Read the docs for outputWithTimestamp, this API is a bit risky.
If the input {#link PCollection} elements have timestamps, the output
timestamp for each element must not be before the input element's
timestamp minus the value of {#link getAllowedTimestampSkew()}. If an
output timestamp is before this time, the transform will throw an
{#link IllegalArgumentException} when executed. Use {#link
withAllowedTimestampSkew(Duration)} to update the allowed skew.
CAUTION: Use of {#link #withAllowedTimestampSkew(Duration)} permits
elements to be emitted behind the watermark. These elements are
considered late, and if behind the {#link
Window#withAllowedLateness(Duration) allowed lateness} of a downstream
{#link PCollection} may be silently dropped.
SELECT
COUNT(o1.detectedCount) as number
,o2.sensor
,sa
FROM SENSOR o1
LEFT OUTER JOIN AREA o2
on o1.sensor = o2.sensor
GROUP BY sa,o1.sensor,o2.sensor