I am using apache-beam==2.16.0 python sdk. Pipeline looks as below - reads files from gcp bucket, parses rows, validates them and saves to BigQuery. bq_table_name is ValueProvider argument. I am passing table name in format PROJECT:DATASET.TABLE but beam is complaining that there is no dataset by the name of PROJECT:DATASET. I am using --experiments use_beam_bq_sink flag as well.
pipeline \
| 'Read GCS Files' >> beam.io.ReadFromText(
file_pattern=dq_options.files_path) \
| 'Parse Rows' >> beam.ParDo(.....) \
| "Validate Rows" >> beam.ParDo(....) \
| "Write to BQ" >> beam.io.WriteToBigQuery(
pipeline_options.bq_table_name,
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED)
Related
I want to make a simple Beam pipeline which will store the events from kafka to S3 in parquet format every minute.
Here's a simplified version of my pipeline:
def add_timestamp(event: Any) -> Any:
from datetime import datetime
from apache_beam import window
return window.TimestampedValue(event, datetime.timestamp(event[1].timestamp))
# Actual Pipeline
(
pipeline
| "Read from Kafka" >> ReadFromKafka(consumer_config, topics, with_metadata=False)
| "Transformed" >> beam.Map(my_transform)
| "Add timestamp" >> beam.Map(add_timestamp)
| "window" >> beam.WindowInto(window.FixedWindows(60)) # 1 mins
| "writing to parquet" >> beam.io.WriteToParquet('s3://test-bucket/', pyarrow_schema)
)
However, the pipeline failed with
GroupByKey cannot be applied to an unbounded PCollection with global windowing and a default trigger
This seems to be coming from https://github.com/apache/beam/blob/v2.41.0/sdks/python/apache_beam/io/iobase.py#L1145-L1146
which always add a GlobalWindows and thus causing this error. Wondering what should I do to correctly backup the event from Kafka (Unbounded) to S3. Thanks!
btw, I am running with portableRunner with Flink. Beam Version is 2.41.0 (the latest version seems to have the same code) and Flink version is 1.14.5
I need to change the environment variable under container definition in the ecs Task Definition.
TASK_DEFINITION=$(aws ecs describe-task-definition --task-definition $TASKDEFINITION_ARN --output json)
echo $TASK_DEFINITION | jq '.taskDefinition.containerDefinitions[0] | ( .environment[] |= if
.name == "ES_PORT_9200_TCP_ADDR" then .value = "vpc-kkslke-shared-3-abcdkalssdfy.us-east-1.es.amazonaws.com"
else . end)' | jq -s . >container-definition.json
CONTAINER_DEF=$(<$container-definition.json)
aws ecs register-task-definition --family $FAMILY_NAME --container-definitions $CONTAINER_DEF
Error Message:
Error parsing parameter '--container-definitions': Invalid JSON:
Expecting property name enclosed in double quotes: line 1 column 2
(char 1) JSON received: {
One observation not sure if it is related to a bug in VScode or not . When I try to use the view the variable value in debug mode. I only get partial text as seen below. But when I do echo for the same variable I do see the full json. Not sure if the whole value is being passed to the container definition.
Up until recently my gcloud spanner queries where nicely presented as columns across the screen, with each output line representing a single row from the query. Recently however, for some unknown reason, the output is now displayed as row data presented in single column:value pair down the screen, e.g.
PKey: 9moVr4HmSy6GGIYJyVGu3A==
Ty: Pf
IsNode: Y
P: None
IX: X
I have tried various --format command line options but alas have had no success in generating the original row-per-line-output format i.e. with columns presented across the screen as follows
PKey Ty IsNode P IX <-- columns names
9moVr4HmSy6GGIYJyVGu3A== Pf Y None X. <--- row data
What format option should I use ?.
Example of gcloud query:
gcloud spanner databases execute-sql test-sdk-db --instance=test-instance --sql="Select * from Block "
Thanks
gcloud formats the results as a table if they're being written to a file, the usual formatting rules apply otherwise.
So one easy way to see the table in the shell is to tee it somewhere:
gcloud spanner databases execute-sql test-sdk-db --instance=test-instance --sql="Select * from Block " \
| tee /dev/null
If you can't do that for some reason you can always get the same result with some --format surgery. To print the column names:
gcloud spanner databases execute-sql test-sdk-db --instance=test-instance --sql="Select * from Block " \
--format 'csv[no-heading, delimiter=" "](metadata.rowType.fields.name)'
And to print the rows:
gcloud spanner databases execute-sql test-sdk-db --instance=test-instance --sql="Select * from Block " \
--format 'csv[no-heading, delimiter="\n"](rows.map().flatten(separator=" "))'
The format of gcloud spanner databases execute-sql is a result of broader changes in formatting to better support accessibility standards for screen readers.
There are two methods to receive results in the tabular format:
Set the accessibility/screen_reader configuration property to false.
gcloud config set accessibility/screen_reader false
Similar to the other suggestion about using formatting, you can use a --format=multi(...) option in your gcloud command:
gcloud spanner databases execute-sql test-sdk-db \
--instance=test-instance --sql="Select * from Block " \
--format="multi(metadata:format='value[delimiter='\t'](rowType.fields.name)', \
rows:format='value[delimiter='\t']([])')"
The caveat of this second method is that column names and values may not align due to differences in length.
I understand that dataproc workflow-templates is still in beta, but how do you pass parameters via the add-job into the executable sql? Here is a basic example:
#/bin/bash
DATE_PARTITION=$1
echo DatePartition: $DATE_PARTITION
# sample job
gcloud beta dataproc workflow-templates add-job hive \
--step-id=0_first-job \
--workflow-template=my-template \
--file='gs://mybucket/first-job.sql' \
--params="DATE_PARTITION=$DATE_PARTITION"
gcloud beta dataproc workflow-templates run $WORK_FLOW
gcloud beta dataproc workflow-templates remove-job $WORK_FLOW --step-
id=0_first-job
echo `date`
Here is my first-job.sql file called from the shell:
SET hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
SET mapred.output.compress=true;
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
SET io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;
USE mydb;
CREATE EXTERNAL TABLE if not exists data_raw (
field1 string,
field2 string
)
PARTITIONED BY (dt String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 'gs://data/first-job/';
ALTER TABLE data_raw ADD IF NOT EXISTS PARTITION(dt="${hivevar:DATE_PARTITION}");
In the ALTER TABLE statement, what is the correct syntax? I’ve tried what feels like over 15 variations but nothing works. If I hard code it like this (ALTER TABLE data_raw ADD IF NOT EXISTS PARTITION(dt="2017-10-31");) the partition gets created, but unfortunately it needs to be parameterized.
BTW – The error I receive is consistently like this:
Error: Error while compiling statement: FAILED: ParseException line 1:48 cannot recognize input near '${DATE_PARTITION}' ')' '' in constant
I am probably close but not sure what I am missing.
TIA,
Melissa
Update: Dataproc now has workflow template parameterization, a beta feature:
https://cloud.google.com/dataproc/docs/concepts/workflows/workflow-parameters
For your specific case, you can do the following:
Create an empty template
gcloud beta dataproc workflow-templates create my-template
Add a job with a placeholder for the value you want to parameterize
gcloud beta dataproc workflow-templates add-job hive \
--step-id=0_first-job \
--workflow-template=my-template \
--file='gs://mybucket/first-job.sql' \
--params="DATE_PARTITION=PLACEHOLDER"
Export the template configuration to a file
gcloud beta dataproc workflow-templates export my-template \
--destination=hive-template.yaml
Edit the file to add a parameter
jobs:
- hiveJob:
queryFileUri: gs://mybucket/first-job.sql
scriptVariables:
DATE_PARTITION: PLACEHOLDER
stepId: 0_first-job
parameters:
- name: DATE_PARTITION
fields:
- jobs['0_first-job'].hiveJob.scriptVariables['DATE_PARTITION']
Import the changes
gcloud beta dataproc workflow-templates import my-template \
--source=hive-template.yaml
Add a managed cluster or cluster selector
gcloud beta dataproc workflow-templates set-managed-cluster my-template \
--cluster-name=my-cluster \
--zone=us-central1-a
Run your template with parameters
gcloud beta dataproc workflow-templates instantiate my-template \
--parameters="DATE_PARTITION=${DATE_PARTITION}"
Thanks for trying out Workflows! First-class support for parameterization is part of our roadmap. However for now your remove-job/add-job trick is the best way to go.
Regarding your specific question:
Values passed via params are accessed as ${hivevar:PARAM} (see [1]). Alternatively, you can set --properties which are accessed as ${PARAM}
The brackets around params are not needed. If it's intended to handle spaces in parameter values use quotations like: --params="FOO=a b c,BAR=X"
Finally, I noticed an errant space here DATE_PARTITION =$1 which probably results in empty DATE_PARTITION value
Hope this helps!
[1] How to use params/properties flag values when executing hive job on google dataproc
When using Mallet, how do I get a list of topics associated with each document? I think I need to use train-topics and --output-topic-docs, but when I do, I get an error.
I'm using Mallet (2.0.8), and I use the following bash script to do my modeling:
MALLET=/Users/emorgan/desktop/mallet/bin/mallet
INPUT=/Users/emorgan/desktop/sermons
OBJECT=./object.mallet
$MALLET import-dir --input $INPUT --output $OBJECT --keep-sequence --remove-stopwords
$MALLET train-topics --input $OBJECT --num-topics 10 --num-top-words 1 \
--num-iterations 50 \
--output-doc-topics ./topics.txt \
--output-topic-keys ./keys.txt \
--xml-topic-report ./topic.xml \
--output-topic-docs ./docs.txt
Unfortunately, ./docs.txt does not get created. Instead I get the following error:
Exception in thread "main" java.lang.ClassCastException: java.net.URI cannot be cast to java.lang.String
at cc.mallet.topics.ParallelTopicModel.printTopicDocuments(ParallelTopicModel.java:1773)
at cc.mallet.topics.tui.TopicTrainer.main(TopicTrainer.java:281)
More specifically, I want Mallet to generate a list of documents and the associated topics assigned to them, or I want a list of topics and then the list of associated documents. How do I create such lists?
At least in mallet 2.0.7, it is --output-doc-topics ./topics.txt that gives the desired table (a topic composition of each document). While the output format has changed from 2.0.7 to 2.0.8, the main content of the file stayed the same.