BigQuery to Postgre execution failed on Dataflow workflow timestamp - postgresql

Hi have this issue where am unsure how to get a proper start date for my query, I get the following error and am unsure how to go about fixing it. Can I get help on the time conversion format please?
apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
Workflow failed. Causes: S01:QueryTableStdSQL+Writing to DB/ParDo(_WriteToRelationalDBFn) failed., BigQuery execution failed., Error:
Message: No matching signature for operator >= for argument types: TIMESTAMP, INT64. Supported signature: ANY >= ANY at [1:1241]
HTTP Code: 400
My script main query looks like:
with beam.Pipeline(options=options) as p:
rows = p | 'QueryTableStdSQL' >> beam.io.Read(beam.io.BigQuerySource(use_standard_sql=True,
query = 'SELECT \
billing_account_id, \
service.id as service_id, \
service.description as service_description, \
sku.id as sku_id, \
sku.description as sku_description, \
usage_start_time, \
usage_end_time, \
project.id as project_id, \
project.name as project_description, \
TO_JSON_STRING(project.labels) \
as project_labels, \
project.ancestry_numbers \
as project_ancestry_numbers, \
TO_JSON_STRING(labels) as labels, \
TO_JSON_STRING(system_labels) as system_labels, \
location.location as location_location, \
location.country as location_country, \
location.region as location_region, \
location.zone as location_zone, \
export_time, \
cost, \
currency, \
currency_conversion_rate, \
usage.amount as usage_amount, \
usage.unit as usage_unit, \
usage.amount_in_pricing_units as \
usage_amount_in_pricing_units, \
usage.pricing_unit as usage_pricing_unit, \
TO_JSON_STRING(credits) as credits, \
invoice.month as invoice_month, \
cost_type, \
FROM `pprodjectID.bill_usage.gcp_billing_export_v1_xxxxxxxx` \
WHERE export_time >= 2020-01-01'))
source_config = relational_db.SourceConfiguration(
The date format on bigquery console
export_time
2018-01-25 01:18:55.637 UTC
usage_start_time
2018-01-24 21:23:10.643 UTC

You forgot to include as a string the time
WHERE export_time >= 2020-01-01
The above results Calc: 0+2020-01-01=2018 you should have
WHERE export_time >= "2020-01-01"

Related

Spark Shuffle Read and Shuffle Write Increasing in Structured Screaming

I have been running spark-structured streaming with Kafka for the last 23 hours. And I could see Shuffle Read and Shuffle Write Increasing drastically and finally, the driver stopped due to"out of memory".
Data Pushing to Kafak is 3 json per second and Spark streaming processingTime='30 seconds'
spark = SparkSession \
.builder \
.master("spark://spark-master:7077") \
.appName("demo") \
.config("spark.executor.cores", 1) \
.config("spark.cores.max", "4") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.config("spark.sql.warehouse.dir", "hdfs://172.30.7.36:9000/user/hive/warehouse") \
.config("spark.streaming.stopGracefullyOnShutdown", "true") \
.config("spark.executor.memory", '1g') \
.config("spark.scheduler.mode", "FAIR") \
.config("spark.driver.memory", '2g') \
.config("spark.sql.caseSensitive", "true") \
.config("spark.sql.shuffle.partitions", 8) \
.enableHiveSupport() \
.getOrCreate()
CustDf \
.writeStream \
.queryName("customerdatatest") \
.format("delta") \
.outputMode("append") \
.trigger(processingTime='30 seconds') \
.option("mergeSchema", "true") \
.option("checkpointLocation", "/checkpoint/bronze_customer/") \
.toTable("bronze.customer")
I am expecting this straming should be run alteast for 1 month continuously.
Spark is transforming json (Flattening the json) and insert into the delta table.
Please help me on this. weather i misssed any configuration ?

Why is kafka returning same data from the topic even after it is read once?

This is my first Kafka project (with Spark streaming)
I am trying to read a Kafka topic which is getting data from an upstream source.
They are pushing data into the Kafka topic in the below way:
def kafka_ingest(df: DataFrame, kafkaconfig: dict, topic_name: str):
jaas_config = kafkaconfig['jaas_config'] + \
f" oauth.client.id='{kafkaconfig['client_id']}'" + \
f" oauth.client.secret='{kafkaconfig['client_secret']}'" + \
f" oauth.token.endpoint.uri='{kafkaconfig['endpoint_uri']}'" + \
" oauth.max.token.expiry.seconds='30000' ;"
df.write.format('kafka') \
.option('kafka.bootstrap.servers', kafkaconfig['kafka_broker']) \
.option('kafka.batch.size', kafkaconfig['kafka_batch_size']) \
.option('retries', kafkaconfig['retries']) \
.option('kafka.max.block.ms', kafkaconfig['kafka_max_block_ms']) \
.option('kafka.metadata.max.age.ms', kafkaconfig['kafka_metadata_max_age_ms']) \
.option('kafka.request.timeout.ms', kafkaconfig['kafka_request_timeout_ms']) \
.option('kafka.linger.ms', kafkaconfig['kafka_linger_ms']) \
.option('kafka.delivery.timeout.ms', kafkaconfig['kafka_delivery_timeout_ms']) \
.option('acks', kafkaconfig['acks']) \
.option('kafka.security.protocol', kafkaconfig['kafka_security_protocol']) \
.option('kafka.sasl.jaas.config', jaas_config) \
.option('kafka.sasl.login.callback.handler.class', kafkaconfig['kafka_sasl_login_callback_handler_class']) \
.option('kafka.sasl.mechanism', kafkaconfig['kafka_sasl_mechanism']) \
.option('topic', topic_name) \
.save()
When ingesting data into Kafka, I am using the above method in a foreachBatch method where I also mention the corresponding checkpoint as given below.
def write_stream_batches(kafka_df: DataFrame, checkpoint_location: str):
kafka_df.writeStream \
.format('kafka') \
.foreachBatch(join_kafka_streams_po_denorm) \
.option('checkpointLocation', checkpoint_location) \
.start() \
.awaitTermination()
def join_kafka_streams_po_denorm(kafka_df: DataFrame, batch_id: int):
final_df = kafka_df.some_transformations
kafka_ingest(kafka_ingest, kafkaconfig, topic_name)
I am reading data from the topic as:
def extract_kafka_data(kafka_config: dict, topic_name: str, column_schema: str, checkpoint_location: str):
schema = extract_schema(column_schema)
jass_config = kafka_config['jaas_config'] \
+ " oauth.token.endpoint.uri=" + '"' + kafka_config['endpoint_uri'] + '"' \
+ " oauth.client.id=" + '"' + kafka_config['client_id'] + '"' \
+ " oauth.client.secret=" + '"' + kafka_config['client_secret'] + '" ;'
stream_df = spark.readStream \
.format('kafka') \
.option('kafka.bootstrap.servers', kafka_config['kafka_broker']) \
.option('subscribe', topic_name) \
.option('kafka.security.protocol', kafka_config['kafka_security_protocol']) \
.option('kafka.sasl.mechanism', kafka_config['kafka_sasl_mechanism']) \
.option('kafka.sasl.jaas.config', jass_config) \
.option('kafka.sasl.login.callback.handler.class', kafka_config['kafka_sasl_login_callback_handler_class']) \
.option('startingOffsets', 'earliest') \
.option('fetchOffset.retryIntervalMs', kafka_config['kafka_fetch_offset_retry_intervalms']) \
.option('fetchOffset.numRetries', kafka_config['retries']) \
.option('failOnDataLoss', 'False') \
.option('checkpointLocation', checkpoint_location) \
.load() \
.select(from_json(col('value').cast('string'), schema).alias("json_dta")).selectExpr('json_dta.*')
return stream_df
Every time I display data from the dataframe I see same data returning:
Read 1:
df = extract_kafka_data(kafka_config, topic_name, column_schema, checkpoint_location)
display(df)
output:
+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance |10 |
|Marketing|20 |
|Sales |30 |
|IT |40 |
+---------+-------+
I have 4 records in my topic which I ingested using the method: kafka_ingest. Now that I have read all 4 records, I am expecting the no output if I read the topic again.
Read 2:
df = extract_kafka_data(kafka_config, topic_name, column_schema, checkpoint_location)
display(df)
output:
+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance |10 |
|Marketing|20 |
|Sales |30 |
|IT |40 |
+---------+-------+
Once I read the data in Read 1, I shouldn't see the same data again as per the offset mechanism in the topic.
But the job is returning the same data as it was for Read 1.
Is there something wrong with the way I have setup the offset strategy and the usage of checkpointing ?
Could anyone let me know what is the mistake I am doing here ?
Any help is massively appreciated.

pysprak - microbatch streaming delta table as a source to perform merge against another delta table - foreachbatch is not getting invoked

I have created a delta table and now I'm trying to perform merge data to that table using foreachBatch(). I've followed this example. I am running this code in dataproc image 1.5x in google cloud.
Spark version 2.4.7
Delta version 0.6.0
My code looks as follows:
from delta.tables import *
spark = SparkSession.builder \
.appName("streaming_merge") \
.master("local[*]") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
# Function to upsert `microBatchOutputDF` into Delta table using MERGE
def mergeToDelta(microBatchOutputDF, batchId):
(deltaTable.alias("accnt").merge(
microBatchOutputDF.alias("updates"), \
"accnt.acct_nbr = updates.acct_nbr") \
.whenMatchedDelete(condition = "updates.cdc_ind='D'") \
.whenMatchedUpdateAll(condition = "updates.cdc_ind='U'") \
.whenNotMatchedInsertAll(condition = "updates.cdc_ind!='D'") \
.execute()
)
deltaTable = DeltaTable.forPath(spark, "gs:<<path_for_the_target_delta_table>>")
# Define the source extract
SourceDF = (
spark.readStream \
.format("delta") \
.load("gs://<<path_for_the_source_delta_location>>")
# Start the query to continuously upsert into target tables in update mode
SourceDF.writeStream \
.format("delta") \
.outputMode("update") \
.foreachBatch(mergeToDelta) \
.option("checkpointLocation","gs:<<path_for_the_checkpint_location>>") \
.trigger(once=True) \
.start() \
This code runs without any problems, but there is no data written to the delta table, I doubt foreachBatch is not getting invoked. Anyone know what I'm doing wrong?
After adding awaitTermination, streaming started working and picked up the latest data from the source and performed the merge on delta target table.

AttributeError: 'Namespace' object has no attribute 'project'

I am trying to reuse a code which I copied from https://www.opsguru.io/post/solution-walkthrough-visualizing-daily-cloud-spend-on-gcp-using-gke-dataflow-bigquery-and-grafana. Am not too familiar with python as such seek for help here. Trying to copy GCP Bigquery data into Postgres
I have done some modification to the code and am getting some error due to my mistake or code
Here is what I have
import uuid
import argparse
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions, StandardOptions, GoogleCloudOptions, WorkerOptions
from beam_nuggets.io import relational_db
from apache_beam.io.gcp import bigquery
parser = argparse.ArgumentParser()
args = parser.parse_args()
project = args.project("project", help="Enter Project ID")
job_name = args.job_name + str(uuid.uuid4())
bigquery_source = args.bigquery_source
postgresql_user = args.postgresql_user
postgresql_password = args.postgresql_password
postgresql_host = args.postgresql_host
postgresql_port = args.postgresql_port
postgresql_db = args.postgresql_db
postgresql_table = args.postgresql_table
staging_location = args.staging_location
temp_location = args.temp_location
subnetwork = args.subnetwork
options = PipelineOptions(
flags=["--requirements_file", "/opt/python/requirements.txt"])
# For Cloud execution, set the Cloud Platform project, job_name,
# staging location, temp_location and specify DataflowRunner.
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = project
google_cloud_options.job_name = job_name
google_cloud_options.staging_location = staging_location
google_cloud_options.temp_location = temp_location
google_cloud_options.region = "europe-west4"
worker_options = options.view_as(WorkerOptions)
worker_options.zone = "europe-west4-a"
worker_options.subnetwork = subnetwork
worker_options.max_num_workers = 20
options.view_as(StandardOptions).runner = 'DataflowRunner'
start_date = define_start_date()
with beam.Pipeline(options=options) as p:
rows = p | 'QueryTableStdSQL' >> beam.io.Read(beam.io.BigQuerySource(
query = 'SELECT \
billing_account_id, \
service.id as service_id, \
service.description as service_description, \
sku.id as sku_id, \
sku.description as sku_description, \
usage_start_time, \
usage_end_time, \
project.id as project_id, \
project.name as project_description, \
TO_JSON_STRING(project.labels) \
as project_labels, \
project.ancestry_numbers \
as project_ancestry_numbers, \
TO_JSON_STRING(labels) as labels, \
TO_JSON_STRING(system_labels) as system_labels, \
location.location as location_location, \
location.country as location_country, \
location.region as location_region, \
location.zone as location_zone, \
export_time, \
cost, \
currency, \
currency_conversion_rate, \
usage.amount as usage_amount, \
usage.unit as usage_unit, \
usage.amount_in_pricing_units as \
usage_amount_in_pricing_units, \
usage.pricing_unit as usage_pricing_unit, \
TO_JSON_STRING(credits) as credits, \
invoice.month as invoice_month cost_type \
FROM `' + project + '.' + bigquery_source + '` \
WHERE export_time >= "' + start_date + '"', use_standard_sql=True))
source_config = relational_db.SourceConfiguration(
drivername='postgresql+pg8000',
host=postgresql_host,
port=postgresql_port,
username=postgresql_user,
password=postgresql_password,
database=postgresql_db,
create_if_missing=True,
)
table_config = relational_db.TableConfiguration(
name=postgresql_table,
create_if_missing=True
)
rows | 'Writing to DB' >> relational_db.Write(
source_config=source_config,
table_config=table_config
)
When I run the program am getting the following error:
bq-to-sql.py: error: unrecognized arguments: --project xxxxx --job_name bq-to-sql-job --bigquery_source xxxxxxxx
--postgresql_user xxxxx --postgresql_password xxxxx --postgresql_host xx.xx.xx.xx --postgresql_port 5432 --postgresql_db xxxx --postgresql_table xxxx --staging_location g
s://xxxxx-staging --temp_location gs://xxxxx-temp --subnetwork regions/europe-west4/subnetworks/xxxx
argparse needs to be configured. Argparse works like magic, but it does need configuration. These lines are needed between line 10 parser = argparse.ArgumentParser() and line 11 args = parser.parse_args()
parser.add_argument("--project")
parser.add_argument("--job_name")
parser.add_argument("--bigquery_source")
parser.add_argument("--postgresql_user")
parser.add_argument("--postgresql_password")
parser.add_argument("--postgresql_host")
parser.add_argument("--postgresql_port")
parser.add_argument("--postgresql_db")
parser.add_argument("--postgresql_table")
parser.add_argument("--staging_location")
parser.add_argument("--temp_location")
parser.add_argument("--subnetwork")
Argparse is a useful library. I recommend adding a lot of options to these add_argument calls.

How to install MongoDB 4.4 on Alpine 3.11

I want to install MongoDB 4.4 on Alpine 3.11, but it appears that Alpine has removed the MDB package because of BSD license.
I have to build the image myself, but I have some errors...
Firstly, I clone de git repository :
git clone --branch v4.4 --single-branch --depth 1 https://github.com/mongodb/mongo.git /tmp/mongo
Then I install some packages :
apk add --no-cache --virtual build-pack \
boost-build=1.69.0-r1 \
boost-filesystem=1.71.0-r1 \
boost-iostreams=1.71.0-r1 \
boost-program_options=1.71.0-r1 \
boost-python3=1.71.0-r1 \
boost-system=1.71.0-r1 \
build-base=0.5-r1 \
busybox=1.31.1-r9 \
curl-dev=7.67.0-r0 \
cmake=3.15.5-r0 \
db=5.3.28-r1 \
isl=0.18-r0 \
libbz2=1.0.8-r1 \
libc-dev=0.7.2-r0 \
libgcc=9.2.0-r4
libpcrecpp=8.43-r0 \
libsasl=2.1.27-r5 \
libssl1.1=1.1.1d-r3 \
libstdc++=9.2.0-r4 \
linux-headers=4.19.36-r0 \
g++=9.2.0-r4 \
gcc=9.2.0-r4 \
gmp=6.1.2-r1 \
jsoncpp=1.9.2-r0 \
jsoncpp-dev=1.9.2-r0 \
mpc1=1.1.0-r1 \
mpfr4=4.0.2-r1 \
musl=1.1.24-r2 \
musl-dev=1.1.24-r2 \
openssl-dev=1.1.1d-r3 \
pcre=8.43-r0 \
pkgconf=1.6.3-r0 \
python3=3.8.2-r0 \
py3-cheetah=3.2.4-r1 \
py3-crypto=2.6.1-r5 \
py3-openssl=19.1.0-r0 \
py3-psutil=5.6.7-r0 \
py3-yaml=5.3.1-r0 \
scons=3.1.1-r0 \
snappy=1.1.7-r1 \
xz-libs=5.2.4-r0 \
yaml-cpp=0.6.3-r0 \
yaml-cpp-dev=0.6.3-r0 \
zlib=1.2.11-r3
I have this error when I use this command python3 buildscripts/scons.py MONGO_VERSION=4.4 --prefix=/opt/mongo mongod --disable-warnings-as-errors :
src/mongo/util/processinfo_linux.cpp:50:10: fatal error: gnu/libc-version.h: No such file or directory
...
scons: building terminated because of errors.
build/opt/mongo/util/processinfo_linux.o failed: Error 1
Any idea ?
Thanks.
EDIT : I have tried 4.2.5 version, I have this error message and :
In file included from src/third_party/mozjs-60/platform/x86_64/linux/build/Unified_cpp_js_src29.cpp:11:
src/third_party/mozjs-60/extract/js/src/threading/posix/Thread.cpp: In function 'void js::ThisThread::GetName(char*, size_t)':
src/third_party/mozjs-60/extract/js/src/threading/posix/Thread.cpp:210:8: error: 'pthread_getname_np' was not declared in this scope; did you mean 'pthread_setname_np'?
210 | rv = pthread_getname_np(pthread_self(), nameBuffer, len);
| ^~~~~~~~~~~~~~~~~~
| pthread_setname_np
scons: *** [build/opt/third_party/mozjs-60/platform/x86_64/linux/build/Unified_cpp_js_src29.o] Error 1
scons: building terminated because of errors.
build/opt/third_party/mozjs-60/platform/x86_64/linux/build/Unified_cpp_js_src29.o failed: Error 1
With these packages :
apk add --no-cache --virtual build-pack \
build-base=0.5-r1 \
cmake=3.15.5-r0 \
curl-dev=7.67.0-r0 \
libgcc=9.2.0-r4 \
libssl1.1=1.1.1d-r3 \
libstdc++=9.2.0-r4 \
linux-headers=4.19.36-r0 \
g++=9.2.0-r4 \
gcc=9.2.0-r4 \
openssl-dev=1.1.1d-r3 \
musl=1.1.24-r2
python3=3.8.2-r0 \
py3-cheetah=3.2.4-r1 \
py3-crypto=2.6.1-r5 \
py3-openssl=19.1.0-r0 \
py3-psutil=5.6.7-r0 \
py3-yaml=5.3.1-r0 \
scons=3.1.1-r0 \
libc-dev=0.7.2-r0
Just run:
apk update && apk add openrc