java.lang.NumberFormatException: For input string: "|" - scala

I have imported a table into HDFS as
fields-terminated-by '|'
sqoop import \
--connect jdbc:mysql://connection \
--username \
--password \
--table products \
--as-textfile \
--target-dir /user/username/productsdemo \
--fields-terminated-by '|'
after that, I am trying to read it as RDD using spark-shell version 1.6.2
var productsRDD = sc.textFile("/user/username/productsdemo")
and converting it into a data frame
var productsDF = productsRDD.map(product =>{
var o = product.split("|");
products(o(0).toInt,o(1).toInt,o(2),o(3),o(4).toFloat,o(5))
}).toDF("product_id", "product_category_id","product_name","product_description","product_price","product_image" )
But When I try to print the output it is throwing the below exception.
java.lang.NumberFormatException: For input string: "|"
Why I am getting this error can anyone help me out of this?

split are use regex to do the split string, since | is a special character in regex means OR You need to use \\| instead of | when split
var o = product.split("\\|");

Related

spark struct streaming writeStream output no data but no error

I have a struct streaming job which reads message from Kafka topic then saves to dbfs. The code is as follows:
input_stream = spark.readStream \
.format("kafka") \
.options(**kafka_options) \
.load() \
.transform(create_raw_features)
# tranformation by 7 days rolling window
def transform_func(df):
window_spec = window("event_timestamp", "7 days", "1 day")
return df \
.withWatermark(eventTime="event_timestamp", delayThreshold="2 days") \
.groupBy(window_spec.alias("window"), "customer_id") \
.agg(count("*").alias("count")) \
.select("window.end", "customer_id", "count")
result = input_stream.transform(transform_func)
query = result \
.writeStream \
.format("memory") \
.queryName("test") \
.option("truncate","false").start()
I can see the checkpointing is working fine. However, there is no data output.
spark.table("test").show(truncate=False)
Show empty table. Any clue why?
I found the issue. in the Spark documentation output mode section, it states:
Append mode uses watermark to drop old aggregation state. But the output of a windowed aggregation is delayed the late threshold specified in withWatermark() as by the modes semantics, rows can be added to the Result Table only once after they are finalized (i.e. after watermark is crossed).
Since I didn't specify the output mode explicitly, append is applied implicitly, which means the first output will occur only after the watermark threshold is passed.
To get the output per micro-batch, use output mode update or complete instead.
This works for me now
query = result \
.writeStream \
.format("memory") \
.outputMode("update") \
.queryName("test") \
.option("truncate","false").start()

pysprak - microbatch streaming delta table as a source to perform merge against another delta table - foreachbatch is not getting invoked

I have created a delta table and now I'm trying to perform merge data to that table using foreachBatch(). I've followed this example. I am running this code in dataproc image 1.5x in google cloud.
Spark version 2.4.7
Delta version 0.6.0
My code looks as follows:
from delta.tables import *
spark = SparkSession.builder \
.appName("streaming_merge") \
.master("local[*]") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
# Function to upsert `microBatchOutputDF` into Delta table using MERGE
def mergeToDelta(microBatchOutputDF, batchId):
(deltaTable.alias("accnt").merge(
microBatchOutputDF.alias("updates"), \
"accnt.acct_nbr = updates.acct_nbr") \
.whenMatchedDelete(condition = "updates.cdc_ind='D'") \
.whenMatchedUpdateAll(condition = "updates.cdc_ind='U'") \
.whenNotMatchedInsertAll(condition = "updates.cdc_ind!='D'") \
.execute()
)
deltaTable = DeltaTable.forPath(spark, "gs:<<path_for_the_target_delta_table>>")
# Define the source extract
SourceDF = (
spark.readStream \
.format("delta") \
.load("gs://<<path_for_the_source_delta_location>>")
# Start the query to continuously upsert into target tables in update mode
SourceDF.writeStream \
.format("delta") \
.outputMode("update") \
.foreachBatch(mergeToDelta) \
.option("checkpointLocation","gs:<<path_for_the_checkpint_location>>") \
.trigger(once=True) \
.start() \
This code runs without any problems, but there is no data written to the delta table, I doubt foreachBatch is not getting invoked. Anyone know what I'm doing wrong?
After adding awaitTermination, streaming started working and picked up the latest data from the source and performed the merge on delta target table.

AttributeError: 'Namespace' object has no attribute 'project'

I am trying to reuse a code which I copied from https://www.opsguru.io/post/solution-walkthrough-visualizing-daily-cloud-spend-on-gcp-using-gke-dataflow-bigquery-and-grafana. Am not too familiar with python as such seek for help here. Trying to copy GCP Bigquery data into Postgres
I have done some modification to the code and am getting some error due to my mistake or code
Here is what I have
import uuid
import argparse
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions, StandardOptions, GoogleCloudOptions, WorkerOptions
from beam_nuggets.io import relational_db
from apache_beam.io.gcp import bigquery
parser = argparse.ArgumentParser()
args = parser.parse_args()
project = args.project("project", help="Enter Project ID")
job_name = args.job_name + str(uuid.uuid4())
bigquery_source = args.bigquery_source
postgresql_user = args.postgresql_user
postgresql_password = args.postgresql_password
postgresql_host = args.postgresql_host
postgresql_port = args.postgresql_port
postgresql_db = args.postgresql_db
postgresql_table = args.postgresql_table
staging_location = args.staging_location
temp_location = args.temp_location
subnetwork = args.subnetwork
options = PipelineOptions(
flags=["--requirements_file", "/opt/python/requirements.txt"])
# For Cloud execution, set the Cloud Platform project, job_name,
# staging location, temp_location and specify DataflowRunner.
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = project
google_cloud_options.job_name = job_name
google_cloud_options.staging_location = staging_location
google_cloud_options.temp_location = temp_location
google_cloud_options.region = "europe-west4"
worker_options = options.view_as(WorkerOptions)
worker_options.zone = "europe-west4-a"
worker_options.subnetwork = subnetwork
worker_options.max_num_workers = 20
options.view_as(StandardOptions).runner = 'DataflowRunner'
start_date = define_start_date()
with beam.Pipeline(options=options) as p:
rows = p | 'QueryTableStdSQL' >> beam.io.Read(beam.io.BigQuerySource(
query = 'SELECT \
billing_account_id, \
service.id as service_id, \
service.description as service_description, \
sku.id as sku_id, \
sku.description as sku_description, \
usage_start_time, \
usage_end_time, \
project.id as project_id, \
project.name as project_description, \
TO_JSON_STRING(project.labels) \
as project_labels, \
project.ancestry_numbers \
as project_ancestry_numbers, \
TO_JSON_STRING(labels) as labels, \
TO_JSON_STRING(system_labels) as system_labels, \
location.location as location_location, \
location.country as location_country, \
location.region as location_region, \
location.zone as location_zone, \
export_time, \
cost, \
currency, \
currency_conversion_rate, \
usage.amount as usage_amount, \
usage.unit as usage_unit, \
usage.amount_in_pricing_units as \
usage_amount_in_pricing_units, \
usage.pricing_unit as usage_pricing_unit, \
TO_JSON_STRING(credits) as credits, \
invoice.month as invoice_month cost_type \
FROM `' + project + '.' + bigquery_source + '` \
WHERE export_time >= "' + start_date + '"', use_standard_sql=True))
source_config = relational_db.SourceConfiguration(
drivername='postgresql+pg8000',
host=postgresql_host,
port=postgresql_port,
username=postgresql_user,
password=postgresql_password,
database=postgresql_db,
create_if_missing=True,
)
table_config = relational_db.TableConfiguration(
name=postgresql_table,
create_if_missing=True
)
rows | 'Writing to DB' >> relational_db.Write(
source_config=source_config,
table_config=table_config
)
When I run the program am getting the following error:
bq-to-sql.py: error: unrecognized arguments: --project xxxxx --job_name bq-to-sql-job --bigquery_source xxxxxxxx
--postgresql_user xxxxx --postgresql_password xxxxx --postgresql_host xx.xx.xx.xx --postgresql_port 5432 --postgresql_db xxxx --postgresql_table xxxx --staging_location g
s://xxxxx-staging --temp_location gs://xxxxx-temp --subnetwork regions/europe-west4/subnetworks/xxxx
argparse needs to be configured. Argparse works like magic, but it does need configuration. These lines are needed between line 10 parser = argparse.ArgumentParser() and line 11 args = parser.parse_args()
parser.add_argument("--project")
parser.add_argument("--job_name")
parser.add_argument("--bigquery_source")
parser.add_argument("--postgresql_user")
parser.add_argument("--postgresql_password")
parser.add_argument("--postgresql_host")
parser.add_argument("--postgresql_port")
parser.add_argument("--postgresql_db")
parser.add_argument("--postgresql_table")
parser.add_argument("--staging_location")
parser.add_argument("--temp_location")
parser.add_argument("--subnetwork")
Argparse is a useful library. I recommend adding a lot of options to these add_argument calls.

Pyspark saving is not working when called from inside a foreach

I am building a pipeline that receives messages from Azure EventHub and save into databricks delta tables.
All my tests with static data went well, see the code below:
body = 'A|B|C|D\n"False"|"253435564"|"14"|"2019-06-25 04:56:21.713"\n"True"|"253435564"|"13"|"2019-06-25 04:56:21.713"\n"
tableLocation = "/delta/tables/myTableName"
spark = SparkSession.builder.appName("CSV converter").getOrCreate()
csvData = spark.sparkContext.parallelize(body.split('\n'))
df = spark.read \
.option("header", True) \
.option("delimiter","|") \
.option("quote", "\"") \
.option("nullValue", "\\N") \
.option("inferShema", "true") \
.option("mergeSchema", "true") \
.csv(csvData)
df.write.format("delta").mode("append").save(tableLocation)
However in my case, each eventhub message is a CSV string, and they may come from many sources. So each message must be processed separatelly, because each message may end up saved in different delta tables.
When I try to execute this same code inside a foreach statement, It doesn't work. There are no errors shown at the logs, and I cant find any table saved.
So maybe I am doing something wrong when calling the foreach. See the code below:
def SaveData(row):
...
The same code above
dfEventHubCSV.rdd.foreach(SaveData)
I tried to do this on a streaming context, but I sadly went through the same problem.
What is in the foreach that makes it behave different?
Below the full code I am running:
import pyspark.sql.types as t
from pyspark.sql import SQLContext
--row contains the fields Body and SdIds
--Body: CSV string
--SdIds: A string ID
def SaveData(row):
--Each row data that is going to be added to different tables
rowInfo = GetDestinationTableData(row['SdIds']).collect()
table = rowInfo[0][4]
schema = rowInfo[0][3]
database = rowInfo[0][2]
body = row['Body']
tableLocation = "/delta/" + database + '/' + schema + '/' + table
checkpointLocation = "/delta/" + database + '/' + schema + "/_checkpoints/" + table
spark = SparkSession.builder.appName("CSV").getOrCreate()
csvData = spark.sparkContext.parallelize(body.split('\n'))
df = spark.read \
.option("header", True) \
.option("delimiter","|") \
.option("quote", "\"") \
.option("nullValue", "\\N") \
.option("inferShema", "true") \
.option("mergeSchema", "true") \
.csv(csvData)
df.write.format("delta").mode("append").save(tableLocation)
dfEventHubCSV.rdd.foreach(SaveData)
Well, at the end of all, as always it is something very simple, but I dind't see this anywere.
Basically when you perform a foreach and the dataframe you want to save is built inside the loop. The worker unlike the driver, won't automatically setup the "/dbfs/" path on the saving, so if you don't manually add the "/dbfs/", it will save the data locally in the worker and you will never find the saved data.
That is why my loops weren't working.

writing pyspark DF into Redshift

I have established the connection between Pyspark and Redshift using the following code.
import sqlalchemy as sa
from sqlalchemy.orm import sessionmaker
import psycopg2
DATABASE = "d"
USER = "user1"
PASSWORD = "1234"
HOST = "sparkvalidation.crv9zfdiseqm.us-west-2.redshift.amazonaws.com"
PORT = "5439"
SCHEMA = "public"
connection_string = "redshift+psycopg2://%s:%s#%s:%s/%s" % (USER,PASSWORD,HOST,str(PORT),DATABASE)
engine = sa.create_engine(connection_string)
session = sessionmaker()
session.configure(bind=engine)
s = session()
SetPath = "SET search_path TO %s" % SCHEMA
s.execute(SetPath)
Now how can I write a pyspark dataframe to Redshift?
If you use DataBricks, you could write something like this:
dataframe.write \
.format("com.databricks.spark.redshift") \
.option("url", connection_string) \
.option("dbtable", "target") \
.option("tempdir", "s3a://your_s3_tmp_bucket/tmp_data") \
.mode("error") \
.save()
Note that you need a s3 bucket, as it's usually the case when copying data into redshift