Spark streaming Redis Read Time Out with Scala - scala

While i'm reading table from redis getting this below error.
Below code normally working well.
val readDF= spark.sparkContext.fromRedisKeyPattern(tableName,5).getHash().toDS()
Normally it's working for less than 2 million rows. But if i'm reading big table getting this error.
18/10/11 17:08:25 ERROR Executor: Exception in task 37.0 in stage 3.0
(TID 338) redis.clients.jedis.exceptions.JedisConnectionException:
java.net.SocketTimeoutException: Read timed out at
redis.clients.util.RedisInputStream.ensureFill(RedisInputStream.java:202)
at
redis.clients.util.RedisInputStream.readByte(RedisInputStream.java:40)
val redis =
spark.sparkContext.fromRedisKeyPattern(tableName,100).getHash().toDS()
I also changed some settings on redis but i think it's not about that.
Do you know how can i solve this problem ?

Related

Reading url via pyspark in Databricks notebook

I am unable to read the content of a URL via pySpark in Databricks Notebooks(Version 8.3, Spark 3.1.1). I have tried almost all the possibilities but unable to find out the exact problem. Here is my code.
from pyspark import SparkFiles
url = 'https://pds-atmospheres.nmsu.edu/PDS/data/mors_1101/tps/1998_028/8028d38a.tps'
spark.sparkContext.addFile(url)
df1 = spark.read.text("file://"+SparkFiles.get('8028d38a.tps'))
df1.show()
Here is the error
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 10.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10.0 (TID 43) (10.139.64.4 executor 0): com.databricks.sql.io.FileReadException: Error while reading file file:/local_disk0/spark-95887d0f-a955-4075-86ac-520a51f0c64e/userFiles-9204e03a-a0fd-4999-9f40-9d9c3cc599a6/8028d38a.tps. It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. If Delta cache is stale or the underlying files have been removed, you can invalidate Delta cache manually by restarting the cluster.
I have referred reading data from URL using spark databricks platform as an example. Did anyone face the similar problem?
This is the best i've found from youtube pyspark for everyone playlist
!curl "https://pds-atmospheres.nmsu.edu/PDS/data/mors_1101/tps/1998_028/8028d38a.tps" >> 8028d38a.tps
As workaround , we can read respective location panda dataframe and covert into pyspark dataframe for further process .
url = 'https://pds-atmospheres.nmsu.edu/PDS/data/mors_1101/tps/1998_028/8028d38a.tps'
import pandas as pd
df = spark.createDataFrame(pd.read_csv(url))
display(df)
Screen print :
If you want to skip first row if that is invalid one ,

Spark read failed due to duplicate column

I am trying to read parquet file from S3 in databricks, using scala.
below is the simple read code
val df = spark.read.parquet(s"/mnt/$MountName/tstamp=2020_03_25")
display(df)
MountName is the dbfs where data is mounted from S3.
But I am getting error which is due to duplicate key in file.
SparkException: Job aborted due to stage failure: Task 0 in stage 813.0 failed 4 times, most recent failure: Lost task 0.3 in stage 813.0 (TID 79285, 10.179.245.218, executor 0): com.databricks.sql.io.FileReadException: Error while reading file dbfs:/mnt/Alibaba_data/tstamp=2020_03_25/ts-1585154320710.parquet.gz.
Caused by: java.lang.RuntimeException: Found duplicate field(s) "subtype": [subtype, subType] in case-insensitive mode
Now i need to overcome it. May be making the read case sensitive or by dropping the column while read, or by any other means if suggested.
Suggestion please.
Try with case sensitivity enabled.
spark.sql.caseSensitive should be set to true.

SPARK Join strategy in Cloud Datafusion

In cloud Datafusion I am using a joiner transform to join two tables.
One of them is a large table with about 87M Joins, while the other is a smaller table with only ~250 records. I am using 200 partitions in the joiner.
This causes the following failure:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 50 in stage 7.0 failed 4 times, most recent failure: Lost task
50.3 in stage 7.0 (TID xxx, cluster_workerx.c.project.internal, executor 6): ExecutorLostFailure (executor 6 exited caused by one of
the running tasks) Reason: Executor heartbeat timed out after 133355
ms java.util.concurrent.ExecutionException:
java.lang.RuntimeException: org.apache.spark.SparkException:
Application application_xxxxx finished with failed status
On a closer look into the spark UI the 200 tasks for the Join, nearly 80% of the 87m records go into one task O/P which fails with the heartbeat error, while the succeeded tasks has very few record O/P ~<10k records
Seems like spark performs a shuffle hash Join, is there a way in datafusion/cdap where we can force a broadcast join since one of my table is very small? Or can i make come configuration changes to the cluster config to make this join work?
What are the performance tuning i can make in the data fusion pipeline. I didnt find any reference to the configuration, tuning in the Datafusion documentation
You can use org.apache.spark.sql.functions.broadcast(Dataset[T]) to mark a dataframe/dataset to be broadcasted while being joined. Broadcast is not always guaranteed but for 250 record it will work. If the dataframe with 87M rows is evenly partitioned then it should improve the performance.

MongoSpark E11000 error when writing to a MongoDB Replica Set

I am using a Spark2 application which uses the following command from com.mongodb.spark.MongoSpark to write a DataFrame to a three-node-MongoDB Replica Set:
//The real command is similar to this one, depending on options
//set to the DataFrame and the DataFrameWriter object about MongoDB configurations,
//such as the writeConcern
var df: DataFrameWriter[Row] = spark.sql(sql).write
.option("uri", theUri)
.option("database", theDatabase)
.option("collection", theCollection)
.option("replaceDocument", "false")
.mode("append")
[...]
MongoSpark.save(df)
The fact is that although I am sure the source data, which comes from a Hive table, has a unique primary key, when Spark application is running I get a duplicate key error:
2019-01-14 13:01:08 ERROR: Job aborted due to stage failure: Task 51 in stage 19.0 failed 8 times,
most recent failure: Lost task 51.7 in stage 19.0 (TID 762, mymachine, executor 21):
com.mongodb.MongoBulkWriteException: Bulk write operation error on server myserver.
Write errors: [BulkWriteError{index=0, code=11000,
message='E11000 duplicate key error collection:
ddbb.tmp_TABLE_190114125615 index: idx_unique dup key: { : "00120345678" }', details={ }}].
at com.mongodb.connection.BulkWriteBatchCombiner.getError(BulkWriteBatchCombiner.java:176)
at com.mongodb.connection.BulkWriteBatchCombiner.throwOnError(BulkWriteBatchCombiner.java:205)
[...]
I have tried setting write concern to "3" or even "majority". Furthermore, the timeout has been set to 4/5 seconds, but sometimes this duplicate key error still appears.
I would like to know how to configurate the load in order not to obtain duplicate entries when writing to the Replica Set.
Any suggestions? Thanks in advance!

Spark job using HBase fails

Any Spark job I run that involves HBase access results in the errors below. My own jobs are in Scala, but supplied python examples end the same. The cluster is Cloudera, running CDH 5.4.4. The same jobs run fine on a different cluster with CDH 5.3.1.
Any help is greatly apreciated!
...
15/08/15 21:46:30 WARN TableInputFormatBase: initializeTable called multiple times. Overwriting connection and table reference; TableInputFormatBase will not close these old references when done.
...
15/08/15 21:46:32 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, some.server.name): java.io.IOException: Cannot create a record reader because of a previous error. Please look at the previous logs lines from the task's full log for more details.
at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:163)
...
Caused by: java.lang.IllegalStateException: The input format instance has not been properly initialized. Ensure you call initializeTable either in your constructor or initialize method
at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getTable(TableInputFormatBase.java:389)
at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:158)
... 14 more
run spark-shell with this parameters:
--driver-class-path .../cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar --driver-java-options "-Dspark.executor.extraClassPath=.../cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar"
Why it works is described here.