How to read new files uploaded in last one hour using pyspark filestream? - scala

I am trying to read latest files(say new files in last one hour) available in a directory and load that data . I am trying with pyspark structured streaming. i have tried maxFileAge option of spark streaming, but still it is loading all the files in the diretory, regardless of time specified in the option.
spark.readStream\
.option("maxFileAge", "1h")\
.schema(cust_schema)\
.csv(upload_path) \
.withColumn("closing_date", get_date_udf_func(input_file_name()))\
.writeStream.format('parquet') \
.trigger(once=True) \
.option('checkpointLocation', checkpoint_path) \
.option('path', write_path) \
.start()
Above is the code that i tried, but it will load all available files regardless of time . Please point out what i am doing wrong here ..

Related

Reading Excel(xlsx) with Pyspark does not work above a certain medium size

Having the following configuration of a cluster in databricks: 64GB, 8 cores
The tests have been carried out as the only notebook in the cluster, at that time there were no other notebooks running.
I find that reading a simple 30 MB Excel file in spark keeps loading and does not work. Using the following code for this purpose:
sdf = spark.read.format("com.crealytics.spark.excel")\
.option("header", True)\
.option("inferSchema", "true")\
.load(my_path)
display(sdf)
I have tried reducing the excel file and it works fine up to 15MB.
As a workaround I am going to export the excel to csv and read it from there, but I find it shocking that spark can't even read 30MB of excel.
or am I doing something wrong in the configuration?
You need to install these 2 libraries on your databricks cluster to read excel files. Follow these paths to install:
Clusters -> select your cluster -> Libraries -> Install New -> Maven -> in Coordinates: com.crealytics:spark-excel_2.12:0.13.5
Clusters -> select your cluster -> Libraries -> Install New -> PyPI-> in Package: xlrd
Now, you will be able to read your excel as follows:
sdf = spark.read.format("com.crealytics.spark.excel") \
.option("header", "true") \
.option("inferSchema", "true") \
.option("dataAddress", "'NameOfYourExcelSheet'!A1") \
.load(filePath)
Can you please try the below option as shown in this spark-excel - github ?
Based on your input you can modify the number of rows. The value 20 is an sample value.
.option("maxRowsInMemory", 20) // Optional, default None. If set, uses a streaming reader which can help with big files (will fail if used with xls format files)
As mentioned above, the option does not work for .xls files.
In case the files are really big, consider the options as show in the link #590
Please validate before using any of the options specified.
Cheers...

spark structured streaming file source read from a certain partition onwards

I have a folder on HDFS like below containing ORC files:
/path/to/my_folder
It contains partitions:
/path/to/my_folder/dt=20190101
/path/to/my_folder/dt=20190103
/path/to/my_folder/dt=20190103
...
Now I need to process the data here using streaming.
A spark.readStream.format("orc").load("/path/to/my_folder") nicely works.
However, I do not want to process the whole table, but rather only start from a certain partition onwards similar to a certain kafka offset.
How can this be implemented? I.e. how can I specify the initial state where to read from.
Spark Structured Streaming File Source Starting Offset claims that there is no such feature.
Their suggestion to use: latestFirst is not desirable for my use-case, as I do not aim to build an always-on streaming application, but rather use Trigger.Once like a batch job with the nice streaming semantics of duplicate reduction and handling of late-arriving data
If this is not available, what would be a suitable workaround?
edit
Run warn-up stream with option("latestFirst", true) and
option("maxFilesPerTrigger", "1") with checkpoint, dummy sink and huge
processing time. This way, warm-up stream will save latest file
timestamp to checkpoint.
Run real stream with option("maxFileAge", "0"), real sink using the
same checkpoint location. In this case stream will process only newly
available files.
https://stackoverflow.com/a/51399134/2587904
building from this idea, let's look at an example:
# in bash
rm -rf data
mkdir -p data/dt=20190101
echo "1,1,1" >> data/dt=20190101/1.csv
echo "1,1,2" >> data/dt=20190101/2.csv
mkdir data/dt=20190102
echo "1,2,1" >> data/dt=20190102/1.csv
echo "1,2,2" >> data/dt=20190102/2.csv
mkdir data/dt=20190103
echo "1,3,1" >> data/dt=20190103/1.csv
echo "1,3,2" >> data/dt=20190103/2.csv
mkdir data/dt=20190104
echo "1,4,1" >> data/dt=20190104/1.csv
echo "1,4,2" >> data/dt=20190104/2.csv
spark-shell --conf spark.sql.streaming.schemaInference=true
// from now on in scala
val df = spark.readStream.csv("data")
df.printSchema
val query = df.writeStream.format("console").start
query.stop
// cleanup the data and start from scratch.
// this time instead of outputting to the console, write to file
val query = df.writeStream.format("csv")
.option("path", "output")
.option("checkpointLocation", "checkpoint")
val started = query.start
// in bash
# generate new data
mkdir data/dt=20190105
echo "1,5,1" >> data/dt=20190105/1.csv
echo "1,5,2" >> data/dt=20190105/2.csv
echo "1,4,3" >> data/dt=20190104/3.csv
// in scala
started.stop
// cleanup the output, start later on with custom checkpoint
//bash: rm -rf output/*
val started = query.start
// bash
echo "1,4,3" >> data/dt=20190104/4.csv
started.stop
// *****************
//bash: rm -rf output/*
Everything works as intended. The operation picks up where the checkpoint marks the last processed file.
How can a checkpoint definition be generated by hands such as all files in dt=20190101 and dt=20190102 have been processed and no late-arriving data is tolerated there anymore and the processing shall continue with all the files from dt=20190103 onwards?
I see that spark generates:
commits
metadata
offsets
sources
_spark-metadata
files and folders.
So far I only know that _spark-metadata can be ignored to set an initial state / checkpoint.
But have not yet figured out (from the other files) which minimal values need to be present so processing picks up from dt=20190103 onwards.
edit 2
By now I know that:
commits/0 needs to be present
metadata needs to be present
offsets needs to be present
but can be very generic:
v1
{"batchWatermarkMs":0,"batchTimestampMs":0,"conf":{"spark.sql.shuffle.partitions":"200"}}
{"logOffset":0}
When I tried to remove one of the already processed and committed files from sources/0/0, the query still runs but: not only the new data is processed larger than the existing committed data, but any data, in particular, the files I just removed from the log.
How can I change this behavior to only process data more current than the initial state?
edit 3
The docs (https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-FileStreamSource.html) but also javadocs ;) list getOffset
The maximum offset (getOffset) is calculated by fetching all the files
in path excluding files that start with _ (underscore).
That sounds interesting, but so far I have not figured out how to use it to solve my problem.
Is there a simpler way to achieve the desired functionality besides creating a custom (copy) of the FileSource?
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L237
maxFileAge also sounds interesting.
I have started to work on a custom file stream source. However, fail to properly instanciate it. https://gist.github.com/geoHeil/6c0c51e43469ace71550b426cfcce1c1
When calling:
val df = spark.readStream.format("org.apache.spark.sql.execution.streaming.StatefulFileStreamSource")
.option("partitionState", "/path/to/data/dt=20190101")
.load("data")
The operation fails with:
InstantiationException: org.apache.spark.sql.execution.streaming.StatefulFileStreamSource
at java.lang.Class.newInstance(Class.java:427)
at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:196)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:88)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:88)
at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:150)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:159)
... 53 elided
Caused by: java.lang.NoSuchMethodException: org.apache.spark.sql.execution.streaming.StatefulFileStreamSource.<init>()
at java.lang.Class.getConstructor0(Class.java:3082)
at java.lang.Class.newInstance(Class.java:412)
... 59 more
Even though it is basically a copy of the original source. What is different? Why is the constructor not found from https://github.com/apache/spark/blob/v2.2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L196? But it works just fine for https://github.com/apache/spark/blob/v2.2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L42
Even:
touch -t 201801181205.09 data/dt=20190101/1.csv
touch -t 201801181205.09 data/dt=20190101/2.csv
val df = spark.readStream
.option("maxFileAge", "2d")
.csv("data")
returns the whole dataset and fails to filter to the most k current days.

Is there a spark-defaults.conf when installed with pip install pyspark

I installed pyspark with pip.
I code in jupyter notebooks. Everything works fine but not I got a java heap space error when exporting a large .csv file.
Here someone suggested editing the spark-defaults.config. Also in the spark documentation, it says
"Note: In client mode, this config must not be set through the
SparkConf directly in your application, because the driver JVM has
already started at that point. Instead, please set this through the
--driver-memory command line option or in your default properties file."
But I'm afraid there is no such file when installing pyspark with pip.
I'm I right? How do I solve this?
Thanks!
I recently ran into this as well. If you look at the Spark UI under the Classpath Entries, the first path is probably the configuration directory, something like /.../lib/python3.7/site-packages/pyspark/conf/. When I looked for that directory, it didn't exist; presumably it's not part of the pip installation. However, you can easily create it and add your own configuration files. For example,
mkdir /.../lib/python3.7/site-packages/pyspark/conf
vi /.../lib/python3.7/site-packages/pyspark/conf/spark-defaults.conf
The spark-defaults.conf file should be located in:
$SPARK_HOME/conf
If no file is present, create one (a template should be available in the same directory).
How to find the default configuration folder
Check contents of the folder in Python:
import glob, os
glob.glob(os.path.join(os.environ["SPARK_HOME"], "conf", "spark*"))
# ['/usr/local/spark-3.1.2-bin-hadoop3.2/conf/spark-env.sh.template',
# '/usr/local/spark-3.1.2-bin-hadoop3.2/conf/spark-defaults.conf.template']
When no spark-defaults.conf file is available, built-in values are used
To my surprise, no spark-defaults.conf but just a template file was present!
Still I could look at Spark properties, either in the “Environment” tab of the Web UI http://<driver>:4040 or using getConf().getAll() on the Spark context:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("myApp") \
.getOrCreate()
spark.sparkContext.getConf().getAll()
# [('spark.driver.port', '55128'),
# ('spark.app.name', 'myApp'),
# ('spark.rdd.compress', 'True'),
# ('spark.sql.warehouse.dir', 'file:/path/spark-warehouse'),
# ('spark.serializer.objectStreamReset', '100'),
# ('spark.master', 'local[*]'),
# ('spark.submit.pyFiles', ''),
# ('spark.app.startTime', '1645484409629'),
# ('spark.executor.id', 'driver'),
# ('spark.submit.deployMode', 'client'),
# ('spark.app.id', 'local-1645484410352'),
# ('spark.ui.showConsoleProgress', 'true'),
# ('spark.driver.host', 'xxx.xxx.xxx.xxx')]
Note that not all properties are listed but:
only values explicitly specified through spark-defaults.conf, SparkConf, or the command line. For all other configuration properties, you can assume the default value is used.
For instance, consider the default parallelism is in my case:
spark._sc.defaultParallelism
8
This is the default for local mode, namely the number of cores on the local machine--see https://spark.apache.org/docs/latest/configuration.html. In my case 8=2x4cores because of hyper-threading.
If passed the property spark.default.parallelism when launching the app
spark = SparkSession \
.builder \
.appName("Set parallelism") \
.config("spark.default.parallelism", 4) \
.getOrCreate()
then the property is shown in the Web UI and in the list
spark.sparkContext.getConf().getAll()
Precedence of configuration settings
Spark will consider given properties in this order (spark-defaults.conf comes last):
SparkConf
flags passed to spark-submit
spark-defaults.conf
From https://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties:
Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file. A few configuration keys have been renamed since earlier versions of Spark; in such cases, the older key names are still accepted, but take lower precedence than any instance of the newer key.
Note
Some pyspark Jupyter kernels contain flags for spark-submit in the environment variable $PYSPARK_SUBMIT_ARGS, so one might want to check that too.
Related question: Where to modify spark-defaults.conf if I installed pyspark via pip install pyspark
The spark-defaults.config file is needed when we have to change any of the default configs for spark.
As #niuer suggested, it should be present in the $SPARK_HOME/conf/ directory. But that might not be the case with you. By default, a template config file will be present there. You can just add a new spark-defaults.conf file in $SPARK_HOME/conf/.
Check your spark path. There are configuration files under:
$SPARK_HOME/conf/, e.g.
spark-defaults.conf.

reading compressed file in spark with scala

I am trying to read the content of .gz file in spark/scala in a dataframe/rdd using the following code
val conf = new SparkConf()
val sc = new SparkContext(conf)
val data = sc.wholeTextFiles("path to gz file")
data.collect().foreach(println);
.gz file is 28 mb and when i do the spark submit using this command
spark-submit --class sample--master local[*] target\spark.jar
It gives ma Java Heap space issue in the console .
Is this the best way of reading .gz file and if yes how could i solve java heap error issue .
Thanks
Disclaimer: That code and description will purely read in a small compressed text file using spark, collect it to an array of every line and print every line in the entire file to console. The number of ways and reasons to do this outside far outnumber those to do it in spark
1) use SparkSession instead of SparkContext if you can swing it. sparkSession.read.text() is the command to use (it automatically handles a few compression formats)
2) Or at least use sc.textFile() instead of wholeTextFiles
3) you're calling .collect on that data which brings the entire file back to the driver (in this case since you're local not network bound). Add the --driver-memory option to the spark shell to increase memory if you MUST do the collect.

Getting an file exists error while import into Hive using sqoop

I am trying to copy the retail_db database tables into hive database which I already created. When I execute the following code
sqoop import-all-tables \
--num-mappers 1 \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username=retail_dba \
--password=cloudera \
--hive-import \
--hive-overwrite \
--create-hive-table \
--outdir java_files \
--hive-database retail_stage
My Map-reduce job stops with the following error:
ERROR tool.ImportAllTablesTool: Encountered IOException running import
job: org.apache.hadoop.mapred.FileAlreadyExistsException: Output
directory hdfs://quickstart.cloudera:8020/user/cloudera/categories
already exists
I am trying to copy the tables to hive database,Then why an existing file in cloudera caused the problem. Is there a way to ignore this error or overwrite the existing file.
This is how sqoop imports job works:
sqoop creates/imports data in tmp dir(HDFS) which is user's home dir(in your case it is /user/cloudera).
Then copy data to its actual hive location (i.e., /user/hive/wearhouse.
This categories dir should have exist before you ran import statements. so delete that dir or rename it if its important.
hadoop fs -rmr /user/cloudera/categories
OR
hadoop fs -mv /user/cloudera/categories /user/cloudera/categories_1
and re-run sqoop command!
So in short, Importing to Hive will use hdfs as the staging place and sqoop deletes staging dir /user/cloudera/categories after copying(sucessfully) to actual hdfs location - it is last stage of sqoop job to clean up staging/tmp files - so if you try to list the tmp staging dir, you won't find it.
After successful import: hadoop fs -ls /user/cloudera/categories - dir will not be there.
Sqoop import to Hive works in 3 steps:
Put data to HDFS
Create Hive table if not exists
Load data into Hive Table
You have not mentioned --target-dir or --warehouse-dir, so it will put data in HDFS Home Directory which I believe /user/cloudera/ in your case.
Now for a MySQL table categories you might have imported it earlier. So, /user/cloudera/categories directory exists and you are getting this exception.
Add any non-existing directory in --taget-dir like --taget-dir /user/cloudera/mysqldata. Then sqoop will put all the Mysql Tables imported by above command in this location.
Based on answer #1 above, I found this. I tried and it works.
So, just add --delete-target-dir
You cannot use hive-import and hive-overwrite at the same time.
The version I confirmed this issue is;
$ sqoop help import
--hive-overwrite Overwrite existing data in
the Hive table
$ sqoop version
Sqoop 1.4.6-cdh5.13.0
ref. https://stackoverflow.com/a/22407835/927387