I am using Pyspark along with Celery in a Django app. So the flow of my code is as follows:
1. Put a POST request to upload a file (large file).
2. Django handles the request and loads the file to hdfs. This large file in hdfs is read by pyspark to load it into the cassandra.
3. This upload is handled by Celery (from reading file to cassandra upload). Celery starts the process in the background and starts a spark context to start the upload.
4. The data gets loaded to cassandra, but the spark context which was created via the celery does not stop even after using spark.stop() when the load is complete.
project -> celery.py
import os
from celery import Celery
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'project.settings')
app = Celery('project')
app.config_from_object('django.conf:settings', namespace='CELERY')
app.autodiscover_tasks()
tasks.py
import celery
from project.celery import app
from cassandra.cluster import Cluster
from pyspark.sql import SparkSession
class uploadfile():
def __init__(self):
self.cluster = Cluster(getattr(settings, "CASSANDRA_IP", ""))
self.session = self.cluster.connect()
def start_spark(self):
self.spark = SparkSession.builder.master(getattr(settings,'SPARK_MASTER', settings.SPARK_MASTER))\
.appName('Load CSV to Cassandra')\
.config('spark.jars', self.jar_files_path)\
.config('spark.cassandra.connection.host', getattr(settings,'SPARK_CASSANDRA_CONNECTION_HOST','0.0.0.0'))\
.getOrCreate()
def spark_stop(self):
self.spark.stop()
def file_upload(self):
self.start_spark()
df = self.spark.read.csv(file_from_hdfs)
# do some operation on the dataframe
# self.session.create_cassandra_table_if_does_not_exist
df.write.format('org.apache.spark.sql.cassandra').\
.option('table',table_name)\
.option('keyspace',keyspace)\
.mode('append').save()
self.spark_stop() <<<-------------------- This does not close the spark context
#task(name="api.tasks.uploadfile")
def csv_upload():
# handle request.FILE and upload the file to hdfs
spark_obj = uploadfile()
spark_obj.file_upload()
calling_task_script.py
from task import csv_upload
from rest_framework.views import APIView
class post_it(APIView):
def post(request):
csv_upload.delay()
return Response('success')
Related
I have Zip file of 1.3GB and inside it a txt file with comma separated format which is of 6GB. This zip folder is on Azure Data Lake Storage and using service principle, its mounted on DBFS Databricks file system.
When using normal python code to extract the 6GB file, I get the 1.98GB as extracted file.
Please suggest a way to read the txt file directly and store it as spark Dataframe.
I have tried using python code but directly reading from python gives error - Error tokenizing data. C error: Expected 2 fields in line 371, saw 3
this was also fixed using the UTF-16-LE coding but after that got error - ConnectException: Connection refused (Connection refused) on Databricks while trying to display the df.head().
import pandas as pd
import zipfile
zfolder = zipfile.ZipFile('dbfszipath')
zdf = pd.read_csv(zfolder.open('6GBtextfile.txt'),error_bad_lines=False,encoding='UTF-16-LE')
zdf.head()
Extract code -
import pandas as pd
import zipfile
zfolder = zipfile.ZipFile('/dbfszippath')
zfolder.extract(dbfsexrtactpath)
The dataframe should contain all the data when directly read through the zip folder and also it should display some data and should not hang the Databricks Cluster. Need options in Scala or Pyspark.
The connection refused comes from the memory setting that Databricks and spark have. You will have to increase the size allowance to avoid this error.
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.sql import SQLContext
conf=SparkConf()
conf.set("spark.executor.memory", "4g")
conf.set("spark.driver.memory", "4g")
In this case, the allotted memory is 4GB so change it as needed.
Another solution would be the following:
import zipfile
import io
def zip_extract(x):
in_memory_data = io.BytesIO(x[1])
file_obj = zipfile.ZipFile(in_memory_data, "r")
files = [i for i in file_obj.namelist()]
return dict(zip(files, [file_obj.open(file).read() for file in files]))
zips = sc.binaryFiles("somerandom.zip")
files_data = zips.map(zip_extract)
Let me know if this works or what the error is in this case.
[Source]
We are writing the spark streaming application, to read kafka messages using createStream method and batch interval is 180 seconds.
The code successfully working and creating files for every 180 seconds into s3 buckets , but no messages in the files. Below is the Environment
Spark 2.3.0
Kakfa 1.0
Please go through code and please let me know anything wrong here
#import dependencies
import findspark
findspark.init()
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import json
from pyspark.sql import *
Creating Context variables
sc = SparkContext(appName="SparkStreamingwithPython").getOrCreate()
sc.setLogLevel("WARN")
ssc = StreamingContext(sc,180)
topic="thirdtopic"
ZkQuorum = "localhost:2181"
Connect to Kafka And create Stream
kakfaStream = KafkaUtils.createStream(ssc,ZkQuorum,"Spark-Streaming-Consumer",{topic:1})
def WritetoS3(rdd):
rdd.saveAsTextFile("s3://BucketName/thirdtopic/SparkOut")
kakfaStream.foreachRDD(WritetoS3)
ssc.start()
ssc.awaitTermination()
Thanks in Advance.
I have three node Cassandra DSE cluster and db schema with RF=3. Now I'm creating a scala application to be executed on DSE spark. Scala code is as follow :-
package com.spark
import com.datastax.spark.connector._
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.sql._
import org.apache.spark.sql.SQLContext
object sample {
def main(args: Array[String]) {
val conf = new SparkConf()
.setMaster("local")
.setAppName("testing")
.set("spark.cassandra.connection.host", "192.168.0.40")
.set("spark.driver.allowMultipleContexts", "true")
.set("spark.executor.memory", "1g")
.set("spark.driver.memory", "1g")
.set("spark.driver.maxResultSize", "500M")
.set("spark.executor.heartbeatInterval", "30s")
.set("spark.submit.deployMode", "cluster")
val sc = new SparkContext(conf)
val lRDD = sc.cassandraTable("dbname", "tablename")
lRDD.collect.foreach(println)
}}
I'm running script using
dse> bin/dse spark-submit --class com.spark.sample --total-executor-cores 4 /home/db-svr/sample.jar
So, now I want to execute my spark application from 1 node but system should do processing on 3 nodes internally and I want to monitor the same so that I can utilize RAM and processor collectively of 3 nodes. How can I do that ?
Also, this current script is taking lot of time to bring result (table size 1 million rows with 128 byte each). Is there any performance tuning parameters that I'm missing?
There a few things you probably want to change. The main thing stopping you from running on multiple machines is
.setMaster("local")
Which instructs the application that it shouldn't use a distributed Resource Manager and instead should run everything locally in the application process. With DSE you should follow the relevant documentation or start with the Spark Build Examples.
In addition you most likely never want to set
.set("spark.driver.allowMultipleContexts", "true")
having multiple Spark Contexts in one JVM is frought with problems and usually means things are not set up correctly.
I am testing checkpointing and write ahead logs with this basic Spark streaming code below. I am checkpointing into a local directory. After starting and stopping the application a few times (using Ctrl-C) - it would refuse to start, for what looks like some data corruption in the checkpoint directoty. I am getting:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 80.0 failed 1 times, most recent failure: Lost task 0.0 in stage 80.0 (TID 17, localhost): com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 13994
at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
Full code:
import org.apache.hadoop.conf.Configuration
import org.apache.spark._
import org.apache.spark.streaming._
object ProtoDemo {
def createContext(dirName: String) = {
val conf = new SparkConf().setAppName("mything")
conf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
val ssc = new StreamingContext(conf, Seconds(1))
ssc.checkpoint(dirName)
val lines = ssc.socketTextStream("127.0.0.1", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
val runningCounts = wordCounts.updateStateByKey[Int] {
(values: Seq[Int], oldValue: Option[Int]) =>
val s = values.sum
Some(oldValue.fold(s)(_ + s))
}
// Print the first ten elements of each RDD generated in this DStream to the console
runningCounts.print()
ssc
}
def main(args: Array[String]) = {
val hadoopConf = new Configuration()
val dirName = "/tmp/chkp"
val ssc = StreamingContext.getOrCreate(dirName, () => createContext(dirName), hadoopConf)
ssc.start()
ssc.awaitTermination()
}
}
Basically what you are trying to do is a driver failure scenario , for this to work , based on the cluster you are running you have to follow the below instructions to monitor the driver process and relaunch the driver if it fails
Configuring automatic restart of the application driver - To automatically recover from a driver failure, the deployment infrastructure that is used to run the streaming application must monitor the driver process and relaunch the driver if it fails. Different cluster managers have different tools to achieve this.
Spark Standalone - A Spark application driver can be submitted to
run within the Spark Standalone cluster (see cluster deploy
mode), that is, the application driver itself runs on one of the
worker nodes. Furthermore, the Standalone cluster manager can be
instructed to supervise the driver, and relaunch it if the driver
fails either due to non-zero exit code, or due to failure of the
node running the driver. See cluster mode and supervise in the Spark
Standalone guide for more details.
YARN - Yarn supports a similar mechanism for automatically restarting an application. Please refer to YARN documentation for
more details.
Mesos - Marathon has been used to achieve this with Mesos.
You need to configure write ahead logs as below ,there are special instructions for S3 which you need to follow.
While using S3 (or any file system that does not support flushing) for write ahead logs, please remember to enable
spark.streaming.driver.writeAheadLog.closeFileAfterWrite
spark.streaming.receiver.writeAheadLog.closeFileAfterWrite.
See Spark Streaming Configuration for more details.
The issue looks rather Kryo Serializer issue than checkpoint corruption.
At code example (including GitHub project), Kryo Serialization is not configured.
Since it is not configured KryoException exception could not happen.
When using "write ahead logs", and restoring from a directory, all Spark config is getting from there.
At your example, createContext method does not call when starting from the checkpoint.
I assume the issue is another application were tested before with the same checkpoint directory, where Kryo Serializer where configured.
And current application fails to be restored from that checkpoint.
Here is how I launch the Spark job :
./bin/spark-submit \
--class MyDriver\
--master spark://master:7077 \
--executor-memory 845M \
--deploy-mode client \
./bin/SparkJob-0.0.1-SNAPSHOT.jar
The class MyDriver accesses the spark context using :
val sc = new SparkContext(new SparkConf())
val dataFile= sc.textFile("/data/example.txt", 1)
In order to run this within a cluster I copy the file "/data/example.txt" to all nodes within the cluster. Is there a mechanism using Spark to share this data file between nodes without manually copying them ? I don't think I can use a broadcast variable in this case ?
Update :
An option is to have a dedicated file server which shares the file to be processed : val dataFile= sc.textFile("http://fileserver/data/example.txt", 1)
sc.textFile("/some/file.txt") read a file distributed in hdfs, i.e.:
/some/file.txt is (already) split in multiple parts which are distributed a couple of computers each.
and each worker/task read one parts of the file. This is useful because you don't need to manage which part yourself.
If you have copied the files on each worker node, you can read it in all task:
val myRdd = sc.parallelize(1 to 100) // 100 tasks
val fileReadEveryWhere = myRdd.map( read("/my/file.txt") )
and have the code of read(...) implemented somewhere.
Otherwise, you can also use a [broadcast variable] that is seed from the driver to all workers:
val myObject = read("/my/file.txt") // obj instantiated on driver node
val bdObj = sc.broadcast(myObject)
val myRdd = sc.parallelize(1 to 100)
.map{ i =>
// use bdObj in task i, ex:
bdObj.value.process(i)
}
In this case, myObject should be serializable and it is better if it is not too big.
Also, the method read(...) is run on the driver machine. So you only need the file on the driver. But if you don't know which machine it is (e.g. if you use spark-submit) then the file should be on all machines :-\ . In this case, it is maybe better to have access to some DB or external file system.