I am not able to import the below lines..
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.mqtt import MQTTUtils
It is throwing error as below.
ModuleNotFoundError Traceback (most recent call last)
Input In [3], in <module>
1 from pyspark import SparkContext
2 from pyspark.streaming import StreamingContext
----> 3 from pyspark.streaming.mqtt import MQTTUtils
ModuleNotFoundError: No module named 'pyspark.streaming.mqtt
I have download the rabbitMQ jar files and saved in jar directory too but still I am not able to import.
Also I would like to make clear that I am completely new for spark. I have task to pull the data from AMQP topics based on spark structure streaming. How can I do it please suggest if you have any idea. It would be helpful if someone take the initiative to explain through sample code. There is very less information on internet for pyspark. please, share your thoughts.
Related
I have a problem with reading data from Elasticsearch into Spark cluster (I'm using Zeppelin environment, so all connection settings are configured in the Zeppelin interpreter settings).
First, I have tried to read it with PySpark:
%pyspark
from pyspark.sql import SQLContext
from pyspark.sql import functions as F
df = spark.read.format("org.elasticsearch.spark.sql").load("index")
df = df.limit(100).drop('tags').drop('a.b')
# if 'tags' field is not dropped, pyspark cannot map scala field and throws an exception.
# If the limit is not set, pyspark will probably try to get the whole index at once
# if "a.b" is not dropped, the dot in the field name causes mapping error: https://github.com/elastic/elasticsearch-hadoop/issues/853
df = df.cache()
z.show(df)
Unfortunately, in this case I face many mapping issues. Cause I have a lot of fields containing dots in the dataset, I decided to give Scala a try to read the data (in order to process it in PySpark later):
%spark
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.elasticsearch.spark._
import org.apache.spark.sql.SQLContext
import org.elasticsearch.spark
import org.elasticsearch.spark.sql
import org.apache.spark.serializer.KryoSerializer
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.Encoder
val conf = new SparkConf()
conf.set("spark.es.mapping.date.rich", "false");
conf.set("spark.serializer", classOf[KryoSerializer].getName)
val EsReadRDD = sc.esRDD("index")
However, even with Scala I can only retrieve small numbers of records, like
EsReadRDD.take(10).foreach(println)
For some reason, collect() does not work:
val esdf = EsReadRDD.collect() //does not work probably because data are too large
The error is:
Job aborted due to stage failure: Task 0 in stage 833.0 failed 4 times, most recent failure: Lost task 0.3 in stage 833.0 (TID 479, 10.10.11.37, executor 5): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
I have also tried conversion to DF, but get an error:
val esdf = EsReadRDD.toDF()
java.lang.UnsupportedOperationException: No Encoder found for scala.AnyRef
- map value class: "java.lang.Object"
- field (class: "scala.collection.Map", name: "_2")
- root class: "scala.Tuple2"
Do you have any idea on how to deal with it?
I have Zip file of 1.3GB and inside it a txt file with comma separated format which is of 6GB. This zip folder is on Azure Data Lake Storage and using service principle, its mounted on DBFS Databricks file system.
When using normal python code to extract the 6GB file, I get the 1.98GB as extracted file.
Please suggest a way to read the txt file directly and store it as spark Dataframe.
I have tried using python code but directly reading from python gives error - Error tokenizing data. C error: Expected 2 fields in line 371, saw 3
this was also fixed using the UTF-16-LE coding but after that got error - ConnectException: Connection refused (Connection refused) on Databricks while trying to display the df.head().
import pandas as pd
import zipfile
zfolder = zipfile.ZipFile('dbfszipath')
zdf = pd.read_csv(zfolder.open('6GBtextfile.txt'),error_bad_lines=False,encoding='UTF-16-LE')
zdf.head()
Extract code -
import pandas as pd
import zipfile
zfolder = zipfile.ZipFile('/dbfszippath')
zfolder.extract(dbfsexrtactpath)
The dataframe should contain all the data when directly read through the zip folder and also it should display some data and should not hang the Databricks Cluster. Need options in Scala or Pyspark.
The connection refused comes from the memory setting that Databricks and spark have. You will have to increase the size allowance to avoid this error.
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.sql import SQLContext
conf=SparkConf()
conf.set("spark.executor.memory", "4g")
conf.set("spark.driver.memory", "4g")
In this case, the allotted memory is 4GB so change it as needed.
Another solution would be the following:
import zipfile
import io
def zip_extract(x):
in_memory_data = io.BytesIO(x[1])
file_obj = zipfile.ZipFile(in_memory_data, "r")
files = [i for i in file_obj.namelist()]
return dict(zip(files, [file_obj.open(file).read() for file in files]))
zips = sc.binaryFiles("somerandom.zip")
files_data = zips.map(zip_extract)
Let me know if this works or what the error is in this case.
[Source]
I have a problem trying to execute aws example fro Aws Glue Etl - locally
after read all those steps:
https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint-tutorial-local-notebook.html
and create my endpoints into aws glue. When i try to execute this code:
%pyspark
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
# sc = SparkContext()
#glueContext = GlueContext(sc)
glueContext = GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session
persons = glueContext.create_dynamic_frame.from_catalog(
database="sampledb",
table_name="avro_avro_files"
)
print(persons.count())
persons.printSchema()
I have this error:
File "/usr/share/aws/glue/etl/python/PyGlue.zip/awsglue/__init__.py", line 13, in <module>
from dynamicframe import DynamicFrame
ImportError: No module named 'dynamicframe'
And i don't know how solve this problem
i'm have zeppeling0.7.3 config locally.
the idea with the code showed before is , get this result:
2019-04-01 11:37:22 INFO avro-test-bo: Test log message
Count: 5
root
|-- name: string
|-- favorite_number: int
|-- favorite_color: string
Hello finally i get the answer here
the problem is when i create my endpoint , i create it just on a private network.
After create a new endpoint with public network. this error was solved.
Thanks for the help for everybody
Regards
do you mean to say the code was working earlier, and have stopped working? sorry couldnt interpret it correctly.
With reference to local development using Zeppelin, can you please confirm if the configuration is correct, and have enabled ssh tunneling, etc? You may need to do some config. changes in the Zeppelin->Spark interpreters, etc.
Please make sure you are connected to AWS Glue DEP using SSH tunneling. Here are some references that may help you. Looks like your zeppelin is unable to get a GlueContext (I dont see a glueconext object being created?)
# Create a Glue context
glueContext = GlueContext(SparkContext.getOrCreate())
Please refer to this linke, setting up zeppelin on windows, for any help on configuring local zeppelin environment.
We are writing the spark streaming application, to read kafka messages using createStream method and batch interval is 180 seconds.
The code successfully working and creating files for every 180 seconds into s3 buckets , but no messages in the files. Below is the Environment
Spark 2.3.0
Kakfa 1.0
Please go through code and please let me know anything wrong here
#import dependencies
import findspark
findspark.init()
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import json
from pyspark.sql import *
Creating Context variables
sc = SparkContext(appName="SparkStreamingwithPython").getOrCreate()
sc.setLogLevel("WARN")
ssc = StreamingContext(sc,180)
topic="thirdtopic"
ZkQuorum = "localhost:2181"
Connect to Kafka And create Stream
kakfaStream = KafkaUtils.createStream(ssc,ZkQuorum,"Spark-Streaming-Consumer",{topic:1})
def WritetoS3(rdd):
rdd.saveAsTextFile("s3://BucketName/thirdtopic/SparkOut")
kakfaStream.foreachRDD(WritetoS3)
ssc.start()
ssc.awaitTermination()
Thanks in Advance.
I have this function with Spark and Scala:
import org.apache.kudu.client.CreateTableOptions
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.{DataFrame, Dataset, Encoders, SparkSession}
import org.apache.kudu.spark.kudu._
def save(df: DataFrame): Unit ={
val kuduContext: KuduContext = new KuduContext("quickstart.cloudera:7051")
kuduContext.createTable(
"test_table", df.schema, Seq("anotheKey", "id", "date"),
new CreateTableOptions()
.setNumReplicas(1))
kuduContext.upsertRows(df, "test_table")
}
But when trying to create the kuduContext raises an exception:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.kudu.client.KuduClient.exportAuthenticationCredentials()[B
at org.apache.kudu.spark.kudu.KuduContext.<init>(KuduContext.scala:63)
at com.mypackge.myObject$.save(myObject.scala:24)
at com.mypackge.myObject$$anonfun$main$1.apply$mcV$sp(myObject.scala:59)
at com.mypackge.myObject$$anonfun$main$1.apply(myObject.scala:57)
at com.mypackge.myObject$$anonfun$main$1.apply(myObject.scala:57)
at com.mypackge.myObject$.time(myObject.scala:17)
at com.mypackge.myObject$.main(myObject.scala:57)
at com.mypackge.myObject.main(myObject.scala)
Spark works without any problem. I have installed kudu VM as described on official docs and I have logged from bash to impala instance without a problem.
Someone have any idea about what I am doing wrong?
The problem was a dependency of the project using an old version of kudu-client (1.2.0), when I was using kudu-spark 1.3.0 (which includes kudu-client 1.3.0). Excluding kudu-client from pom.xml was the solution.