how to import and use spark-dynamodb in my pyspark code - pyspark

I am trying to write data from a json file to dynamodb. came across spark-dynamodb
how do i import this package in the pyspark script?

The page you linked (GitHub spark-dynamodb) to reads:
Note: When running from pyspark shell, you can add the library as:
pyspark --packages com.audienceproject:spark-dynamodb_<spark-scala-version>:<version>
Does that answer your question?

Related

How to save a PySpark dataframe to the personal machine using Databricks?

I have a dataframe within the Databricks environment. I need to download this dataframe to my personal machine. This dataframe contains 10,000 rows. So try to do the following:
df_test.coalesce(1).write.csv("dbfs:/FileStore/tables/df_test", header=True, mode='overwrite')
However, I'm not able to run the cell. The following error message appears:
org.apache.spark.SparkException: Job aborted.
Could someone help me solve the problem?
If you didn’t resolve the error, you can try this alternate to save your Pyspark dataframe to local machine as csv file.
With display(dataframe):
Here I created a dataframe with 10,000 rows for your reference. With the display(), databricks allows to download the rows up to 1 million.
Code:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
schema=StructType([ \
StructField("id",IntegerType(),True), \
StructField("firstname",StringType(),True) \
])
data2=[(1,"Rakesh")]
for i in range(2,10000):
data2.append((i,"Rakesh"))
df=spark.createDataFrame(data=data2,schema=schema)
df.show(5)
display(df)
Dataframe Creation:
display(df):
In this output by default the display() shows 1000 rows and to download the total dataframe click on the downarrow and then click on Download full results.
Then, click on re-execute and download, now you can download the dataframe as csv file to your local machine.

Is there a way to use Impala rather than Hive in PySpark?

I have queries that work in Impala but not Hive. I am creating a simply PySpark file such as:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, HiveContext
sconf = SparkConf()
sc = SparkContext.getOrCreate(conf=sconf)
sqlContext = HiveContext(sc)
sqlContext.sql('use db1')
...
When I run this script, it's queries get the errors I get when I run them in the Hive editor (they work in the Impala editor). Is there a way to fix this so that I can run these queries in the script using Impala?
You can use Impala or HiveServer2 in Spark SQL via JDBC Data Source. That requires you to install Impala JDBC driver, and configure connection to Impala in Spark application. But "you can" doesn't mean "you should", because it incurs overhead and creates extra dependencies without any particular benefits.
Typically (and that is what your current application is trying to do), Spark SQL runs against underlying file system directly, not needing to go through either HiveServer2 or Impala coordinators. In this scenario, Spark only (re)uses Hive Metastore to retrieve the metadata -- database and table definitions.

Unable to connect hivellap from pyspark

I am using pyspark kernel inside jupyterhub and want to connect hivellap from spark . I am able to create a spark session but when i am trying to execute
from pyspark_llap import HiveWarehouseSession it is showing error no module found pyspark_llap
The same command i am able to execute in python kernel and it successfully executed.
Kindly suggest what configuration is needed to import HiveWarehouseSession from pyspark_llap inside pyspark kernel.

how to import JSON file into mongodb in Scala

I need to update my MongoDB database each time from given JSON files. Is there any way to import a JSON file into MongoDB with Scala? or is it possible to execute a raw mongo command like this in Scala environment?
mongoimport --db issue --collection customer --type json --file /home/lastvalue/part-00000.json
In Java we can do like this but I need to implement it in Scala. Where I will get to import the libraries of this classes?
When writing Scala, you can use any Java library, including the Process library that your link refers to.
This will allow you to run the mongoimport command in a process spawned from your Scala code. If you're looking for a solution entirely written in Scala, you should use the mongo-scala-driver. The documentation includes a complete example to mimic mongoimport's functionalities.

Spark SQL build for hive?

I have downloaded spark release - 1.3.1 and package type is Pre-build for Hadoop 2.6 and later
now i want to run below scala code using spark shell so i followed this steps
1. bin/spark-shell
2. val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
3. sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
Now the problem is if i verity it on hue browser like
select * from src;
then i get
table not found exception
that means table not created how do i configure hive with spark shell to make this successful. i want to use SparkSQL also i need to read and write data from hive.
i randomly heard that we need to copy hive-site.xml file somewhere in spark directory
can someone please explain me with the steps - SparkSQL and Hive configuration
Thanks
Tushar
Indeed, the hive-site.xml direction is correct. Take a look at https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables .
Also it sounds like you wish to create a hive table from spark, for that look at "Saving to Persistent Tables" in the same document as above.