How read .shp files in databricks from filestore? - pyspark

I'm using Databricks community, and I save a .shp in the FileStore, but when I tried to read I get this error:
DriverError: /dbfs/FileStore/tables/World_Countries.shp: No such file or directory
this is my Code
import geopandas as gpd
gdf = gpd.read_file("/dbfs/FileStore/tables/World_Countries.shp")
I also tried
gdf = gpd.read_file("/FileStore/tables/World_Countries.shp")

You should first verify that the file path is correct and that the file exists in the specified location. You can use the dbutils.fs.ls command to list the contents of the directory and check if the file is present. You can do this using:
dbutils.fs.ls("dbfs:/FileStore/path/to/your/file.shp")
Also, make sure that you have the correct permissions to access the file. In Databricks, you may need to be an administrator or have the correct permissions to access the file.
Try to read the file using the full path, including the file extension:
file_path = "dbfs:/FileStore/path/to/your/file.shp"
df = spark.read.format("shapefile").option("shape", file_path).load()
There are then several methods to read files in Databrick:
1.
from pyspark.sql.functions import *
file_path = "dbfs:/FileStore/path/to/your/file.shp"
df = spark.read.format("shapefile").option("shape", file_path).load()
df.show()
df = spark.read.shape(file_path)
and
3.
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql import functions as F
from shapely.geometry import Point
geo_df = df.select("shape").withColumn("geometry", F.shape_to_geometry("shape")).drop("shape").select("geometry")``

Related

Pyspark ModuleNotFound when importing custom package

Context: I'm running a script on azure databricks and I'm using imports to import functions from a given file
Let's say we have something like this in a file called "new_file"
from old_file import x
from pyspark.sql import SparkSession
from pyspark.context import SparkContext
from pyspark.sql.types import *
spark = SparkSession.builder.appName('workflow').config(
"spark.driver.memory", "32g").getOrCreate()
The imported funcion "x" will take as argument a string that was read as a pyspark dataframe as such:
new_df_spark = spark.read.parquet(new_file)
new_df = ps.DataFrame(new_df_spark)
new_df is then passed as argument to a function that calls the function x
I then get an error like
ModuleNotFoundError: No module named "old_file"
Does this mean I can't use imports? Or do I need to install the old_file in the cluster for this to work? If so, how would this work and will the package update if I change old_file again?
Thanks

Does ruamel.yaml have a function to do the process with all files in one directory?

Does ruamel.yaml have a function to do the process with all files in one directory?
Something like this:
data = yaml.load(Path("*.*"))
No, it does not, but you can do it one line (assuming you have the imports and the YAML() instance):
from pathlib import Path
import ruamel.yaml
yaml = ruamel.yaml.YAML()
data = [yaml.load(p) for p in Path('.').glob('*.yaml')]

No module named 'pyspark' in Zeppelin

I am new to Spark and just started using it. Trying to import SparkSession from pyspark but it throws an error: 'No module named 'pyspark'. Please see my code below.
# Import our SparkSession so we can use it
from pyspark.sql import SparkSession
# Create our SparkSession, this can take a couple minutes locally
spark = SparkSession.builder.appName("basics").getOrCreate()```
Error:
```---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-2-6ce0f5f13dc0> in <module>
1 # Import our SparkSession so we can use it
----> 2 from pyspark.sql import SparkSession
3 # Create our SparkSession, this can take a couple minutes locally
4 spark = SparkSession.builder.appName("basics").getOrCreate()
ModuleNotFoundError: No module named 'pyspark'```
I am in my conda env and I tried ```pip install pyspark``` but I already have it.
If you are using Zepl, they have their own specific way of importing. This makes sense, they need their own syntax since they are running in the cloud. It clarifies their specific syntax vs. Python itself. For instance %spark.pyspark.
%spark.pyspark
from pyspark.sql import SparkSession

I am able to create a .csv file using Talend job and I want to convert .csv to .parquet file using tSystem component?

I have a Talend job to create a .csv file and now I want to convert .parquet format using Talend v6.5.1. Only option I can think, tSystem component to call the python script from local or directory where .csv landing temporarily. I know I can convert this easily using pandas or pyspark but I am not sure the same code will be work for tSystem in Talend. Can you please provide the suggestions or instructions-
Code:
import pandas as pd
DF = pd.read_csv("Path")
DF1 = to_parquet(DF)
If you have an external script on your file system, you can try
"python \"myscript.py\" "
Here is a link on talend forum regarding this problem :
https://community.talend.com/t5/Design-and-Development/how-to-execute-a-python-script-file-with-an-argument-using/m-p/23975#M3722
I am able to resolve the problem following below steps-
import pandas as pd
import pyarrow as pa
import numpy as np
import sys
filename = sys.argv[1]
test = pd.read_csv(r"C:\\Users\\your desktop\\Downloads\\TestXML\\"+ filename+".csv")
test.to_parquet(r"C:\\Users\\your desktop\\Downloads\\TestXML\\"+ filename+".parque
t")

How to executing arbitrary sql from pyspark notebook using SQLContext?

I'm trying a basic test case of reading data from dashDB into spark and then writing it back to dashDB again.
Step 1. First within the notebook, I read the data:
sqlContext = SQLContext(sc)
dashdata = sqlContext.read.jdbc(
url="jdbc:db2://bluemix05.bluforcloud.com:50000/BLUDB:user=****;password=****;",
table="GOSALES.BRANCH"
).cache()
Step 2. Then from dashDB I create the target table:
DROP TABLE ****.FROM_SPARK;
CREATE TABLE ****.FROM_SPARK AS (
SELECT *
FROM GOSALES.BRANCH
) WITH NO DATA
Step 3. Finally, within the notebook I save the data to the table:
from pyspark.sql import DataFrameWriter
writer = DataFrameWriter(dashdata)
dashdata = writer.jdbc(
url="jdbc:db2://bluemix05.bluforcloud.com:50000/BLUDB:user=****;password=****;",
table="****.FROM_SPARK"
)
Question: Is it possible to run the sql in step 2 from pyspark? I couldn't see how this could be done from the pyspark documentation. I don't want to use vanilla python for connecting to dashDB because of the effort involved in setting up the library.
Use ibmdbpy. See this brief demo.
With as_idadataframe() you can upload DataFrames into dashDB as a table.
Added key steps here as stackoverflow doesn't like linking to answers:
Step 1: Add a cell containing:
#!pip install --user future
#!pip install --user lazy
#!pip install --user jaydebeapi
#!pip uninstall --yes ibmdbpy
#!pip install ibmdbpy --user --no-deps
#!wget -O $HOME/.local/lib/python2.7/site-packages/ibmdbpy/db2jcc4.jar https://ibm.box.com/shared/static/lmhzyeslp1rqns04ue8dnhz2x7fb6nkc.zip
Step 2: Then from annother notebook cell
from ibmdbpy import IdaDataBase
idadb = IdaDataBase('jdbc:db2://<dashdb server name>:50000/BLUDB:user=<dashdb user>;password=<dashdb pw>')
....
Yes,
You can create table in dashdb from Notebook.
Below is the code for Scala :
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql._
import org.apache.log4j.Logger
import org.apache.log4j.Level
import java.sql.Connection
import java.sql.DriverManager
import java.sql.SQLException
import com.ibm.db2.jcc._
import java.io._
val jdbcClassName="com.ibm.db2.jcc.DB2Driver"
val url="jdbc:db2://awh-yp-small02.services.dal.bluemix.net:50001/BLUDB:sslConnection=true;" // enter the hostip fromc connection settings
val user="<username>"
val password="<password>"
Class.forName(jdbcClassName)
val connection = DriverManager.getConnection(url, user, password)
val stmt = connection.createStatement()
stmt.executeUpdate("CREATE TABLE COL12345(" +
"month VARCHAR(82))")
stmt.close()
connection.commit()
connection.close()