Error building pipeline in Foundry Code Repos, Code works in Preview mode but fails in Build mode - pyspark

We keep getting the following error on a Foundry Code Repo transform. It works in preview mode, but fails in build mode.
No transforms discovered in the pipeline from the requested files.
Please add the transform to the pipeline definer.
If using the Build button in Authoring, please ensure you are running the build from the file where the transform is generated.
Also note generated transforms may not be discovered using the Authoring Build button, and can be triggered
instead through the Dataset Preview app.: {filesWithoutDatasets=[transforms-python/src/name_of_transform_file.py]}
The input dataset is the result of a Data Connector REST ingest. And there's a column which I will call jsonResponseColumn that contains the actual json response from said ingest.
The code roughly looks like
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, ArrayType, StringType, DecimalType
from transforms.api import transform_df, Input, Output
#transform_df(
Output("output_df_location"),
input_df=Input("input_df_location"),
)
def compute(input_df):
schema = create_schema()
parsed_df = input_df
parsed_df = parsed_df.withColumn('newField1', F.from_json(parsed_df.jsonResponseColumn, schema, {"mode": "FAILFAST"}))
parsed_df = parsed_df.withColumn('newField2', F.explode(parsed_df.newField1.fieldInJsonResponse))
parsed_df = parsed_df.withColumn('newField3', parsed_df.newField2.nestedFieldInJsonResponse)
parsed_df = parsed_df.withColumn(
'id', ...
).withColumn(
'key', ...
).withColumn(
.....
)
return parsed_df.select(
...
)
def create_schema():
//basically returns a StructType([...]) that matches the json response from the REST ingest

The sub-folder of this x-form file was missing in the pipeline.py. I added the missing x-form sub-folder and it worked.
And just so other devs don't have to update this file every time they add a new sub-folder, I moved all the sub folders to /transforms and changed pipeline.py to what I have below & it works too!
from transforms.api import Pipeline
import repo_name.transforms as transforms
my_pipeline = Pipeline()
my_pipeline.discover_transforms(transforms)

Related

Writting Parquet file with append or overwrite, error File doesn't exist, Scala Spark

When I try to write my final DF with append or overwrite mode, sometimes, I get the following error:
Caused by: java.io.FileNotFoundException: File file:/C:/Users/xxx/ScalaSparkProjects/Date=2019-11-02/part-xxxx2x.28232x.213.c000.snappy.parquet does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
And I can't understand why. This is how I am writting the DF as a parquet file:
df.write.mode("append")
.partitionBy("Date")
.format("parquet")
.save(/data/testing/files)
Why could be happening this?
Based on your information consider this scenario:
Source DataFrame example under the path /tmp/sourceDF
target path to save under /tmp/destPath
val sourceDF = spark.read.parquet("/tmp/source")
At this point spark reads the header of the parquet in this folder to infere the schema. The schema I used is for simplicity reasons num: Integer
Now what you probably think is that all data is loaded at this point, but spark works lazy until an action occurs (Actions: df.show(), df.take(1), df.count())
so this code would result in error.
import scala.reflect.io.Directory
import java.io.File
import spark.implicits._
val sourceDF = spark.read.parquet("/tmp/sourceDF")
val directory = new Directory(new File("/tmp/sourceDF"))
directory.deleteRecursively()
sourceDF.write.parquet("/tmp/destDF")
the result will be:
java.io.FileNotFoundException: File file:/tmp/source/part-00000-1915503b-4beb-4e14-87ef-ca8b99fc4b11-c000.snappy.parquet does not exist
In order to fix this you you have two options I can think of.
Change the order:
import scala.reflect.io.Directory
import java.io.File
import spark.implicits._
val sourceDF = spark.read.parquet("/tmp/sourceDF")
sourceDF.write.mode("append").parquet("/tmp/destDF")
// Deletion happens now after writing
val directory = new Directory(new File("/tmp/sourceDF"))
directory.deleteRecursively()
Or you can use a checkpoint which loads the df at some point and caches it:
import scala.reflect.io.Directory
import java.io.File
import spark.implicits._
// set checkpoint directory
spark.sparkContext.setCheckpointDir("/tmp/checkpoint")
// cache df
val sourceDF = spark.read.parquet("/tmp/sourceDF").checkpoint()
// Now you can delete before writing it out
val directory = new Directory(new File("/tmp/sourceDF"))
directory.deleteRecursively()
sourceDF.write.mode("append").parquet("/tmp/destDF")

Creating temporary resource test files in Scala

I am currently writing tests for a function that takes file paths and loads a dataset from them. I am not able to change the function. To test it currently I am creating files for each run of the test function. I am worried that simply making files and then deleting them is a bad practice. Is there a better way to create temporary test files in Scala?
import java.io.{File, PrintWriter}
val testFile = new File("src/main/resources/temp.txt" )
val pw = new PrintWriter(testFile)
val testLines = List("this is a text line", "this is the next text line")
testLines.foreach(pw.write)
pw.close
// test logic here
testFile.delete()
I would generally prefer java.nio over java.io. You can create a temporary file like so:
import java.nio.Files
Files.createTempFile()
You can delete it using Files.delete. To ensure that the file is deleted even in the case of an error, you should put the delete call into a finally block.

AttributeError for selfloop_edges()

When executing the following:
import networkx as nx
import matplotlib.pyplot as plt
import csv
with open("nutrients.csv") as file:
reader = csv.reader(file)
G = nx.Graph(reader) #initialize Graph
print(G.nodes()) #this part works fine
print(repr(G.edges))
G.selfloop_edges()#attribute of question
It's coming back with
AttributeError:"Graph" object has no attribute 'selfloop_edge'
Does anyone know what could be the issue?
You getting an error because this method has been moved from the base graph class into the main namespace, see Migration guide from 1.X to 2.0. So either you're looking at the docs of 1.X or using code from previous releases.
You need to call this method as:
nx.selfloop_edges(G, data=True)

How to check a file/folder is present using pyspark without getting exception

I am trying to keep a check for the file whether it is present or not before reading it from my pyspark in databricks to avoid exceptions? I tried below code snippets but i am getting exception when file is not present
from pyspark.sql import *
from pyspark.conf import SparkConf
SparkSession.builder.config(conf=SparkConf())
try:
df = sqlContext.read.format('com.databricks.spark.csv').option("delimiter",",").options(header='true', inferschema='true').load('/FileStore/tables/HealthCareSample_dumm.csv')
print("File Exists")
except IOError:
print("file not found")`
When i have file, it reads file and "prints File Exists" but when the file is not there it will throw "AnalysisException: 'Path does not exist: dbfs:/FileStore/tables/HealthCareSample_dumm.csv;'"
Thanks #Dror and #Kini. I run spark on cluster, and I must add sc._jvm.java.net.URI.create("s3://" + path.split("/")[2]), here s3 is the prefix of the file system of your cluster.
def path_exists(path):
# spark is a SparkSession
sc = spark.sparkContext
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(
sc._jvm.java.net.URI.create("s3://" + path.split("/")[2]),
sc._jsc.hadoopConfiguration(),
)
return fs.exists(sc._jvm.org.apache.hadoop.fs.Path(path))
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path("path/to/SUCCESS.txt"))
The answer posted by #rosefun worked for me but it took lot of time for me to get it working. So I am giving some details about how that solution is working and what are the stuffs you should avoid.
def path_exists(path):
# spark is a SparkSession
sc = spark.sparkContext
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(
sc._jvm.java.net.URI.create("s3://" + path.split("/")[2]),
sc._jsc.hadoopConfiguration(),
)
return fs.exists(sc._jvm.org.apache.hadoop.fs.Path(path))
The function is same and it works fine to check whether a file exists or not in the S3 bucket path that you provided.
You will have to change this function based on how you are specifying your path value to this function.
path = f"s3://bucket-name/import/data/"
pathexists = path_exists(path)
if the path variable that you are defining is having the s3 prefix in the path then it would work.
Also the portion of the code which split the string gets you just the bucket name as follows:
path.split("/")[2] will give you `bucket-name`
but if you don't have s3 prefix in the path then you will have to use the function by changing some code and which is as below:
def path_exists(path):
# spark is a SparkSession
sc = spark.sparkContext
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(
sc._jvm.java.net.URI.create("s3://" + path),
sc._jsc.hadoopConfiguration(),
)
return fs.exists(sc._jvm.org.apache.hadoop.fs.Path("s3://" + path))
Looks like you should change except IOError: to except AnalysisException:.
Spark throws different errors/exception than regular python in a lot of cases. It’s not doing typical python io operations when reading a file, so makes sense for it to throw a different exception.
nice to see you on StackOverFlow.
I second dijksterhuis's solution, with one exception -
Analysis Exception is very general exception in Spark, and may be resulted for various reasons, not only due to missing file.
If you want to check whether the file exists or not, you'll need to bypass Spark's FS abstraction, and access the storage system directly (Whether is s3, posix, or something else). The down side of this solution is the lack of abstraction - once you'll change your underlying FS, you'll need to change your code as well.
You can validate existence of a file as seen here:
import os
if os.path.isfile('/path/file.csv'):
print("File Exists")
my_df = spark.read.load("/path/file.csv")
...
else:
print("File doesn't exists")
dbutils.fs.ls(file_location)
Do not import dbutils. It's already there when you start your cluster.

Read Parquet into scala without Spark

I have a Parquet file which I would like to read into my Scala program without using Spark or other Big Data Technologies.
I found the projects
https://github.com/apache/parquet-mr
https://github.com/51zero/eel-sdk
but not detailed enough examples to get them to work.
Parquet-MR
https://stackoverflow.com/a/35594368/4533188 mentions this, but the examples given are not complete. For example it is not clear what path is supposed to be. It is supposed to implement InputFile, how is this supposed to be done? Also, from the post it seems to me that Parquet-MR does not directly truns the parquet data as standard Scala classes.
Eel
Here I tried
import io.eels.component.parquet.ParquetSource
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
val parquetFilePath = new Path("file://home/raeg/Datatroniq/Projekte/14. Witzenmann/Teilprojekt Strom und Spannung/python_witzenmann/src/data/1.parquet")
implicit val hadoopConfiguration = new Configuration()
implicit val hadoopFileSystem = FileSystem.get(hadoopConfiguration) // This is required
ParquetSource(parquetFilePath)
.toDataStream()
.collect
.foreach(row => println(row))
but I get the error
java.io.IOException: No FileSystem for scheme: file
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(ParquetReaderTesting.sc:2582)
at org.apache.hadoop.fs.FileSystem.createFileSystem(ParquetReaderTesting.sc:2589)
at org.apache.hadoop.fs.FileSystem.access$200(ParquetReaderTesting.sc:87)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(ParquetReaderTesting.sc:2628)
at org.apache.hadoop.fs.FileSystem$Cache.get(ParquetReaderTesting.sc:2610)
at org.apache.hadoop.fs.FileSystem.get(ParquetReaderTesting.sc:366)
at org.apache.hadoop.fs.FileSystem.get(ParquetReaderTesting.sc:165)
at dataReading.A$A6$A$A6.hadoopFileSystem$lzycompute(ParquetReaderTesting.sc:7)
at dataReading.A$A6$A$A6.hadoopFileSystem(ParquetReaderTesting.sc:7)
at dataReading.A$A6$A$A6.get$$instance$$hadoopFileSystem(ParquetReaderTesting.sc:7)
at #worksheet#.#worksheet#(ParquetReaderTesting.sc:30)
in my worksheet.