I want to load XML files from a specific folder with Pyspark. But I don't want to use com.databricks.spark.xml package. From every example, I get using com.databricks.spark.xml package.
Is there any way to read XML files without this package?
Can you use 'xml.etree.ElementTree as ET'? If yes, then write a function in python using this function, and create a udf. Read the XML files into PySpark as RDDs and parse them with the udf.
Related
how can I convert an uploaded CSV to dataframe in foundry using a code workbook? should I use the #transform decorator with spark.read.... (not sure of the exact syntax)?
Thx!!
CSV is a "special format" where Foundry can infer the schema of the CSV and automatically convert it to a dataset. In this example, I uploaded a CSV with hail data from NOAA:
If you hit Edit Schema on the main page, you can use a front end tool to set certain configurations about the CSV (delimiter, column name row, column type etc).
Once it's a dataset, you can import the dataset in Code Workbooks and it should work out of the box.
If you wanted to parse the raw file, I would recommend you use Code Authoring, not Code Workbooks, as it's better suited for production level code, has less overhead and is more efficient for this type of workflow (parsing raw files). If you really wanted to use code workbooks, change the type of your input using the input helper bar, or in the inputs tab.
Once you finish iterating please move this to a Code Authoring repository and repartition your data. File reads at code workbook can substantially slow down your whole code workbook pipeline. Code Authoring offers preview of raw files now, so it's just as fast to develop as using Code Workbooks.
Note: Only imported datasets and persisted datasets can be read in as Python transform inputs. Transforms that are not saved as a dataset cannot be read in as Python transform inputs. Datasets with no schema should be read in as a transform input automatically.
Is there any approach to read hdfs data to spark df without explicitly mentioning file type.
spark.read.format("auto_detect").option("header", "true").load(inputPath)
We can achieve above requirement by using scala.sys.process_ or python subprocess(cmd). and splitting the extension of any part file. But without using any subprocess or sys.process, can we achieve this ..?
I need to built a module in scala in which source data is coming from two modules present in Pyspark. Can you help me to read data from pyspark to scala module.
I have multiple files stored in HDFS, and I need to merge them into one file using spark. However, because this operation is done frequently (every hour). I need to append those multiple files to the source file.
I found that there is the FileUtil that gives the 'copymerge' function. but it doesn't allow to append two files.
Thank you for your help
You can do this with two methods:
sc.textFile("path/source", "path/file1", "path/file2").coalesce(1).saveAsTextFile("path/newSource")
Or as #Pushkr has proposed
new UnionRDD(sc, Seq(sc.textFile("path/source"), sc.textFile("path/file1"),..)).coalesce(1).saveAsTextFile("path/newSource")
If you don't want to create a new source and overwrite the same source every hour, you can use dataframe with save mode overwrite ( How to overwrite the output directory in spark)
I used saveAsTextFile("outputPath") to save file using scala in spark.
I want to read the saved file one by one like getline command in c or java from HDFS.
How can I use this?
Is it possible to read the file one by one?