I am trying to read some jsons from An azure blob storage as a dataframe
there is a function to input the filename like using a builtin function input_file_name() during spark.read.
are there any similar buildin function for reading the file timestamp ?
if no?
how we can read the timestamp of the input file along with the data ?
is anyone has any idea to do that? any workaround ?
Do you mean the file's modification date? If so, you can use the metadata column file_modification_time.
This can be done as follows:
df = (spark.read.format('json').load(filepath)
.select('*', '_metadata.file_modification_time')
)
Documentation:
https://docs.databricks.com/ingestion/file-metadata-column.html#supported-metadata
Related
I have a problem that I hope you can help me with.
The text file that looks like this:
Report Name :
column1,column2,column3
this is row 1,this is row 2, this is row 3
I am leveraging Synapse Notebooks to try to read this file into a dataframe. If I try to read the csv file using spark.read.csv() it thinks that the column name is "Report Name : ", which is obviously incorrect.
I know that the Pandas csv reader has a 'skipRows[1]' function but unfortunately I cannot read the file directly with Pandas, as I am getting some strange networking errors. I can however convert a PySpark dataframe to a Pandas dataframe via: df.toPandas()
I'd like to be able to solve this with straight PySpark dataframes.
Surely someone else has encountered this issue! Help!
I have tried every variation of reading files, and drop, etc. but the schema has already been defined when the first dataframe was created, with 1 column (Report Name : ).
Not sure what to do now..
Copied answer from similar question: How to skip lines while reading a CSV file as a dataFrame using PySpark?
import csv
from pyspark.sql.types import StringType
df = sc.textFile("test.csv")\
.mapPartitions(lambda line: csv.reader(line,delimiter=',', quotechar='"')).filter(lambda line: len(line)>=2 and line[0]!= 'column1')\
.toDF(['column1','column2','column3'])
Microsoft got back to me with an answer that worked! When using pandas csv reader, and you use the path to the source file you want to read. It requires an endpoint to blob storage (not adls gen2). I only had an endpoint that read dfs in the URI and not blob. After I added the endpoint to blob storage, the pandas reader worked great! Thanks for looking at my thread.
I'm working in Azure Synapse Notebooks and reading reading file(s) into a Dataframe from a well-formed folder path like so:
Given there are many folders references by that wildcard, how do I capture the "State" value as a column in the resulting Dataframe?
Use input_file_name function to get the full input path and then apply regexp_extract to extract the part that you want.
Example:
df.withColumn("filepath", F.input_file_name())
df.withColum("filepath", F.regexp_extract("filepath", "State=(.+)\.snappy\.parquet", 1)
No need to use the wildcard *.
try : df = spark.read.load("abfss://....dfs.core.windows.net/")
Spark can read partitionned folders directly, and df should then contains the column state with its different values.
This question already has an answer here:
Reading partition columns without partition column names
(1 answer)
Closed 2 years ago.
I have to read parquet files that are stored in the following folder structure
/yyyy/mm/dd/ (eg: 2021/01/31)
If I read the files like this, it works:
unPartitionedDF = spark.read.option("mergeSchema", "true").parquet("abfss://xxx#abc.dfs.core.windows.net/Address/*/*/*/*.parquet")
Unfortunately, the folder structure is not stored in the typical partitioned format /yyyy=2021/mm=01/dd=31/ and I don't have the luxury of converting it to that format.
I was wondering if there is a way I can provide Spark a hint as to the folder structure so that it would make "2021/01/31" available as yyyy, mm, dd in my dataframe.
I have another set of files, which are stored in the /yyyy=aaaa/mm=bb/dd=cc format and the following code works:
partitionedDF = spark.read.option("mergeSchema", "true").parquet("abfss://xxx#abc.dfs.core.windows.net/Address/")
Things I have tried
I have specified the schema, but it just returned nulls
customSchema = StructType([
StructField("yyyy",LongType(),True),
StructField("mm",LongType(),True),
StructField("dd",LongType(),True),
StructField("id",LongType(),True),
StructField("a",LongType(),True),
StructField("b",LongType(),True),
StructField("c",TimestampType(),True)])
partitionDF = spark.read.option("mergeSchema", "true").schema(customSchema).parquet("abfss://xxx#abc.dfs.core.windows.net/Address/")
display(partitionDF)
the above returns no data!. If I change the path to: "abfss://xxx#abc.dfs.core.windows.net/Address////.parquet", then I get data, but yyyy,mm,dd columns are empty.
Another option would be to load the folder path as a column, but I cant seem to find a way to do that.
TIA
Databricks N00B!
I suggest you load the data without the partitioned folders as you mentioned
unPartitionedDF = spark.read.option("mergeSchema", "true").parquet("abfss://xxx#abc.dfs.core.windows.net/Address/*/*/*/*.parquet")
Then add a column with the input_file_name function value in:
import pyspark.sql.functions as F
unPartitionedDF = unPartitionedDF.withColumn('file_path', F.input_file_name())
Then you could split the values of the new file_path column into three separate columns.
df = unPartitionedDF.withColumn('year', F.split(df['file_path'], '/').getItem(3)) \
.withColumn('month', F.split(df['file_path'], '/').getItem(4)) \
.withColumn('day', F.split(df['file_path'], '/').getItem(5))
The input value of getItem function is based on the exact folder structure you have.
I hope it could resolve your proble.
I have a scenario where I need to convert Informatica mapping (source and target SQL Server) into Pyspark code (source blob file and target Hive). In expression transformation one column contains 'reg_extract' function and I need to convert this to Pyspark dataframe. My final goal is to create the same table in Hive as it is in SQL Server.
What will be the replacement for reg_extract function in Pyspark? I am using Pyspark 2.
Below is the code from Informatica Expression transformation (for one column variable field):
LTRIM(RTRIM(IIF(instr(v_DATE,'AMENDED')>0,
reg_Extract(DATE,'.*(^\w+\s+[0-9]{2}[,]\s+[0-9]{4}|^\w+\s+[0-9]{1}[,]\s+[0-9]{4}).*'),
reg_Extract(DATE,'.*((\s0?[1-9]|1[012])[./-](0?[1-9]|[12][0-9]|3[01])[./-][0-9]{2,4}|(^0?[1-9]|1[012])[./-](0?[1-9]|[12][0-9]|3[01])[./-][0-9]{2,4}|(0[1-9]|[12][0-9]|3[01])[./-](0?[1-9]|1[012])[./-][0-9]{2,4}|\s\w+\s+(0?[1-9]|[12][0-9]|3[01])[.,](\s+)?[0-9]{4}|^\w+\s+(0?[1-9]|[12][0-9]|3[01])[.,](\s+)?[0-9]{4}|^(19|20)[0-9]{2}|^[0-9]{2}\s+\w+\s+[0-9]{4}|^[0-9]{6}|^(0?[1-9]|[12][0-9]|3[01])\s+\w+[.,]?\s+(19|20)[0-9]{2}|^[0-9]{1,2}[-,/]\w+[-,/][0-9]{2,4}).*'))))
In Pyspark, I have saved the source file in one dataframe and selected the required columns. After that I am unable to proceed.
input_data=spark.read.csv(file_path,header=True)
input_data.createOrReplaceTempView("input_data")
df_test = "select ACCESSION_NUMBER, DATE, REPORTING_PERSON from input_data"
df = sqlContext.sql(df_test)
I am new to Pyspark/SparkSQL. Please help.
You can use regexp_extract :
df = df.withColumn('New_Column_Name', regexp_extract(col('Date'), '.*(^\w+\s+[0-9]{2}[,]\s+[0-9]{4}|^\w+\s+[0-9]{1}[,]\s+[0-9]{4}).*', 1))
Related question
I need to implement converting csv.gz files in a folder, both in AWS S3 and HDFS, to Parquet files using Spark (Scala preferred). One of the columns of the data is a timestamp and I only have a week of dataset. The timestamp format is:
'yyyy-MM-dd hh:mm:ss'
The output that I desire is that for every day, there is a folder (or partition) where the Parquet files for that specific date is located. So there would 7 output folders or partitions.
I only have a faint idea of how to do this, only sc.textFile is on my mind. Is there a function in Spark that can convert to Parquet? How do I implement this in S3 and HDFS?
Thanks for you help.
If you look into the Spark Dataframe API, and the Spark-CSV package, this will achieve the majority of what you're trying to do - reading in the CSV file into a dataframe, then writing the dataframe out as parquet will get you most of the way there.
You'll still need to do some steps on parsing the timestamp and using the results to partition the data.
old topic but ill think it is important to answer even old topics if not answered right.
in spark version >=2 csv package is already included before that you need to import databricks csv package to your job e.g. "--packages com.databricks:spark-csv_2.10:1.5.0".
Example csv:
id,name,date
1,pete,2017-10-01 16:12
2,paul,2016-10-01 12:23
3,steve,2016-10-01 03:32
4,mary,2018-10-01 11:12
5,ann,2018-10-02 22:12
6,rudy,2018-10-03 11:11
7,mike,2018-10-04 10:10
First you need to create the hivetable so that the spark written data is compatible with the hive schema. (this might be not needed anymore in future versions)
create table:
create table part_parq_table (
id int,
name string
)
partitioned by (date string)
stored as parquet
after youve done that you can easy read the csv and save the dataframe to that table.The second step overwrites the column date with the dateformat like"yyyy-mm-dd". For each of the value a folder will be created with the specific lines in it.
SCALA Spark-Shell example:
spark.sqlContext.setConf("hive.exec.dynamic.partition", "true")
spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
First two lines are hive configurations which are needed to create a partition folder which not exists already.
var df=spark.read.format("csv").option("header","true").load("/tmp/test.csv")
df=df.withColumn("date",substring(col("date"),0,10))
df.show(false)
df.write.format("parquet").mode("append").insertInto("part_parq_table")
after the insert is done you can directly query the table like "select * from part_parq_table".
The folders will be created in the tablefolder on default cloudera e.g. hdfs:///users/hive/warehouse/part_parq_table
hope that helps
BR
Read csv file /user/hduser/wikipedia/pageviews-by-second-tsv
"timestamp" "site" "requests"
"2015-03-16T00:09:55" "mobile" 1595
"2015-03-16T00:10:39" "mobile" 1544
The following code uses spark2.0
import org.apache.spark.sql.types._
var wikiPageViewsBySecondsSchema = StructType(Array(StructField("timestamp", StringType, true),StructField("site", StringType, true),StructField("requests", LongType, true) ))
var wikiPageViewsBySecondsDF = spark.read.schema(wikiPageViewsBySecondsSchema).option("header", "true").option("delimiter", "\t").csv("/user/hduser/wikipedia/pageviews-by-second-tsv")
Convert String-timestamp to timestamp
wikiPageViewsBySecondsDF= wikiPageViewsBySecondsDF.withColumn("timestampTS", $"timestamp".cast("timestamp")).drop("timestamp")
or
wikiPageViewsBySecondsDF= wikiPageViewsBySecondsDF.select($"timestamp".cast("timestamp"), $"site", $"requests")
Write into parquet file.
wikiPageViewsBySecondsTableDF.write.parquet("/user/hduser/wikipedia/pageviews-by-second-parquet")