Spark Read Multiple Parquet Files from a variable - pyspark

I have a MS SQL table which contains a list of files that are stored within an ADLS gen2 account. All files have the same schema and structure.
I have concatenated the results of the table into a string.
mystring = ""
for index, row in files.iterrows():
mystring += "'"+ row["path"] + "',"
mystring = mystring[:-1]
print(mystring)
OUTPUT
'abfss://[file]#[container].dfs.core.windows.net/ARCHIVE/2021/08/26/003156/file.parquet','abfss:/[file]#[container].dfs.core.windows.net/ARCHIVE/2021/08/30/002554/file.parquet','abfss:/[file]#[container].dfs.core.windows.net/ARCHIVE/2021/09/02/003115/file.parquet'
I am now attempting to pass the string using
sdf = spark.read.parquet(mystring)
however I am getting the error
IllegalArgumentException: java.net.URISyntaxException: Illegal character in scheme name at index 0: 'abfss://[file]#[container].dfs.core.windows.net/ARCHIVE/2021/08/26/003156/file.parquet','abfss:/[file]#[container].dfs.core.windows.net/ARCHIVE/2021/08/30/002554/file.parquet','abfss:/[file]#[container].dfs.core.windows.net/ARCHIVE/2021/09/02/003115/file.parquet','abfss:/[file]#[container].dfs.core.windows.net/ARCHIVE/2021/09/24/003516/file.parquet','abfss:/[file]#[container].dfs.core.windows.net/ARCHIVE/2021/10/07/002659/file.parquet'
When I manually copy and past mystring into read.parquet the code executes with no errors.
Maybe I'm going down a rabbit hole but some feedback would be much appreciated

After reproducing from my end, I could able to achieve following the below.
paths = []
for index,row in files.iterrows():
paths.append(row["path"])
df = spark.read.parquet(paths)
RESULTS:
NOTE: Make sure you have same schema in all the files.

Related

PySpark - Return first row from each file in a folder

I have multiple .csv files in a folder on Azure. Using PySpark I am trying to create a dataframe that has two columns, filename and firstrow, which are captured for each file within the folder.
Ideally I would like to avoid having to read the files in full as some of them can be quite large.
I am new to PySpark so I do not yet understand the basics so I would appreciate any help.
I have write a code for your scenario and it is working fine.
Create a empty list and append it with all the filenames stored in the source
# read filenames
filenames = []
l = dbutils.fs.ls("/FileStore/tables/")
for i in l:
print(i.name)
filenames.append(i.name)
# converting filenames to tuple
d = [(x,) for x in filenames]
print(d)
Read the data from multiple files and store in a list
# create data by reading from multiple files
data = []
i = 0
for n in filename:
temp = spark.read.option("header", "true").csv("/FileStore/tables/" + n).limit(1)
temp = temp.collect()[0]
temp = str(temp)
s = d[i] + (temp,)
data.append(s)
i+=1
print(data)
Now create DataFrame from the data with column names.
column = ["filename", "filedata"]
df = spark.createDataFrame(data, column)
df.head(2)

using CSV's date in powerbi

I'm currently building a dashboard with powerbi and several CSV as sources.
Those CSV will have regular updates and i want to get a visualisation with the date of the last source update (for every CSV)
Is there a way to use the metadata csv file in powerbi to visualise it ?
Or a better way to get what i'm seeking for ?
Regards,
I tried the solution proposed:
Which mean the function is not well orthographied , any idea?
You can use Folder.Files with appropriate filters, to get to file metadata.
It's simple enough to use a function to get the data you want:
fnFileData:
(FullFilename as text, OutputType as text) =>
let
Filepath = Text.BeforeDelimiter(FullFilename, "\", {0, RelativePosition.FromEnd}) & "\",
Filename = Text.AfterDelimiter(FullFilename, "\", {0, RelativePosition.FromEnd}),
Filtered = Table.SelectRows(Folder.Files(Filepath), each [Folder Path] = Filepath and [Name] = Filename),
Filedata = Table.TransformColumnNames(Filtered, Text.Lower),
Output = Table.Column(Filedata,Text.Lower(OutputType)){0}
in
Output
In this case, if you want the Date Modified, then invoke as
= fnFileData("C:\MyPath\MyFile.txt", "Date Modified")

To split data into good and bad rows and write to output file using Spark program

I am trying to filter the good and bad rows by counting the number of delimiters in a TSV.gz file and write to separate files in HDFS
I ran the below commands in spark-shell
Spark Version: 1.6.3
val file = sc.textFile("/abc/abc.tsv.gz")
val data = file.map(line => line.split("\t"))
var good = data.filter(a => a.size == 995)
val bad = data.filter(a => a.size < 995)
When I checked the first record the value could be seen in the spark shell
good.first()
But when I try to write to an output file I am seeing the below records,
good.saveAsTextFile(good.tsv)
Output in HDFS (top 2 rows):
[Ljava.lang.String;#1287b635
[Ljava.lang.String;#2ef89922
Could ypu please let me know on how to get the required output file in HDFS
Thanks.!
Your final RDD is type of org.apache.spark.rdd.RDD[Array[String]]. Which leads to writing objects instead of string values in the write operation.
You should convert the array of strings to tab separated string values again before saving. Just try;
good.map(item => item.mkString("\t")).saveAsTextFile("goodFile.tsv")

Escape hyphen when reading multiple dataframes at once in pyspark

I can't seem to find any documentation on this in pyspark's docs.
I am trying to read multiple parquets at once, like this:
df = sqlContext.read.option("basePath", "/some/path")\
.load("/some/path/date=[2018-01-01, 2018-01-02]")
And receive the following exception
java.util.regex.PatternSyntaxException: Illegal character range near index 11
E date=[2018-01-01,2018-01-02]
I tried replacing the hyphen with \-, but then I just receive a file not found exception.
I would appreciate help on that
Escaping the - is not your issue. You can not specify multiple dates in a list like that. If you'd like to read multiple files, you have a few options.
Option 1: Use * as a wildcard:
df = sqlContext.read.option("basePath", "/some/path").load("/some/path/date=2018-01-0*")
But this will also read any files named /some/path/data=2018-01-03 through /some/path/data=2018-01-09.
Option 2: Read each file individually and union the results
dates = ["2018-01-01", "2018-01-02"]
df = reduce(
lambda a, b: a.union(b),
[
sqlContext.read.option("basePath", "/some/path").load(
"/some/path/date={d}".format(d=d)
) for d in dates
]
)

How to subtract or remove contents of context variable from string in Talend

In talend Open Studio if I have a context variable which points to a directory C:/MyData how can I subtract that from a directory string e.g. C:/MyData/Folder/Sub/ so that I end up with /Folder/Sub/ for additional processing
I tried storing the C:/MyData/Folder/Sub/ in a variable Path and the context as as string in tMap use Var.Path.replace(Var.ContextAsString, "") but that didn't affect the output at all
Are there better ways to manipulate strings that represent directory paths using Talend tMap?
No need to declare a tMap variable.
Suppose the field containing the full path is "row1.fullpath" and the context variable is called root (containing "D:/MyData").
On the right part of the tMap, just write:
row1.fullpath.replace(context.root, "")
You can refer to example below and port it to tMap expression.
String s1 = "C:/MyData";
String s2 = "C:/MyData/Folder/Sub/";
String s3 = (s2.indexOf(s1) >= 0) ? s2.substring(s2.indexOf(s1) + s1.length()): s2;
System.out.println(s3);