Equivalent of collection.groupBy in scalaz-streams - scala

I have a folder which contain multiple files with names such as filetype1_ddMMyyyy_hhmm, filetype2_ddMMyyyy_hhmm
Per each day, there could be multiple files with a different hour and I would need to parse only the one with the highest hour. In a non-reactive stream world, the algorithm can be implemented as a groupBy date, what's its equivalent in scalaz-stream?

Related

Reading Multiple folders parallely

i have multiple part folders each containing parquet files (ex given below). Now across a part-folder the schema can be different (either the num of cols or the datatype of certain col). My requirement is that i have to read all the part folders and finally create a single df according to a predefined passed schema.
/feed=abc -> contains multiple part folders based on date like below
/feed=abc/date=20221220
/feed=abc/date=20221221
.....
/feed=abc/date=20221231
Since i am not sure what type of changes are there in which part folders i am reading each part folder individually then comparing the schema with teh predefined schema and making the necessary changes i..e, adding/dropping col or typecasting the col datatype. Once done i am writing the result into a temp location and then moving on to the next part folder and repeating the same operation. Once all the part-folders are read i am reading the temp location at one go to get the final output.
Now i want to do this operation parallely, i.e., there will be parallel thread/process (?) which will read part-folders parallely and then execute teh logic of schema comparison and any changes necessary and write into a temp location . Is this thing possible ?
i searched for parallel processing of multi-dir in here but in majority of the scenarios they have same schema across dir so somehow they are using wildcard to read the input path location and create the df, but that is not going to work in my case. The problem statement in the below path is similar to mine but in my case the num of part folders to be read is random and sometimes over 1000. Moreover there are operation involved in comparing the fixing the col types as well.
Any help will be appreciated.
Reading multiple directories into multiple spark dataframes
Divide your existing ETL into two phases. The first one transforms existing data into the appropriate schema, and the second one reads the transformed data convenient way (with * symbols). Use Airflow (or Oozie) to start one data transformer application per directory. And after all instances of the data transformer are successfully finished, run the union app.

Retention scripts to container data

I'm trying to do something to apply data retention policies to my data stored in container storage in my data lake. The content is structured like this:
2022/06/30/customer.parquet
2022/06/30/product.parquet
2022/06/30/emails.parquet
2022/07/01/customer.parquet
2022/07/01/product.parquet
2022/07/01/emails.parquet
That's basically every day a new file is added, using the copy task from azure data factory. There are in reality more than 3 files per day.
I want to start applying different retention policies to different files. For example, the emails.parquet files, I want to delete the entire file after it is 30 days old. The customer files, I want to anonymise by replacing the contents of certain columns with some placeholder text.
I need to do this in a way that preserves the next stage of data processing - which is where pyspark scripts read all data for a given type (e.g. emails, or customer), transform it and output it to a different container.
So to apply the retention changes mentioned above, I think I need to iteratively look through the container, find each file (each emails file, or each customer file), do the transformations, and then output (overwrite) the original file. I'd plan to use pyspark notebooks for this, but I don't know how to iterate through folder structures in a container.
As for making date comparisons to decide if my data is to be not retained, I can either use the folder structures for the dates (but I don't know how to do this), or there's a "RowStartDate" in every parquet file that I can use too.
Can anybody help point me in the right direction of how to achieve what I wish, either by the route I'm alluding to above (pyspark script to iterate through container folders, add data to data frame, transform, then overwrite original file) or any other means.

Modifying number of output files per write-partition with spark

I have a data source that consists of a huge amount of small files. I would like to save this partitioned by column user_id to another storage:
sdf = spark.read.json("...")
sdf.write.partitionBy("user_id").json("...")
The reason for this is I want another system to be able to delete only select users' data upon request.
This works, but, I still get many files within each partition (due to my input data). For performance reasons I would like to reduce the number of files within each partition, ideally simply to one (the process will run each day, so having an output file per user per day would work well).
How do I obtain this with pyspark?
You can use repartition to ensure that each partition gets one file
sdf.repartition('user_id').write.partitionBy("user_id").json("...")
This will make sure for each partition one file is created but in case of coalesce if there are more than one partition it can cause trouble.
Just add coalesce and no. of file you want.
sdf.coalesce(1).write.partitionBy("user_id").json("...")

Spark.SQL – Aggregation of separated data in parallel

My task is to aggregate data by an hour (and store each as a row in DB).
For aggregating one hour, there is no need to know what the other hours have.
The input is json files. Important point is that these files are stored in separated folders – folder for an hour.
I have 2 questions:
What is the right way to aggregate in such scenario – I'd want to "send" each hour data to different node/s and aggregate them separately in parallel – such that in the end I'll finish with a dataframe that contains only an aggregated result of each hour. I understand that simple partitioning doesn't return such dataframe.
How could I take advantage of that separated folders – is it worth to read each hour data separately, and then combine all with union? (while preserving the partition like here). Is it indeed saves the "group-by" operation?

Spark: Is it possible to load an RDD from multiple files in different formats?

I have an heterogeneously-formatted input of files, batch mode.
I want to run a batch over a number of files. These files are of different formats, and they will have different mappings to normalize data (e.g. extract fields with different schema names or positions in the records, to a standard naming).
Given the tabular nature of the data, I'm considering using Dataframes (cannot use datasets due to the Spark version I'm bound to).
In order to apply different extraction logic to each file - do they need to be loaded each file in a separate dataframe, then apply extraction logic (extraction of some files, a process which is different per each file type, configured in terms of e.g. CSV/JSON/XML, position of fields to select (CSV), name of field to select (JSON), etc.), and then join datasets?
That would force me to iterate files, and act on each dataframe separately, and join dataframes afterwards; instead of applying the same (configurable) logic.
I know I could make it with RDD, i.e. loading all files into the RDD, emitting PairRDD[fileId, record], and then run a map where you would look the fileId to get the configuration to apply to that record, which tells you which logic to apply.
I'd rather use Dataframes, for all of the niceties it offers over raw RDDS, in terms of performance, simplicity and parsing.
Is there a better way to use Dataframes to address this problem than the one already explained? Any suggestions or misconceptions I may have?
I'm using Scala, though it should not matter to this problem.