Apache Druid Appending Segment without dropping or summing it - druid

I have three JSON files with the same timestamp but different values to upload to the Druid. I want to upload them separately with the same segment granularity. However, it drops the existing segment and uploads the new one.
I don't want to use appendToExisting: True bc it sums the values of the same rows. This is the situation that I don't want to happen (I may be adding the same file in the future).
Is there a way to add new data to a specific segment without dropping or summing it?

Related

Retention scripts to container data

I'm trying to do something to apply data retention policies to my data stored in container storage in my data lake. The content is structured like this:
2022/06/30/customer.parquet
2022/06/30/product.parquet
2022/06/30/emails.parquet
2022/07/01/customer.parquet
2022/07/01/product.parquet
2022/07/01/emails.parquet
That's basically every day a new file is added, using the copy task from azure data factory. There are in reality more than 3 files per day.
I want to start applying different retention policies to different files. For example, the emails.parquet files, I want to delete the entire file after it is 30 days old. The customer files, I want to anonymise by replacing the contents of certain columns with some placeholder text.
I need to do this in a way that preserves the next stage of data processing - which is where pyspark scripts read all data for a given type (e.g. emails, or customer), transform it and output it to a different container.
So to apply the retention changes mentioned above, I think I need to iteratively look through the container, find each file (each emails file, or each customer file), do the transformations, and then output (overwrite) the original file. I'd plan to use pyspark notebooks for this, but I don't know how to iterate through folder structures in a container.
As for making date comparisons to decide if my data is to be not retained, I can either use the folder structures for the dates (but I don't know how to do this), or there's a "RowStartDate" in every parquet file that I can use too.
Can anybody help point me in the right direction of how to achieve what I wish, either by the route I'm alluding to above (pyspark script to iterate through container folders, add data to data frame, transform, then overwrite original file) or any other means.

Bulk load causing schema drift in files (adf pipeline or mapping dataflows)

Bit of a challenge here
I have around 45,000 historic .parquet files
partitioned like this yyyy,mm,dd (2021/08/19) in the dd level I have 24 files (one for each hour)
The columns in each day file are pretty wide, anything up to 250 columns. It has increased and decreased over time, hence there being schema drift when trying to load into SQL using mapping dataflows that made the file larger.
Around 200 of those columns I require and I know what they are. I even have them in a schema template. The rest are legacy or unwanted
I'd like to retain the original files in blob as they are, but load files with those 200 columns per file into SQL.
What is the best way to achieve this?
How do I iterate over every file but only take the columns I need?
I tried using a wildcard path
'2021/**/*.parquet'
within mapping dataflows to pick up All files in blob so I don't have to iterate creating multiple clusters or a foreach
I'm not even sure how to handle this or whether it should be a copy activity or a mapping df
both have their benefits but I think I can only use mapping df if I need to transform parts of these files in depth.
should I be combining the months or even years into a single file then trying to read from this files so I can exclude the additional from the columns I want to take into SQL server.
ideally this is a bulk load that need some refinement when it lands.
Thank in advance
Add a data flow to the pipeline and use a Select transformation to choose the columns you wish to propagate. You can create pattern-based rules in the data flow Select transformation to choose the columns that you wish to pick from each file schema.

Modifying number of output files per write-partition with spark

I have a data source that consists of a huge amount of small files. I would like to save this partitioned by column user_id to another storage:
sdf = spark.read.json("...")
sdf.write.partitionBy("user_id").json("...")
The reason for this is I want another system to be able to delete only select users' data upon request.
This works, but, I still get many files within each partition (due to my input data). For performance reasons I would like to reduce the number of files within each partition, ideally simply to one (the process will run each day, so having an output file per user per day would work well).
How do I obtain this with pyspark?
You can use repartition to ensure that each partition gets one file
sdf.repartition('user_id').write.partitionBy("user_id").json("...")
This will make sure for each partition one file is created but in case of coalesce if there are more than one partition it can cause trouble.
Just add coalesce and no. of file you want.
sdf.coalesce(1).write.partitionBy("user_id").json("...")

EF Core 3.1: Update a single record

What would be the best solution for building a number sequence generator. I've to apply several different text formats to a sequence of numbers. I would like to save the current number values in a own table. When I need to draw a new number, I would like to increase the number in my table and save the new number to my table. But I have to make it independent from other (bigger) transactions, which may be running at the same time.
For better understanding my issue: I've to import a batch of sales orders. During the import I've to generate a new number sequence value, but it will happen quite in the middle of a transaction. After processing some header information I can now generate my number from the sequence. At this point I've to save this in the database, so that the next user has to draw a new number. After the number generation I've to process the order lines. Only if this is finished successful, then I'm allow to save the whole sales order in database. If errors occur, the whole sales order will be retried to be imported manually later. The used number of the sequence must not be used again.
So I need a possibility to save single records from the ChangeTracker instead of writing all modifications at once. Any idea how to deal with it?

Working with huge csv files in Tableau

I have a large csv files (1000 rows x 70,000 columns) which I want to create a union between 2 smaller csv files (since these csv files will be updated in the future). In Tableau working with such a large csv file results in very long processing time and sometimes causes Tableau to stop responding. I would like to know what are better ways of dealing with such large csv files ie. by splitting data, converting csv to other data file type, connecting to server, etc. Please let me know.
The first thing you should ensure is that you are accessing the file locally and not over a network. Sometimes it is minor, but in some cases that can cause some major slow down in Tableau reading the file.
Beyond that, your file is pretty wide should be normalized some, so that you get more row and fewer columns. Tableau will most likely read it in faster because it has fewer columns to analyze (data types, etc).
If you don't know how to normalize the CSV file, you can use a tool like: http://www.convertcsv.com/pivot-csv.htm
Once you have the file normalized and connected in Tableau, you may want to extract it inside of Tableau for improved performance and file compression.
The problem isn't the size of the csv file: it is the structure. Almost anything trying to digest a csv will expect lots of rows but not many columns. Usually columns define the type of data (eg customer number, transaction value, transaction count, date...) and the rows define instances of the data (all the values for an individual transaction).
Tableau can happily cope with hundreds (maybe even thousands) of columns and millions of rows (i've happily ingested 25 million row CSVs).
Very wide tables usually emerge because you have a "pivoted" analysis with one set of data categories along the columns and another along the rows. For effective analysis you need to undo the pivoting (or derive the data from its source unpivoted). Cycle through the complete table (you can even do this in Excel VBA despite the number of columns by reading the CSV directly line by line rather than opening the file). Convert the first row (which is probably column headings) into a new column (so each new row contains every combination of original row label and each column header plus the relevant data value from the relevant cell in the CSV file). The new table will be 3 columns wide but with all the data from the CSV (assuming the CSV was structured the way I assumed). If I've misunderstood the structure of the file, you have a much bigger problem than I thought!