Dataprep "create dataset with parameters" does not take all files - google-cloud-storage

I am struggling with having files in the way that Dataprep intends to have, when importing a parameterized dataset from Google Cloud Storage (GCS).
To be specific:
I store .csv files on GCS in a location:
/20190807/file1.csv
/20190807/file2.csv
/20190807/file3.csv
/20190808/file1.csv
/20190808/file2.csv
/20190808/file3.csv
/20190809/file1.csv
/20190809/file2.csv
/20190809/file3.csv
...
Then I create a Dataset with Parameters on this location using a *wildcard.
In addition:
Encoding: I apply detect automatic structure, and I select UTF-8 (as I store all my files with this encoding).
Columns: I also make sure that all files have the same columns.
Problem:
However, for some reason or another, depending how the file has been saved I guess, Dataprep does not take all the files when importing them. When I take those two files, I cannot identify what is different about them. Both have been saved as a type application/octet-stream and I apply a UTF-8 encoding.
As a consequence, when I export my output after the dataset is wrangled in a flow, I miss some dates (eg 20190808).
Is there therefore a tool that I can compare these two files, to see what is different about them, in order to prevent these things of happening. It is not an option to store them in different locations as I do not know in advance which files will be different.
I am really surprised about this shortcoming, and it would be great to have somehow a way to only check the columns for each files instead of also checking other "hidden" differences.

Related

Retention scripts to container data

I'm trying to do something to apply data retention policies to my data stored in container storage in my data lake. The content is structured like this:
2022/06/30/customer.parquet
2022/06/30/product.parquet
2022/06/30/emails.parquet
2022/07/01/customer.parquet
2022/07/01/product.parquet
2022/07/01/emails.parquet
That's basically every day a new file is added, using the copy task from azure data factory. There are in reality more than 3 files per day.
I want to start applying different retention policies to different files. For example, the emails.parquet files, I want to delete the entire file after it is 30 days old. The customer files, I want to anonymise by replacing the contents of certain columns with some placeholder text.
I need to do this in a way that preserves the next stage of data processing - which is where pyspark scripts read all data for a given type (e.g. emails, or customer), transform it and output it to a different container.
So to apply the retention changes mentioned above, I think I need to iteratively look through the container, find each file (each emails file, or each customer file), do the transformations, and then output (overwrite) the original file. I'd plan to use pyspark notebooks for this, but I don't know how to iterate through folder structures in a container.
As for making date comparisons to decide if my data is to be not retained, I can either use the folder structures for the dates (but I don't know how to do this), or there's a "RowStartDate" in every parquet file that I can use too.
Can anybody help point me in the right direction of how to achieve what I wish, either by the route I'm alluding to above (pyspark script to iterate through container folders, add data to data frame, transform, then overwrite original file) or any other means.

Bulk load causing schema drift in files (adf pipeline or mapping dataflows)

Bit of a challenge here
I have around 45,000 historic .parquet files
partitioned like this yyyy,mm,dd (2021/08/19) in the dd level I have 24 files (one for each hour)
The columns in each day file are pretty wide, anything up to 250 columns. It has increased and decreased over time, hence there being schema drift when trying to load into SQL using mapping dataflows that made the file larger.
Around 200 of those columns I require and I know what they are. I even have them in a schema template. The rest are legacy or unwanted
I'd like to retain the original files in blob as they are, but load files with those 200 columns per file into SQL.
What is the best way to achieve this?
How do I iterate over every file but only take the columns I need?
I tried using a wildcard path
'2021/**/*.parquet'
within mapping dataflows to pick up All files in blob so I don't have to iterate creating multiple clusters or a foreach
I'm not even sure how to handle this or whether it should be a copy activity or a mapping df
both have their benefits but I think I can only use mapping df if I need to transform parts of these files in depth.
should I be combining the months or even years into a single file then trying to read from this files so I can exclude the additional from the columns I want to take into SQL server.
ideally this is a bulk load that need some refinement when it lands.
Thank in advance
Add a data flow to the pipeline and use a Select transformation to choose the columns you wish to propagate. You can create pattern-based rules in the data flow Select transformation to choose the columns that you wish to pick from each file schema.

In OpenEdge, how do you transfer parts of the data in the database in an easy way?

I have a lot of data in 2 different databases and in many different tables I would like to move from one computer into a few others. The others has the same definition of the db:s. Note, not all the data should be transfered, only some that I define. Some tables fully, and some others just partly.
How would I move these data in the easiest way? To dump each table and load separately in many .d files - is not an easy way. Could you do something similar to the Incremental .df File that contains all that has to be changed?
Dumping (and loading) entire tables is easy. You can do it from the GUI or by command line. Look at for instance this KnowledgeBase entry about command line dump & load and this about creating scripts for dumping the entire database.
Parts of the data is another story. This is very individual and depends on your database and your application. It's hard for a generic tool to compare data and tell if a difference in data depends on changed data, added data or deleted data. Different databases has different kinds of layout, keys and indices.
There are however several built in commands that could help you:
For instance:
IMPORT and EXPORT for importing and exporting data to files, streams etc.
Basic import and export
OUTPUT TO c:\temp\foo.data.
FOR EACH foo NO-LOCK:
EXPORT foo.
END.
OUTPUT CLOSE.
INPUT FROM c:\temp\foo.data.
REPEAT:
CREATE foo.
IMPORT foo.
END.
INPUT CLOSE.
BUFFER-COPY and BUFFER-COMPARE for copying and comparing data between tables (and possibly even databases).
You could also use the built in commands for doing "dump" and then manually edit the created files.
Calling Progress Built in commands
You can call the back end that dumps data from Data Administration. That will require you to extract those .p-files from it's archives and calling them manually. This will also require you to change PROPATHS etc so it's not straightforward. You could also look into modifying the extracted files to your needs. Remember that this might break when upgrading Progress so store away your changes in separate files.
Look at this Progress KB entry:
Progress KB 15884
Best way for you depends on if this is a one time or reacurring task, size and layout of database etc.

How can I create a web page that shows aggregate data from Sawtooth surveys?

I'm guessing this won't apply to 99.99% of anyone that sees this. I've been doing some Sawtooth survey programming at work and I've been needing to create a webpage that shows some aggregate data from the completed surveys. I was just wondering if anyone else has done this using the flat files that Sawtooth generates and how you went about doing it. I only know very basic Perl and the server I use does not have PHP so I'm somewhat at a loss for solutions. Anything you've got would be helpful.
Edit: The problem with offering example files is that it's more complicated. It's not a single file and it occasionally gets moved to a different file with a different format. The complexities added in there are why I ask this question.
Doesn't Sawtooth export into CSV format? There are many Perl parsers for CSV files. Just about every language has a CSV parser or two (or twelve), and MS Excel can open them directly, and they're still plaintext so you can look at them in any text editor.
I know our version of Sawtooth at work (which is admittedly very old) exports Sawtooth data into SPSS format, which can then be exported into various spreadsheet formats including CSV, if all else fails.
If you have a flat (fixed-width field) file, you can easily parse it in Perl using regular expressions or just taking substrings of each line one at a time, assuming you know the width of the fields. Your question is too general to give much better advice, sorry.
Matching the values up from a plaintext file with meta-data (variable names and labels, value labels etc.) is more complicated unless you already have the meta-data in some script-readable format. Making all of that stuff available on a web page is more complicated still. I've done it and it can be a bit of a lengthy project to roll your own. There are packages you can buy, like SDA, which will help you build a website where people can browse and download your survey data and view your codebooks.
Honestly though the easiest thing to do if you're posting statistical data on a website is get the data into SPSS or SAS or another statistics package format and post those files for download directly. Then you don't have to worry about it.

How do you deal with lots of small files?

A product that I am working on collects several thousand readings a day and stores them as 64k binary files on a NTFS partition (Windows XP). After a year in production there is over 300000 files in a single directory and the number keeps growing. This has made accessing the parent/ancestor directories from windows explorer very time consuming.
I have tried turning off the indexing service but that made no difference. I have also contemplated moving the file content into a database/zip files/tarballs but it is beneficial for us to access the files individually; basically, the files are still needed for research purposes and the researchers are not willing to deal with anything else.
Is there a way to optimize NTFS or Windows so that it can work with all these small files?
NTFS actually will perform fine with many more than 10,000 files in a directory as long as you tell it to stop creating alternative file names compatible with 16 bit Windows platforms. By default NTFS automatically creates an '8 dot 3' file name for every file that is created. This becomes a problem when there are many files in a directory because Windows looks at the files in the directory to make sure the name they are creating isn't already in use. You can disable '8 dot 3' naming by setting the NtfsDisable8dot3NameCreation registry value to 1. The value is found in the HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\FileSystem registry path. It is safe to make this change as '8 dot 3' name files are only required by programs written for very old versions of Windows.
A reboot is required before this setting will take effect.
NTFS performance severely degrades after 10,000 files in a directory. What you do is create an additional level in the directory hierarchy, with each subdirectory having 10,000 files.
For what it's worth, this is the approach that the SVN folks took in version 1.5. They used 1,000 files as the default threshold.
The performance issue is being caused by the huge amount of files in a single directory: once you eliminate that, you should be fine. This isn't a NTFS-specific problem: in fact, it's commonly encountered with user home/mail files on large UNIX systems.
One obvious way to resolve this issue, is moving the files to folders with a name based on the file name. Assuming all your files have file names of similar length, e.g. ABCDEFGHI.db, ABCEFGHIJ.db, etc, create a directory structure like this:
ABC\
DEF\
ABCDEFGHI.db
EFG\
ABCEFGHIJ.db
Using this structure, you can quickly locate a file based on its name. If the file names have variable lengths, pick a maximum length, and prepend zeroes (or any other character) in order to determine the directory the file belongs in.
I have seen vast improvements in the past from splitting the files up into a nested hierarchy of directories by, e.g., first then second letter of filename; then each directory does not contain an excessive number of files. Manipulating the whole database is still slow, however.
I have run into this problem lots of times in the past. We tried storing by date, zipping files below the date so you don't have lots of small files, etc. All of them were bandaids to the real problem of storing the data as lots of small files on NTFS.
You can go to ZFS or some other file system that handles small files better, but still stop and ask if you NEED to store the small files.
In our case we eventually went to a system were all of the small files for a certain date were appended in a TAR type of fashion with simple delimiters to parse them. The disk files went from 1.2 million to under a few thousand. They actually loaded faster because NTFS can't handle the small files very well, and the drive was better able to cache a 1MB file anyway. In our case the access and parse time to find the right part of the file was minimal compared to the actual storage and maintenance of stored files.
You could try using something like Solid File System.
This gives you a virtual file system that applications can mount as if it were a physical disk. Your application sees lots of small files, but just one file sits on your hard drive.
http://www.eldos.com/solfsdrv/
If you can calculate names of files, you might be able to sort them into folders by date, so that each folder only have files for a particular date. You might also want to create month and year hierarchies.
Also, could you move files older than say, a year, to a different (but still accessible) location?
Finally, and again, this requires you to be able to calculate names, you'll find that directly accessing a file is much faster than trying to open it via explorer. For example, saying
notepad.exe "P:\ath\to\your\filen.ame"
from the command line should actually be pretty quick, assuming you know the path of the file you need without having to get a directory listing.
One common trick is to simply create a handful of subdirectories and divvy up the files.
For instance, Doxygen, an automated code documentation program which can produce tons of html pages, has an option for creating a two-level deep directory hierarchy. The files are then evenly distributed across the bottom directories.
Aside from placing the files in sub-directories..
Personally, I would develop an application that keeps the interface to that folder the same, ie all files are displayed as being individual files. Then in the application background actually takes these files and combine them into a larger files(and since the sizes are always 64k getting the data you need should be relatively easy) To get rid of the mess you have.
So you can still make it easy for them to access the files they want, but also lets you have more control how everything is structured.
Having hundreds of thousands of files in a single directory will indeed cripple NTFS, and there is not really much you can do about that. You should reconsider storing the data in a more practical format, like one big tarball or in a database.
If you really need a separate file for each reading, you should sort them into several sub directories instead of having all of them in the same directory. You can do this by creating a hierarchy of directories and put the files in different ones depending on the file name. This way you can still store and load your files knowing just the file name.
The method we use is to take the last few letters of the file name, reversing them, and creating one letter directories from that. Consider the following files for example:
1.xml
24.xml
12331.xml
2304252.xml
you can sort them into directories like so:
data/1.xml
data/24.xml
data/1/3/3/12331.xml
data/2/5/2/4/0/2304252.xml
This scheme will ensure that you will never have more than 100 files in each directory.
Consider pushing them to another server that uses a filesystem friendlier to massive quantities of small files (Solaris w/ZFS for example)?
If there are any meaningful, categorical, aspects of the data you could nest them in a directory tree. I believe the slowdown is due to the number of files in one directory, not the sheer number of files itself.
The most obvious, general grouping is by date, and gives you a three-tiered nesting structure (year, month, day) with a relatively safe bound on the number of files in each leaf directory (1-3k).
Even if you are able to improve the filesystem/file browser performance, it sounds like this is a problem you will run into in another 2 years, or 3 years... just looking at a list of 0.3-1mil files is going to incur a cost, so it may be better in the long-term to find ways to only look at smaller subsets of the files.
Using tools like 'find' (under cygwin, or mingw) can make the presence of the subdirectory tree a non-issue when browsing files.
Rename the folder each day with a time stamp.
If the application is saving the files into c:\Readings, then set up a scheduled task to rename Reading at midnight and create a new empty folder.
Then you will get one folder for each day, each containing several thousand files.
You can extend the method further to group by month. For example, C:\Reading become c:\Archive\September\22.
You have to be careful with your timing to ensure you are not trying to rename the folder while the product is saving to it.
To create a folder structure that will scale to a large unknown number of files, I like the following system:
Split the filename into fixed length pieces, and then create nested folders for each piece except the last.
The advantage of this system is that the depth of the folder structure only grows as deep as the length of the filename. So if your files are automatically generated in a numeric sequence, the structure is only is deep is it needs to be.
12.jpg -> 12.jpg
123.jpg -> 12\123.jpg
123456.jpg -> 12\34\123456.jpg
This approach does mean that folders contain files and sub-folders, but I think it's a reasonable trade off.
And here's a beautiful PowerShell one-liner to get you going!
$s = '123456'
-join (( $s -replace '(..)(?!$)', '$1\' -replace '[^\\]*$','' ), $s )