Is there a way to pull only metadata and specific range in grib? - bucket

There is an increasing huge abundance of weather data available on cloud buckets. Awesome! However they are not stored in cloud optimized formats. I was wondering if there was a way to only pull metadata from grib2 files stored on AWS, and subsequently only pull single points from those files. Same question for netcdf4. I know Netcdf4 support libraries allow you to do so for files on disk, but I have no idea how to do it on the cloud.
I'm at a loss for which resources I should be looking into in order to explore this. Any help would be really appreciated.

You could parse the GRIB2 file on-the-fly and drop everything you don't need right away. Each GRIB2 file contains one or more GRIB2 messages which has the following structure:
Section 0: Indicator Section
Section 1: Identification Section
Section 2: Local Use Section (optional)
Section 3: Grid Definition Section (can be repeated)
Section 4: Product Definition Section (can be repeated)
Section 5: Data Representation Section (can be repeated)
Section 6: Bit-Map Section (can be repeated)
Section 7: Data Section (can be repeated)
Section 8: End Section
Section 0 has in GRIB2 always 16bytes, Section 8 always 4 bytes. The rest has always starts with length of the section (4 bytes) and section number (1 byte). Therefore it should be easy to skip all section you don't need fast. You can then read only section 1, 3 or 5, depending on what meta-data you want.
There is a drawback however. If I understand it correctly you want to do that on online resources. In this case you will download the whole file while skipping over some or most of its parts.
If you are trying to build some kind of index of available GRIB data this will be probably one options. Kind of a GRIB crawler.
Note that GRIB1 has a bit different structure
More details about GRIB2 sections: https://www.nco.ncep.noaa.gov/pmb/docs/grib2/grib2_doc/

Related

how to quickly locate which sheets/dashboards contain a field?

I am creating a data dictionary and I am supposed to track the location of any used field in a workbook. For example (superstore sample data), I need to specify which sheets/dashboards have the [sub-category] field.
My dataset has hundreds of measures/dimensions/calc fields, so it's incredibly time exhaustive to click into every single sheet/dashboard just to see if a field exists in there, so is there a quicker way to do this?
One robust, but not free, approach is to use Tableau's Data Catalog which is part of the Tableau Server Data Management Add-On
Another option is to build your own cross reference - You could start with Chris Gerrard's ruby libraries described in the article http://tableaufriction.blogspot.com/2018/09/documenting-dashboards-and-their.html

How can I insert a set of pages into all existing documents within an existing batch in Kofax Capture

We are using Kofax Capture 11.1
I have a use case where a business unit will give us 2 stacks of paper:
Stack 1: Single page cover sheets printed from accounting software with all required index values that will be processed with zonal recognition. Anywhere from 50-200 of these at a time.
Stack 2: A 50-100 page document that is the supporting documentation that covers each/all of the items in stack one.
I'm trying to reduce manual work by having users scan Stack 1 and do the separation into the 50-200 individual documents within the batch. THEN, I want them to be able to scan the second document and have it automatically inserted into or associated with each of those original documents in the batch without having to scan it for each document or manually copy/paste it.
The base idea is to have those cover sheets inserted before the actual document.
So take the page from stack one and put it before the beginning of the document in stack 2.
Continue the preparation for all other documents.
If you setup you separation correctly to coversheet. Then everything should be done automatically. (most cases you can even choose if you want your cover-sheet to be exported with the actual document)
If you are stuck on having to scan the stack one as a complete and then do stack 2.
Then you should look into foldering options in Kofax Capture.
Documents will be sorted as you want, but they will still be exported as separate documents. (You can alternatively merge them with custom module or custom export)

How to get the count of an Excel in unix or Linux in datastage

Is there any possibility i can get the count of an excel in unix or linux.
I tried creating server routine & i am able to get the output link count but i am unable to get return value into a file or variable.
Please suggest,
There are lots of information (besides the data itself) that can be extracted from an Excel file - check out the documentation
I could imageine two options for your goal:
check a certain cell which would be filled/exist if the file is not empty
check the filesize of an empty file and if the filesize is bigger than that it is not empty
The Best approach will be,
Use 'Unstructured Data' stage from File section in palette to read the xlsx(excel file).
Use Aggregator stage to take count of row , consider any column which will always be there in case the data comes into the file.
Write this count into a file.
use this as per your logical requirement.
After point 2 you can use transformer stage to handle count logic, whichever you find suitable.

Importing data from postgres to cytoscape

I have been trying to load some gis data from a postgis database into Cytoscape 3.6. I am trying to get some inDegree and outDegree values I have used the sif file format.
As long as the data is written out in the follow format
source_point\tinteracts with\ttarget_point
Cytoscape is happy to read it.
I am just wondering if there is anyway of including my own metric for the cost of getting between source_point and target_point
Sure! There are several ways to read in text files into Cytoscape -- SIF is just one of them. I would create a file that looks like SIF, but is actually a more complete text file:
Source\tTarget\tScore
source_point\ttarget_point\t1.1
...
And then use the "File->Import Network->File", choose your source and target and leave score as an edge attribute. You can have as many attributes on each line as you want, and can even mix edge attributes, source node attributes, and target node attributes.
-- scooter

Using Talend Open Studio DI to extract extract value from unique 1st row before continuing to process columns

I have a number of excel files where there is a line of text (and blank row) above the header row for the table.
What would be the best way to process the file so I can extract the text from that row AND include it as a column when appending multiple files? Is it possible without having to process each file twice?
Example
This file was created on machine A on 01/02/2013
Task|Quantity|ErrorRate
0102|4550|6 per minute
0103|4004|5 per minute
And end up with the data from multiple similar files
Task|Quantity|ErrorRate|Machine|Date
0102|4550|6 per minute|machine A|01/02/2013
0103|4004|5 per minute|machine A|01/02/2013
0467|1264|2 per minute|machine D|02/02/2013
I put together a small, crude sample of how it can be done. I call it crude because a. it is not dynamic, you can add more files to process but you need to know how many files in advance of building your job, and b. it shows the basic concept, but would require more work to suite your needs. For example, in my test files I simply have "MachineA" or "MachineB" in the first line. You will need to parse that data out to obtain the machine name and the date.
But here is how may sample works. Each Excel is setup as two inputs. For the header the tFileInput_Excel is configured to read only the first line while the body tFileInput_Excel is configured to start reading at line 4.
In the tMap they are combined (not joined) into the output schema. This is done for the Machine A Excel and Machine B excels, then those tMaps are combined with a tUnite for the final output.
As you can see in the log row the data is combined and includes the header info.