I want to create a global dataset of wetlands using the OSM database. As there are problems for huge datasets if I use the overpass-turbo or so, I thought I could use wget to download the planet file and filter it for only the data I'm interested in. Problem is, I don't know much about wget. So I wanted to know whether there is a way to filter the data from the planet file while downloading and unzipping it?
In general I'm looking for the least time and disk-space consuming way to get to that data. Would you have any suggestions?
wget is just a tool for downloading, it can't filter on the fly. There is probably a chance that you can pipe the data to a second tool which does filtering on the fly but I don't see any advantage here. And the disadvantage is that you can't verify the file checksum afterwards.
Download the planet and filter it afterwards using osmosis or osmfilter.
Related
I have loaded a massive set of tiles into a postgres database for use in a tile server. These were all loaded into bytea columns in PNG format.
I now find out that the tile server code needs these to be in GEOTiff format.
The command:-
gdal_translate -of GTiff -expand rgb -co COMPRESS=DEFLATE -co ZLEVEL=6
Works perfectly.
However, a massive amount of data is loaded already on a remote server. So is it possible for me to do the conversion within the database instead of retrieving each file and using gdal_translate on them individually? I understand that gdal is integrated with postgis 2.0 through the raster support which is installed on my server.
If not, any suggestions as to how to do this efficiently.
Is it possible to do in the database with an appropriate procedural language? I suppose. Additionally it is worth noting up front that gdal's support for Postgis goes rather one way.
To be honest the approach is likely to be "retrieve individual record, convert, restore using external image processing like you are doing." You might get some transaction benefit but this is likely to be offset by locks.
If you are going to go this route, you may find pl/java to be the most helpful approach since you can load any image processing library Java supports and use that.
I am, however, not convinced that this will be better than retrieve/transform/load.
I will have xml file on server. It will store information of about 600 stores. information includes name, address, opening time , coordinates. So is it ok to parse whole file into iphone then select nearest stores according to coordinates?
I am thinking about processing time and memory use
Please suggest
The way I would do this is write a web service and pass it the coordinates and download only those within a certain radius. Always try to download as little data as possible to the iPhone (especially xml data)
I just put this here
http://quatermain.tumblr.com/post/93651539/aqxmlparser-big-memory-win
A simple solution would be to group them into clusters that are somehow related, probably by location. You already have an XML on a server, so simply split them up into 3 groups of related stores of around 200, or preferably even smaller. I'm not entirely sure on why you would want to store 600 data points of that nature. I feel that if you filter/shrink on the server side you could be saving a lot of time/memory.
I have seen people storing 300-400 data points, though it is so dependent on how large your defined objects in your Core Database are, that it is probably best for you to just run some tests.
I need a suggest how to operate with large amount of data on iPhone. Let say I have xml file with ~120k text records. I need to perform search on this data. The solution i have tried is to use Core Data to store information in sorted order in caches. And then use binary search which works fast. But the problem is to build this caches. On first launch application takes about 15-25 seconds to build this caches. Maybe I need to use different approach to search the data?
Thanks in advance.
If you're using an XML file with the requirement that you can't cache, then you're not going to succeed unless you somehow carefully format your XML file to have useful data traversal properties -- but then you may as well use a binary file that's more useful unless you have some very esoteric requirements.
Really what you want is one of the typical indexing algorithms (on disk hash, B-tree, etc) from the get-go.
However...
If you have to read in and parse your XML text file, then you can skirt using a typical big and slow generic XML parser and write a fast hackish version since most of the data records you'll need to recognize are probably formatted the same way over and over. Nothing special, just find where the relevant data fields start, grab the data until it ends, move on to the next data field.
Honestly, 120k of text isn't very much-- sounds like whatever XML parser you're using is just slow. (I use this trick all the time for autogenerated XML data that just represents things like tables or simple data records -- my own parser is faster than any generic XML parser.)
This is probably the solution you actually want since you sound fairly attached to the XML file format. It won't be as error-proof as a generic XML parser if you're not careful, however it will eat that 120KB file up like nobody's business. And it's entry level CS work -- read in a file with certain specific formatting and grab the data values from it. Regexps are your friend if you have access to them.
Try storing and doing your searches in the cloud. (using a database stored on a server somewhere)
Unless you specifically need ALL of the information on the device..
As part of my work we get approx 25TB worth log files annually, currently it been saved over an NFS based filesystem. Some are archived as in zipped/tar.gz while others reside in pure text format.
I am looking for alternatives of using an NFS based system. I looked at MongoDB, CouchDB. The fact that they are document oriented database seems to make it the right fit. However the log files content needs to be changed to JSON to be store into the DB. Something I am not willing to do. I need to retain the log files content as is.
As for usage we intend to put a small REST API and allow people to get file listing, latest files, and ability to get the file.
The proposed solutions/ideas need to be some form of distributed database or filesystem at application level where one can store log files and can scale horizontally effectively by adding more machines.
Ankur
Since you dont want queriying features, You can use apache hadoop.
I belive HDFS and HBase will be nice fit for this.
You can see lot of huge storage stories inside Hadoop powered by page
Take a look at Vertica, a columnar database supporting parallel processing and fast queries. Comcast used it to analyze about 15GB/day of SNMP data, running at an average rate of 46,000 samples per second, using five quad core HP Proliant servers. I heard some Comcast operations folks rave about Vertica a few weeks ago; they still really like it. It has some nice data compression techniques and "k-safety redundancy", so they could dispense with a SAN.
Update: One of the main advantages of a scalable analytics database approach is that you can do some pretty sophisticated, quasi-real time querying of the log. This might be really valuable for your ops team.
Have you tried looking at gluster? It is scalable, provides replication and many other features. It also gives you standard file operations so no need to implement another API layer.
http://www.gluster.org/
I would strongly disrecommend using a key/value or document based store for this data (mongo, cassandra, etc.). Use a file system. This is because the files are so large, and the access pattern is going to be linear scan. One thing problem that you will run into is retention. Most of the "NoSQL" storage systems use logical delete, which means that you have to compact your database to remove deleted rows. You'll also have a problem if your individual log records are small and you have to index each one of them - your index will be very large.
Put your data in HDFS with 2-3 way replication in 64 MB chunks in the same format that it's in now.
If you are to choose a document database:
On CouchDB you can use the _attachement API to attach the file as is to a document, the document itself could contain only metadata (like timestamp, locality and etc) for indexing. Then you will have a REST API for the documents and the attachments.
A similar approach is possible with Mongo's GridFs, but you would build the API yourself.
Also HDFS is a very nice choice.
I'm guessing this won't apply to 99.99% of anyone that sees this. I've been doing some Sawtooth survey programming at work and I've been needing to create a webpage that shows some aggregate data from the completed surveys. I was just wondering if anyone else has done this using the flat files that Sawtooth generates and how you went about doing it. I only know very basic Perl and the server I use does not have PHP so I'm somewhat at a loss for solutions. Anything you've got would be helpful.
Edit: The problem with offering example files is that it's more complicated. It's not a single file and it occasionally gets moved to a different file with a different format. The complexities added in there are why I ask this question.
Doesn't Sawtooth export into CSV format? There are many Perl parsers for CSV files. Just about every language has a CSV parser or two (or twelve), and MS Excel can open them directly, and they're still plaintext so you can look at them in any text editor.
I know our version of Sawtooth at work (which is admittedly very old) exports Sawtooth data into SPSS format, which can then be exported into various spreadsheet formats including CSV, if all else fails.
If you have a flat (fixed-width field) file, you can easily parse it in Perl using regular expressions or just taking substrings of each line one at a time, assuming you know the width of the fields. Your question is too general to give much better advice, sorry.
Matching the values up from a plaintext file with meta-data (variable names and labels, value labels etc.) is more complicated unless you already have the meta-data in some script-readable format. Making all of that stuff available on a web page is more complicated still. I've done it and it can be a bit of a lengthy project to roll your own. There are packages you can buy, like SDA, which will help you build a website where people can browse and download your survey data and view your codebooks.
Honestly though the easiest thing to do if you're posting statistical data on a website is get the data into SPSS or SAS or another statistics package format and post those files for download directly. Then you don't have to worry about it.