How to filter Wikidata dump for a language? - pyspark

I've downloaded the Wikidata truthy dump in RDF format (.nt.bz2 file). I want to limit the language of the dump to English only and generate this new filtered dump as a new .nt file.
I've tried using parallel grep to filter lines with '#en' text, but it consumes a lot of processing time.
Is there some much faster way to generate filtered dumps? Something like using Spark?

Maybe it is a bit late for you, but meanwhile a tool was generated to create custom dumps: https://tools.wmflabs.org/wdumps/
With this tool you can online define a language filter and then download an .nt file with only the relevant triples.

Related

Compare tables to ensure non regression in postgresql

Here is my issue: I often need to compare the same postgresql tables (or views that depend on it) between some ETL code refactoring to check for non regressions in my developments.
Let's say I have an ETL code I want to refactor, which regularly uploads data in a table. Currently, once my modifs are done, I often download my data from postgresql as a .csv file as a first step, then empty it, fill it again using my refactored code, and download the data again. Then, I compare the .csv files using for instance Python in a Jupyter Notebook.
That does not seem like the way to go at all. That notably supposes I am the only one to use that table during the operation, and so many other things I can't list them all here.
Is there a better way to go ?
It sounds to me like you have the correct approach. There's no magic to the CSV export operation: whatever tool you use runs a query and formats its resultset into the file. Any other before-and-after comparison operation would have to run the same query.
If you're doing this sort of regression test on an active database, it's probably wise to put some sort of distinctive tag on your test records, maybe prepend ETLTEST- to your customer names, so it's ETLTEST-John Bull. Then you can make your queries handle only your test records. And make sure you do something reliable for ORDER BY.
Juptyer seems a complex way to diff your csv files. Most operating systems have lightweight fast difftools.

Sorting file contents idiomatically with spring-batch

I have a CSV file with a number of fields. What is an idiomatic way to read the file, sort the file using a subset of fields, and then write another CSV as output.
Should I even attempt to do this in spring-batch? I understand that *nix-based OSes have the sort utility to do this, but I'd like to contain all my work within spring batch if possible.
The Batch Processing Strategies section of the documentation seems to suggest that might be standard utility steps to accomplish this:
In addition to the main building blocks, each application may use one or more of standard utility steps, such as:
Sort: A program that reads an input file and produces an output file where records have been re-sequenced according to a sort key field in the records. Sorts are usually performed by standard system utilities.
But I am not able to locate this. Any pointers most welcome!
Thanks very much!
Unless you really should do it inside Spring Batch I would suggest you do it with OS based commands.
But your point is correct, adding intermediary Steps to your Jobs to Sort/Filter or even clean DATA is a mainstream pattern used in Batch Processing or ETL Jobs.
Hope this helps.
I found out that there is a SystemCommandTasklet that is meant to run OS commands. This can be used to do things like sorting, finding unique items, etc.

How to show data from the RDF archive in scala flink

I am looking for an approach to load and print data from .n3 files of .tar.gz archive in scala. Or should I extract it?
If you want to download the file, it is located on
http://wiki.knoesis.org/index.php/LinkedSensorData
Could anyone describe how can I print the data on the screen from this archive using scala?
The files that you are dealing with are large. I therefore suggest you import it into an RDF store of some sort rather than try and parse it yourself. You can use GraphDB, Blazegraph, Virtuso and the list goes on. A search for RDF stores should give many other options. Then you can use SPARQL to query the RDF store (which is like SQL for relational databases).
For finding a Scala library that can access RDF data you can see this related SO question, though it does not look promising. I would suggest you look at Apache Jena, a Java library.
You may also want to look at the DBPedia Extraction Framework where they extract data from Wikipedia and store it as RDF data using Scala. It is certainly not exactly what you are trying to do, but it could give you insight into the tools they used for generating RDF from Scala and related issues.

Load and perform search on large amount of data

I need a suggest how to operate with large amount of data on iPhone. Let say I have xml file with ~120k text records. I need to perform search on this data. The solution i have tried is to use Core Data to store information in sorted order in caches. And then use binary search which works fast. But the problem is to build this caches. On first launch application takes about 15-25 seconds to build this caches. Maybe I need to use different approach to search the data?
Thanks in advance.
If you're using an XML file with the requirement that you can't cache, then you're not going to succeed unless you somehow carefully format your XML file to have useful data traversal properties -- but then you may as well use a binary file that's more useful unless you have some very esoteric requirements.
Really what you want is one of the typical indexing algorithms (on disk hash, B-tree, etc) from the get-go.
However...
If you have to read in and parse your XML text file, then you can skirt using a typical big and slow generic XML parser and write a fast hackish version since most of the data records you'll need to recognize are probably formatted the same way over and over. Nothing special, just find where the relevant data fields start, grab the data until it ends, move on to the next data field.
Honestly, 120k of text isn't very much-- sounds like whatever XML parser you're using is just slow. (I use this trick all the time for autogenerated XML data that just represents things like tables or simple data records -- my own parser is faster than any generic XML parser.)
This is probably the solution you actually want since you sound fairly attached to the XML file format. It won't be as error-proof as a generic XML parser if you're not careful, however it will eat that 120KB file up like nobody's business. And it's entry level CS work -- read in a file with certain specific formatting and grab the data values from it. Regexps are your friend if you have access to them.
Try storing and doing your searches in the cloud. (using a database stored on a server somewhere)
Unless you specifically need ALL of the information on the device..

How can I create a web page that shows aggregate data from Sawtooth surveys?

I'm guessing this won't apply to 99.99% of anyone that sees this. I've been doing some Sawtooth survey programming at work and I've been needing to create a webpage that shows some aggregate data from the completed surveys. I was just wondering if anyone else has done this using the flat files that Sawtooth generates and how you went about doing it. I only know very basic Perl and the server I use does not have PHP so I'm somewhat at a loss for solutions. Anything you've got would be helpful.
Edit: The problem with offering example files is that it's more complicated. It's not a single file and it occasionally gets moved to a different file with a different format. The complexities added in there are why I ask this question.
Doesn't Sawtooth export into CSV format? There are many Perl parsers for CSV files. Just about every language has a CSV parser or two (or twelve), and MS Excel can open them directly, and they're still plaintext so you can look at them in any text editor.
I know our version of Sawtooth at work (which is admittedly very old) exports Sawtooth data into SPSS format, which can then be exported into various spreadsheet formats including CSV, if all else fails.
If you have a flat (fixed-width field) file, you can easily parse it in Perl using regular expressions or just taking substrings of each line one at a time, assuming you know the width of the fields. Your question is too general to give much better advice, sorry.
Matching the values up from a plaintext file with meta-data (variable names and labels, value labels etc.) is more complicated unless you already have the meta-data in some script-readable format. Making all of that stuff available on a web page is more complicated still. I've done it and it can be a bit of a lengthy project to roll your own. There are packages you can buy, like SDA, which will help you build a website where people can browse and download your survey data and view your codebooks.
Honestly though the easiest thing to do if you're posting statistical data on a website is get the data into SPSS or SAS or another statistics package format and post those files for download directly. Then you don't have to worry about it.