Import free text files - import

I have been asked to do NLP on a folder of free text documents in SAS. Normally I do this in Python or R and I am not sure how to import he txt files into SAS because there is no structure.
I have thought about using proc import but don't know what I would use as a delimiter. How can one import free text files with no structure into SAS? I supposed once I got in I could use things '%like%' sort of items to pull out what they want.

I would strongly recommend against this. Use the right tool for the right job, in this case it's not SAS.
Ok, that being said some basics you could do:
Import text files and create n grams. Ideally, 1,2 &3 words.
Use PROC FREQ to summarize n-grams.
Find a parts of speech corpus and merge than with the 1 gram to remove useless words.
Calculate length of words and length of sentence to create a document complexity score.
Those are all doable in Base.

Related

Importing a CSV file with commas in my rows into mongoDB

I really hope this question hasn't been asked and answered before but I can't find a single clear SO answer on it so here goes.
I am trying to import a CSV file into a mongoDB collection. The CSV file contains word definitions, which often include commas. I want to be able to store these data points in mongodb with their commas and be able to read them with their commas in a javascript app.
When I import the csv into mongoDB, it reads these commas as an indication to go on to the next field. I have tried using double quotes around each of my observations but to no avail - instead this imports as three quotation marks (""") and mongoDB still takes the comma within that observation as an indication to move on to the next field.
Please help!!
(For clarity I am using the simplified GUI through mongoDB compass...but happy to use the command line if there is a solution!)
Example of the CSV:
You might try using the command-line tool, mongoimport for more options than the MongoDB Compass import interface has.

How to filter Wikidata dump for a language?

I've downloaded the Wikidata truthy dump in RDF format (.nt.bz2 file). I want to limit the language of the dump to English only and generate this new filtered dump as a new .nt file.
I've tried using parallel grep to filter lines with '#en' text, but it consumes a lot of processing time.
Is there some much faster way to generate filtered dumps? Something like using Spark?
Maybe it is a bit late for you, but meanwhile a tool was generated to create custom dumps: https://tools.wmflabs.org/wdumps/
With this tool you can online define a language filter and then download an .nt file with only the relevant triples.

Manipulating large csv files with Matlab

I am trying to work with a large set of numerical data stored in a csv file. Is so big that I cannot store in a single variable, as Matlab does not have enough memory.
I was wondering if there is some way to manipulate large csv files in matlab similar as if they were variables, i.e. I want to sort it, delete some rows, find the column and row of some values, etc.
If that is not possible, what programming language do you recommend to do that, considering that the data is stored in a matrix form?
You can import the csv file into a database. E.g. sqlite - https://sqlite.org/cvstrac/wiki?p=ImportingFiles
Take one of the sqlite Toolboxes for Matlab, e.g. http://go-sqlite.osuv.de/doc/
You should be able to select single rows and columns due sql language and import to matlab. Or use sqlite functions (for sort -> order by etc.)...
Another option is to access csv files directly like it is a sql database with q. See https://github.com/harelba/q

Import a CSV into Matlab in multiple parts

I have a very large CSV file (870mb) that I'm trying to import into Matlab. Some of the data are numeric and some are text. I have 16GB of RAM and a SSD but the import wizard script is using 37GB and doesn't progress past 0% scanning file after a couple of hours.
Is there a way to break up the import wizard script to import the first 500,000 rows and save them to variables and empty dataArray then import the next 500,000 rows and append it to the variables, etc, until the file is complete? I'm surprised that Matlab doesn't do something like this natively.
Thank you for your help.
Have a look at the memory-mapping approach I described here. If you know the format of your file, or can deduce it from the contents, I have found this to be the fastest way to read large CSV files into Matlab. It also helps reduce memory usage.

How can I create a web page that shows aggregate data from Sawtooth surveys?

I'm guessing this won't apply to 99.99% of anyone that sees this. I've been doing some Sawtooth survey programming at work and I've been needing to create a webpage that shows some aggregate data from the completed surveys. I was just wondering if anyone else has done this using the flat files that Sawtooth generates and how you went about doing it. I only know very basic Perl and the server I use does not have PHP so I'm somewhat at a loss for solutions. Anything you've got would be helpful.
Edit: The problem with offering example files is that it's more complicated. It's not a single file and it occasionally gets moved to a different file with a different format. The complexities added in there are why I ask this question.
Doesn't Sawtooth export into CSV format? There are many Perl parsers for CSV files. Just about every language has a CSV parser or two (or twelve), and MS Excel can open them directly, and they're still plaintext so you can look at them in any text editor.
I know our version of Sawtooth at work (which is admittedly very old) exports Sawtooth data into SPSS format, which can then be exported into various spreadsheet formats including CSV, if all else fails.
If you have a flat (fixed-width field) file, you can easily parse it in Perl using regular expressions or just taking substrings of each line one at a time, assuming you know the width of the fields. Your question is too general to give much better advice, sorry.
Matching the values up from a plaintext file with meta-data (variable names and labels, value labels etc.) is more complicated unless you already have the meta-data in some script-readable format. Making all of that stuff available on a web page is more complicated still. I've done it and it can be a bit of a lengthy project to roll your own. There are packages you can buy, like SDA, which will help you build a website where people can browse and download your survey data and view your codebooks.
Honestly though the easiest thing to do if you're posting statistical data on a website is get the data into SPSS or SAS or another statistics package format and post those files for download directly. Then you don't have to worry about it.