Regarding the model output options in mallet:
--output-model [FILENAME]
--output-state [FILENAME]
--output-doc-topics [FILENAME]
--output-topic-keys [FILENAME]
Is there a specification for the text file (which column corresponds to which value), which goes beyond this general description.
The output format of these 2 files
--output-doc-topics [FILENAME]
--output-topic-keys [FILENAME]
is a csv file (tab-separated values in a text file). It is really easy to read off what is going on in these two files; a little unusual is the fact that the topics are sorted by the strength and the topic numbers are a necessary part of the doc-topics file.
The former 2 files
--output-model [FILENAME]
--output-state [FILENAME]
is "Java serialization data, version 5" (output from the UNIX file command); I am not aware of a deeper documentation of the details.
Please edit if you find something useful!
--output-topic-keys The first column is the topic ID number, corresponding to the original order that each label first appeared in the training data. The second column is the label string. The third column is the total number of tokens assigned to that topic at the particular Gibbs sampling state where we stopped. The last column is a space-delimited list of 20 words in descending order by probability in the topic.
Related
I'm using Dataprep on GCP to wrangle a large file with a billion rows. I would like to limit the number of rows in the output of the flow, as I am prototyping a Machine Learning model.
Let's say I would like to keep one million rows out of the original billion. Is this possible to do this with Dataprep? I have reviewed the documentation of sampling, but that only applies to the input of the Transformer tool and not the outcome of the process.
You can do this, but it does take a bit of extra work in your Recipe--set up a formula in a new column using something like RANDBETWEEN to give you a random integer output between 1 and 1,000 (in this million-to-billion case). From there, you can filter rows based on whatever random integer between 1 and 1,000 as what you'll keep, and then your output will only have your randomized subset. Just have your last part of the recipe remove this temporary column.
So indeed there are 2 approaches to this.
As Courtney Grimes said, you can use one of the 2 functions that create random-number out of a range.
randbetween :
rand :
These methods can be used to slice an "even" portion of your data. As suggested, a randbetween(1,1000) , then pick 1<x<1000 to filter, because it's 1\1000 of data (million out of a billion).
Alternatively, if you just want to have million records in your output, but either
Don't want to rely on the knowledge of the size of the entire table
just want the first million rows, agnostic to how many rows there are -
You can just use 2 of these 3 row filtering methods: (top rows\ range)
P.S
By understanding the $sourcerownumber metadata parameter (can read in-product documentation), you can filter\keep a portion of the data (as per the first scenario) in 1 step (AKA without creating an additional column.
BTW, an easy way of "discovery" of how-to's in Trifacta would be to just type what you're looking for in the "search-transtormation" pane (accessed via ctrl-k). By searching "filter", you'll get most of the relevant options for your problem.
Cheers!
Am pushing csv files pipe(|) delimited from one storage account in azure to another storage account using the ORC file format but it throws an error:
Error found when processing 'Csv/Tsv Format Text' source 'time.csv' with row number 122277 found more columns than expected column count
how do I solve this error ?
'Csv/Tsv Format Text' source 'time.csv' with row number 122277 found
more columns than expected column count
Based on the error, it indicates that your columns violate the below 3rd rule which is mentioned in this link.
Source data store query result does not have a column name that is
specified in the input dataset "structure" section.
Sink data store (if with pre-defined schema) does not have a column
name that is specified in the output dataset "structure" section.
Either fewer columns or more columns in the "structure" of sink
dataset than specified in the mapping.
Duplicate mapping.
You need to check whether the row number 122277 source columns are divided into different constants with | delimitation so that it can't map to the sink columns.
I am having trouble importing a CSV file. I get the following error: File "filename.csv" not recognised as an 'CSV data files' file. Reason: Attributes names are not unique! Causes: '2' '1'.
Can anyone tell me how to fix these issues? I am using Weka 3.8 on a Windows 10 64 bit laptop.
Thanks in advance.
Just make sure to have a column name that's going to be unique vis-a-vis attribute values. This happens for me when I applied StringtoWordVector and get strings which are of the same name as my column name. Just give a good column name :)
WEKA will assume that the first row of data is the names of the columns, but the version of the NSL-KDDCup Dataset that I looked at on
github
did not have column headers. Since the first row had some repeated values, you get this error message. I will suggest two solutions.
The above noted github has a weka-friendly arff file with the data.
Add column headers to the csv file. What should the column headers be? They are listed in the arff file. :-)
It happens when attribute name is same, in more than one column of the excel sheet. Just rename the column name which are same. It should be unique. This worked for me
I was getting the same error when I uploaded a dataset to weka. When I examined the columns of the dataset, I found that the same column name was present. When I changed the name of one of the two different columns of the 'fwd header length' value, the error was fixed.
I'm new to stackoverflow (regular reader, but I want to participate now). I'm also new to Scala and Spark and functional programming. Looking forward to contributing and learning on all fronts.
my question:
I am working with a variable record length (multiple sections in file) with fixed position fields (aka fixed width - where the format is specified by column widths). For example myfile.txt layout (starting at 1) is: 1-5 = column 1, 5-6 = column 2, 6-20 = column 3 and 20-28 = column 4; whereas sub-header-a2 to sub-footer-z2 has an entirely different layout 1-3 = column 1, 3-6 = column 2 and 6-11 = column 3
myfile.txt example:
header
sub-header-a1
1234a Mr. John Doe 19770101
4321a Mrs. Jane Doe19770101
sub-footer-z1
sub-header-a2
1203400001
4302100001
sub-footer-z2
footer
Using Spark/Scala I want to select sub-header-a1 to sub-footer-z1 section in one RDD and the other section into a second RDD for further processing (minus the sub-header/footer). Two separate RDD's should be created from baseRDDInput.
First RDD
1234a Mr. John Doe 19770101
4321a Mrs. Jane Doe19770101
Second RDD
1203400001
4302100001
I have searched high and low for code examples for selecting a range from a base RDD and transform into another RDD. Found this, but I have a StringRDD and I don't get the RangePartitioner part. All the other file reading examples I found are always csv and don't have nested sections.
Here's what I have so far:
// created a base RDD from raw file, I assumed that I need an index
val baseRDDinput = sc.textFile("myfile.txt") zipWithIndex ()
// get the start and end point of my range
val (start, end) = ("sub-header-a1", "sub-footer-z1")
// get the index of start and end point
???
// iterator over index in order (index is stable based on comments https://stackoverflow.com/questions/26828815/how-to-get-element-by-index-in-spark-rdd-java) and select elements between start and end index and create RDD-1 then do the same with next section.
???
// next based on code examples from (https://stackoverflow.com/questions/8299885/how-to-split-a-string-given-a-list-of-positions-in-scala) I will parse the element and make k/v using the first column of file as the key
Any suggestions on approach and/or code would greatly be appreciated. I just need a nudge in the right direction. Thanks in advance.
UPDATE: fixed links
You are wrong with the approach to store the data this way, you need to have 2 separate files instead of putting everything in a single file.
In this article (http://0x0fff.com/spark-hdfs-integration/) I've got an example of using Hadoop InputFormats to read the data from HDFS in Spark. You need to use org.apache.hadoop.mapreduce.lib.input.TextInputFormat, it would return you a pair of values - first one is Long representing an offset from the beginning of the file, second one is the line from the file itself. This way you can find the lines with "sub-header-a1", "sub-footer-a1", etc. values, remember their offsets and filter on top of them to form 2 separate RDDs
I've started to do some programming in ILE RPG and I'm curious about one thing - what exactly is the record format? I know that it has to be defined in physical/logical/display files but what exactly it does? In an old RPG book from 97 I've found that "Each record format defines what is written to or read from the workstation in a single I/O operation"
In other book I have found definition that record format describe the fields within a record(so for example length, type like char or decimal?).
And last, what exactly means that "every record within a physical file must have an identical record layout"?
I'm a bit confused right now. Still not sure what is record format :F.
Still not sure what is record format :F
The F Specification: This specification is also known as the File specification. Here we declare all the files which we will be using in the program. The files might be any of the physical file, logical file, display file or the printer file. Message files are not declared in the F specification.
what exactly means that "every record within a physical file must have an identical record layout"?
Each and every record within one physical file has the same layout.
Let's make a record layout of 40 characters.
----|---10----|---20----|---30----|---40
20150130 DEBIT 00002100
20150130 CREDIT 00012315
The bar with the numbers is not part of the record layout. It's there so we can count columns.
The first field in the record layout is the date in yyyymmdd format. This takes up 8 characters, from position 1 to position 8.
The second field is 2 blank spaces, from position 9 to position 10.
The third field is the debit / credit indicator. It takes up 10 characters, from position 11 to position 20.
The fourth field is the debit / credit amount. It takes up 8 positions, from position 21 to position 28. The format is assumed to be 9(6)V99. In other words, there's an implied decimal point between positions 26 and 27.
The fifth field is more blank spaces, from position 29 to position 40.
Every record in this file has these 5 fields, all defined the same way.
A "record format" is a named structure that is used for device file I/O. It contains descriptions of each column in the 'record' (or 'row'). The specific combination of data types and sizes and the number and order of columns is hashed into a value that is known as the "record format identifier".
A significant purpose is the inclusion by compilers of the "record format identifier" in compiled program objects for use when the related file is opened. The system will compare the format ID from the program object to the current format ID of the file. If the two don't match, the system will notify the program that the file definition has changed since the program was compiled. The program can then know that it is probably going to read data that doesn't match the definitions that it knows. Nearly all such programs are allowed to fail by sending a message that indicates that the format level has changed, i.e., a "level check" failed.
The handling of format IDs is rooted in the original 'native file I/O' that pre-dates facilities such as SQL. It is a part of the integration between DB2 and the various program compilers available on the system.
The 'native' database file system was developed using principles that eventually resulted in SQL. A SQL table should have rows that all hold the same series of column definitions. That's pretty much the same as saying "every record within a physical file must have an identical record layout".
Physical database files can be thought of as being SQL tables. Logical database files can be thought of as being SQL views. As such, all records in a physical file will have the same definitions, but there is some potential variation in logical files.
A record format It's something you learn in old school. You read a file (table) and update/write through a record format.
DSPFD FILE(myTable)
Then you can see everything about the file. The record format name is in there.
New or Young Developers believe that every record in a physical file must be identical, but in ancient times, the dinosaurs walk on earth and in one single file you could have several types of records or "record formats", so as the name indicates a record format is the format of a record within a file.