I am working on a text classification system and I would like to use unigrams as features. When building the arff file, I declared a string attribute field inside which I want to specify all the words contained in a message separated by comma. However, Weka is telling me that it "Cannot handle string attributtes". I tried defining the relation in the header with StringToWordVector, but it didn't help. How to go about this otherway? Many thanks!
if your arff file format is correct then the following code can help you
// dataSource: arff file (path of your arff file)
BufferedReader trainReader = new BufferedReader(new FileReader(dataSource));
trainInsts = new Instances(trainReader);
trainInsts.setClassIndex(trainInsts.numAttributes() - 1);
// the filter is used to convert the data from string to numeric
StringToWordVector STWfilter = new StringToWordVector();
FilteredClassifier model = new FilteredClassifier();
model.setFilter(STWfilter);
STWfilter.setInputFormat(trainInsts);
// the converted data
trainInsts = Filter.useFilter(trainInsts, STWfilter);
Related
I want to read an csv file from the cloud bucket and write it to a bigquery table with columns using dataflow in java. How can I set the headers to the csv file while writing to bigquery?
There are two issues to solve here
Skipping the header when reading the data, and
Using the header to correctly populate teh bigquery table columns.
For (1) this is, as of June 2019, not implemented natively, though you could try the options listed at Skipping header rows - is it possible with Cloud DataFlow?. For (2) the easiest would be to read the first line of your CSV in your main program, and pass the list of column names in the constructor to a DoFn that converts CSV lines into TableRow objects ready to write to Bigquery.
Your final program would look something like
public void CsvToBigquery(csvInputPattern, bigqueryTable) {
final String[] columns = readAndSplitFirstLineOfFirstFile(csvInputPattern);
Pipeline p = new Pipeline.create(...);
p
.apply(TextIO.read().from(csvInputPattern)
.apply(Filter.by(new MatchIfNonHeader())
.apply(ParDo.of(new DoFn<String, TableRow>() {
... // use columns here to TableRows
})
.apply(BigtableIO.write().withTableId(bigqueryTable)...);
}
I've done a similar task and used Apache Common library in ParDo function to extract the data from CSV files and then converted them to Table Row Objects for BQ.
String fileData = c.element();
BufferedReader fileReader = new BufferedReader(new InputStreamReader(
new ByteArrayInputStream(fileData.getBytes("UTF-8")), "UTF-8"));
CSVParser csvParser = new CSVParser(fileReader,CSVFormat.DEFAULT.withFirstRecordAsHeader().withIgnoreHeaderCase().withTrim());
Iterable<CSVRecord> csvRecords = csvParser.getRecords();
for (CSVRecord csvRecord : csvRecords) {
TableRow row = new TableRow();
checkAndConvertIntoBqDataType(csvRecord.toMap());
c.output(row);
}
I am trying to find the file type in order to read the file based on its type.
Input come in different file formats such as CVS, excel and orc etc..,
for example input =>"D:\\resources\\core_dataset.csv"
I am expecting output => csv
You could achieve this as follows:
import java.nio.file.Paths
val path = "/home/gmc/exists.csv"
val fileName = Paths.get(path).getFileName // Convert the path string to a Path object and get the "base name" from that path.
val extension = fileName.toString.split("\\.").last // Split the "base name" on a . and take the last element - which is the extension.
// The above produces:
extension: String = csv
I have:
Raw xml filled by a select query.This xml transformed into a HL7
message
One of the tags of this xml represents a clob column from a table in
the database
I mapped this data (from edit transformer section) as a variable.
Now I am trying to convert this variable into a base64 string then
replace it in my transformed hl7 message.
5.I tried this conversion on a destination channel which is a javascript writer.
I read and tried several conversion methods like
Packages.org.apache.commons.codec.binary.Base64.encodeBase64String();
I have got only error messages like :
EvaluatorException: Can't find method org.apache.commons.codec.binary.Base64.encodeBase64String(java.lang.String);
Code piece:
var ads=$('V_REPORT_CLOB');
var encoded = Packages.org.apache.commons.codec.binary.Base64.encodeBase64String(ads.toString());
It is pretty clear that I am a newbie on that.How can I manage to do this conversion ?
Here is what I use for Base64 encoding a string with your var substituted.
//Encode Base 64//
var ads = $('V_REPORT_CLOB');
var adsLength = ads.length;
var base64Bytes = [];
for(i = 0; i < adsLength;i++){
base64Bytes.push(ads.charCodeAt(i));
}
var encodedData = FileUtil.encode(base64Bytes);
Is there a way to configure the Apache Tikka so that it only extracts the metadata properties from the file and does not access the content of the file. ? We need a way to do this so as to avoid reading the entire content in larger files.
The code to extract we are using is as follows:
var tikaConfig = TikaConfig.getDefaultConfig();
var metadata = new Metadata();
AutoDetectParser parser = new AutoDetectParser(tikaConfig);
BodyContentHandler handler = new BodyContentHandler();
using (TikaInputStream stream = TikaInputStream.get(new File(filename), metadata))
{
parser.parse(stream, handler, metadata, new ParseContext());
Array metadataKeys = metadata.names();
Array.Sort(metadataKeys);
}
With the above code sample, when we try to extract the metadata even the content is being read. We would need a way to avoid the same.
I have successfully opened a DBF Table:
String dbfDirectoryPath = "Z:/ESRI/data/washingtonCountyDataFiles/tlg_roads";
IWorkspaceFactory workspaceFactory = new ShapefileWorkspaceFactory();
IWorkspace workspace = workspaceFactory.OpenFromFile(dbfDirectoryPath, 0);
IFeatureWorkspace featureWorkspace = workspace as IFeatureWorkspace;
String dbfTable = "tlg_roads_l.dbf";
ITable table = featureWorkspace.OpenTable(dbfTable);
Now I want to map it, which I think entails a call to mapControl.AddLayer(layer). So I need to convert this object from a featureWorkspace to a ILayer somehow.
It looks like I can just CreateFeatureClass and then cast to an ILayer, but there are 6 arguments for CreateFeatureClass including CLSID so I get the impression I am missing some conceptual points. Thanks for any advice : )
//IFeatureClass featureclass = tableWorkspace.CreateFeatureClass //req six args, incl. CLSID
ILayer layer = featureclass as ILayer;
mapControl.AddLayer(layer);
Apparently DBF files are not for viewing and only shape files are for rendering and apparently somehow the DBF contains data necessary for the shapefiles. This is what I was told anyway feel free to enlighten me.