How to export an csv file to a bigqery table using java dataflow? - export-to-csv

I want to read an csv file from the cloud bucket and write it to a bigquery table with columns using dataflow in java. How can I set the headers to the csv file while writing to bigquery?

There are two issues to solve here
Skipping the header when reading the data, and
Using the header to correctly populate teh bigquery table columns.
For (1) this is, as of June 2019, not implemented natively, though you could try the options listed at Skipping header rows - is it possible with Cloud DataFlow?. For (2) the easiest would be to read the first line of your CSV in your main program, and pass the list of column names in the constructor to a DoFn that converts CSV lines into TableRow objects ready to write to Bigquery.
Your final program would look something like
public void CsvToBigquery(csvInputPattern, bigqueryTable) {
final String[] columns = readAndSplitFirstLineOfFirstFile(csvInputPattern);
Pipeline p = new Pipeline.create(...);
p
.apply(TextIO.read().from(csvInputPattern)
.apply(Filter.by(new MatchIfNonHeader())
.apply(ParDo.of(new DoFn<String, TableRow>() {
... // use columns here to TableRows
})
.apply(BigtableIO.write().withTableId(bigqueryTable)...);
}

I've done a similar task and used Apache Common library in ParDo function to extract the data from CSV files and then converted them to Table Row Objects for BQ.
String fileData = c.element();
BufferedReader fileReader = new BufferedReader(new InputStreamReader(
new ByteArrayInputStream(fileData.getBytes("UTF-8")), "UTF-8"));
CSVParser csvParser = new CSVParser(fileReader,CSVFormat.DEFAULT.withFirstRecordAsHeader().withIgnoreHeaderCase().withTrim());
Iterable<CSVRecord> csvRecords = csvParser.getRecords();
for (CSVRecord csvRecord : csvRecords) {
TableRow row = new TableRow();
checkAndConvertIntoBqDataType(csvRecord.toMap());
c.output(row);
}

Related

Spring Batch - Comma separated values - Save in Data Base

I have a file which contains list of values (user IDs) separated by comma(“,”) as follows.
111, 222, 333, 444, 555, 777 …………
The file contains millions of such records and I wanted to save these values into a single column in a table in RDBMS.
I tried to use DelimitedLineTokenizer for parsing data.
The issue is that “DelimitedLineTokenizer” considers only one entry in a single line, and rest of the values are ignored.The first entry ("111") is saved and rest of the values in the same line are ignored.If there is a second line , the first element in the second line is saved and rest are ignored.
Is there a way to tokenize all the comma separated values from a single line and save all of them into DB?
The query is a s follows.
INSERT INTO users (id) VALUES (: userid).
I used the following code to parse the file and save it in DB.
public FlatFileItemReader<User> reader() {
FlatFileItemReader<User> reader = new FlatFileItemReader<User>();
DelimitedLineTokenizer reader = new DelimitedLineTokenizer(",");
reader.setNames(new String[] {“userid”});
blah…blah….blah….
reader.setLineMapper(new DefaultLineMapper<User>() {
{
setLineTokenizer(reader);
setFieldSetMapper(new BeanWrapperFieldSetMapper<User>() {
{
setTargetType(User.class);
}
});
}
});
return reader;
}
#Bean
public UserItemProcessor processor() {
return new UserItemProcessor();
}
#Bean
public Job importUserJob(JobCompletionNotificationListener listener) {
return jobBuilderFactory.get("importUserJob").incrementer(new RunIdIncrementer()).listener(listener)
.flow(step1()).end().build();
}
#Bean
public Step step1() {
return stepBuilderFactory.get("step1").<User, User> chunk(5).reader(reader()).processor(processor())
.writer(writer()).build();
}
Basically, you have two delimiters for target object - comma & new line. So either you writer a custom reader that works on both delimiters or you need to pre process your file to bring it to standard format.
In my opinion, you are better off by pre processing your file to replace all comma with new line character.
You might retain original file as is and create pre processed data in a new temporary file.
You can either do that as a separate spring batch step ( not recommended due to file size ) or if its going to be a scheduled job then probably, in your kick off script.
Replace comma with newline in java
How to break lines at a specific character in Notepad++?
Notepad++ find and replace string with a new-line
Replace comma with new line in a text file using tr in Linux

Apache Tika - Parsing and extracting only metadata without reading content

Is there a way to configure the Apache Tikka so that it only extracts the metadata properties from the file and does not access the content of the file. ? We need a way to do this so as to avoid reading the entire content in larger files.
The code to extract we are using is as follows:
var tikaConfig = TikaConfig.getDefaultConfig();
var metadata = new Metadata();
AutoDetectParser parser = new AutoDetectParser(tikaConfig);
BodyContentHandler handler = new BodyContentHandler();
using (TikaInputStream stream = TikaInputStream.get(new File(filename), metadata))
{
parser.parse(stream, handler, metadata, new ParseContext());
Array metadataKeys = metadata.names();
Array.Sort(metadataKeys);
}
With the above code sample, when we try to extract the metadata even the content is being read. We would need a way to avoid the same.

Change date column to integer

I have a large csv file as below:
DATE status code value value2
2014-12-13 Shipped 105732491-20091002165230 0.000803398 0.702892835
2014-12-14 Shipped 105732491-20091002165231 0.012925206 1.93748834
2014-12-15 Shipped 105732491-20091002165232 0.000191278 0.004772389
2014-12-16 Shipped 105732491-20091002165233 0.007493046 0.44883348
2014-12-17 Shipped 105732491-20091002165234 0.022015049 3.081006137
2014-12-18 Shipped 105732491-20091002165235 0.001894693 0.227268466
2014-12-19 Shipped 105732491-20091002165236 0.000312871 0.003113062
2014-12-20 Shipped 105732491-20091002165237 0.001754068 0.105016053
2014-12-21 Shipped 105732491-20091002165238 0.009773315 0.585910214
:
:
What i need to do is remove the header and change the date format to an integer yyyymmdd (eg. 20141217)
I am using opencsv to read and write the file.
Is there a way where i can change all the dates at once without parsing them one by one?
Below is my code to remove the header and create a new file:
void formatCsvFile(String fileToChange) throws Exception {
CSVReader reader = new CSVReader(new FileReader(new File(fileToChange)), CSVParser.DEFAULT_SEPARATOR, CSVParser.NULL_CHARACTER, CSVParser.NULL_CHARACTER, 1)
info "Read all rows at once"
List<String[]> allRows = reader.readAll();
CSVWriter writer = new CSVWriter(new FileWriter(fileToChange), CSVWriter.DEFAULT_SEPARATOR, CSVWriter.NO_QUOTE_CHARACTER)
info "Write all rows at once"
writer.writeAll(allRows)
writer.close()
}
Please can some one help?
Thanks
You don't need to parse the dates, but you do need to process each line in the file and convert the data on each line you want to convert. Java/Groovy doesn't have anything like awk where you can work with file data as columns, for example, the first 10 "columns" (characters usually) in every line in a file. Java/Groovy only deals with "rows" of data in a file, not "columns".
You could try something like this: (in Groovy)
reader.eachLine { String theLine ->
int idx = theLine.indexOf(' ')
String oldDate = theLine.subString(0, idx)
String newDate = oldDate.replaceAll('-', '')
String newLine = newDate + theLine.subString(idx);
writer.writeLine(newline);
}
Edit:
If your CSVReader class is not derived from File, then you can't use Groovy's eachLine method on it. And if the CSVReader class's readAll() method really returns a List of String arrays, then the above code could change to this:
allRows.each { String[] theLine ->
String newDate = theLine[0].replaceAll('-', '')
writer.writeLine(newDate + theLine[1..-1])
}
Ignore the first line (the header):
List<String[]> allRows = reader.readAll()[1..-1];
and replace the '-' in the dates by splitting each row and editting the first:
allrows = allrows.collect{
row -> row.split(',')[0].replace(',','') // the date
+ row.split(',')[1..-1] // the rest
}
I don't know what you mean by "all dates at once". For me can only be iterated.

How to bundle many files in S3 using Spark

I have 20 million files in S3 spanning roughly 8000 days.
The files are organized by timestamps in UTC, like this: s3://mybucket/path/txt/YYYY/MM/DD/filename.txt.gz. Each file is UTF-8 text containing between 0 (empty) and 100KB of text (95th percentile, although there are a few files that are up to several MBs).
Using Spark and Scala (I'm new to both and want to learn), I would like to save "daily bundles" (8000 of them), each containing whatever number of files were found for that day. Ideally I would like to store the original filenames as well as their content. The output should reside in S3 as well and be compressed, in some format that is suitable for input in further Spark steps and experiments.
One idea was to store bundles as a bunch of JSON objects (one per line and '\n'-separated), e.g.
{id:"doc0001", meta:{x:"blah", y:"foo", ...}, content:"some long string here"}
{id:"doc0002", meta:{x:"foo", y:"bar", ...}, content: "another long string"}
Alternatively, I could try the Hadoop SequenceFile, but again I'm not sure how to set that up elegantly.
Using the Spark shell for example, I saw that it was very easy to read the files, for example:
val textFile = sc.textFile("s3n://mybucket/path/txt/1996/04/09/*.txt.gz")
// or even
val textFile = sc.textFile("s3n://mybucket/path/txt/*/*/*/*.txt.gz")
// which will take for ever
But how do I "intercept" the reader to provide the file name?
Or perhaps I should get an RDD of all the files, split by day, and in a reduce step write out K=filename, V=fileContent?
You can use this
First You can get a Buffer/List of S3 Paths :
import scala.collection.JavaConverters._
import java.util.ArrayList
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.services.s3.model.ObjectListing
import com.amazonaws.services.s3.model.S3ObjectSummary
import com.amazonaws.services.s3.model.ListObjectsRequest
def listFiles(s3_bucket:String, base_prefix : String) = {
var files = new ArrayList[String]
//S3 Client and List Object Request
var s3Client = new AmazonS3Client();
var objectListing: ObjectListing = null;
var listObjectsRequest = new ListObjectsRequest();
//Your S3 Bucket
listObjectsRequest.setBucketName(s3_bucket)
//Your Folder path or Prefix
listObjectsRequest.setPrefix(base_prefix)
//Adding s3:// to the paths and adding to a list
do {
objectListing = s3Client.listObjects(listObjectsRequest);
for (objectSummary <- objectListing.getObjectSummaries().asScala) {
files.add("s3://" + s3_bucket + "/" + objectSummary.getKey());
}
listObjectsRequest.setMarker(objectListing.getNextMarker());
} while (objectListing.isTruncated());
//Removing Base Directory Name
files.remove(0)
//Creating a Scala List for same
files.asScala
}
Now Pass this List object to the following piece of code, note : sc is an object of SQLContext
var df: DataFrame = null;
for (file <- files) {
val fileDf= sc.textFile(file)
if (df!= null) {
df= df.unionAll(fileDf)
} else {
df= fileDf
}
}
Now you got a final Unified RDD i.e. df
Optional, And You can also repartition it in a single BigRDD
val files = sc.textFile(filename, 1).repartition(1)
Repartitioning always works :D
have you tried something along the lines of sc.wholeTextFiles?
It creates an RDD where the key is the filename and the value is the byte array of the whole file. You can then map this so the key is the file date, and then groupByKey?
http://spark.apache.org/docs/latest/programming-guide.html
At your scale, elegant solution would be a stretch.
I would recommend against using sc.textFile("s3n://mybucket/path/txt/*/*/*/*.txt.gz") as it takes forever. What you can do is use AWS DistCp or something similar to move files into HDFS. Once its in HDFS, spark is quite fast in ingesting the information in whatever way suits you.
Note that most of these processes require some sort of file list so you'll need to generate that somehow. for 20 mil files, this creation of file list will be a bottle neck. I'd recommend creating a file that get appended with the file path, every-time a file gets uploaded to s3.
Same for output, put into hdfs and then move to s3 (although direct copy might be equally efficient).

How to specify strings in Weka file?

I am working on a text classification system and I would like to use unigrams as features. When building the arff file, I declared a string attribute field inside which I want to specify all the words contained in a message separated by comma. However, Weka is telling me that it "Cannot handle string attributtes". I tried defining the relation in the header with StringToWordVector, but it didn't help. How to go about this otherway? Many thanks!
if your arff file format is correct then the following code can help you
// dataSource: arff file (path of your arff file)
BufferedReader trainReader = new BufferedReader(new FileReader(dataSource));
trainInsts = new Instances(trainReader);
trainInsts.setClassIndex(trainInsts.numAttributes() - 1);
// the filter is used to convert the data from string to numeric
StringToWordVector STWfilter = new StringToWordVector();
FilteredClassifier model = new FilteredClassifier();
model.setFilter(STWfilter);
STWfilter.setInputFormat(trainInsts);
// the converted data
trainInsts = Filter.useFilter(trainInsts, STWfilter);