I am using Google Cloud Datalab for the first time to build a classifier for a Kaggle competition. But I am stuck trying to write a csv file containing the pre-processed training data to Cloud Storage using the google.datalab.storage API.
The file contains strings with unicode characters which causes the write_stream to a Storage object to trigger the error:
Failed to process HTTP response.
Here is the simplified code only trying to write a single string:
from google.datalab import Context
import google.datalab.storage as storage
project = Context.default().project_id
bucket_name = project
bucket_object = storage.Bucket(bucket_name)
file_object = bucket_object.object('x.txt')
test_string = 'Congratulations from me as well, use the tools well. \xc2\xa0\xc2\xb7 talk'
#test_string = 'Congratulations from me as well, use the tools well. talk'
print type(test_string)
print len(test_string)
test_string = test_string.decode('utf-8')
print type(test_string)
print len(test_string)
test_string = test_string.encode('utf-8')
print type(test_string)
print len(test_string)
try:
file_object.write_stream(test_string, 'text/plain')
except Exception as e:
print e
Output:
<type 'str'>
62
<type 'unicode'>
60
<type 'str'>
62
Failed to process HTTP response.
If I use the string without the Unicode characters the Storage object is created and the string is written to the file. It makes no difference whether I am trying to write the Unicode decoded version or the string encoded one. The content type ('text/plain' or 'application/octet-stream') also makes no difference.
I would appreciate any help or idea how to solve this, especially since the google.datalab.storage API is barely documented (like most things GCP).
Thx.
Related
I want to get the time creation of files in GCS, I used the code below :
println(Files
.getFileAttributeView(Paths.get("gs://datalake-dev/mu/tpu/file.0450138"), classOf[BasicFileAttributeView])
.readAttributes.creationTime)
The problem is that the Paths.get function replace // with / so I will get gs:/datalake-dev/mu/tpu/file.0450138 instead of gs://datalake-dev/mu/tpu/file.0450138.
Anyone can help me with this ?
Thanks a lot !
I solved the problem by adding the following java code and then calling the java function in scala.
import com.google.cloud.storage.*;
import java.sql.Timestamp;
public class ExtractDate {
public static String getTime(String fileName){
String bucketName = "bucket-data";
String blobName = "doc/files/"+fileName;
// Instantiates a client
Storage storage_client = StorageOptions.getDefaultInstance().getService();
Bucket bucket = storage_client.get(bucketName);
//val storage_client = Storage.
BlobId blobId = BlobId.of(bucketName, blobName);
Blob blob = storage_client.get(blobId);
Timestamp tmp = new Timestamp(bucket.get(blobName).getCreateTime());
System.out.print(bucket.get(blobName).getContent());
// return the year of the file date creation
return tmp.toString().substring(0,4);
}
}
You can use the file_get_contents method to read the contents of the path. From the documentation on Reading and Writing Files
Read objects contents using PHP to fetch an object's custom metadata from Google Cloud Storage.An App Engine PHP 5 app must use the Cloud Storage stream wrapper to write files at runtime. However, if an app needs to read files, and these files are static, you can optionally read static files uploaded with your app using PHP filesystem functions such as file_get_contents.
$fileContents = file_get_contents($filePath);
where the path specified must be a path relative to the script accessing them.
You must upload the file or files in an application subdirectory when you deploy your app to App Engine, and must configure the app.yaml file so your app can access those files. For complete details, see PHP 5 Application Configuration with app.yaml.
In the app.yaml configuration, notice that if you use a static file or directory handler (static_files or static_dir) you must specify application_readable set to true or your app won't be able to read the files. However, if the files are served by a script handler, this isn't necessary, because these files are readable by script handlers by default.
I am trying to call the language detection method of the translate client api from pyspark for each row in a file.
I created a map method as the following but the job seems to just freeze with no error. If I remove the call to the translate API it executes fine. Is it possible to call Google client API methods within pySpark map ?
mapping method to do translation
def doTranslate(data):
translate_client = translate.Client()
# Get the message information
messageId = data[0]
messageContent = data[6]
detectedLang = translate_client.detect_language(messageContent)
r = []
r.append(detectedLang)
return r
Figured it out!! your question led me in the right direction. thanks!
Turns out I was getting an exception from the call because I was going past the default quota for sizes of messages. I added a try/except block and determined this was the problem. Then cutting the message size down (I am just testing so dont want to mess with the quota) fixed the issue.
Is there a way to read an Excel file stored in a GCS bucket using Dataflow?
And I would also like to know if we can access the metadata of an object in GCS using Dataflow. If yes then how?
CSV files are often used to read files from excel. These files can be split and read line by line so they are ideal for dataflow. You can use TextIO.Read to pull in each line of the file, then parse them as CSV lines.
If you want to use a different binary excel format, then I believe that you would need to read in the entire file and use a library to parse it. I recommend using CSV files if you can.
As for reading the GCS metadata. I don't think that you can do this with TextIO, but you could call the GCS API directly to access the metadata. If you only do this for a few files at the start of your program then it will work and not be too expensive. If you need to read many files like this, you'll be adding an extra RPC for each file.
Be careful to not read the same file multiple times, I suggest reading each file's metadata once once and then writing the metadata out to a side input. Then in one of your ParDo's you can access the side input for each file.
Useful links:
ETL & Parsing CSV files in Cloud Dataflow
https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/io/TextIO.Read
https://cloud.google.com/dataflow/model/par-do#side-inputs
private static final int BUFFER_SIZE = 64 * 1024;
private static void printBlob(com.google.cloud.storage.Storage storage, String bucketName, String blobPath) throws IOException, InvalidFormatException {
try (ReadChannel reader = ((com.google.cloud.storage.Storage) storage).reader(bucketName, blobPath)) {
InputStream inputStream = Channels.newInputStream(reader);
Workbook wb = WorkbookFactory.create(inputStream);
StringBuffer data = new StringBuffer();
for(int i=0;i<wb.getNumberOfSheets();i++) {
String fName = wb.getSheetAt(i).getSheetName();
File outputFile = new File("D:\\excel\\"+fName+".csv");
FileOutputStream fos = new FileOutputStream(outputFile);
XSSFSheet sheet = (XSSFSheet) wb.getSheetAt(i);
Iterator<Row> rowIterator = sheet.iterator();
data.delete(0, data.length());
while (rowIterator.hasNext())
{
// Get Each Row
Row row = rowIterator.next();
data.append('\n');
// Iterating through Each column of Each Row
Iterator<Cell> cellIterator = row.cellIterator();
while (cellIterator.hasNext())
{
Cell cell = cellIterator.next();
// Checking the cell format
switch (cell.getCellType())
{
case Cell.CELL_TYPE_NUMERIC:
data.append(cell.getNumericCellValue() + ",");
break;
case Cell.CELL_TYPE_STRING:
data.append(cell.getStringCellValue() + ",");
break;
case Cell.CELL_TYPE_BOOLEAN:
data.append(cell.getBooleanCellValue() + ",");
break;
case Cell.CELL_TYPE_BLANK:
data.append("" + ",");
break;
default:
data.append(cell + ",");
}
}
}
fos.write(data.toString().getBytes());
}
}
}
You should be able to read the metadata of a GCS file by using the GCS API. However you would need the filenames. You can do this by doing a ParDo or other transform over a list of PCollection<string> which holds the filenames.
We don't have any default readers for excel files. You can parse from a CSV file by using a text input:(ETL & Parsing CSV files in Cloud Dataflow)
I'm not very knowledgeable on excel, and how the file format is stored. If you want to process one file at a time, you can use a PCollection<string> of files. And then use some library to parse the excel file at a time.
If an excel file can be split into easily-parallelizable parts, I'd suggest you take a look at this doc (https://beam.apache.org/documentation/io/authoring-overview/). (If you are still using Dataflow SDK, it should be similar.) It may be worth splitting into smaller chunks before reading to get more parallelization out of your pipeline. In this case you could use IOChannelFactory to read from the file.
I am trying to convert documents using the Bluemix Document Conversion service with a Node.js application. I am getting nothing but errors in my app, but the test document I'm using converts fine using the demo page. Below is a minimal app that demonstrates the problem (Note that, while this app is converting a PDF from disk, the "real" app can't do that, hence the Buffer object).
'use strict';
var fs = require('fs');
var DocumentConversionV1 = require('watson-developer-cloud/document-conversion/v1');
var bluemix=require('./my_bluemix');
var extend=require('util')._extend; //Node.js' built-in object extend function
var dcCredentials = extend({
url: '<url>',
version: 'v1',
username: '<username>',
password: '<password>'
}, bluemix.getServiceCreds('document_conversion')); // VCAP_SERVICES
var document_conversion = new DocumentConversionV1(dcCredentials);
var contents = fs.readFileSync('./testdoc.pdf', 'utf8');
var parms={
file: new Buffer(contents,'utf8'),
conversion_target: 'ANSWER_UNITS', // (JSON) ANSWER_UNITS, NORMALIZED_HTML, or NORMALIZED_TEXT
content_type:'application/pdf',
contentType:'application/pdf', //don't know which of these two works, seems to be inconsistent so I include both
html_to_answer_units: {selectors: [ 'h1', 'h2','h3', 'h4']},
};
console.log('First 100 chars of file:\n******************\n'+contents.substr(0,100)+'\n******************\n');
document_conversion.convert(parms, function(err,answerUnits)
{
if (!err)
console.log('Returned '+answerUnits.length);
else
console.log('Error: '+JSON.stringify(err));
});
The results from running this program against the test PDF (782K) is:
$ node test.js
[DocumentConversion] WARNING: No version_date specified. Using a (possibly old) default. e.g. watson.document_conversion({ version_date: "2015-12-15" })
[DocumentConversion] WARNING: No version_date specified. Using a (possibly old) default. e.g. watson.document_conversion({ version_date: "2015-12-15" })
First 100 chars of file:
******************
%PDF-1.5
%����
1 0 obj
<</Type/Catalog/Pages 2 0 R/Lang(en-US) /StructTreeRoot 105 0 R/MarkInfo<<
******************
Error: {"code":400,"error":"Could not push back 82801 bytes in order to reparse stream. Try increasing push back buffer using system property org.apache.pdfbox.baseParser.pushBackSize"}
$
Can someone tell me
How to get rid of the warning messages
Why the document is not getting converted
How do I "increase the push back buffer"
Other documents give different errors, but I'm hoping if I can make this one work then the other errors will go away too.
You can get rid of the warning message by specifying a version date in your configuration. See the tests for an example. 1
If the document converted through the demo but failed to convert when using your application, it is likely an error with how the binary data is passed to the service. (For example, it's getting corrupted or truncated.) You can see the Node.js source code for the demo here 2. It may help you figure out the mistake or give you a different approach to loading/sending the file.
That is an error from one of the underlying libraries used by the service. Unfortunately, it's not something that a caller can adjust at this point.
This is the first time i am integrating Email service with liftweb
I want to send Email with attachments(Like:- Documents,Images,Pdfs)
my code looking like below
case class CSVFile(bytes: Array[Byte],filename: String = "file.csv",
mime: String = "text/csv; charset=utf8; header=present" )
val attach = CSVFile(fileupload.mkString.getBytes("utf8"))
val body = <p>Please research the enclosed.</p>
val msg = XHTMLPlusImages(body,
PlusImageHolder(attach.filename, attach.mime, attach.bytes))
Mailer.sendMail(
From("vyz#gmail.com"),
Subject(subject(0)),
To(to(0)),
)
this code is taken from LiftCookbook its not working like my requirement
its working but only the Attached file name is coming(file.csv) no data in it(i uploaded this file (gsy.docx))
Best Regards
GSY
You don't specify what type fileupload is, but assuming it is of type net.liftweb.http. FileParamHolder then the issue is that you can't just call mkString and expect it to have any data since there is no data in the object, just a fileStream method for retrieving it (either from disk or memory).
The easiest to accomplish what you want would be to use a ByteArrayInputStream and copy the data to it. I haven't tested it, but the code below should solve your issue. For brevity, it uses Apache IO Commons to copy the streams, but you could just as easily do it natively.
val data = {
val os = new ByteArrayOutputStream()
IOUtils.copy(fileupload.fileStream, os)
os.toByteArray
}
val attach = CSVFile(data)
BTW, you say you are uploading a Word (DOCX) file and expecting it to automatically be CSV when the extension is changed? You will just get a DOCX file with a csv extension unless you actually do some conversion.