Right now, we are using
val codec = scala.io.Codec.UTF8
codec.onMalformedInput(CodingErrorAction.REPLACE)
to simply replace bad characters. I would like to know if there is a way to know that the replacement occurred? We would like to inform our customers that the data they sent us was malformed. Is there a way to do this in Scala or Java?
Thanks
Related
This is my code and I got comment said " can I use the predefined string format?", what is predefined string format, how can I do that?
val uri = new URI(baseUrl + "?provider=" + provider + "&format=" + format)
This is a clean "Scala" way to do this:
val uri = new URI(s"$baseUrl?provider=$provider&format=$format")
A decent editor like IntelliJ IDEA will highlight this so that it is clear which parts are code and which parts are plain text.
This is not a proper way of constructing a URI, and neither is using a string interpolator such as s"..." as suggested in the other answer or by your coworker. The reason is that it is going to break as soon as provider or format contain “weird” characters such as # or &. That can lead to all kinds of bugs, including security vulnerabilities.
Unfortunately, Scala doesn't come with an easy built-in way to construct URI query strings. You should use some URI abstraction from a library such as akka http or http4s.
For instance, with http4s you can write
val uri = uri"https://stackoverflow.com".withQueryParam("provider", provider)
That will take care of all the necessary escape sequences and the like.
it is easy for Hadoop to use .replace() for example
String[] valArray = value.toString().replace("\N", "")
But it dosen't work in Spark,I write Scala in Spark-shell like below
val outFile=inFile.map(x=>x.replace("\N",""))
So,how to deal with it?
For some reason your x is an Array[String]. How did you get it like that? You can .toString.replace it if you like, but that will probably not get you what you want (and would give the wrong output in java anyway); you probably want to do another layer of map, inFile.map(x => x.map(_.replace("\N","")))
I'm using the native parser combinator library in scala, and I'd like to use it to parse a number of large files. I have my combinators set up, but the file that I'm trying to parse is too large to be read into memory all at once. I'd like to be able to stream from an input file through my parser and read it back to disk so that I don't need to store it all in memory at once.My current system looks something like this:
val f = Source.fromFile("myfile")
parser.parse(parser.document.+, f.reader).get.map{_.writeToFile}
f.close
This reads the whole file in as it parses, which I'd like to avoid.
There is no easy or built-in way to accomplish this using scala's parser combinators, which provide a facility for implementing parsing expression grammars.
Operators such as ||| (longest match) are largely incompatible with a stream parsing model, as they require extensive backtracking capabilities. In order to accomplish what you are trying to do, you would need to re-formulate your grammar such that no backtracking is required, ever. This is generally much harder than it sounds.
As mentioned by others, your best bet would be to look into a preliminary phase where you chunk your input (e.g. by line) so that you can handle a portion of the stream at a time.
One easy way of doing it is to grab an Iterator from the Source object and then walk through the lines like so:
val source = Source.fromFile("myFile")
val lines = source.getLines
for (line <- lines) {
// Do magic with the line-value
}
source.close // Close the file
But you will need to be able to use the lines one by one in your parser of course.
Source: https://groups.google.com/forum/#!topic/scala-user/LPzpXo3sUVE
You might try the StreamReader class that is part of the parsing package.
You would use it something like:
val f = StreamReader( fromFile("myfile","UTF-8").reader() )
parseAll( parser, f )
The longest match as one poster above mentioned combined with regex's using source.subSequence(0, source.length) means even StreamReader doesn't help.
The best kludgy answer I have is use getLines as others have mentioned, and chunk as the accepted answer mentions. My particular input required me to chunk 2 lines at a time. You could build an iterator out of the chunks you build to make it slightly less ugly.
I have a system that reads data from various sources and stores them in MongoDB. The data I receive is already properly encoded in utf-8 or in unicode. Documents are loosely related and vary greatly in schema, if you will.
Every now and then, a document has a field value that is pure binary data, like a JPEG image. I know how to wrap that value in a bson.binary.Binary object to avoid the bson.errors.InvalidStringData exception.
Is there a way to tell which part of a document made pymongo driver to raise a bson.errors.InvalidStringData, or do I have to try and convert each field to find it ?
(+If by chance a binary object happens to be a valid unicode string or utf-8, it will be stored as a string and that's ok)
PyMongo has two BSON implementations, one in Python for portability and one in C for speed. _make_c_string in the Python version will tell you what it failed to encode but the C version, which is evidently what you're using, does not. You can tell which BSON implementation you have with import bson; bson.has_c(). I've filed PYTHON-533, it'll be fixed soon.
(Answering my own question)
You can't tell from the exception, and some rewrite of the driver would be required to support that feature.
The code is in bson/__init__.py. There is a function named _make_c_string that raises InvalidStringData if string throws a UnicodeError if it is to be encoded in utf-8. The same function is used for both keys and values that are strings.
In other words, at this point in code, the driver does not know if it is dealing with a key or value.
The offending data is passed as a raw string to the exception's constructor, but for a reason I don't understand, it does not come out of the driver.
>>> bad['zzz'] = '0\x82\x05\x17'
>>> try:
... db.test.insert(bad)
... except bson.errors.InvalidStringData as isd:
... print isd
...
strings in documents must be valid UTF-8
But that does not matter: you would have to look up the keys for that value anyway.
The best way is to iterate over the values, trying to decode them in utf-8. If a UnicodeDecodeError is raised, wrap the value in a Binary object.
Somewhat like this :
try:
#This code could deal with other encodings, like latin_1
#but that's not the point here
value.decode('utf-8')
except UnicodeDecodeError:
value = bson.binary.Binary(str(value))
Suppose I have a txt file named "input.txt" and I want to use scala to read it in. The dimension of the file is not available in the beginning.
So, how to construct such an Array[Array[Float]]? What I want is a simple and neat way rather than write some code like in Java to iterates over lines and parse each number. I think functional programming should be quite good at it.. but cannot think of one up to now.
Best Regards
If your input is correct, you can do it in such way:
val source = io.Source.fromFile("input.txt")
val data = source.getLines().map(line => line.split(" ").map(_.toFloat)).toArray
source.close()
Update: for additional information about using Source check this thread