When I add a new record in a database using a form, I also can upload an image. These two are not linked together; record goes in database and the image goes in an folder on my desktop, so to know which image belongs to which record, I want to put the filename in a column. How do i approach this?
Im using PlayFramework 2.4, Scala, H2 Database and Anorm for my Project
In your html form you need to have an input tag of file type, something like:
<input type="file" name="picture">
And in your scala method Controller, where you are getting the form submit, something like:
def save = Action(parse.multipartFormData) { request =>
request.body.file("picture").map { picture =>
import java.io.File
val filename = picture.filename
println(filename)
Ok("saved")
}
You could retrieve the absolute path of the file like so:
scala> val filePath = getClass.getResource("myImage.png")
filePath: java.net.URI = file:/home/robert/myImage.png
Related
I have an XML document that has mixed content and I am using a custom schema in Dataframe to parse it. I am having an issue where the schema will only pick up the text for "Measure".
The XML looks like this
<QData>
<Measure> some text here
<Answer>Answer1</Answer>
<Question>Question1</Question>
</Measure>
<Measure> some text here
<Answer>Answer1</Answer>
<Question>Question1</Question>
</Meaure>
</QData>
My schema is as follows:
def getCustomSchema():StructType = {StructField("QData",
StructType(Array(
StructField("Measure",
StructType( Array(
StructField("Answer",StringType,true),
StructField("Question",StringType,true)
)),true)
)),true)}
When I try to access the data in Measure I am only getting "some text here" and it fails when I try to get info from Answer. I am also just getting one Measure.
EDIT: This is how I am trying to access the data
val result = sc.read.format("com.databricks.spark.xml").option("attributePrefix", "attr_").schema(getCustomSchema)
.load(filename.toString)
val qDfTemp = result.mapPartitions(partition =>{val mapper = new QDMapper();partition.map(row=>{mapper(row)}).flatMap(list=>list)}).toDF()
case class QDMapper(){
def apply(row: Row):List[QData]={
val qDList = new ListBuffer[QData]()
val qualData = row.getAs[Row]("QData") //When I print as list I get the first Measure text and that is it
val measure = qualData.getAs[Row]("Measure") //This fails
}
}
you can use row tag as a root tag and access other element:-
df_schema = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='<xml_tag_name>').load(schema_path)
please visit https://github.com/harshaltaware/Pyspark/blob/main/Spark-data-parsing/xmlparsing.py for brief code
I want to add to my app a simple button that on click will call an Action that will create a csv file from two lists I have and download it to the user computer.
This is my Action:
def createAndDownloadFile = Action {
val file = new File("newFile.csv")
val writer = CSVWriter.open(file)
writer.writeAll(List(listOfHeaders, listOfValues))
writer.close()
Ok.sendFile(file, inline = false, _ => file.getName)
}
but this is now working for me, the file is not getting downloaded from the browser...
im expecting to see the file get downloaded by the browser, i thought Ok.sendFile should do the trick..
thanks!
You can use Enumerators and streams for that. It should work like this:
val enum = Enumerator.fromFile(...)
val source = akka.stream.scaladsl.Source.fromPublisher(play.api.libs.streams.Streams.enumeratorToPublisher(enum))
Result(
header = ResponseHeader(OK, Map(CONTENT_DISPOSITION → "attachment; filename=whatever.csv.gz")),
body = HttpEntity.Streamed(source.via(Compression.gzip), None, None)
)
This will actually pipe the download through gzip. Just remove the .via(Compression.gzip) part if that is not needed.
I have 20 million files in S3 spanning roughly 8000 days.
The files are organized by timestamps in UTC, like this: s3://mybucket/path/txt/YYYY/MM/DD/filename.txt.gz. Each file is UTF-8 text containing between 0 (empty) and 100KB of text (95th percentile, although there are a few files that are up to several MBs).
Using Spark and Scala (I'm new to both and want to learn), I would like to save "daily bundles" (8000 of them), each containing whatever number of files were found for that day. Ideally I would like to store the original filenames as well as their content. The output should reside in S3 as well and be compressed, in some format that is suitable for input in further Spark steps and experiments.
One idea was to store bundles as a bunch of JSON objects (one per line and '\n'-separated), e.g.
{id:"doc0001", meta:{x:"blah", y:"foo", ...}, content:"some long string here"}
{id:"doc0002", meta:{x:"foo", y:"bar", ...}, content: "another long string"}
Alternatively, I could try the Hadoop SequenceFile, but again I'm not sure how to set that up elegantly.
Using the Spark shell for example, I saw that it was very easy to read the files, for example:
val textFile = sc.textFile("s3n://mybucket/path/txt/1996/04/09/*.txt.gz")
// or even
val textFile = sc.textFile("s3n://mybucket/path/txt/*/*/*/*.txt.gz")
// which will take for ever
But how do I "intercept" the reader to provide the file name?
Or perhaps I should get an RDD of all the files, split by day, and in a reduce step write out K=filename, V=fileContent?
You can use this
First You can get a Buffer/List of S3 Paths :
import scala.collection.JavaConverters._
import java.util.ArrayList
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.services.s3.model.ObjectListing
import com.amazonaws.services.s3.model.S3ObjectSummary
import com.amazonaws.services.s3.model.ListObjectsRequest
def listFiles(s3_bucket:String, base_prefix : String) = {
var files = new ArrayList[String]
//S3 Client and List Object Request
var s3Client = new AmazonS3Client();
var objectListing: ObjectListing = null;
var listObjectsRequest = new ListObjectsRequest();
//Your S3 Bucket
listObjectsRequest.setBucketName(s3_bucket)
//Your Folder path or Prefix
listObjectsRequest.setPrefix(base_prefix)
//Adding s3:// to the paths and adding to a list
do {
objectListing = s3Client.listObjects(listObjectsRequest);
for (objectSummary <- objectListing.getObjectSummaries().asScala) {
files.add("s3://" + s3_bucket + "/" + objectSummary.getKey());
}
listObjectsRequest.setMarker(objectListing.getNextMarker());
} while (objectListing.isTruncated());
//Removing Base Directory Name
files.remove(0)
//Creating a Scala List for same
files.asScala
}
Now Pass this List object to the following piece of code, note : sc is an object of SQLContext
var df: DataFrame = null;
for (file <- files) {
val fileDf= sc.textFile(file)
if (df!= null) {
df= df.unionAll(fileDf)
} else {
df= fileDf
}
}
Now you got a final Unified RDD i.e. df
Optional, And You can also repartition it in a single BigRDD
val files = sc.textFile(filename, 1).repartition(1)
Repartitioning always works :D
have you tried something along the lines of sc.wholeTextFiles?
It creates an RDD where the key is the filename and the value is the byte array of the whole file. You can then map this so the key is the file date, and then groupByKey?
http://spark.apache.org/docs/latest/programming-guide.html
At your scale, elegant solution would be a stretch.
I would recommend against using sc.textFile("s3n://mybucket/path/txt/*/*/*/*.txt.gz") as it takes forever. What you can do is use AWS DistCp or something similar to move files into HDFS. Once its in HDFS, spark is quite fast in ingesting the information in whatever way suits you.
Note that most of these processes require some sort of file list so you'll need to generate that somehow. for 20 mil files, this creation of file list will be a bottle neck. I'd recommend creating a file that get appended with the file path, every-time a file gets uploaded to s3.
Same for output, put into hdfs and then move to s3 (although direct copy might be equally efficient).
I'm trying to store user-uploaded images in my application which is written by scala and play framework 2.2.x
I've deployed my app in heroku.
Heroku does not allow me to save my file in file system.
So I've tried to store my file in data base.
here is the code that I use for storing image :
def updateImage(id: Long, image: Array[Byte]) = {
val selected = getById(id)
DB.withConnection {
implicit c =>
SQL("update subcategory set image={image} where id = {id}").on('id -> id, 'image -> image).executeUpdate()
}
selected }
and here is the code that I use to retreive my image :
def getImageById(id: Long): Array[Byte] = DB.withConnection {
implicit c =>
val all = SQL("select image from subcategory where id = {id}").on('id -> id)().map {
case Row(image: Array[Byte]) => image
case Row(Some(image: Array[Byte])) => image
case Row(image: java.sql.Blob )=> image.getBytes(0 , image.length().toInt)
}
all.head
}
The problem is: when I use H2 database and blob column, I get the "Match Error" exception.
When I use Postgresql and bytea column, I got no error but when I retrieve the image, It's in hex format and some of the bytes in the beginning of the array are missing.
According to the PostgreSQL documentation, bytea stores the length of the array in the four bytes at the beginning of the array. These are stripped when you read the row, so that's why they seem to be "missing" when you compare the data in Scala with the data in the DB.
You will have to set the response's content-type to the appropriate value if you want the web browser to display the image correctly, as otherwise it does not know it is receiving image data. The Ok.sendFile helper does it for you. Otherwise you will have to do it by hand:
def getPicture = Action {
SimpleResult(
header = ResponseHeader(200),
body = Enumerator(pictureByteArray))
.as(pictureContentType)
}
In the example above, pictureByteArray is the Array[Byte] containing the picture data from your database, and pictureContentType is a string with the appropriate content type (for example, image/jpeg).
This is all quite well explained in the Play documentation.
my problem is that i can get the save picture to gridFS even if it's there, i've verified and it show me the piture and its name and the size from the console.
here is the code:
conn = Connection()
the classe that saves to the database:
class Profile(tornado.web.RequestHandler):
def post(self):
self.db = conn["essog"]
avat = self.request.files['avatar'][0]["body"]
avctype = self.request.files['avatar'][0]["content_type"]
nomfich = self.request.files['avatar'][0]["filename"]
#..operation using PIL to decide to save the picture or not
self.fs = GridFS(self.db)
avatar_id = self.fs.put(avat, content_type=avctype, filename=nomfich) #change the name later to avoid delete using the same name, so generating a different name...
.....
user={..., "avatar":avatar_id}
self.db.users.insert(user)
self.db.users.save(user)
the class that reads from the database:
class Profil(tornado.web.RequestHandler):
def get(self):
self.db = conn["essog"]
self.fs = GridFS(self.db)
avatar_id = self.db.users.find_one()["avatar"]
...
avatar = self.fs.get(avatar_id).read()
self.render("profile.html",..., avatar=avatar)
and in the view (profile.html)
img src="{{avatar}}" />
but nothing is displayed!
Unless you want to use a base64 URI for the source of the image, you should use a url and then create a view for returning the data from that view. If you are using nginx you might be interested in the nginx-gridfs module for better performance.
The src attribute of an img tag does not (typically) contain the image data itself, but rather the URL of the image. I think you're confusing two separate requests and responses:
HTML page that contains an <img src="..." /> tag:
class Profil(tornado.web.RequestHandler):
self.render('profile.html',
avatar=self.reverse_url('avatar', avatar_id))
image itself (which needs a separate handler):
class Avatar(tornado.web.RequestHandler):
def get(self, avatar_id):
avatar = self.fs.get(avatar_id)
self.set_header('Content-Type', avatar.content_type)
self.finish(avatar.read())