Display URL Image Using Pyspark - pyspark

I have a dataframe that contains a column with URL links, I want each of the images displayed.
I tried the following solution for local files but it didn't work for URL links.
Spark using PySpark read images
If anyone knows how to accomplish this for a pyspark dataframe using an URL link, please do share.
Example of url jpg:
https://steemitimages.com/DQmWSoXZPHH2XEuVRUbPqiPLf6niA2xfvFXYZ2FYPYhMQ4X/1%20(3).jpg

Loading image only work for local path or hdfs like path.
You can only download this image to local disk then load it .
import urllib.request
# path to your image source directory
sample_img_dir = /tmp/images
urllib.request.urlretrieve(' https://steemitimages.com/DQmWSoXZPHH2XEuVRUbPqiPLf6niA2xfvFXYZ2FYPYhMQ4X/1%20(3).jpg', sample_img_dir+'/image1.jpg')
# Read image data using new image scheme
image_df = spark.read.format("image")\
.option("dropInvalid", true)\
.load(sample_img_dir)
image_df.select("image.origin", "image.width", "image.height").show(truncate=False)
+-------------------------------------------+-----+------+
|origin |width|height|
+-------------------------------------------+-----+------+
|file:///tmp/images/image1.jpg |300 |311 |
|file:///tmp/images/image2.jpg |199 |313 |
|file:///tmp/images/image3.jpg |300 |200 |
|file:///tmp/images/image4.jpg |300 |296 |
+-------------------------------------------+-----+------+
Reference:
Introducing Built-in Image Data Source in Apache Spark 2.4

Related

How does one read multiple DICOM and PNG files at once using pydicom.read_file() and cv2.imread()?

Currently working on a Fully CNN for renal segmentation in MR images. Have 40 images and their ground truth labels, attempting to load all of the images for pre-processing purposes.
Using Google Colab, with the latest versions of pydicom and pip installed, for this project. Currently have the Google Drive mounted to the Colab program and the code below shows the correct pathways to the images and their masks in the pydicom.read_file() and cv2.imread() calls, respectively.
However, when I use the "/../IMG*.dcm" or "/../IMG*.png" file paths (which should be legal?), I receive a "FileNotFoundError" as listed below. But, when I specify a specific .dcm or .png image, the pydicom.read_file() and cv2.imread() calls function quite normally.
Any suggestions on how to resolve this issue? I am struggling a lot with loading the data and pre-processing but have the model architecture ready to go once these preliminary hurdles are overcome.
#import data as data
import pydicom
import numpy as np
images= pydicom.read_file("/content/drive/My Drive/CHOAS_Kidney_Labels/Training_Images/T1DUAL/IMG*.dcm");
numpyArray = images.pixel_array
masks= cv2.imread("/content/drive/My Drive/CHOAS_Kidney_Labels/Ground_Truth_Training/T1DUAL/IMG*.png");
-----> FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/My Drive/CHOAS_Kidney_Labels/Training_Images/T1DUAL/IMG*.dcm'
pydicom.read_file does not support wildcards. You have to iterate over the files yourself, something like (untested):
import glob
import pydicom
pixel_data = []
paths = glob.glob("/content/drive/My Drive/CHOAS_Kidney_Labels/Training_Images/T1DUAL/IMG*.dcm")
for path in paths:
dataset = pydicom.dcmread(path)
pixel_data.append(dataset.pixel_array)

How to download a file (csv.gz) from a url using Python 3.7

As with others who have posted in the past, I cannot figure out to download a csv.gz file from a URL in Python 3.7. I see posts but they only post a 2kb file.
I am a 100% newbie using Python. What follows is the code for one file that I am trying to obtain. I can't even do that. The final goal would be to request all files that start with 2019* using python. Please try the code below to save the file. As others stated, the file is just a name without the true content - Ref: Downloading a csv.gz file from url in Python
import requests
url = 'https://public.bitmex.com/?prefix=data/trade/20191026.csv.gz'
r = requests.get(url, allow_redirects=True)
open('20191026.csv.gz', 'wb').write(r.content)
Yields:
Out[40]:
1245
I've tried "wget" and urllib.request along with "urlretrieve" also.
I wish I could add a screenshot or attach a file. The file created is 2kb and not even a csv.gz file. But the true file that I can download from a web browser is 78mb. The file is 20191026.csv.gz not that it matters as they all do the same thing. The location is https://public.bitmex.com/?prefix=data/trade/
Again, if you know of a way to obtain all the files using a filter such that 2019*csv.gz would be fantastic.
You are trying to download the files from https://public.bitmex.com/?prefix=data/trade/.
To achieve your final goal of download all the files starting from 2019* you have to do in 3 steps
1) you read the content of https://public.bitmex.com/?prefix=data/trade/
2) convert the content into an list, from that filter out the file names which starting from 2019.
3) from the result list try to download the csv using the example which you referring.
Hope this approach will help you
Happy coding.

How to load a spark-nlp pre-trained model from disk

From the spark-nlp Github page I downloaded a .zip file containing a pre-trained NerCRFModel. The zip contains three folders: embeddings, fields, and metadata.
How do I load that into a Scala NerCrfModel so that I can use it? Do I have to drop it into HDFS or the host where I launch my Spark Shell? How do I reference it?
you just need to provide the path where the folders you mentioned are contained,
import com.johnsnowlabs.nlp.annotators.ner.crf.NerCrfModel
val path = "path/to/unziped/file/folder"
val model = NerCrfModel.read.load(path)
// use your model
model.setInputCols(someCol)
model.transform(yourData) // which contains 'someCol',
As long as I remember, you can place the folder in local FS or distributed FS, hope this helps other users as well!.
best,
Alberto.

Connecting cleansing components to tFileList - Talend

What is the best way to apply logic to objects during an iteration of tFileList.
The issue is that if I use a tFileList to get a list of files, i am not able to use tJavaRow or jMap to create the filename that i want the file to be renamed. Basically, if i have zip files with years(2010,2011,2012 etc) and each zip file contains files with the same name (f1.csv, f2.csv, f3.csv), i want to iterate through the compressed files, uncompress them and rename the files with
f1_2010.csv, f2_2010.csv, f3_2010.csv..f1_2012.csv etc.
Thanks!
Iterate links are providing a way to execute components based on events or facts while main links are transfering data between components.
With something looking like that you should be able to resolve your problem :
tFileList_1 --iterate--> tFileUnarchive_1
|
onComponentOK
|
tFileList_2 -- iterate --> tFileCopy_1
|
onComponentOK
|
tFileArchive_1
Use ((String)globalMap.get("tFileList_1_CURRENT_FILEPATH")) in your tFileUnarchive to get the ZIP path.
In tFileCopy use ((String)globalMap.get("tFileList_2_CURRENT_FILEPATH")) to get the path of file and config it to be a rename.
For your name modification you can add tJava on "onComponentOK" links. By using globalMap.put("year",((String)globalMap.get("tFileList_1_CURRENT_FILEPATH")).substring(x,x)) or more complicated code. And use these variables in your others components parameters.

Apache Sling index page/directory listing

in order to gain a basic understanding of Apache Sling I'm trying to
build a simple blogging application using it.
I defined an own node type blog:post that is used for single posts.
This is the structure of the content repository so far:
/
|
|-content
| |
| |-blog
| |
| |-some-blogpost (jcr:primaryType=blog:post)
| |-another-blogpost (jcr:primaryType=blog:post)
|-apps
|
|-blog (jcr:primaryType=sling:Folder)
|
|-post
|
|-html.jsp
I can refer to a specific blog post by opening http://example.com/blog/some-blogpost.html
Now suppose I wanted to have an overview of the most recent posts available at http://example.com/blog.
How do I have to name the necessary script and where do I have to put it?
Kind regards,
Markus
Rather than creating separate JCR node type for each content type, I'd use sling:resourceType property. So you could create new blog post with jcr:primaryType=nt:unstructured and add property sling:resourceType=blog/post to it.
Moving to your question: you could create new component /apps/blog/recentPosts (and script like /apps/blog/recentPosts/html.jsp) and then set sling:resourceType=blog/recentPosts property on the /content/blog node to tell Sling which script should be used to render this piece of content.