Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5

Python Web Scraping Scripts
#1

There are a number of famous membership sites that allow you to upload documents as well as download PDFs free. There are many ebooks that are available for premium members only. I found a number of good ebooks on one such membership site. I had come across an article online that explained how I can use to python to download ebooks from such sites. I started a one month free trial of the membership site. In the beginning, I was unable to make sense how to download the ebook. The googling and reading I finally developed the following script that did the job for me and I was able to download a number of ebooks from that site. Below is the python script:

from PIL import Image
from time import sleep
import os
#import re
import img2pdf
#browser = webdriver.Chrome()
from selenium import webdriver
#from selenium.webdriver.common.keys import Keys
#from selenium.webdriver.common.action_chains import ActionChains
#D:\\Shared\\Python\\geckodriver-v0.24.0-win64\\geckodriver.exe
browser = webdriver.Firefox(executable_path=\
r'D:\\Shared\\Python\\geckodriver-v0.24.0-win64\\geckodriver.exe')

url ="https://www.xxxxx.com/login"
browser.get(url)
browser.maximize_window()
#loginto site 
usr='xxxxxxxxxxxx'  
pwd='xxxxxxxxxxx'
#print ("Opened Site")
#sleep(5)

#scroll the pages and make screenshots
#body = browser.find_element_by_css_selector('body')
#body.send_keys(Keys.PAGE_DOWN)
count=1
browser.save_screenshot(\
"D:/Shared/Downloads/Ebooks1/{}.jpg".format(\
   str(count).zfill(3)))
#ActionChains(browser).send_keys(Keys.PAGE_DOWN).perform()
count=count+1
print(count)
#sleep(10)
browser.quit()
#remove the alpha channel and save again
images=[]
newpath="D:/Shared/Downloads/ScribdEbooks1/"
images=[newpath+i \
           for i in os.listdir(newpath) \
           if i.endswith(".jpg")]
for image in images:
   imageObject  = Image.open(image)
   cropped = imageObject.crop((220,65,1700,815))
   cropped.convert("RGB").save(image)

#images=os.listdir(newpath)
#crop the image
# import the Python Image processing Library

# Create an Image object from an Image
#width, height = imageObject.size
#width, height
# storing pdf path
pdfPath = 'D:/Shared/Downloads/ScribdEbooks1/ebook1.pdf'
with open(pdfPath, "wb") as f:
   f.write(img2pdf.convert(\
       [newpath+i \
           for i in os.listdir(newpath) \
           if i.endswith(".jpg")]))
# closing pdf file
f.close()


In the above python script, we first import PIL (python image library) previously known as Pillow. Then we need the Selenium. In order to use Selenium we need to install the geckdriver for firefox. After taking images of the webpages, I had to remove the alpha channel. Alpha Channel just gives the transparency. When I was cropping I found out I have remove the alpha channel as cropping was only possible with RGB channels. Then I converted the images into a PDF.

Subscribe My YouTube Channel:
https://www.youtube.com/channel/UCUE7VPo...F_BCoxFXIw

Join Our Million Dollar Trading Challenge:
https://www.doubledoji.com/million-dolla...challenge/
Reply


Forum Jump:


Users browsing this thread: 1 Guest(s)