Highest scored 'web-scraping' questions

697 votes

21 answers

1.2m views

How to find elements by class

I'm having trouble parsing HTML elements with "class" attribute using Beautifulsoup. The code looks like this soup = BeautifulSoup(sdata) mydivs = soup.findAll('div') for div in mydivs: if (div["...

Neo

13.9k

asked Feb 18, 2011 at 11:58

378 votes

3 answers

84k views

Headless Browser and scraping - solutions [closed]

I'm trying to put list of possible solutions for browser automatic tests suits and headless browser platforms capable of scraping. BROWSER TESTING / SCRAPING: Selenium - polyglot flagship in browser ...

Community wiki

44 revs, 16 users 51%
Inoperable

351 votes

26 answers

164k views

How do I prevent site scraping? [closed]

I have a fairly large music website with a large artist database. I've been noticing other music sites scraping our site's data (I enter dummy Artist names here and there and then do google searches ...

pixel

3,659

asked Jul 1, 2010 at 20:49

315 votes

27 answers

590k views

Scraping: SSL: CERTIFICATE_VERIFY_FAILED error for http://en.wikipedia.org [duplicate]

I'm practicing the code from 'Web Scraping with Python', and I keep having this certificate problem: from urllib.request import urlopen from bs4 import BeautifulSoup import re pages = set() def ...

Catherine4j

3,198

asked May 8, 2018 at 14:32

295 votes

17 answers

510k views

How can I scrape a page with dynamic content (created by JavaScript) in Python?

I'm trying to develop a simple web scraper. I want to extract plain text without HTML markup. My code works on plain (static) HTML, but not when content is generated by JavaScript embedded in the page....

mocopera

3,003

asked Nov 8, 2011 at 11:13

293 votes

7 answers

163k views

How can I pass variable into an evaluate function?

I'm trying to pass a variable into a page.evaluate() function in Puppeteer, but when I use the following very simplified example, the variable evalVar is undefined. I can't find any examples to build ...

Cat Burston

3,073

asked Sep 7, 2017 at 5:17

265 votes

6 answers

680k views

How can I get the Google cache age of any URL or web page? [closed]

In my project I need the Google cache age to be added as important information. I tried to search sources for the Google cache age, that is, the number of days since Google last re-indexed the page ...

Tokendra Kumar Sahu

3,534

asked Dec 30, 2010 at 6:06

211 votes

3 answers

221k views

How can I efficiently parse HTML with Java?

I do a lot of HTML parsing in my line of work. Up until now, I was using the HtmlUnit headless browser for parsing and browser automation. Now, I want to separate both tasks. I want to use a light ...

Amit

35k

asked Jan 30, 2010 at 16:52

210 votes

18 answers

359k views

How to save an image locally using Python whose URL address I already know?

I know the URL of an image on Internet. e.g. http://www.digimouth.com/news/media/2011/09/google-logo.jpg, which contains the logo of Google. Now, how can I download this image using Python without ...

Pankaj Vatsa

2,671

asked Nov 27, 2011 at 14:46

200 votes

9 answers

411k views

How can I use Python's Requests to fake a browser visit a.k.a and generate User Agent? [duplicate]

I want to get the content from this website. If I use a browser like Firefox or Chrome, I could get the real website page I want, but if I use the Python Requests package (or wget command) to get it, ...

user1726366

2,436

asked Dec 26, 2014 at 3:29

196 votes

10 answers

226k views

Web scraping with Python [closed]

I'd like to grab daily sunrise/sunset times from a web site. Is it possible to scrape web content with Python? what are the modules used? Is there any tutorial available?

eozzy

69.2k

asked Jan 17, 2010 at 16:06

193 votes

16 answers

361k views

retrieve links from web page using python and BeautifulSoup [closed]

How can I retrieve the links of a webpage and copy the url address of the links using Python?

NepUS

1,999

asked Jul 3, 2009 at 18:29

190 votes

14 answers

351k views

How do I avoid HTTP error 403 when web scraping with Python?

When I try this code to scrape a web page: #import requests import urllib.request from bs4 import BeautifulSoup #from urllib import urlopen import re webpage = urllib.request.urlopen('http://www....

Josh

3,331

asked May 18, 2013 at 17:47

167 votes

10 answers

345k views

can we use XPath with BeautifulSoup?

I am using BeautifulSoup to scrape an URL and I had the following code, to find the td tag whose class is 'empformbody': import urllib import urllib2 from BeautifulSoup import BeautifulSoup url = &...

Shiva Krishna Bavandla

26.9k

asked Jul 13, 2012 at 6:55

163 votes

11 answers

196k views

How to scrape only visible webpage text with BeautifulSoup?

Basically, I want to use BeautifulSoup to grab strictly the visible text on a webpage. For instance, this webpage is my test case. And I mainly want to just get the body text (article) and maybe even ...

user233864

1,847

asked Dec 20, 2009 at 17:55

Collectives™ on Stack Overflow