51,777 questions
697
votes
21
answers
1.2m
views
How to find elements by class
I'm having trouble parsing HTML elements with "class" attribute using Beautifulsoup. The code looks like this
soup = BeautifulSoup(sdata)
mydivs = soup.findAll('div')
for div in mydivs:
if (div["...
378
votes
3
answers
84k
views
Headless Browser and scraping - solutions [closed]
I'm trying to put list of possible solutions for browser automatic tests suits and headless browser platforms capable of scraping.
BROWSER TESTING / SCRAPING:
Selenium - polyglot flagship in browser ...
351
votes
26
answers
164k
views
How do I prevent site scraping? [closed]
I have a fairly large music website with a large artist database. I've been noticing other music sites scraping our site's data (I enter dummy Artist names here and there and then do google searches ...
315
votes
27
answers
590k
views
Scraping: SSL: CERTIFICATE_VERIFY_FAILED error for http://en.wikipedia.org [duplicate]
I'm practicing the code from 'Web Scraping with Python', and I keep having this certificate problem:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
pages = set()
def ...
295
votes
17
answers
510k
views
How can I scrape a page with dynamic content (created by JavaScript) in Python?
I'm trying to develop a simple web scraper. I want to extract plain text without HTML markup. My code works on plain (static) HTML, but not when content is generated by JavaScript embedded in the page....
293
votes
7
answers
163k
views
How can I pass variable into an evaluate function?
I'm trying to pass a variable into a page.evaluate() function in Puppeteer, but when I use the following very simplified example, the variable evalVar is undefined.
I can't find any examples to build ...
265
votes
6
answers
680k
views
How can I get the Google cache age of any URL or web page? [closed]
In my project I need the Google cache age to be added as important information. I tried to search sources for the Google cache age, that is, the number of days since Google last re-indexed the page ...
211
votes
3
answers
221k
views
How can I efficiently parse HTML with Java?
I do a lot of HTML parsing in my line of work. Up until now, I was using the HtmlUnit headless browser for parsing and browser automation.
Now, I want to separate both tasks.
I want to use a light ...
210
votes
18
answers
359k
views
How to save an image locally using Python whose URL address I already know?
I know the URL of an image on Internet.
e.g. http://www.digimouth.com/news/media/2011/09/google-logo.jpg, which contains the logo of Google.
Now, how can I download this image using Python without ...
200
votes
9
answers
411k
views
How can I use Python's Requests to fake a browser visit a.k.a and generate User Agent? [duplicate]
I want to get the content from this website.
If I use a browser like Firefox or Chrome, I could get the real website page I want, but if I use the Python Requests package (or wget command) to get it, ...
196
votes
10
answers
226k
views
Web scraping with Python [closed]
I'd like to grab daily sunrise/sunset times from a web site. Is it possible to scrape web content with Python? what are the modules used? Is there any tutorial available?
193
votes
16
answers
361k
views
retrieve links from web page using python and BeautifulSoup [closed]
How can I retrieve the links of a webpage and copy the url address of the links using Python?
190
votes
14
answers
351k
views
How do I avoid HTTP error 403 when web scraping with Python?
When I try this code to scrape a web page:
#import requests
import urllib.request
from bs4 import BeautifulSoup
#from urllib import urlopen
import re
webpage = urllib.request.urlopen('http://www....
167
votes
10
answers
345k
views
can we use XPath with BeautifulSoup?
I am using BeautifulSoup to scrape an URL and I had the following code, to find the td tag whose class is 'empformbody':
import urllib
import urllib2
from BeautifulSoup import BeautifulSoup
url = &...
163
votes
11
answers
196k
views
How to scrape only visible webpage text with BeautifulSoup?
Basically, I want to use BeautifulSoup to grab strictly the visible text on a webpage. For instance, this webpage is my test case. And I mainly want to just get the body text (article) and maybe even ...