dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap[“phantomjs.page.settings.userAgent”] = (
“Mozilla/1111115.0 (X11; Linux x86_64) AppleWebKit/111111153 “
“(KHTML, like Gecko) test Chrome/15.0.871111111111111”
)
driver = webdriver.PhantomJS(desired_capabilities=dcap)
driver.get(url)
print(driver.page_source)
実行結果ですが「The owner of this website (egg.5ch.net) has banned your access based on your browser’s signature (3c2034d893226ea5-ua31).」とエラーメッセージが出力され、コンテンツにはアクセスできませんでした。
しかし「 SyntaxError: Missing parentheses in call to ‘print’. Did you mean print(int “Unit tests have failed!”)?」とエラーが出力されました。
(pyenv) [test@SAKURA_VPS scraping]$ pip3.6 install beautifulsoup
Collecting beautifulsoup
Using cached BeautifulSoup-3.2.1.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File “”, line 1, in
File “/tmp/pip-build-y_sslt56/beautifulsoup/setup.py”, line 22
print “Unit tests have failed!”
^ SyntaxError: Missing parentheses in call to ‘print’. Did you mean print(int “Unit tests have failed!”)?
—————————————- Command “python setup.py egg_info” failed with error code 1 in /tmp/pip-build-y_sslt56/beautifulsoup/
(pyenv) [test@SAKURA_VPS scraping]$ sudo pip3.6 install beautifulsoup
Collecting beautifulsoup
Using cached BeautifulSoup-3.2.1.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File “”, line 1, in
File “/tmp/pip-build-ed58db15/beautifulsoup/setup.py”, line 22
print “Unit tests have failed!”
^ SyntaxError: Missing parentheses in call to ‘print’. Did you mean print(int “Unit tests have failed!”)?
—————————————- Command “python setup.py egg_info” failed with error code 1 in /tmp/pip-build-ed58db15/beautifulsoup/
(pyenv) [test@SAKURA_VPS scraping]$ python beautifulsoup.py
Traceback (most recent call last):
File “beautifulsoup.py”, line 5, in
html = urlopen(“http://egg.5ch.net/test/read.cgi/bizplus/1511266895/”)
File “/usr/lib64/python3.6/urllib/request.py”, line 223, in urlopen
return opener.open(url, data, timeout)
File “/usr/lib64/python3.6/urllib/request.py”, line 532, in open
response = meth(req, response)
File “/usr/lib64/python3.6/urllib/request.py”, line 642, in http_response
‘http’, request, response, code, msg, hdrs)
File “/usr/lib64/python3.6/urllib/request.py”, line 570, in error
return self._call_chain(*args)
File “/usr/lib64/python3.6/urllib/request.py”, line 504, in _call_chain
result = func(*args)
File “/usr/lib64/python3.6/urllib/request.py”, line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 403: Forbidden
【BeautifulSoup】ユーザーエージェント付きプログラム
以下のように詳細に「ユーザーエージェント」や「リファラー」などを指定しています。
from urllib.request import urlopen
import urllib.request
import requests
from bs4 import BeautifulSoup
import time
url = ‘https://www.yahoo.co.jp’
# 以下のような感じで細かくヘッダーを指定できます。
headers= { “User-Agent” : “Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A403 Safari/8536.25”, “Connection” : “keep-alive”, “Accept” : “text/html,application/xhtml+xml,application/xml;”, “Referrer” : “https://www.google.co.jp”, “Accept-Language” : “ja;q=1.0”
}
req = urllib.request.Request(url=url,headers=headers)
res = urllib.request.urlopen(req)
time.sleep(3)
data = res.read()
time.sleep(3)
decoded_data = data.decode(‘utf-8’)
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import time
url = ‘https://yahoo.co.jp/’
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap[“phantomjs.page.settings.userAgent”] = (“Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A403 Safari/8536.25”)
driver = webdriver.PhantomJS(desired_capabilities=dcap)
time.sleep(3)
driver.get(url)
コメント