【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.3】

2017年11月23日2019年2月11日

今回も Web スクレイピングの続けます。

エンジニアとして長年現場で仕事をしていますが、HTTP技術一つとっても、まだまだ自分の知らない分野は数多くあり奥の深さを感じます。

【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.1】

AWSインフラ研究所

【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part...

https://go-journey.club/archives/6643

Python 3.6 ＆ Selenium WebDriver（Selenium）＆ headless でスクレイピングをしてみます。※人により「Selenium」と呼んだり「Selenium WebDriver」と呼んだり「WebDriver」と呼んだりします。本ページでは「Selenium」もしくは「Selenium WebDriver」と呼びます。以前、「【Python3.6】BeautifulSoupのインストール＆実行手順」を解説しました。今回は BeautifulSoup を使わずに Selenium WebDriver ＆ headless を使って Web サイトをスクレイピングしてみます。【Python】Python 3.6 ＆ Selenium WebDriv...

【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.2】

AWSインフラ研究所

【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part...

https://go-journey.club/archives/6659

今回は Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピングの2回目です。普段はインフラ系エンジニアとして現場で業務をしていますが、更にステップアップするためにプログラミングスキルもコツコツと身に付けていこうと考えています。【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.1】【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.2】【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.3...

【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.3】

AWSインフラ研究所

【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part...

https://go-journey.club/archives/6681

今回も Web スクレイピングの続けます。エンジニアとして長年現場で仕事をしていますが、HTTP技術一つとっても、まだまだ自分の知らない分野は数多くあり奥の深さを感じます。【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.1】【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.2】【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.3】【Python】Python 3.6 ＆ Selenium WebDriver ＆ PhantomJS でス...

【Python】Python 3.6 ＆ Selenium WebDriver ＆ PhantomJS でスクレイピング（find系操作）【Part.4】

AWSインフラ研究所

【Python】Python 3.6 ＆ Selenium WebDriver ＆ PhantomJS でスクレイピング（fin...

https://go-journey.club/archives/6696

今回は Selenium Webdriver で find_element や find_element_by_XXX などの find 系の操作について解説します。なるべく「例」をたくさん記載して直感的に分かりやすくするように心がけています。【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.1】【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.2】【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.3】【Python】Python 3.6 ＆ Selenium WebDrive...

【Python】Python 3.6 ＆ Selenium WebDriver ＆ PhantomJS でスクレイピング（URLを引数で受け取る）【Part.5】

AWSインフラ研究所

【Python】Python 3.6 ＆ Selenium WebDriver ＆ PhantomJS でスクレイピング（URL...

https://go-journey.club/archives/6706

今回も Python 3.6 での Web スクレイピングです。URLを固定化させずに、引数として受け取り、引数チェックをするプログラムを作ります。【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.1】【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.2】【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.3】【Python】Python 3.6 ＆ Selenium WebDriver ＆ PhantomJS でスクレイピング（find系操作）【Part.4】...

banned your access based on your browser’s signature でアクセスができない

今回も、上記記事で作成した仮想環境を利用します。

仮想環境に切り替えます。（仮想環境の有効化）

[test@SAKURA_VPS pyenv]$ source pyenv/bin/activate
(pyenv) [test@SAKURA_VPS pyenv]$　←　仮想環境になりました。

仮想環境を終了します。

(pyenv) [test@SAKURA_VPS pyenv]$ deactivate
[test@SAKURA_VPS pyenv]$

以下のプログラムを試しました。

5ちゃんねる（最近、2ちゃんねるから5ちゃんねるに変わった？）へのアクセスです。

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

#url = ‘https://yahoo.co.jp/’
url = ‘http://egg.5ch.net/test/read.cgi/bizplus/1511266895/’　←　5ちゃんねるへのアクセスを試します。

dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap[“phantomjs.page.settings.userAgent”] = (
“Mozilla/1111115.0 (X11; Linux x86_64) AppleWebKit/111111153 “
“(KHTML, like Gecko) test Chrome/15.0.871111111111111”
)
driver = webdriver.PhantomJS(desired_capabilities=dcap)
driver.get(url)

print(driver.page_source)

実行結果ですが「The owner of this website (egg.5ch.net) has banned your access based on your browser’s signature (3c2034d893226ea5-ua31).」とエラーメッセージが出力され、コンテンツにはアクセスできませんでした。

(pyenv) [test@SAKURA_VPS scraping]$ python test_selenium.py
<html class=”js” lang=”en-US” style=”visibility: visible; opacity: 1;”>

Access denied | egg.5ch.net used Cloudflare to restrict access
<meta charset=”UTF-8″>
<meta http-equiv=”Content-Type” content=”text/html; charset=UTF-8″>
<meta http-equiv=”X-UA-Compatible” content=”IE=Edge,chrome=1″>
<meta name=”robots” content=”noindex, nofollow”>
<meta name=”viewport” content=”width=device-width,initial-scale=1,maximum-scale=1″>
<link rel=”stylesheet” id=”cf_styles-css” href=”/cdn-cgi/styles/cf.errors.css” type=”text/css” media=”screen,projection”>

～　省略　～

<h2 class=”cf-subheadline” data-translate=”error_desc”>Access denied</h2>

      <div class=”cf-section cf-wrapper”>
        <div class=”cf-columns two”>
          <div class=”cf-column”>
            <h2 data-translate=”what_happened”>What happened?

The owner of this website (egg.5ch.net) has banned your access based on your browser’s signature (3c2034d893226ea5-ua31).

    <span class=”cf-footer-item”>Cloudflare Ray ID: 3c2034d893226ea5
    <span class=”cf-footer-separator”>~
    <span class=”cf-footer-item”><span data-translate=”your_ip”>Your IP: 160.16.217.115
    <span class=”cf-footer-separator”>~
    <span class=”cf-footer-item”><span data-translate=”performance_security_by”>Performance & security by <a data-orig-proto=”https” data-orig-ref=”www.cloudflare.com/5xx-error-landing?utm_source=error_footer” id=”brand_link” target=”_blank” href=”https://www.cloudflare.com/5xx-error-landing?utm_source=error_footer”>Cloudflare

(pyenv) [test@SAKURA_VPS scraping]$

これについて調べたところ、以下のような禁止事項がありました。

https://developer.5ch.net/

ウェブスクレイピングを用いたアクセスは禁止されています。

APIを経由せずにアクセスした場合は「不正アクセス行為の禁止等に関する法律」などの法律に違反することになります。

そのため、技術的なことはあまり深堀せずに単純にWebスクレイピングが禁止されていると受け取った方が良さそうです。

Cloudflareとは？

Cloudflare（クラウドフレア）とは、企業名です。

アカマイのような CDN（コンテンツデリバリネットワーク）サービスを提供している企業です。

ユーザーとWebサイトの中間に位置して、Webサーバーの代わりに素早くコンテンツを表示してくれたり、リバースプロキシの設定を細かくできるなど様々なサービスが利用できます。

アメリカの企業です。

CloudflareがWebスクレイピングをブロックしていた

つまり、Cloudflareが一般ユーザーと5ちゃんねるのWebサーバーとの間に入って、Webスクレイピングをするプログラムをはじいていたわけです。

細かく調査をすれば、Webスクレイピングプログラムがユーザーエージェントを偽装すればアクセスは可能になるかと思いますが、そこは禁止されています。

APIにアクセスする場合ですが、おそらく5ちゃんねるの専用ブラウザを開発する人が、Loki Technology,Incより許可を得て、ブラウザのIDを取得して、そのブラウザIDが設定されているブラウザのみAPIにアクセスできるということでしょう。

たしかに何かしらこのようなブロックを入れないと日本中からWebスクレイピングのプログラムがアクセスしてきそうです。

Selenium WebdriverではなくBeautifulsoupだとどうなるか？

PythonでのWebスクレイピングとしては Selenium と共に Beautifulsoup も有名です。

Beautifulsoup だとどうなるのか調べてみました。

まずは現環境に Beautifulsoup をインストールします。

しかし「 SyntaxError: Missing parentheses in call to ‘print’. Did you mean print(int “Unit tests have failed!”)?」とエラーが出力されました。

(pyenv) [test@SAKURA_VPS scraping]$ pip3.6 install beautifulsoup
Collecting beautifulsoup
  Using cached BeautifulSoup-3.2.1.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File “”, line 1, in
      File “/tmp/pip-build-y_sslt56/beautifulsoup/setup.py”, line 22
        print “Unit tests have failed!”
                                      ^
    SyntaxError: Missing parentheses in call to ‘print’. Did you mean print(int “Unit tests have failed!”)?

    —————————————-
Command “python setup.py egg_info” failed with error code 1 in /tmp/pip-build-y_sslt56/beautifulsoup/
(pyenv) [test@SAKURA_VPS scraping]$ sudo pip3.6 install beautifulsoup
Collecting beautifulsoup
  Using cached BeautifulSoup-3.2.1.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File “”, line 1, in
      File “/tmp/pip-build-ed58db15/beautifulsoup/setup.py”, line 22
        print “Unit tests have failed!”
                                      ^
    SyntaxError: Missing parentheses in call to ‘print’. Did you mean print(int “Unit tests have failed!”)?

—————————————-
Command “python setup.py egg_info” failed with error code 1 in /tmp/pip-build-ed58db15/beautifulsoup/

BeautifulsoupはPython2系のため、Python3系のBeautifulsoup4をインストールする必要があった

調べた結果、BeautifulsoupはPython2系のため、Python3系のBeautifulsoup4をインストールする必要がありました。

(pyenv) [test@SAKURA_VPS scraping]$ pip3.6 install beautifulsoup4　←　Beautifulsoup4をインストールします。
Collecting beautifulsoup4
Downloading beautifulsoup4-4.6.0-py3-none-any.whl (86kB)
100% |????????????????????????????????| 92kB 2.3MB/s
Installing collected packages: beautifulsoup4
Successfully installed beautifulsoup4-4.6.0
(pyenv) [test@SAKURA_VPS scraping]$ python
Python 3.6.3 (default, Oct 11 2017, 18:17:37)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] on linux
Type “help”, “copyright”, “credits” or “license” for more information.
>>> from bs4 import BeautifulSoup　←　Beautifulsoupをインポートできました。
>>> exit()
(pyenv) [test@SAKURA_VPS scraping]$

簡単な Beautifulsoup でのスクレイピング例

5ちゃんねるにブロックされるのを分かっていながら、どのようなエラーメッセージが出力されるのか確認します。

ちなみに5ちゃんねるへのWebスクレイピングは禁止されているので、これで終わりにします。

(pyenv) [test@SAKURA_VPS scraping]$ vi beautifulsoup.py
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen(“http://egg.5ch.net/test/read.cgi/bizplus/1511266895/”)
bsObj = BeautifulSoup(html.read())

print(bsObj.h1)

実行します。

今度は「urllib.error.HTTPError: HTTP Error 403: Forbidden」が出力されました。

(pyenv) [test@SAKURA_VPS scraping]$ python beautifulsoup.py
Traceback (most recent call last):
  File “beautifulsoup.py”, line 5, in
    html = urlopen(“http://egg.5ch.net/test/read.cgi/bizplus/1511266895/”)
  File “/usr/lib64/python3.6/urllib/request.py”, line 223, in urlopen
    return opener.open(url, data, timeout)
  File “/usr/lib64/python3.6/urllib/request.py”, line 532, in open
    response = meth(req, response)
  File “/usr/lib64/python3.6/urllib/request.py”, line 642, in http_response
    ‘http’, request, response, code, msg, hdrs)
  File “/usr/lib64/python3.6/urllib/request.py”, line 570, in error
    return self._call_chain(*args)
  File “/usr/lib64/python3.6/urllib/request.py”, line 504, in _call_chain
    result = func(*args)
  File “/usr/lib64/python3.6/urllib/request.py”, line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

【BeautifulSoup】ユーザーエージェント付きプログラム

以下のように詳細に「ユーザーエージェント」や「リファラー」などを指定しています。

from urllib.request import urlopen
import urllib.request
import requests
from bs4 import BeautifulSoup
import time

url = ‘https://www.yahoo.co.jp’

# 以下のような感じで細かくヘッダーを指定できます。
headers= {
          “User-Agent”      : “Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A403 Safari/8536.25”,
          “Connection”      : “keep-alive”,
          “Accept”          : “text/html,application/xhtml+xml,application/xml;”,
          “Referrer”        : “https://www.google.co.jp”,
          “Accept-Language” : “ja;q=1.0”
        }

req = urllib.request.Request(url=url,headers=headers)
res = urllib.request.urlopen(req)
time.sleep(3)

data = res.read()
time.sleep(3)
decoded_data = data.decode(‘utf-8’)

time.sleep(3)
print(decoded_data)

以下、実行結果です。

(pyenv) [test@SAKURA_VPS scraping]$ python beautifulsoup.py
<html lang=”ja” class=”is-iOS is-ltIOS7 is-vtestIdMtop114″>

<div class=”Header” id=”headerBody”>
  <div class=”FlexBox FlexBox–middle js-Header__body”>
    <h1 class=”FlexBox__item Header__logo”>
      <div data-react=”HeaderYlogo”>
        <a href=”https://m.yahoo.co.jp/” data-ylk=”rsec:header;slk:logo;pos:1″>
          <div class=”Icon Icon–yahooJapan” aria-hidden=”true”>
          <span class=”util-displayHiddenVisually”>Yahoo! JAPAN

【Selenium】ユーザーエージェント付きプログラム

今度は Selenium Webdriver でユーザーエージェント付きのプログラムです。

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import time

url = ‘https://yahoo.co.jp/’

dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap[“phantomjs.page.settings.userAgent”] = (“Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A403 Safari/8536.25”)
driver = webdriver.PhantomJS(desired_capabilities=dcap)
time.sleep(3)
driver.get(url)

time.sleep(3)
print(driver.page_source)
print(driver.current_url)

以下、実行結果です。

(pyenv) [test@SAKURA_VPS scraping]$ python test_selenium.py
<html lang=”ja” class=”is-iOS is-ltIOS7 is-vtestIdMtop114″>