【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.2】

2017年11月23日2019年2月11日

今回は Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピングの2回目です。

普段はインフラ系エンジニアとして現場で業務をしていますが、更にステップアップするためにプログラミングスキルもコツコツと身に付けていこうと考えています。

【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.1】

AWSインフラ研究所

【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part...

https://go-journey.club/archives/6643

Python 3.6 ＆ Selenium WebDriver（Selenium）＆ headless でスクレイピングをしてみます。※人により「Selenium」と呼んだり「Selenium WebDriver」と呼んだり「WebDriver」と呼んだりします。本ページでは「Selenium」もしくは「Selenium WebDriver」と呼びます。以前、「【Python3.6】BeautifulSoupのインストール＆実行手順」を解説しました。今回は BeautifulSoup を使わずに Selenium WebDriver ＆ headless を使って Web サイトをスクレイピングしてみます。【Python】Python 3.6 ＆ Selenium WebDriv...

【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.2】

AWSインフラ研究所

【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part...

https://go-journey.club/archives/6659

今回は Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピングの2回目です。普段はインフラ系エンジニアとして現場で業務をしていますが、更にステップアップするためにプログラミングスキルもコツコツと身に付けていこうと考えています。【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.1】【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.2】【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.3...

【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.3】

AWSインフラ研究所

【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part...

https://go-journey.club/archives/6681

今回も Web スクレイピングの続けます。エンジニアとして長年現場で仕事をしていますが、HTTP技術一つとっても、まだまだ自分の知らない分野は数多くあり奥の深さを感じます。【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.1】【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.2】【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.3】【Python】Python 3.6 ＆ Selenium WebDriver ＆ PhantomJS でス...

【Python】Python 3.6 ＆ Selenium WebDriver ＆ PhantomJS でスクレイピング（find系操作）【Part.4】

AWSインフラ研究所

【Python】Python 3.6 ＆ Selenium WebDriver ＆ PhantomJS でスクレイピング（fin...

https://go-journey.club/archives/6696

今回は Selenium Webdriver で find_element や find_element_by_XXX などの find 系の操作について解説します。なるべく「例」をたくさん記載して直感的に分かりやすくするように心がけています。【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.1】【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.2】【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.3】【Python】Python 3.6 ＆ Selenium WebDrive...

【Python】Python 3.6 ＆ Selenium WebDriver ＆ PhantomJS でスクレイピング（URLを引数で受け取る）【Part.5】

AWSインフラ研究所

【Python】Python 3.6 ＆ Selenium WebDriver ＆ PhantomJS でスクレイピング（URL...

https://go-journey.club/archives/6706

今回も Python 3.6 での Web スクレイピングです。URLを固定化させずに、引数として受け取り、引数チェックをするプログラムを作ります。【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.1】【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.2】【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.3】【Python】Python 3.6 ＆ Selenium WebDriver ＆ PhantomJS でスクレイピング（find系操作）【Part.4】...

仮想環境への切り替え

仮想環境に切り替えます。（仮想環境の有効化）

[test@SAKURA_VPS pyenv]$ source pyenv/bin/activate
(pyenv) [test@SAKURA_VPS pyenv]$　←　仮想環境になりました。

ただ、思ったのがわざわざ仮想環境に切り替える必要があるのかどうか。

というのも、別途 Python3.6 をインストール済みだからです。

一番シンプルなプログラム

一番シンプルなプログラムです。

何をやってもエラーが出てどうしようもなくなったら、一旦ここに帰ります。

ただしヤフーは表示されるが、他のサイトは拒否られることが多いです。

from selenium import webdriver

url = ‘https://yahoo.co.jp/’
driver = webdriver.PhantomJS()
driver.get(url)

print(driver.page_source)

実行例です。

(pyenv) [test@SAKURA_VPS scraping]$ python test_selenium.py
<html lang=”ja”>

Selenium で User Agent を設定する

サイトによっては User Agent がないとはじかれます。

そのため User Agent を設定します。

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

# Wikiのメインページにアクセス
url = ‘https://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8’

dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap[“phantomjs.page.settings.userAgent”] = (
“Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 “
“(KHTML, like Gecko) Chrome/15.0.87”
)
driver = webdriver.PhantomJS(desired_capabilities=dcap)
driver.get(url)
print(driver.page_source)

実行結果です。

(pyenv) [test@SAKURA_VPS scraping]$ python test_selenium.py
<!DOCTYPE html><html class=”client-js ve-not-available” lang=”ja” dir=”ltr”><head>
<meta charset=”UTF-8″>
<title>Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, “$1client-js$2” );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({“wgCanonicalNamespace”:””,”wgCanonicalSpecialPageName”:false,”wgNamespaceNumber”:0,”wgPageName”:”メインページ”,”wgTitle”:”メインページ”,”wgCurRevisionId”:65264604,”wgRevisionId”:65264604,”wgArticleId”:253348,”wgIsArticle”:true,”wgIsRedirect”:false,”wgAction”:”view”,”wgUserName”:null,”wgUserGroups”:[“*”],”wgCategories”:[],”wgBreakFrames”:false,”wgPageContentLanguage”:”ja”,”wgPageContentModel”:”wikitext”,”wgSeparatorTransformTable”:[“”,””],”wgDigitTransformTable”:[“”,””],”wgDefaultDateFormat”:”ja”,”wgMonthNames”:[“”,”1月”,”2月”,”3月”,”4月”,”5月”,”6月”,”7月”,”8月”,”9月”,”10月”,”11月”,”12月”],”wgMonthNamesShort”:[“”,”1月”,”2 月”,”3月”,”4月”,”5月”,”6月”,”7月”,”8月”,”9月”,”10月”,”11月”,”12月”],”wgRelevantPageName”:”メインページ”,”wgRelevantArticleId”:253348,”wgRequestId”:”WhW0ZApAAEEAAFSiTB0AAACT”,”wgIsProbablyEditable”:false,”wgRelevantPageIsProbablyEditable”:false,”wgRestrictionEdit”:[“sysop”],”wgRestrictionMove”:[“sysop”],”wgIsMainPage”:true,”wgWikiEditorEnabledModules”:{“toolbar”:true,”preview”:false,”publish”:false},”wgBetaFeaturesFeatures”:[],”wgMediaViewerOnClick”:true,”wgMediaViewerEnabledByDefault”:true,”wgPopupsShouldSendModuleToUser”:true,”wgPopupsConflictsWithNavPopupGadget”:false,”wgVisualEditor”:{“pageLanguageCode”:”ja”,”pageLanguageDir”:”ltr”,”pageVariantFallbacks”:”ja”,”usePageImages”:true,”usePageDescriptions”:true},”wgPreferredVariant”:”ja”,”wgMFExpandAllSectionsUserOption”:false,”wgMFDisplayWikibaseDescriptions”:{“search”:true,”nearby”:true,”watchlist”:true,”tagline”:true},”wgRelatedArticles”:null,”wgRelatedArticlesUseCirrusSearch”:true,”wgRelatedArticlesOnlyUseCirrusSearch”:false,”wgULSCurrentAutonym”:”日本語”,”wgNoticeProject”:”wikipedia”,”wgCentralNoticeCookiesToDelete”:[],”wgCentralNoticeCategoriesUsingLegacy”:[“Fundraising”,”fundraising”],”wgCategoryTreePageCategoryOptions”:”{\”mode\”:0,\”hideprefix\”:20,\”showcount\”:true,\”namespaces\”:false}”,”wgWikibaseItemId”:”Q5296″,”wgCentralAuthMobileDomain”:false,”wgCodeMirrorEnabled”:false,”wgVisualEditorToolbarScrollOffset”:0,”wgVisualEditorUnsupportedEditParams”:[“undo”,”undoafter”,”veswitched”],”wgEditSubmitButtonLabelPublish”:true});mw.loader.state({“ext.globalCssJs.user.styles”:”ready”,”ext.globalCssJs.site.styles”:”ready”,”site.styles”:”ready”,”noscript”:”ready”,”user.styles”:”ready”,”user”:”ready”,”user.options”:”loading”,”user.tokens”:”loading”,”ext.cite.styles”:”ready”,”ext.categoryTree.css”:”ready”,”ext.visualEditor.desktopArticleTarget.noscript”:”ready”,”ext.uls.interlanguage”:”ready”,”ext.wikimediaBadges”:”ready”,”mediawiki.legacy.shared”:”ready”,”mediawiki.legacy.commonPrint”:”ready”,”mediawiki.sectionAnchor”:”ready”,”mediawiki.skinning.interface”:”ready”,”skins.vector.styles”:”ready”,”ext.globalCssJs.user”:”ready”,”ext.globalCssJs.site”:”ready”});mw.loader.implement(“user.options@0sbylvi”,function($,jQuery,require,module){mw.user.options.set({“variant”:”ja”});});mw.loader.implement(“user.tokens@1dqfd7l”,function ( $, jQuery, require, module ) {

本当に設定したユーザーエージェントでアクセスしているか確認

疑問に思ったことが、そもそも本当にこのユーザーエージェントでサイトにアクセスをしているのかどうかということです。

tcpdumpコマンドでパケットをキャプチャしながら、スクレイピングをしてパケットの中身を解析します。

rootアカウントにスイッチして（tcpdumpは一般アカウントでは取得できない）、80番ポートだけ絞ってパケットをキャプチャします。

[root@SAKURA_VPS ~]# tcpdump port 80 -i eth0 -w test111.cap
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
^C4100 packets captured
6202 packets received by filter
2102 packets dropped by kernel
[root@SAKURA_VPS ~]#

キャプチャ結果を出力したファイル「test111.cap」をローカルのパソコンに持ってきて、Wiresharkで解析をします。

確かに「”Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 “　”(KHTML, like Gecko) Chrome/15.0.87″」の文字列があるので、指定したユーザーエージェントでアクセスをしているようです。

念のため、ユーザーエージェントの文字列を変更してみます。

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

url = ‘http://www.metro.tokyo.jp/’

dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap[“phantomjs.page.settings.userAgent”] = (
“Mozilla/1111115.0 (X11; Linux x86_64) AppleWebKit/111111153 “　←　変更
“(KHTML, like Gecko) test Chrome/15.0.871111111111111″　　　　　←　変更
)
driver = webdriver.PhantomJS(desired_capabilities=dcap)
driver.get(url)
print(driver.page_source)

再度 tcpdump をしながらプログラムを実行します。

[root@SAKURA_VPS ~]# tcpdump port 80 -i eth0 -w test11111.cap
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
^C3937 packets captured
6128 packets received by filter
2191 packets dropped by kernel
[root@SAKURA_VPS ~]#

パケットの中身を見ると確かにユーザーエージェントは書き変わっています。

DesiredCapabilitiesとは

DesiredCapabilitiesでユーザーエージェントを指定していますが、DesiredCapabilitiesを利用することで

ブラウザの種類（Firefox、Google Chromeなど）
ブラウザのオプション
ブラウザのバージョン
ユーザーエージェント

など様々な設定をすることが可能です。

Python 辞書型

{}で囲む
keyとvalue（キーと値）の組み合わせ
keyとvalueは「:(コロン)」で区切る
keyとvalueのセットは、「,(カンマ)」で区切る

サンプルプログラム

(pyenv) [test@SAKURA_VPS scraping]$ vi test_dict.py

test = {‘NHK’:1,’日テレ’:4,’テレビ朝日’:5,’TBS’:6,’テレビ東京’:7,’フジテレビ’:8}

print(test)
print(test[‘NHK’])
print(test[‘日テレ’])
print(test[‘テレビ朝日’])
print(test[‘テレビ東京’])

プログラムを実行します。

(pyenv) [test@SAKURA_VPS scraping]$ python test_dict.py
{‘NHK’: 1, ‘日テレ’: 4, ‘テレビ朝日’: 5, ‘TBS’: 6, ‘テレビ東京’: 7, ‘フジテレビ’: 8}
1
4
5
7
(pyenv) [test@SAKURA_VPS scraping]$

Firefox を使用した際の「No such file or directory: ‘geckodriver’: ‘geckodriver’」のエラー出力

geckodriverがインストールされていないと以下のような「FileNotFoundError: [Errno 2] No such file or directory: ‘geckodriver’: ‘geckodriver’」が出力されます。

(pyenv) [test@SAKURA_VPS scraping]$ python test_selenium.py
Traceback (most recent call last):
  File “/home/test/pyenv/lib64/python3.6/site-packages/selenium/webdriver/common/service.py”, line 74, in start
    stdout=self.log_file, stderr=self.log_file)
  File “/usr/lib64/python3.6/subprocess.py”, line 709, in __init__
    restore_signals, start_new_session)
  File “/usr/lib64/python3.6/subprocess.py”, line 1344, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: ‘geckodriver’: ‘geckodriver’

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File “test_selenium.py”, line 4, in
    browser = Firefox()
  File “/home/test/pyenv/lib64/python3.6/site-packages/selenium/webdriver/firefox/webdriver.py”, line 144, in __init__
    self.service.start()
  File “/home/test/pyenv/lib64/python3.6/site-packages/selenium/webdriver/common/service.py”, line 81, in start
    os.path.basename(self.path), self.start_error_message)
selenium.common.exceptions.WebDriverException: Message: ‘geckodriver’ executable needs to be in PATH.

(pyenv) [test@SAKURA_VPS scraping]$

Selenium はインタフェースとしてのブラウザのドライバを必要とします。

例えば、Firefox の場合は「geckodriver」ドライバを必要とします。

【例】

Firefoxの場合

https://github.com/mozilla/geckodriver/releases

「geckodriver-v0.19.1-linux64.tar.gz」をダウンロードします。

(pyenv) [test@SAKURA_VPS ~]$ wget https://github.com/mozilla/geckodriver/releases/download/v0.19.1/geckodriver-v0.19.1-linux64.tar.gz
–2017-11-23 00:34:58– https://github.com/mozilla/geckodriver/releases/download/v0.19.1/geckodriver-v0.19.1-linux64.tar.gz
github.com (github.com) をDNSに問いあわせています… 192.30.255.112, 192.30.255.113
github.com (github.com)|192.30.255.112|:443 に接続しています… 接続しました。
HTTP による接続要求を送信しました、応答を待っています… 302 Found
場所: https://github-production-release-asset-2e65be.s3.amazonaws.com/25354393/e31e4c22-be6f-11e7-9bc7-dedc3490a7fd?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20171122%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20171122T153458Z&X-Amz-Expires=300&X-Amz-Signature=baa732b1a15f4a37be0641185287f13969c8c56d2b18d4ea476a7a185339f6a9&X-Amz-SignedHeaders=host&actor_id=0&response-content-disposition=attachment%3B%20filename%3Dgeckodriver-v0.19.1-linux64.tar.gz&response-content-type=application%2Foctet-stream [続く]
–2017-11-23 00:34:58– https://github-production-release-asset-2e65be.s3.amazonaws.com/25354393/e31e4c22-be6f-11e7-9bc7-dedc3490a7fd?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20171122%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20171122T153458Z&X-Amz-Expires=300&X-Amz-Signature=baa732b1a15f4a37be0641185287f13969c8c56d2b18d4ea476a7a185339f6a9&X-Amz-SignedHeaders=host&actor_id=0&response-content-disposition=attachment%3B%20filename%3Dgeckodriver-v0.19.1-linux64.tar.gz&response-content-type=application%2Foctet-stream
github-production-release-asset-2e65be.s3.amazonaws.com (github-production-release-asset-2e65be.s3.amazonaws.com) をDNSに問いあわせています… 54.231.49.72
github-production-release-asset-2e65be.s3.amazonaws.com (github-production-release-asset-2e65be.s3.amazonaws.com)|54.231.49.72|:443 に接続しています… 接続しました。
HTTP による接続要求を送信しました、応答を待っています… 200 OK
長さ: 2301226 (2.2M) [application/octet-stream]
`geckodriver-v0.19.1-linux64.tar.gz’ に保存中

100%[==============================================================>] 2,301,226 1.24MB/s 時間 1.8s

2017-11-23 00:35:01 (1.24 MB/s) – `geckodriver-v0.19.1-linux64.tar.gz’ へ保存完了 [2301226/2301226]

(pyenv) [test@SAKURA_VPS ~]$

ダウンロードした「geckodriver-v0.19.1-linux64.tar.gz」を展開して「/usr/local/bin」にコピーします。

(pyenv) [test@SAKURA_VPS ~]$ tar xvfz geckodriver-v0.19.1-linux64.tar.gz
geckodriver
(pyenv) [test@SAKURA_VPS ~]$ sudo cp -ip geckodriver /usr/local/bin/
(pyenv) [test@SAKURA_VPS ~]$ ls -l /usr/local/bin/geckodriver
-rwxrwxr-x 1 test test 7194178 11月 1 04:15 /usr/local/bin/geckodriver
(pyenv) [test@SAKURA_VPS ~]$

firefoxのプロセスをまとめてkillする

何度かデバッグを繰り返していると以下のように終了しないfirefoxのプロセスが溜まります。

(pyenv) [test@SAKURA_VPS scraping]$ ps -ef | grep firefox
test      4357     1  0 00:41 pts/2    00:01:13 /usr/lib64/firefox/firefox -marionette -profile /tmp/rust_mozprofile.JWEO0B0JEMzx
test      4458  4357  0 00:41 pts/2    00:03:20 /usr/lib64/firefox/plugin-container -greomni /usr/lib64 firefox/omni.ja -appomni /usr/lib64/firefox/browser/omni.ja -appdir /usr/lib64/firefox/browser 4357 tab
test      4857     1  0 00:48 pts/2    00:00:58 /usr/lib64/firefox/firefox -marionette -profile /tmp/rust_mozprofile.qxESEd4ArZRq
test      4959  4857  0 00:48 pts/2    00:00:02 /usr/lib64/firefox/plugin-container -greomni /usr/lib64 firefox/omni.ja -appomni /usr/lib64/firefox/browser/omni.ja -appdir /usr/lib64/firefox/browser 4857 tab
test     11339  4564  0 07:29 pts/4    00:00:00 grep –color=auto firefox
(pyenv) [test@SAKURA_VPS scraping]$

まとめてプロセスを kill するためには pgrep でプロセスを検索して、xargs で kill コマンドを実行して kill します。

(pyenv) [test@SAKURA_VPS scraping]$ pgrep firefox | xargs kill -9
(pyenv) [test@SAKURA_VPS scraping]$ ps -ef | grep firefox
test 11424 4564 0 07:34 pts/4 00:00:00 grep –color=auto firefox
(pyenv) [test@SAKURA_VPS scraping]$