【Python】Python 3.6 ＆ Selenium WebDriver ＆ PhantomJS でスクレイピング（URLを引数で受け取る）【Part.5】

2017年11月26日2019年2月11日

今回も Python 3.6 での Web スクレイピングです。

URLを固定化させずに、引数として受け取り、引数チェックをするプログラムを作ります。

【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.1】

AWSに特化したインフラ技術活用ガイド

【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part...

https://go-journey.club/archives/6643

Python 3.6 ＆ Selenium WebDriver（Selenium）＆ headless でスクレイピングをしてみます。※人により「Selenium」と呼んだり「Selenium WebDriver」と呼んだり「WebDriver」と呼んだりします。本ページでは「Selenium」もしくは「Selenium WebDriver」と呼びます。以前、「【Python3.6】BeautifulSoupのインストール＆実行手順」を解説しました。今回は BeautifulSoup を使わずに Selenium WebDriver ＆ headless を使って Web サイトをスクレイピングしてみます。【Python】Python 3.6 ＆ Selenium WebDriv...

【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.2】

AWSに特化したインフラ技術活用ガイド

【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part...

https://go-journey.club/archives/6659

今回は Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピングの2回目です。普段はインフラ系エンジニアとして現場で業務をしていますが、更にステップアップするためにプログラミングスキルもコツコツと身に付けていこうと考えています。【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.1】【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.2】【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.3...

【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.3】

AWSに特化したインフラ技術活用ガイド

【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part...

https://go-journey.club/archives/6681

今回も Web スクレイピングの続けます。エンジニアとして長年現場で仕事をしていますが、HTTP技術一つとっても、まだまだ自分の知らない分野は数多くあり奥の深さを感じます。【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.1】【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.2】【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.3】【Python】Python 3.6 ＆ Selenium WebDriver ＆ PhantomJS でス...

【Python】Python 3.6 ＆ Selenium WebDriver ＆ PhantomJS でスクレイピング（find系操作）【Part.4】

AWSに特化したインフラ技術活用ガイド

【Python】Python 3.6 ＆ Selenium WebDriver ＆ PhantomJS でスクレイピング（fin...

https://go-journey.club/archives/6696

今回は Selenium Webdriver で find_element や find_element_by_XXX などの find 系の操作について解説します。なるべく「例」をたくさん記載して直感的に分かりやすくするように心がけています。【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.1】【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.2】【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.3】【Python】Python 3.6 ＆ Selenium WebDrive...

【Python】Python 3.6 ＆ Selenium WebDriver ＆ PhantomJS でスクレイピング（URLを引数で受け取る）【Part.5】

AWSに特化したインフラ技術活用ガイド

【Python】Python 3.6 ＆ Selenium WebDriver ＆ PhantomJS でスクレイピング（URL...

https://go-journey.club/archives/6706

今回も Python 3.6 での Web スクレイピングです。URLを固定化させずに、引数として受け取り、引数チェックをするプログラムを作ります。【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.1】【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.2】【Python】Python 3.6 ＆ Selenium WebDriver ＆ headless でスクレイピング【Part.3】【Python】Python 3.6 ＆ Selenium WebDriver ＆ PhantomJS でスクレイピング（find系操作）【Part.4】...

開発環境

OS ： CentOS 7

Pythonバージョン： 3.6

仮想環境： venv

各種モジュールバージョン（pipでインストールしました）

beautifulsoup4 (4.6.0)
pip (9.0.1)
pytz (2017.3)
requests (2.18.4)
selenium (3.7.0)
setuptools (28.8.0)
urllib3 (1.22)

仮想環境に切り替える

今回も、上記記事で作成した仮想環境を利用します。

仮想環境に切り替えます。（仮想環境の有効化）

[test@SAKURA_VPS pyenv]$ source pyenv/bin/activate
(pyenv) [test@SAKURA_VPS pyenv]$　←　仮想環境になりました。

仮想環境を終了します。

(pyenv) [test@SAKURA_VPS pyenv]$ deactivate
[test@SAKURA_VPS pyenv]$

スクレイピングしたいURLを引数として受け取る

以前、コマンドライン引数を受け取る方法を記述していたので、これを参考にします。

【Ansible】【Python】Ansible で取得した JSON 形式のデータをパースして CSV、Excel 形式にコンバートするプログラム

AWSに特化したインフラ技術活用ガイド

【Ansible】【Python】Ansible で取得した JSON 形式のデータをパースして CSV、Ex...

https://go-journey.club/archives/4841

前回、Ansible で取得した情報を JSON 形式で出力する方法を説明しました。【Ansible】ansible-playbook 取得した結果を JSON 形式で出力する方法今回は、更に進化して Ansible で取得した情報を JSON で出力して、更に JSON 形式のデータを Python で CSV、Excel 形式で出力するプログラムを説明します。 Ansible ＋ Python で業務でも役に立つと思います。 (adsbygoogle = window.adsbygoogle || ).push({}); 構成図以下が今回の検証環境の構成図です。コントロールノードは「Cent07」サーバーですが、...

コマンドライン引数を受け取るプログラム

以下のようにコマンドライン引数を受け取るプログラムを追加します。

#coding:utf-8
# argv を取得するために sys をインポートします。
import sys

# コマンドライン引数を取り込み args に格納します。
args = sys.argv
# コマンドライン引数の2番目を取得して page_url 変数に格納します。
page_url = args[1]

その結果、以下のような Python プログラムになります。

#coding:utf-8
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import time
# argv を取得するために sys をインポートします。
import sys

# コマンドライン引数を取り込み args に格納します。
args = sys.argv
# コマンドライン引数の2番目を取得して page_url 変数に格納します。
page_url = args[1]

dcap = dict(DesiredCapabilities.PHANTOMJS)

# ユーザーエージェント
# Android Chrome
dcap[“phantomjs.page.settings.userAgent”] = (“Mozilla/5.0 (Linux; Android 4.2.2; WX10K Build/103.0.2f30) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.109 Mobile Safari/537.36”)

driver = webdriver.PhantomJS(desired_capabilities=dcap)
driver.get(page_url)

time.sleep(10)
print(driver.page_source)

# 結果をファイルに出力する
f = open(‘test.txt’,’w’)
f.write(driver.page_source)
f.close()

# webdriverを閉じます。
driver.close()

コマンドライン引数が足りない場合にメッセージを出力するプログラム

コマンドライン引数を忘れてプログラムを実行した場合、以下のような「list index out of range」エラーメッセージが出ます。

一瞬、「なぜエラーになったんだろう」と考えてしまうため、「引数が足りませんよ」とエラー原因を出力させるプログラムに改修します。

※もしくは多すぎてもエラーメッセージを出力させるプログラムに改修します。

(pyenv) [test@SAKURA_VPS scraping]$ python sele_test.py
Traceback (most recent call last):
File “sele_test.py”, line 11, in
page_url = args[1]
IndexError: list index out of range
(pyenv) [test@SAKURA_VPS scraping]$

プログラムとしては

コマンドライン引数が何個あるか調べる
プログラム上、必要な引数の数になっているか調べる（多すぎても少なくてもダメ）
条件に合っていない場合は原因を教えるエラーメッセージを出力する

のようになります。

# コマンドライン引数を取り込み args に格納します。
args = sys.argv
print(args)　←　argsに何が入っているのか確認します。

コマンド実行結果です。

値が[]（角カッコ）で囲まれているのでリスト型です。

(pyenv) [test@SAKURA_VPS scraping]$ python sele_test.py https://yahoo.co.jp
[‘sele_test.py’, ‘https://yahoo.co.jp’]
(pyenv) [test@SAKURA_VPS scraping]$

リストの数を調べます。

# コマンドライン引数を取り込み args に格納します。
args = sys.argv
print(args)　←　argsに何が入っているのか確認します。

# リストの数を調べます。
print(len(args))　←　Python 3.6です。

プログラムを実行してどのような結果が表示されるのか確認します。

(pyenv) [test@SAKURA_VPS scraping]$ python sele_test.py https://yahoo.co.jp
[‘sele_test.py’, ‘https://yahoo.co.jp’]
2　←　リストの数は「2」と表示されました。

ここで、引数の数が「2」の場合のみ処理を進め、「2」以外の場合はコマンドの使い方を表示させます。

最終的に以下のようになりました。

(pyenv) [test@SAKURA_VPS scraping]$ vit sele_test.py
#coding:utf-8
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import time
# argv を取得するために sys をインポートします。
import sys

# コマンドライン引数を取り込み args に格納します。
args = sys.argv
print(args)

# リストの数を調べます。
#print(len(args))
arg_count = len(args)

# リストの数をチェックします。
if arg_count != 2:
    print(‘\n’)
    print(‘使用方法：python [プログラム.py] [URL]’)
    print(‘引数にURL（https://yahoo.co.jpなど）が必要です。’)
    print(‘\n’)
    # プログラムを終了します。
    quit()

# コマンドライン引数の2番目を取得して page_url 変数に格納します。
page_url = args[1]

dcap = dict(DesiredCapabilities.PHANTOMJS)

driver = webdriver.PhantomJS(desired_capabilities=dcap)
driver.get(page_url)

time.sleep(10)
#print(driver.page_source)

# 結果をファイルに出力する
f = open(‘text.txt’,’w’)
f.write(driver.page_source)
f.close()

# webdriverを閉じます。
driver.close()

動作チェックをします。

【引数を間違えた場合】

(pyenv) [test@SAKURA_VPS scraping]$ python sele_test.py
[‘sele_test.py’]

使用方法：python [プログラム.py] [URL]
引数にURL（https://yahoo.co.jpなど）が必要です。

(pyenv) [test@SAKURA_VPS scraping]$

【引数が正しい場合】

(pyenv) [test@SAKURA_VPS scraping]$ python sele_test.py https://yahoo.co.jp
[‘sele_test.py’, ‘https://yahoo.co.jp’]
(pyenv) [test@SAKURA_VPS scraping]$

参考サイト

http://sheemaa.hatenablog.jp/entry/2013/11/11/130521

トラブル解決

突然のエラーです。

今まで selenium でプログラムを動かしていたにもかからわず、突然のエラーです。

特に何もインストール＆アンインストールしていないのですが、原因は何でしょうか？

(pyenv) [test@SAKURA_VPS scraping]$ python selenium.py
Traceback (most recent call last):
  File “selenium.py”, line 2, in
    from selenium import webdriver
  File “/home/test/pyenv/scraping/selenium.py”, line 2, in
    from selenium import webdriver
ImportError: cannot import name ‘webdriver’
(pyenv) [test@SAKURA_VPS scraping]$ python
Python 3.6.3 (default, Oct 11 2017, 18:17:37)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] on linux
Type “help”, “copyright”, “credits” or “license” for more information.
>>> from selenium import webdriver
Traceback (most recent call last):
  File “”, line 1, in
  File “/home/test/pyenv/scraping/selenium.py”, line 2, in
    from selenium import webdriver
ImportError: cannot import name ‘webdriver’
>>>

ちなみにプログラムです。

pipコマンドで現在インストールされているリストを表示して確認しました。

確かに「selenium」はインストールされています。

(pyenv) [test@SAKURA_VPS scraping]$ pip list
DEPRECATION: The default format will switch to columns in the future. You can use –format=(legacy|columns) (or define a format=(legacy|columns) in your pip.conf under the [list] section) to disable this warning.
beautifulsoup4 (4.6.0)
certifi (2017.11.5)
chardet (3.0.4)
Django (1.11.7)
idna (2.6)
mod-wsgi (4.5.20)
mod-wsgi-httpd (2.4.27.1)
pip (9.0.1)
pytz (2017.3)
requests (2.18.4)
selenium (3.7.0)　←　仮想環境で selenium がインストールされています。
setuptools (28.8.0)
urllib3 (1.22)
(pyenv) [test@SAKURA_VPS scraping]$

原因解決（ファイル名をselenium.pyに変更したのが原因）

以下のエラーを見ていて原因が分かりました。

単純にimportしただけなのに、「/home/test/pyenv/scraping/selenium.py」を見に行って「seleniumがありません」とメッセージを出しています。

つまりプログラム名を「selenium.py」にしたのが原因でした。

(pyenv) [test@SAKURA_VPS scraping]$ python
Python 3.6.3 (default, Oct 11 2017, 18:17:37)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] on linux
Type “help”, “copyright”, “credits” or “license” for more information.
>>> from selenium import webdriver
Traceback (most recent call last):
  File “”, line 1, in
  File “/home/test/pyenv/scraping/selenium.py”, line 2, in 　←　なぜかここで「/home/test/pyenv/scraping/selenium.py」を読みに行っている。
    from selenium import webdriver
ImportError: cannot import name ‘webdriver’
>>>