【Python】文字列から＜特定の文字列＞を検索・抽出・置換・削除したい(in,search,match,replace,sub)

2017年12月3日2019年2月11日

何度も調べているので備忘録的にここで記事にしておきます。

ある文字列から特定の文字列があるか検索する、判定する、抽出する・削除する方法です。

Python の環境

環境はすべて「Python 3.6」です。

OS は CentOS 7 を利用していますが、プログラムは「Python 3.6」をインストールした仮想環境でやっています。

ある文字列に特定の文字列があるか判定したい（in）

これはあるかないか「判定」するだけです。

文字列「aaaaaatestaaaaaaaa」の中に「test」という文字列があるかどうか判定し、あれば「true」を返し、なければ「false」を返します。

文字列検索「in」を利用します。

■プログラム例

[test@SAKURA_VPS scraping]$ cat test_20180320_01.py
#coding:utf-8

# 検索したい文字列
target_word = ‘test’

# 検索される文字列
words = ‘aaaaaaatestaaaaaaaaa’

# 文字列に検索したい文字列が含まれているか
if target_word in words:
print(‘ある’)
else:
print(‘ない’)
[test@SAKURA_VPS scraping]$

■プログラム実行結果

[test@SAKURA_VPS scraping]$ python3.6 test_20180320_01.py
ある
[test@SAKURA_VPS scraping]$

※ただし in での検索では正規表現によるパターンマッチングができません。単純にある文字列が含まれて「いる」か「いない」か判定するだけです。正規表現によるパターンマッチングをしたい場合は「re.search」や「re.match」を利用します。

文字列を検索・抽出したい（search、match）

ある文字列の中に、特定の文字列があるのか検索・抽出したい場合です。

search 関数と match 関数の違い

search 関数　←　任意の位置から検索をする
match 関数　←　文字列の先頭から検索をする

search 関数

reモジュールをインポートします。

search 関数は、文字列全てを検索して、以下の結果を返します。

抽出したい文字列が存在するかどうかを判定する
マッチする文字列が複数あっても最初の文字列しか返さない
マッチした文字列を返す
マッチした文字列の開始位置を返す
マッチした文字列の終了位置を返す
マッチした文字列の範囲を返す

正規表現を使用するため「re」ライブラリをインポートします。（import re が必要）

■プログラム例

import re

before_words = ‘aaaasssssssssdddddddtestlllllcccccllllll’

after_words = re.search(‘パターン’, before_words)

■特定の文字列を抽出するプログラム例

import re

before_words = ‘aaaasssssssssdddddddtestlllllcccccllllll’

after_words = re.search(‘パターン’, before_words)

# 抽出した文字列を表示する

print(after_words.group(0))

group(0)　←　検索に一致した文字列が返る
group(1)　←　最初に一致した文字列が返る
group(2)　←　2番目に一致した文字列が返る

「re.search」や「re.match」は正規表現によるパターンマッチングも可能です。

以下のように正規表現を記述します。

■プログラム例

[test@SAKURA_VPS test]$ cat test.py
#coding:utf-8
import re

before_words = r’aaaasssssssssddd123ddddtestlllllcccccllllll’
after_words = re.search(‘(.*)(\d)(.*)’, before_words)

# 抽出した文字列を表示する
print(‘検索に一致した文字列group(0) : ‘, after_words.group(0))
print(‘最初に一致した文字列group(1) : ‘, after_words.group(1))
print(‘2番目に一致した文字列group(2): ‘, after_words.group(2))
print(‘3番目に一致した文字列group(3): ‘, after_words.group(3))
[test@SAKURA_VPS test]$

■プログラム実行結果

[test@SAKURA_VPS test]$ python3.6 test.py
検索に一致した文字列group(0) :  aaaasssssssssddd123ddddtestlllllcccccllllll
最初に一致した文字列group(1) :  aaaasssssssssddd12
2番目に一致した文字列group(2):  3
3番目に一致した文字列group(3):  ddddtestlllllcccccllllll
[test@SAKURA_VPS test]$

※group(2)が「3」になるのは一番最後にパターンマッチングしているからです。その結果12がgroup(1)に入ります。

以下の場合は、数字3桁を取得できます。

■プログラム例

[test@SAKURA_VPS test]$ cat test.py
#coding:utf-8
import re

before_words = r’aaaasssssssssddd123ddddtestlllllcccccllllll’
after_words = re.search(‘(.*)(\d{3})(.*)’, before_words)

# 抽出した文字列を表示する
print(‘検索に一致した文字列group(0) : ‘, after_words.group(0))
print(‘最初に一致した文字列group(1) : ‘, after_words.group(1))
print(‘2番目に一致した文字列group(2): ‘, after_words.group(2))
print(‘3番目に一致した文字列group(3): ‘, after_words.group(3))
[test@SAKURA_VPS test]$

■プログラム実行結果

[test@SAKURA_VPS test]$ python3.6 test.py
検索に一致した文字列group(0) :  aaaasssssssssddd123ddddtestlllllcccccllllll
最初に一致した文字列group(1) :  aaaasssssssssddd
2番目に一致した文字列group(2):  123　←　「123」が取得できました。
3番目に一致した文字列group(3):  ddddtestlllllcccccllllll
[test@SAKURA_VPS test]$

\d{3}　←　数字3桁にマッチングします。

\d{3}-\d{4}　←　郵便番号（例：〒175-0000 など）

■プログラム例

3桁の数字を抽出したい

bango = ‘test1test11test123testtesttest’

bango_new_group = re.search(r'(.*)(\d{1,3})[^1-9]’, bango)
bango_new = bango_new_group.group(2)

→　実行すると「123」を抽出できます。

エラー（AttributeError: ‘NoneType’ object has no attribute ‘group’）

「AttributeError: ‘NoneType’ object has no attribute ‘group’」が出力される場合です。

■プログラム例

[test@SAKURA_VPS test]$ cat test.py
#coding:utf-8
import re

before_words = r’aaaasssssssssddd123ddddtestlllllcccccllllll’
after_words = re.search(‘(.*)(\d{4})(.*)’, before_words)　←　\d{4} でわざと検索に引っかからないようにします。

# 抽出した文字列を表示する
print(‘検索に一致した文字列group(0) : ‘, after_words.group(0))
print(‘最初に一致した文字列group(1) : ‘, after_words.group(1))
print(‘2番目に一致した文字列group(2): ‘, after_words.group(2))
print(‘3番目に一致した文字列group(3): ‘, after_words.group(3))
[test@SAKURA_VPS test]$

■実行結果

[test@SAKURA_VPS test]$ python3.6 test.py
Traceback (most recent call last):
File “test.py”, line 8, in
print(‘検索に一致した文字列group(0) : ‘, after_words.group(0))
AttributeError: ‘NoneType’ object has no attribute ‘group’
[test@SAKURA_VPS test]$

原因は、Python の「search」関数と「match」関数は、パターンマッチしない場合は「None」を返すからです。

そのためプログラムが「エラー」になります。

パターンマッチしない場合も想定される場合は、最初に「None」か「None でない」かを判定します。

パターンマッチしない場合も想定される場合

以下のように返り値が「None」か「None でない」かで判定します。

[test@SAKURA_VPS test]$ cat test.py
#coding:utf-8
import re

before_words = r’aaaasssssssssddd123ddddtestlllllcccccllllll’
after_words = re.search(‘(.*)(\d{4})(.*)’, before_words)

if after_words != None:　←　もし「None」出ない場合（マッチングした場合）、group(x) を print で表示します。
    # 抽出した文字列を表示する
    print(‘検索に一致した文字列group(0) : ‘, after_words.group(0))
    print(‘最初に一致した文字列group(1) : ‘, after_words.group(1))
    print(‘2番目に一致した文字列group(2): ‘, after_words.group(2))
    print(‘3番目に一致した文字列group(3): ‘, after_words.group(3))
else:　←　もし「None」の場合、group(x) などは表示しません。
    print(‘パターンマッチしませんでした。’)

[test@SAKURA_VPS test]$

match 関数

reモジュールをインポートします。

match 関数は search 関数と似ています。

match 関数と search 関数の違いは、

match 関数　←　文字列の先頭でマッチするパターンがあるかどうか判定
search 関数　←　文字列全体からマッチするパターンがあるかどうか判定

です。

やれることは search 関数とほぼ一緒です。

抽出したい文字列が存在するかどうかを判定する
マッチした文字列を返す
マッチした文字列の開始位置を返す
マッチした文字列の終了位置を返す
マッチした文字列の範囲を返す

■プログラム例

import re

before_words = ‘aaaasssssssssdddddddtestlllllcccccllllll’

after_words = re.match(‘パターン’, before_words)

特定の文字列を置換したい

replace 関数と sub 関数が使えます。

replace 関数

文字列.replace(‘変換元’,’変換後’)

sub 関数

import re　←　正規表現

text.txt = ‘文字列’

re.sub(‘正規表現’,’変換後’,文字列)

■条件

文字列型のみ使用可能

数値型、バイト型などは使用不可能

具体的な re.sub() を使ったプログラムは以下も参考にしてください。

【Python 3系】re.sub() での置換方法

AWSに特化したインフラ技術活用ガイド

【Python 3系】re.sub() での置換方法

https://go-journey.club/archives/7142

re.sub()での具体的な置換方法です。【Python】文字列から＜特定の文字列＞を検索・抽出・置換・削除したい (adsbygoogle = window.adsbygoogle || ).push({}); re.sub() の構文re.sub() の構文です。、はオプションです。import rere.sub('パターン(正規表現)', '置換後文字', '置換対象の文字列', , ) re.sub() の正規表現は2種類ある実は re.sub() には2パターンあります。もし re.sub() で思った通りに置換されない場合は以下のパターンを確認してみてください。 ■「r...

特定の文字列を削除したい

文字列の置換と同様に replace 関数と sub 関数が使えます。

replace 関数

文字列.replace(‘変換元’,’変換後’)

「2017」の文字列を削除する

‘テスト2017’.replace(‘2017’, ”)

sub 関数

import re　←　正規表現

text.txt = ‘文字列’

re.sub(‘正規表現’,’変換後’,文字列)

■条件

文字列型のみ使用可能

数値型、バイト型などは使用不可能

文字列の一文のみ抽出して置換したい

よくあるパターンだと思いますが、こんな場合です。

■元の文字列

2016年　誕生日　ショートケーキ　2,400円

2017年　誕生日　チョコレートケーキ　3,000円

2018年　誕生日　チーズケーキ　2,600円

この文字列から2017年の「ケーキ」の部分だけを抜き出して置換します。

2017年　誕生日　レアチーズケーキ　3,000円

■考え方

sub関数でグルーピング（グループ化）する
正規表現で目的の個所を抽出する
目的の個所だけ置換する

■プログラム例

#coding:utf-8
import re

test = ‘2017年　誕生日　チョコレートケーキ　3,000円’
print(test)

test_new = re.sub(‘(2017年　誕生日)　[^ -~｡-ﾟ]+　(3,000円)’, ‘\\1　レアチーズケーキ　\\2’, test)
print(test_new)

※[^ -~｡-ﾟ]+ で全角文字（漢字も含む）が対象になります。

■実行結果

(pyenv) [test@SAKURA_VPS scraping]$ python test.py
2017年　誕生日　チョコレートケーキ　3,000円
2017年　誕生日　レアチーズケーキ　3,000円　←　変わっています。
(pyenv) [test@SAKURA_VPS scraping]$

このグループ化のテクニックはプログラムをする上で非常に便利です。

エラー（AttributeError: ‘str’ object has no attribute ‘sub’）

■原因

構文が間違っている

test_new = test.sub(‘(2017年　誕生日)　[^ -~｡-ﾟ]+　(3,000円)’, ‘\\1　レアチーズケーキ　\\2’, test)　←　re.sub ではなく文字列を指定しています。

正しい例

test_new = re.sub(‘(2017年　誕生日)　[^ -~｡-ﾟ]+　(3,000円)’, ‘\\1　レアチーズケーキ　\\2’, test)　←　re.sub ではなく文字列を指定しています。

参考サイト

Pythonでの正規表現の使い方

https://qiita.com/wanwanland/items/ce272419dde2f95cdabc

[正規表現] 半角文字のみ取得と全角文字のみ取得

http://2011428.blog.fc2.com/blog-entry-79.html

よかったらシェアしてね！

URLをコピーしました！

URLをコピーしました！

この記事を書いた人

サイト管理人

コメントする

このサイトはスパムを低減するために Akismet を使っています。コメントデータの処理方法の詳細はこちらをご覧ください。

【Python】文字列から＜特定の文字列＞を検索・抽出・置換・削除したい(in,search,match,replace,sub)

Python の環境

ある文字列に特定の文字列があるか判定したい（in）

文字列を検索・抽出したい（search、match）

search 関数と match 関数の違い

search 関数

エラー（AttributeError: ‘NoneType’ object has no attribute ‘group’）

パターンマッチしない場合も想定される場合

match 関数

特定の文字列を置換したい

replace 関数

sub 関数

特定の文字列を削除したい

replace 関数

sub 関数

文字列の一文のみ抽出して置換したい

エラー（AttributeError: ‘str’ object has no attribute ‘sub’）

参考サイト

この記事を書いた人

関連記事

コメント

コメントする