NumPy Archives

ベンフォードの法則（Benford’s law）とは

某国の大統領選挙で大規模な不正が行われた可能性があるという話が一部で盛大に盛り上がっていますが、その中で「ベンフォードの法則」（Benford’s law）の話が出きます。

同法則についての詳しい解説や練習問題などは Benford’s Law: Theory and Applications [1] などの成書にあたるのが手っ取り早いと思いますが、高校数学の復習も兼ねて、ここでは細かい理屈はスッ飛ばして法則の概略だけ簡単に紹介したいと思います。

まずは、日本の国家会計のデータをサンプルとして実験をしてみます(*1)。

(*1) 末尾に実験用の Python scripts を置いておきました。

Benford’s law の主張

人為的に操作していない（インチキをしていない）数値データに現れる数字（0 ~ 9）の出現パターンはランダムだと思われがちですが、実はそうではなくて(*2)一定の「癖（クセ）」がある:

すなわち、数値の先頭 \(d\) 桁に特定の数字（先頭 1 桁なら 1~9、2 桁なら 10~99、3 桁なら 100~999、など）が出現する確率\(P(d)\)は次式（Equation 1）に従う — というのが Benford’s law の主張です。

\(P(d) = log_{10}(\frac{d+1}{d})\) Equation 1

ちょっと計算（Script 1）すればわかりますが、この法則が成り立つとすると、例えば、会計の帳簿だとか選挙の投票データなどを持って来てそこに並んでいる数値を見ると、先頭 1 桁が \(1\) である確率 \(P(1)\)は \(log_{10}2=0.301\)、\(2\)である確率 \(P(2)\)は \(log_{10}1.5=0.176\) などとなるはずです（なんか直感と違うなぁ🙄）。

(*2) などと能書きを垂れていますが、実はランダムな場合も少なくありません。当たり前の話ですが、Benford’s law が当てはまるかどうかは母集団の性質に依存します – なんてことを言い出したら話がまとまらないので、細かい話は次回以降に 😸

Benford’s law の応用

Benford’s law が成り立つなら、もし誰かがデータを改竄すれば数字の並び方の傾向が Benford’s law から外れるので、インチキがバレるはずです。

と言うことで、これを利用して、会計監査や税務調査、あるいは今回のような選挙監視の場面で、データ改竄の端緒をつかむために Benford’s law が用いられているようで、実際、市販の会計ソフトの伝票チェック機能に Benford’s law が応用されている例もあります(*3)。

検査の実務マニュアル的な資料としては、Association of Certified Fraud Examiners (ACFE) の資格受験用教材 Using Benford’s Law to Detect Fraud [2] などがあるので、斜め読みすると面白いかもしれません(*4)。

(*3) 国税査察官（いわゆる「マルサ」の中の人）[3] などは、押収した大量の帳票類をパラパラ漫画のように一瞥しただけでアヤシイ伝票を見つけ出すという伝説がありますが、このレベルになると厳しい研修の結果、網膜回路にハードウェアレベルで Benford’s law が焼き付けられているのかもしれません 🙀
(*4) 同ウェブページの Sample Chapter [PDF 340 MB] を見ればおおよその検査手順を知ることができます。

論より RUN

Benford’s law なんて本当に成り立っているのか？なんか胡散臭いんですけど？

ここは論より証拠ということで、サンプルとして日本の令和元（2019）年度一般会計歳入歳出決算[4]の数字（Script 2）を使い、先頭 1 桁、2 桁及び 3 桁（\(d=1, 2, 3\)）についてテストしてみました（Script 3, 4, 5）。結果を Fig. 1 (a, b, c) に示します。

それぞれ、青線が Benford’s law による確率（Probability）、橙線が実測値です。

こうしてみると、結構良い感じで当てはまっていますね。日本政府の会計不正を白日の下に暴き出そうという今回の計画ははかなくも頓挫してしましました 😭

Fig. 1: 令和元年度一般会計歳入歳出決算の不正をキビシク監査 🕵️‍♀️

日本の選挙に不正はないのか？

今回の令和元年度一般会計歳入歳出決算については、Benford’s law が綺麗に当てはまってしまったので拍子抜け 😞

そこで捲土重来、次回は日本の政治の闇に鋭く切り込むべく、第25回参議院議員通常選挙結果のデータ[5]を使って実験をしてみます。

Python Scripts

Benford’s law の確率計算

検査に使う桁数（引数）を与えると、{先頭 k 桁の数字の並び d: このパターンが出現する確率} を Dictionary として返します。

(*5) 以下で使う packages や小道具類は Google Colab に default で既にインストールされているので、そちらで実験を行うと些細な error で引っかかることがなくお勧めです。

なお、変数は上の関数から下の関数にバケツリレー式に渡しているので、Script 1, 2, 3, 4, 5, 6, 7 の順に実行しないと error になります。

Script 1 Benford’s law の確率計算

import numpy as np


def benford_table(k=1):
    """Generates a dictionary of Benford's Set
    for the first k digit(s)"""
    # First k digit(s) as a key => Probability as a value
    benford_dict = {}
    lower, upper = 10 ** (k - 1), 10 ** k  # lower/upper bound of the bins.
    for d in range(lower, upper):
        benford_dict[d] = np.log10((d + 1) / d)

    return benford_dict


if __name__ == '__main__':
    bfd_dict = benford_table(1)  # Sample: first-digit table
    # Draw a table
    print('d', '\t', 'P(d)')
    for i in bfd_dict:
        print(i, '\t', f'{bfd_dict[i]:.5f}')

# Output sample (k=1)
d 	 P(d)
1 	 0.30103
2 	 0.17609
3 	 0.12494
4 	 0.09691
5 	 0.07918
6 	 0.06695
7 	 0.05799
8 	 0.05115
9 	 0.04576

令和元年度決算書（一般会計歳出決算）データのダウンロードと前処理

サンプルとして、令和元年度決算書（一般会計歳出決算）をダウンロードして pandas.DataFrame にしておきます。

なお、実験のたびに毎回新しいファイルをダウンロードするのも行儀が悪いので、ダウンロードしたデータを Excel file として作業ディレクトリに保存し、必要があればそちらを使うことにします。

Script 2 令和元年度決算書（一般会計歳出決算）データのダウンロード

import requests
import pandas as pd
from urllib.parse import urlparse
from pathlib import Path, PurePath


def get_excel(my_excel_url):
    """Downloads an Excel file and
    converts the first sheet to a pandas.DataFrame"""
    df = pd.read_excel(requests.get(my_excel_url).content)

    return df


if __name__ == '__main__':
    # [令和元年度決算書](https://www.bb.mof.go.jp/hdocs/bxss010br1a.html)
    STLMT_FY2019 = 'https://www.bb.mof.go.jp/server/2019/excel/DL201977001.xls'
    file_name = PurePath(urlparse(STLMT_FY2019).path).name

    dl_path = PurePath('.')
    file_path = dl_path.joinpath(file_name)

    if not Path(file_path).is_file():
        df_raw = get_excel(STLMT_FY2019)  # Downloads using the function.
        df_raw.to_excel(file_path, index=False)  # Backup the file.

    else:
        print(file_path, 'is already downloaded.')
        df_raw = pd.read_excel(file_path)

    print('DataFrame(df_raw) is ready')

List にして改めて数値を眺めると、ゼロや負の値も混ざっています。

このままだと後でトラブルの原因になりそうなので、次のステップで 3 桁以上の正の整数だけを相手にすることにします。

Script 3 DataFrame の中身を一つの list にまとめる

import numpy as np
import pandas as pd


# Implodes values in the selected DF columns to a list
def df2list(source_df, col_filter: list) -> list:
    """Extracts cell values from a DataFrame and list them."""
    df_filtered = source_df.iloc[:, col_filter]  # Filter for columns.

    array_filtered = np.array(df_filtered)  # Convert to numpy.ndarray
    # DataFrame(2D) to list(1D)
    row_size, col_size = array_filtered.shape  # (m, n) array to...
    value_list_length = row_size * col_size  # a list with length of m*n.
    value_list = array_filtered.reshape(1, value_list_length).tolist()[0]

    return value_list


if __name__ == '__main__':
    columns_to_use = [9, 10, 11, 12, 13, 14]
    my_value_list = df2list(df_raw, columns_to_use)


# Confirm
print('\t', '\t'.join([i[:5] for i in df_raw.columns[columns_to_use]]))
for i in range(len(my_value_list)):
    print(f'{my_value_list[i]:>16}', end='')
    if i % len(columns_to_use) == len(columns_to_use)-1:
        print()

最初の n 桁を抽出する

Array 要素の数値から、指定番目の数字を取り出して dictionary にしておきます。

ただし、3 桁未満の数値を分析してもあまり意味がないので、それらを filter して落としています。

数字の取り出し方は、Using Benford’s Law to Detect Fraud(ACFE) p.47 を参考にしていますが、最後の ‘The last two digits test’ は省略しました。

The first digit test
The second digit test
The first two digits test
The first three digits test
The last two digits test

Script 4 最初の n 桁を抽出する

def head_hunter(input_array, cut_off=100):
    """Returns first n digit(s) from a list or 1D-np.array.
    Values less than cut_off value will be discarded."""
    filtered_array = list(filter(lambda x: x >= cut_off, input_array))
    print('Length of in/out array:', len(input_array), len(filtered_array))

    heads_dict = {
        'first_one_digit': [int(str(int(i))[0]) for i in filtered_array],
        'second_md_digit': [int(str(int(i))[1]) for i in filtered_array],
        'first_two_digits': [int(str(int(i))[:2]) for i in filtered_array],
        'first_thr_digits': [int(str(int(i))[:3]) for i in filtered_array]
        # 'last_two_digits': [int(str(int(i))[-2:]) for i in filtered_array]
    }

    return heads_dict  # Comprises four lists above.


if __name__ == '__main__':
    my_first_one_digit = head_hunter(my_value_list)['first_one_digit']
    my_first_two_digits = head_hunter(my_value_list)['first_two_digits']
    my_first_thr_digits = head_hunter(my_value_list)['first_thr_digits']
    # Confirm
    # print(my_first_one_digit)
    # print(my_first_two_digits)
    # print(my_first_thr_digits)

先頭桁のパターンで分類、Histogram data 化する

Array に含まれる値の分布（density=True）を Benford’s law の階級（bins）にしたがって整理します。

Script 5 渡された Array の中身の値を分類してカウント（histogram）

import numpy as np


def bfd_hist(input_array):
    """Groups the data in the input_array into bins"""
    # Check the integrity of the input_array.
    digit_min = int(np.log10(min(input_array)) + 1)
    digit_max = int(np.log10(max(input_array)) + 1)
    if digit_min == digit_max:  # of digits must be the same.
        digit = digit_max
        print('OK: An array of',
              str(digit) + '-digit numbers is passed to the function.')
        # Define bin boundaries.
        lower, upper = 10 ** (digit - 1), 10 ** digit
        hist, bin_edges = np.histogram(input_array,
                                       bins=range(lower, upper + 1),
                                       density=True
                                       )
        bfd_hist, bfd_bins = hist, bin_edges[:-1]  # of bins = eds-1

        return bfd_hist, bfd_bins

    else:
        print('Error: Illegal array')


if __name__ == '__main__':
    my_fone_hist, my_fone_bins = bfd_hist(my_first_one_digit)
    my_ftwo_hist, my_ftwo_bins = bfd_hist(my_first_two_digits)
    my_fthr_hist, my_fthr_bins = bfd_hist(my_first_thr_digits)

Script 6 集計表作成

# The first digit test
number_of_digits = 1
table_dict = benford_table(number_of_digits)

df_STLMT_FY2019_fone_test = pd.DataFrame(
    {
        'First Digit': table_dict.keys(),
        'Benford\'s Set': table_dict.values(),
        'Data Set X': my_fone_hist,
        'Deviation': np.array(list(table_dict.values())) - my_fone_hist
    }
)

# The first two digits test
number_of_digits = 2
table_dict = benford_table(number_of_digits)

df_STLMT_FY2019_ftwo_test = pd.DataFrame(
    {
        'First Two Digits': table_dict.keys(),
        'Benford\'s Set': table_dict.values(),
        'Data Set X': my_ftwo_hist,
        'Deviation': np.array(list(table_dict.values())) - my_ftwo_hist
    }
)

# The first three digits test
number_of_digits = 3
table_dict = benford_table(number_of_digits)

df_STLMT_FY2019_fthr_test = pd.DataFrame(
    {
        'First Three Digits': table_dict.keys(),
        'Benford\'s Set': table_dict.values(),
        'Data Set X': my_fthr_hist,
        'Deviation': np.array(list(table_dict.values())) - my_fthr_hist
    }
)

# Confirm
print('\n', df_STLMT_FY2019_fone_test)
print('\n', df_STLMT_FY2019_ftwo_test)
print('\n', df_STLMT_FY2019_fthr_test)

Script 7 グラフ描画（on Jupyter Notebook/lab or Colab）

df_STLMT_FY2019_fone_test.plot(x=0, y=[1, 2])
df_STLMT_FY2019_ftwo_test.plot(x=0, y=[1, 2])
df_STLMT_FY2019_fthr_test.plot(x=0, y=[1, 2])

おまけ

今回使用した Python scripts の Jupyter Notebook 版を benford_01.ipynb_.zip [41 KB] に置いておきました (Preview)。
次回使用するデータの準備として、第25回参議院議員通常選挙結果の Excel を総務省の公式サイトからダウンロードして整理し、pandas.DataFrame に収める script を sangiin_votes.ipynb_.zip [4 KB] に置いておきました (Preview)。

References

[1] Miller, S. J. (2015). Benford’s Law: Theory and Applications: Princeton University Press.
[2] Using Benford’s Law to Detect Fraud: Association of Certified Fraud Examiners, Inc.
[3] 国税庁 税務大学校
[4] 財務省 (2020) 令和元年度決算書
[5] 総務省 (2020) 第25回参議院議員通常選挙結果