DynamodbのLimitが思った動きでなくて嵌った話(Dynamodbの基本的なクエリ等)

Dynamodbを使うようになったのですが、RDBMSに比べるとDynamodbは雰囲気でやっているところが多く、表題のLimitオプションの挙動で嵌ったのでboto3での例とあわせてまとめておきます。

対象テーブル

order_table = {
    'TableName': 'order_table',
    'KeySchema': [
        {
            'AttributeName': 'user_id',
            'KeyType': 'HASH'
        }, 
        {
            'AttributeName': 'ordered_at',
            'KeyType': 'RANGE'
        }
    ],
    'AttributeDefinitions': [
        {
            'AttributeName': 'user_id',
            'AttributeType': 'N'
        }, 
        {
            'AttributeName': 'ordered_at',
            'AttributeType': 'S'
        }
    ],
    'ProvisionedThroughput': {
        'ReadCapacityUnits': 100,
        'WriteCapacityUnits': 100
    }
}

user_idとordered_atをKeyとしてもつorder_tableを定義します。そして以下のように1000行ほどダミーデータを投入しておきます。

table = dynamodb.Table('order_table')
with table.batch_writer() as batch:
    for i in tqdm(range(1000)):
        batch.put_item(
            Item = {
                'user_id': i % 50,
                'ordered_at': (datetime.datetime(2018, 2, 1) + datetime.timedelta(days=(i))).isoformat(),
                'shop_id': i% 20,
                'category': 'category_{}'.format(i%3),
                'order_items': [
                    'item_{}'.format(i %20)
                ]
            }
        )

それぞれのデータはこんな感じのフォーマットで各user_idごとに20件ずつ入っています。

{
    'category': 'category_1',
    'order_items': ['item_1'],
    'ordered_at': '2018-02-02T00:00:00',
    'shop_id': Decimal('1'),
    'user_id': Decimal('1'),
}

一応ちゃんと1000件(20件 * 50ユーザ)入っていることも確認しておきます。

table.scan().get('Count')  # -> 1000

基本的なクエリ

とりあえずいくつかクエリを実行しておきます。

Key属性での絞込

# user_id = 1 のデータを取得
table.query(
    KeyConditionExpression=Key('user_id').eq(1)
)

{'Count': 20,
 'Items': [{'category': 'category_1',
   'order_items': ['item_1'],
   'ordered_at': '2018-02-02T00:00:00',
   'shop_id': Decimal('1'),
   'user_id': Decimal('1')},
  {'category': 'category_0',
   'order_items': ['item_11'],
   'ordered_at': '2018-03-24T00:00:00',
   'shop_id': Decimal('11'),
   'user_id': Decimal('1')},
  {'category': 'category_2',
   'order_items': ['item_1'],
   'ordered_at': '2018-05-13T00:00:00',
   'shop_id': Decimal('1'),
   'user_id': Decimal('1')},
   :

user_id = 120件の結果が戻っています。さらに日付で2018-12-31T23:59:59までという絞込をかけます。

table.query(
    KeyConditionExpression=Key('user_id').eq(1) & Key('ordered_at').lt('2018-12-31T23:59:59')
)

{'Count': 7,
 'Items': [{'category': 'category_1',
   'order_items': ['item_1'],
   'ordered_at': '2018-02-02T00:00:00',
   'shop_id': Decimal('1'),
   'user_id': Decimal('1')},
  {'category': 'category_0',
   'order_items': ['item_11'],
   'ordered_at': '2018-03-24T00:00:00',
   'shop_id': Decimal('11'),
   'user_id': Decimal('1')},
  {'category': 'category_2',
   'order_items': ['item_1'],
   :

2018年の注文は7件であることがわかりました。またこのうちで5件だけ取得するにはLimit=5をつければ良いです。

table.query(
    KeyConditionExpression=Key('user_id').eq(1) & Key('ordered_at').lt('2018-12-31T23:59:59'),
    Limit=5
)

{'Count': 5,
 'Items': [{'category': 'category_1',
   'order_items': ['item_1'],
   'ordered_at': '2018-02-02T00:00:00',
   'shop_id': Decimal('1'),
   'user_id': Decimal('1')},
  {'category': 'category_0',
   'order_items': ['item_11'],
   'ordered_at': '2018-03-24T00:00:00',
   'shop_id': Decimal('11'),
   'user_id': Decimal('1')},
  {'category': 'category_2',
   'order_items': ['item_1'],
   'ordered_at': '2018-05-13T00:00:00',
   'shop_id': Decimal('1'),
   'user_id': Decimal('1')},
  {'category': 'category_1',
   'order_items': ['item_11'],
   'ordered_at': '2018-07-02T00:00:00',
   'shop_id': Decimal('11'),
   'user_id': Decimal('1')},
  {'category': 'category_0',
   'order_items': ['item_1'],
   'ordered_at': '2018-08-21T00:00:00',
   'shop_id': Decimal('1'),
   'user_id': Decimal('1')}],

このケースでのLimitは意図通りの動きです

非Key属性での絞込

先程はuser_idやordered_atというKey属性での絞込でした。非Key属性だけで絞込をするにはscanとFilterExpressionを使用します。

table.scan(
    FilterExpression=Key('category').eq('category_1')
).get('Count')  # -> 333

category=category_1は333件であることがわかります。ただしscanはRDBMSでいうところのFULL SCANをした上での絞込をしているため、負荷の高い処理であることに注意します

また、Key属性と非Key属性を併用する場合はqueryでKeyConditionExpressionとFilterExpressionを指定します。

# user_id = 1 and category = 'category_1'
table.query(
    KeyConditionExpression=Key('user_id').eq(1),
    FilterExpression=Key('category').eq('category_1'),
)

Limitに気持ちが伝わらないケース

結論から言うと、FilterExpressionとLimitの併用のケースがほぼ意図通りにならないので注意しましょう。という話です。

例えば、先程333件が戻った以下のscanを例にします。

table.scan(
    FilterExpression=Key('category').eq('category_1')
).get('Count')

これにLimit=5をつけたときはCount:5が戻ると思ってしまいます。

table.scan(
    FilterExpression=Key('category').eq('category_1'),
    Limit=5
).get('Count')

しかし意図に反して、このscanは2を戻しました。結果は状況次第ですが、5以下が戻るのは同じです。これはFilterExpressionとLimitの評価順によるものです。

http://boto3.readthedocs.io/en/latest/reference/services/dynamodb.html#DynamoDB.Client.query

FilterExpression is applied after a Query finishes, but before the results are returned. A FilterExpression cannot contain partition key or sort key attributes. You need to specify those attributes in the KeyConditionExpression .

FilterExpressionはクエリが終了して取得された結果に反映されるというものです。一方でLimitはクエリ自体が戻す行数を制限します。そのため、先に評価されるのはLimitなのです。

つまり、先の例をあげるとorder_tableから5件を取得(Limit=5)し、その5件についてFilterExpression=Key('category').eq('category_1')が適用されたため最終的に2件しか取得されなかったのです。

一方でKeyConditionExpressionはクエリ自体に組み込まれるため、KeyConditionExpressionとLimitの併用はKeyConditionExpressionを満たす結果をLimit分だけ取り出すという意図した挙動になっているのです。

どうすればいいのか

非効率的ではありますが、以下は意図した結果になります。

table.scan(
    FilterExpression=Key('category').eq('category_1')
).get('Items', [])[:5]

もっと効率的にやるのであれば非Key属性であるcategoryをKeyConditionExpressionで処理できればいいので、categoryを含んだセカンダリーインデックスを作成することが有効です。このあたりはテーブル設計時点で事前にわかっていないこともあるので、難しいところですね。

[Dd]enzow(ill)? with DB and Python

DBとか資格とかPythonとかの話をつらつらと