Designing relational system for large scale

Question

I've been having some difficulty scaling up the application and decided to ask a question here.

Consider a relational database (say mysql). Let's say it allows users to make posts and these are stored in the post table (has fields: postid, posterid, data, timestamp). So, when you go to retrieve all posts by you sorted by recency, you simply get all posts with posterid = you and order by date. Simple enough.

This process will use timestamp as the index since it has the highest cardinality and correctly so. So, beyond looking into the indexes, it'll take literally 1 row fetch from disk to complete this task. Awesome!

But let's say it's been 1 million more posts (in the system) by other users since you last posted. Then, in order to get your latest post, the database will peg the index on timestamp again, and it's not like we know how many posts have happened since then (or should we at least manually estimate and set preferred key)? Then we wasted looking into a million and one rows just to fetch a single row.

Additionally, a set of posts from multiple arbitrary users would be one of the use cases, so I cannot make fields like userid_timestamp to create a sub-index.

Am I seeing this wrong? Or what must be changed fundamentally from the application to allow such operation to occur at least somewhat efficiently?

Assuming you have enough users to get to one million posts then an index on `posterid` should also be selective, if not more selective than an index on `date`. — Joel Brown, Dec 25 '12 at 15:25
@JoelBrown I've added clarity by changing `date` to `timestamp`. It would be accurate down to a second. So, cardinality of timestamp would roughly equal to size of the table. It will **never** pick posterid as basis of the index unless manually told to. — Grumpy, Dec 25 '12 at 15:42
Peter - If you are querying the most recent posts by all users then the default behaviour of the query optimizer is desirable. If you know that you are looking for just the posts from a particular author (or even list of authors) then using a query hint to force selection by `posterid` seems like a pretty good idea no matter how long it's been since that author has posted. — Joel Brown, Dec 25 '12 at 15:47
@JoelBrown List of users is actually one of the features that I'm trying to fulfill. And I feel that the sum of posts from a list will be sizable. Then, we'd be forced to pull up a large row for every cases, whereas a user who hasn't come for a long time and pulling a million rows would be rarer. It may potentially be worse on average than just serving a million every so often. I think the problem is a bit of stuck between a rock and a hard place. — Grumpy, Dec 25 '12 at 15:57

score 3 · Accepted Answer · edited May 23 '17 at 11:43

Indexing

If you have a query: ... WHERE posterid = you ORDER BY timestamp [DESC], then you need a composite index on {posterid, timestamp}.

Finding all posts of a given user is done by a range scan on the index's leading edge (posterid).
Finding user's oldest/newest post can be done in a single index seek, which is proportional to the B-Tree height, which is proportional to log(N) where N is number of indexed rows.

To understand why, take a look at Anatomy of an SQL Index.

Clustering

The leafs of a "normal" B-Tree index hold "pointers" (physical addresses) to indexed rows, while the rows themselves reside in a separate data structure called "table heap". The heap can be eliminated by storing rows directly in leafs of the B-Tree, which is called clustering. This has its pros and cons, but if you have one predominant kind of query, eliminating the table heap access through clustering is definitely something to consider.

In this particular case, the table could be created like this:

CREATE TABLE T (
    posterid int,
    `timestamp` DATETIME,
    data VARCHAR(50),
    PRIMARY KEY (posterid, `timestamp`)
);

The MySQL/InnoDB clusters all its tables and uses primary key as clustering key. We haven't used the surrogate key (postid) since secondary indexes in clustered tables can be expensive and we already have the natural key. If you really need the surrogate key, consider making it alternate key and keeping the clustering established through the natural key.

Thank you. I haven't really strongly through of clusters created by primary keys before and it sounds like it makes perfect sense. — Grumpy, Dec 26 '12 at 05:58

goat · Answer 2 · 2012-12-25T16:50:41.920

For queries like

where posterid = 5
order by timestamp

or

where posterid in (4, 578, 222299, ...etc...)
order by timestamp

make an index on (posterid, timestamp) and the database should pick it all by itself.

edit - i just tried this with mysql

CREATE TABLE `posts` (
    `id` INT(11) NOT NULL,
    `ts` INT NOT NULL,
    `data` VARCHAR(100) NULL DEFAULT NULL,
    INDEX `id_ts` (`id`, `ts`),
    INDEX `id` (`id`),
    INDEX `ts` (`ts`),
    INDEX `ts_id` (`ts`, `id`)
)
ENGINE=InnoDB

I filled it with a lot of data, and

explain
select * from posts where id = 5 order by ts

picks the id_ts index

amit · Answer 3 · 2012-12-25T16:04:22.290

Assuming you use hash tables to implement your Data Base - yes. Hash tables are not ordered, and you have no other way but to iterate all elements in order to find the maximal.

However, if you use some ordered DS, such as a B+ tree (which is actually pretty optimized for disks and thus data bases), it is a different story.

You can store elements in your B+ tree ordered by user (primary order/comparator) and date (secondary comparator, descending). Once you have this DS, finding the first element can be achieved in O(log(n)) disk seeks by finding the first element matching the primary criteria (user-id).

I am not familiar with the implementations of data bases, but AFAIK, some of them do allow you to create an index, based on a B+ tree - and by doing so, you can achieve finding the last post of a user more efficiently.

P.S.

To be exact, the concept of "greatest" element or ordering is not well defined in Relational Algebra. There is no max operator. To get the max element of a table R with a single column a one should actually create the Cartesian product of that table and find this entry. There is no max nor sort operator in strict relational algebra (though it does exist in SQL)

(Assuming set, and not multiset semantics):
MAX = R \ Project(Select(R x R, R1.a < R2.a),R1.a)

mysql uses b trees as one of the basic forms of indexes on additional indexes you create. But the problem is that it will pick timestamp (date) as it's index due to higher cardinality. And they cannot use two indexes (on user and on date) at the same time because indexes say nothing about the other index. — Grumpy, Dec 25 '12 at 15:44
@Peter: By creating a dummy column `_`, (and assuming the order of the timestamp lexicographically is the same as integer order - can be achieved with adding leading zeros), and using this as the key - one can achieve an easy index using an "existing" field and without creating a new comparator. — amit, Dec 25 '12 at 15:54
Sorry, but I guess I left a bit of a detail in use cases for me in particular -- which I now edited original question and responded to Joel. I cannot use userid_timestamp because the pull would occasionally need multiple userids. Additionally, such index wouldn't work on db systems like mysql/postgresql because it would order on user first which negates the whole problem of reading too many previous rows. — Grumpy, Dec 25 '12 at 16:08

Designing relational system for large scale

3 Answers3

Indexing

Clustering