I am learning the Transformer. Here is the pytorch document for MultiheadAttention. In their implementation, I saw there is a constraint:
assert self.head_dim * num_heads == self.embed_dim, "embed_dim must be divisible by num_heads"
Why require the constraint: embed_dim must be divisible by num_heads? If we go back to the equation
Assume:
Q, K,V are n x emded_dim matrices; all the weight matrices W is emded_dim x head_dim,
Then, the concat [head_i, ..., head_h] will be a n x (num_heads*head_dim) matrix;
W^O with size (num_heads*head_dim) x embed_dim
[head_i, ..., head_h] * W^O will become a n x embed_dim output
I don't know why we require embed_dim must be divisible by num_heads.
Let say we have num_heads=10000, the resuts are the same, since the matrix-matrix product will absort this information.

