I was searching for an efficient thread-safe queue implementation last week for my project at work, and I came up with the term "Lock-free Queue". It seems to be already a widely studied topic (Google suggested me a few papers) but I think they are too difficult to understand. After that I found this page which introduced and explained the author's lock-free queue implementation. Although it has certain limitations (only one producer and one consumer are allowed), its implementation is very simple and I think the author's explanation is simple and clear. What's more, it fits my case (even with those limitations).
For my own reference, I put part of my code (a little bit modified from what's described by the author) below. With CUSTOM_QUEUE_NODE_REUSE flag, the enqueue function will not do the lazy removal as the author suggested. Instead it directly reuses nodes which were previously dequeued. The drawback of this modification is the actual memory usage of the queue will always be the same as when it reached its longest queue size.
Update: Since C++ increment and decrement operations are not atomic and it causes bug in the code below, I removed the related code. If one needs to deal with queue size, he has to find atomic functions for increment and decrement.