Spotting user-defined flexible keyword in real-time is challenging because the keyword is represented in text. In this work, we propose a novel architecture to efficiently detect the flexible keywords based on the following ideas. We contsruct the representative acousting embeding of a keyword using graphene-to-phone conversion. The phone-to-embedding conversion is done by looking up the embedding dictionary which is built by averaging the corresponding embeddings (from audio encoder) of each phone during the training. The key benefit of our approach is that both text embedding and audio embedding are in the same space; hence its comparison is semantically more accurate than the case where independent text encoder is employed. Therefore, we adopt the nearest neighbor search in the embedding space to find out the most likely keyword from the user-defined flexible keyword list.