Skip to main content

MiraclePlus:Efficient Long-context Generation

··359 words·2 mins· loading · loading ·
Event LLM

<em>cgi-bin_mmwebwx-bin_webwxgetmsgimg___MsgID_4966669954753407131_skey</em>@crypt_2697f459_7f25b74893778fd2dcd840f54ed1c553_mmweb_appid_wx_webfilehelper-mark.jpg

Today I attended a discussion led by Betty at MiraclePlus on LLM Long Context, and it just kept getting better and better. If it weren’t for the occasion, I would have been applauding non-stop; it felt like meeting a kindred spirit.

Betty Chen, an Assistant Professor at CMU’s Infinite AI Lab, has been focusing on Efficient Long Context Generation in her recent papers, exploring both modeling and system co-design:

  • TriForce: Dynamic attention compression + Top-k sparse KV cache
  • MagicPIG addresses two issues with TriForce:
    • Using locality-sensitive hashing (LSH—still not sure about this one) to accurately find top k sparse KV.
    • Designing an LLM system with CPU+GPU heterogeneous computing to tackle the Memory Limit issue. In my opinion, this is its biggest highlight. Since the computational power of CPUs and GPUs differs by orders of magnitude, but memory bandwidth only differs by about ten times, and Decoding Attention’s computational intensity is around 1, why not use cheap and abundant memory? Naturally, sparse attention computation should be done on the CPU.

Here are a few takeaways from the forum:

  • Top-4: RAG and Long Context are not at odds. Although multi-level retrieval systems and Graph RAG can alleviate the issue of logic extraction in long contexts, they don’t fundamentally solve the problem of LLMs lacking deep logical understanding. This toughest problem will likely need to be addressed at the model level.
  • Top-3: For LLMs to commercialize, co-design of inference algorithms and systems (software and hardware) is the way forward. I completely agree with Betty’s point: AI exploration should not be led by hardware, as losing diversity is unimaginable. Existing LLM systems need co-design at both the model and hardware levels; we can’t let specific hardware become a bottleneck. There needs to be a balance with hardware beyond GPUs—CPUs, NPUs, TPUs all have their place

    Can’t wait to see what Google and Groq has in store :-)

  • Top-2: Memory, more memory, and more cheap memory are all you need.
  • Top-1: Seeing a professor trained in the U.S. academic system and industry persistently advocating for diversity and innovation was truly inspiring and powerful. It’s hard not to be drawn to such scientific heritage. :)