Last night, I was very excited because I managed to set up my TTS server( piper1 based) locally. It took me a whole day to make it work perfectly, using an unmerged patch2 to improve its performance on CUDA environment. I also plugged it into my custom-built terminal chatbot client and Emacs editor. To ensure that the server as a restless deamon servicea, I wrote a systemd service unit description file for it.
After finishing the task, my home lab server reached a state worth noting. All of ASR, LLM, and TTS are running locally without any cloud API or internet access needed. Moreover, all three services are accelerated by CUDA, which means they perform better than any cloud service.
I believe the state is a crucial infrastructure that is not given enough recognition, which led to the impulse to write about it.
Nowadays, computing power and chips are hot topics. Everyone discusses them. Yes, today's new AI/graphics chips are so powerful and versatile. However, tech companies only put this power in the cloud, which is a highly controllable environment. If you aren't a gamer, you likely won't feel the need for a dedicated graphics or AI card.
In the coming years, the situation could change. ChatGPTs are showing people how incredible AI capabilities can be. Most users access AI through its website or mobile app. Although people say it's amazing，this access method places a limitation on its full potential.
In a name of AI safety, I have doubts about these leading AI companies being willing to release the full potential to people.
So, that make a anticipated vacuum space.
When technology is unstable, people are less inclined to use it. This is particularly true when it involves expectance of natural feedback. Because when we talk with a person, where we naturally expect quick responses, so when talking is the approach we use technology,the expectance also comes into play. In contrast when writing or typing the tolerance is much higher.
So locally running high-performance ASR/LLM/TSS system will be a game changer for "Talking/Language/Speech as main UI" concept gain success.
Another scenario that can be used as an analogy is cloud gaming. Even pushed by companies like Google,"Cloud gaming" was failed too. Because when people play games,The tolerance for delay or pause is much lesser than searching, shoping, writing, coding. when gaming we mobilize our nature instinct, the same applies when we talking.
The potential of the technology is immense. Currently, I can talk with a robot in speech for language learning purposes and summon basic Emacs editor commands. However, it's still just a foundation. Since it runs locally, privacy concerns are lessened. Additionally, this allows the LLM to access my local data more easily. The robot(scripts that wrappe local LLM) could access my exercise data, financial information, daily routines. This creates a truly personalized personal assistant experience beyond the most advanced ChatGPTs in cloud.
I even think it will be a developing foundation. ASR and TTS service natually come with simple API. LLM is a little bit complex, But when comes to its core input/output layer , it's just do text completion. Therefore, The combine of ASR/LLM/TTS can become a standard of multiple vendors.
After my feedback, the extremely long inference time when using CUDA fix has been merged.