So, a mate of mine starts rambling about this project idea of his — keeping it vague for now 'cause he's gonna try and market it or something — but basically it’s gonna involve AI, voice, the works. Since I’ve been off work recovering from some delightful dental surgery (10/10 don’t recommend), he asked if I could whip up a basic offline AI to help with his prototype.

One week later, in between games and wrangling the kids, I’ve somehow ended up knee-deep in a full-on desktop AI assistant. I’m calling it Version 0.8 for now, with my “MVP” version being 1.0.

Right now it uses FFmpeg, Whisper, LLaMA3, and Coqui TTS. It handles both text and voice input/output, caches WAVs, convos, user settings, and has a few colour themes 'cause who doesn’t love a bit of flair. Currently working on per-conversation caching and trying to make convos reference each other — which is as fun as it sounds.

Also, the AI voice? Sounds like a half-baked call centre operator. Absolutely cooked. I’m adding more voice options soon so it stops sounding like a robo-Karen trying to upsell me internet plans.

Performance-wise, I’ve managed to take voice response from "go make a cuppa" times down to about 6–8 seconds, thanks to streaming chunked WAVs and throwing the GPU at it. Still not lightning, but hey, it’s no longer yelling into the void and waiting for enlightenment.

Anyway, point is — since I was putting together a train anyway, thought I’d ask: anyone got feature ideas? Already blown past what my mate expected, so I’ve got a pretty hefty roadmap going. But I’m all ears for wild suggestions, practical or ridiculous.

Here is your entry to a progressive train. Good Luck and Enjoy ^^

Just finalised the addition of allowing the creation of different conversations, user defined conversation titles, conversational tabbing, persistent / cached conversations and deleting conversations ^^ Currently the entire App is 755 Megabytes. Let's watch that expand >.<

View attached image.
View attached image.
View attached image.
View attached image.
View attached image.
View attached image.
View attached image.
3 weeks ago*

Comment has been collapsed.

That's unironically amazing. In the same time period, I almost learned to adjust the wig angle on a Second Life avatar.

In re: nothing, your use of of both "cuppa" and "hillbilly" in the same thread make you seem culturally and geographically ambiguous.

3 weeks ago
Permalink

Comment has been collapsed.

And here I thought the two were related and we would finally have an AI that acted like it was on drugs all the time :/

3 weeks ago
Permalink

Comment has been collapsed.

If you heard the voice my AI currently has, your thoughts might very well be real ^^

3 weeks ago
Permalink

Comment has been collapsed.

Bump

3 weeks ago
Permalink

Comment has been collapsed.

Unable to help with your AI project, but the train is great! Thanks!

3 weeks ago
Permalink

Comment has been collapsed.

Bump

3 weeks ago
Permalink

Comment has been collapsed.

🎤😱I like cucumbers but not Mike.🥒😋
However, I am a bit intrigued by the textual response.
May things go well for you.🍀

3 weeks ago
Permalink

Comment has been collapsed.

Nice

keeping it vague for now 'cause he's gonna try and market it or something

assuming that means open-source projects like github etc is likely off the table at least for the foreseeable future. sad but i get it

anyone got feature ideas?

I don't know enough about it to be able to tell if I'm in your target demographic but I have used some speech recognition stuff in the past (been awhile... mostly it was Windows Speech Recognition aka WSR and some 3rd party programs that relied on it way back in the Windows 7 days). I am aware of Whisper and keep meaning to dive into AI tech, but I haven't actually studied or even played around with it yet. That said, I think I am familiar enough as a user to at least suggest some potential features (again take or leave 'em as needed for whatever fits for you).

  • A "push to talk" option. Have had times where its nice to leave things running but also it can be VERY frustrating when you have a phone call, meeting, someone physically coming over, etc and it decides to pull words out a conversion that were never intended for the speech recognition to operate on.
  • In some cases, being able to output the spoken text to a file is nice. There are already similar projects on github that do stuff like that with Whisper for e.g. creating subtitles for youtube (or other) videos. But depending on what you are doing, sometimes it can be nice to be able to print out a list from whatever specific app your using too.
  • If you are writing in a fairly system-agnostic language like python3 (which a lot of AI stuff is in), then IMHO it's always a good idea to put in the extra time to make sure nothing is OS-specific. And if you DO get lazy and write something OS-specific anyway, do it in a wrapper function so that its easier to change that latter if you ever decide to. Then even if you don't immediately target say Mac or Linux, there's less overhead involved to be able to offer it to a wider number of users. In particular, avoiding WinForms and related packages helps immensely with portability - even in dotnet, it seems like WinForms are responsible for a large number of projects not being able to be easily cross-platform where they otherwise would be. Projects without windows-specific stuff like WinForms etc, Linux users can generally get running under Wine/Proton even if you don't specifically code for them... as many game devs have found out since Proton/SteamDeck started gaining popularity.
  • Being able to remap some words is sometimes nice. Haven't explored Whisper enough to know if there's already functionality for this but I know in WSR, there were sometimes issues where you would say one word and it would interpret it as something else. Being able to define custom overrides can be nice in situations like this.
3 weeks ago*
Permalink

Comment has been collapsed.

You're pretty spot on with a few things I have looked at.

Push to talk is already a feature implemented. I just haven't mapped a key to it yet. I love into PTT as I currently have it set to a 5 second talk input time. This time can be elongated but for testing purposes 5 seconds is enough. PTT I was having issues implementing properly. Depending on the toggled key, it wasn't allowing whisper to close to transcription. So it would almost idle out as if still waiting for input. Some keys worked, some don't. So I stopped it and figured I'll fix it up later. So let's say it's half implemented lol

Outputting the spoken text file already happens. It currently saves text logs, terminal logs and all the audio files. The audio files are saved in chunks also with a tick of a box in the settings it will save the completely built WAV file, piecing all chunks together.

I'm not making anything OS specific but thats more because the one who asked me to make it was too vague and couldn't specify what OS he was looking at using. So right now it could be packaged for almost anything.

As for remapping words I havent really looked too far into this. I have considered making it always listening. Could either do wake words similar to the whole "Hey Google", "Alexa" etc. So remapping words for certain functions shouldn't be hard, especially considering it's literally a transcription defining a named item or action etc.

3 weeks ago
Permalink

Comment has been collapsed.

Thanks for the train!

3 weeks ago
Permalink

Comment has been collapsed.

What a nice train to celebrate the conception of HAL 9000 bump!

3 weeks ago
Permalink

Comment has been collapsed.

Nice train, thanks and bump!

3 weeks ago
Permalink

Comment has been collapsed.

Bump!

3 weeks ago
Permalink

Comment has been collapsed.

Bump!

No recommendations, but I look forward to when my PC will be able to insult me for my life choices in real time!

View attached image.
3 weeks ago
Permalink

Comment has been collapsed.

Thank you for the chance!

3 weeks ago
Permalink

Comment has been collapsed.

Bump

3 weeks ago
Permalink

Comment has been collapsed.

Bump!

3 weeks ago
Permalink

Comment has been collapsed.

Right now it uses FFmpeg, Whisper, LLaMA3, and Coqui TTS

I have no idea what the hell this means, but have a bump

3 weeks ago
Permalink

Comment has been collapsed.

oh my train bump. this is long.

I have no suggestions for AI since I barely use it, though I probably should when I see how impressive some stuff is... Only thing that's stopped me from using it more for work is that the internal AI that we're allowed to use doesn't make it easy to find the section of a webpage it's citing from to give up it's (probably made up) answer.

3 weeks ago
Permalink

Comment has been collapsed.

well, thanks for the train ^^"

3 weeks ago
Permalink

Comment has been collapsed.

Thank you thank you, and great work! I hope your mouth is recovering well

3 weeks ago
Permalink

Comment has been collapsed.

I don't use AI so I can't help you with your assistant, but thanks for the looong train! :)

3 weeks ago
Permalink

Comment has been collapsed.

Bumping into this train!

3 weeks ago
Permalink

Comment has been collapsed.

Good luck on your project 🙂

3 weeks ago
Permalink

Comment has been collapsed.

B00p

3 weeks ago
Permalink

Comment has been collapsed.

Bump :)

3 weeks ago
Permalink

Comment has been collapsed.

I do use a spicy AI chat, and tried a few others of the like, including RPG ones, and ChatGPT for deeper topics. The number 1 thing I need is not JUST relying on chat history for memory. The amount of time it's forgotten details because it's relied on chat logs only. This is especially important for RPGs. So yeah, a really decent, integrated memory that it can move important information into a permanent repository, until it's no longer required, but also something the user can manipulate/edit/delete/add, as needed.

3 weeks ago
Permalink

Comment has been collapsed.

The problem you have with that aspect is every AI will lose itself as the conversations lengthen. In order to not do so, for as long as possible, will simply boost the minimum system requirements well and truly past most people's systems specs. Quite literally you could have memory retention and attention to detail with the program capability for specific detailed function, i.e you could theoretically program the AI to have the capacity to perform as a D&D GM, while retaining all information regarding characters, locations, rolls and more. However, I almost guarantee the standalone computer wouldn't be able to meet the requirements if I was to do that and even then, it will get confused the moment you try implement it with Magic the Gathering details for example.

This is one of the reasons why every AI program, even things like, including those who pay for it, chatGPT will inevitably lose itself in conversation. Just imagining the hardware requirements, put my recent $32,000 odd, dental bill to shame ^^

3 weeks ago
Permalink

Comment has been collapsed.

I'm not suggesting everything needs to be stored that way, you have a lot of fluff and world building stuff that can easier stay in chat logs, but things like location, inventory, powers/skills/abilities/etc, important decisions, etc which can be moved into permanent memory. ChatGPT has memory, even though it doesn't seem to be handled well IMHO.

3 weeks ago
Permalink

Comment has been collapsed.

You have somewhat peaked a curiosity in me. May I ask, what RPG it is you are trying to play with AI? I feel like i could make an AI driven program that would allow this. I honestly don't know what limitations I would be facing and that's the curiosity part. However I need to know what RPG specifically it is. More could be added to it later i suppose but i would start with just 1 if I was going to look into it. ^^

3 weeks ago
Permalink

Comment has been collapsed.

I don't recall which ones I've tried now. I haven't exactly tried to run a specific RPG, like D&D, though an AI, but more general RPing via AI, I know I've tried via ChatGPT, but it forgets it's own rules/decisions :)

3 weeks ago
Permalink

Comment has been collapsed.

Sign in through Steam to add a comment.