"I'm just fed up with these IVR calls. Can't we do something to end this endless spiral of pressing 1 and waiting on hold?" said Nisham Ikka, frustrated after losing money and getting stuck in the maze of automated menus.
That’s when we had the thought—why not use AI to fix this mess? But was it even possible? Could it be feasible or cost-effective? We had no idea.
So we did what we usually do—sat down, brainstormed for hours, and finally decided to give it a shot.
The next question was: who’s going to lead the development?
Since the project would revolve heavily around AI—and I was just wrapping up v1 of our other product, NeuraQuery, an AI-powered productivity tool for researchers (currently in market evaluation)—it naturally landed on my plate.
And that’s how it was decided. I’d lead the project, and we named it: Dialgen.AI.
🎀 From Stack Confusion to Python Conversion
As every other developer does, I had my initial thoughts: Where do I even start?
My go-to stack usually involves TypeScript, Node.js, React, Express, and lately, a lot of Next.js. But this time, the stakes were different. I needed to build something scalable—really scalable. After all, we’re talking about handling real-time phone calls here. That means concurrency, fault tolerance, and cost-efficiency.
Naturally, since it’s 2025, I asked GPT and Claude for their “expert” advice—got their suggestions, mixed in a little developer intuition, and guess what the answer was?
Python. 😂😂
Yep, I decided to start with the backend—and in Python, of all things.
My thinking was simple: we’re building a tool that, in layman's terms, lets an AI talk to a human over a phone call. Sounds straightforward enough, right? 😌
But being both a developer and an entrepreneur, I knew that before building anything at full scale, I had to validate the idea and make sure it was financially feasible. That’s why I started with the backend. Landing pages are easy to build, but I’ve never been a fan of the “sell first, build later” philosophy that some startups follow.
The Tech Breakdown
Dialgen.AI, at its core, needs to:
-
Connect to a phone call
-
Listen to the user (Speech to Text)
-
Pass that to an LLM for understanding and response
-
Convert the response back to speech
-
Send it back to the caller
So I needed four key components:
-
A telephony service
-
A Speech-to-Text model
-
An LLM (Large Language Model)
-
A Text-to-Speech model
For the telephony layer, I went with what I already knew—Twilio. I’d used it before for phone call integration, and to be honest, their $20 free trial credits for new accounts didn’t hurt either.
🎀 Day 1: Research, Ramen, and a Rough Prototype
To kick off the development phase, I started with what any dev worth their salt does first—research. That was the morning of Day 1.
-
For telephony, Twilio was fixed (it can do both inbound and outbound calls) + free credit 🤩
-
Websockets would be required for two-way communication. So it was a no-brainer.
-
For STT (Speech To Text), I knew about Whisper—open-source, pretty good results, and already used in the market. That meant it was added to my list. Parakeet was another option. Deepgram was one of the STT providers I came across.
-
For LLM, I had some open-source models in mind. Our priority was speed and fewer rubbish responses. OFC 4o-mini was in the list, along with models from Groq, which gave free but rate-limited APIs for developers—pretty good. (Running an LLM locally wasn’t possible, and personally, I don’t think that’s reasonable either.)
-
Text to Speech—this was by far the hard part. Most open-source models don’t run in real-time on my humble GTX 1650 Ti laptop.🤧. I still shortlisted some models. Deepgram also happened to provide a TTS model,but I wasn’t a fan of the voice quality. Eleven Labs was a strong contender but expensive and limited by its API-based usage.
(Side note: Deepgram also dropped $200 in credits. Tempting? Absolutely.💰💰)
Wrapping Up Day 1
By noon, I had all the pieces of the puzzle laid out. Now the task was simple: put it all together. 😂
I kicked off development and, obviously, I wasn’t going to do this blind—I brought in Claude to help me structure the early stages. Within a few hours, I had a basic setup running on my machine:
✅ STT + LLM (working, but slow and unreliable)
❌ No telephony integration yet (was saving that for later, since it’s relatively straightforward)
🔄 Started working on TTS integration
Eventually, I gave Kokoro TTS a try—it worked but wasn’t real-time on my machine. Honestly, laziness won that round… So I went straight to Deepgram for TTS.
🎀Day 2: The TTS Day
Day 2: The TTS Day
Soon after gym, I opened up Deepgram’s docs and started the integration work. At this point, I had a working STT → LLM → TTS flow (still no telephony yet), and everything was running fine.
But yeah… regret started creeping in for not doing the telephony integration right from the beginning.
Turns out, it relied on WebSockets, and once I got into handling WebSockets in Python—pain.
Why did I even choose Python again? 🤧
But giving up? Nah, not an option. Developer ego kicked in 🙂.
Somehow dragged through and made it work by 🎀Day 3.
Was it reliable? Not at all.
But did it work? Yes.
Proof of Concept: ✅
At this point, it was pretty clear—migrating to Node.js was the move.
(Not that I hate Python... but you know how it is 🙂)
While scrolling through Twilio docs and blogs, I found a gem—a starter repo that used Deepgram for STT, an LLM, and even their TTS—all in Node.js.
Cloned the repo. Installed the dependencies. Started hacking.
Did a lot of cleanup and refactoring, made it handle concurrent users, and integrated Vercel’s AI SDK (which was so smooth 😌). That let me easily switch between different LLMs on the fly.
And just like that:
✅ Outbound? Working.
✅ Inbound? Working.
Dialgen was finally starting to talk.
Yeoop. It was running and doing its thing.
STT was fine, LLM - switched btw llama and 4o-mini, and TTS was from Deepgram.
There were issues tho—cross talks, latency, and most importantly, the voice just wasn’t that good.
So the question was—what’s an alternative that works similar to Deepgram, supports WebSocket API, and doesn’t need much effort to switch?
Kokoro was the best option so far—but like the saying goes, “saving the best for the last!” 😌
My answer at the time? Eleven Labs.
Costly, yeah. But had to give it a try because the voices are just too damn good.
So I went for it.
Soon enough, Proof of Concept v2 happened. 😌
Not too shabby—just... costly.
But the confidence boost was real. We knew it—this can actually work.
Just needed to make it better. And more reliable.
And in the meantime,
our team had come up with a market plan for Dialgen.AI—to make it a self-serve platform, where users can create their own agents, connect telephony, import contacts, and boom—there you have it. Any scenario, no issue.”
A complete AI Calling Suite, powered by AI Agents, right at your fingertips.
ANY SCENARIO. ZERO DOWNTIME. HIGHLY CUSTOMISABLE. 24/7 SERVICE.
🎀 The next steps – the inevitables:
-
Throw away the starter template and start from scratch
-
TypeScript mandatory
-
No ElevenLabs or Deepgram—borrow an RTX 3060 machine to run TTS and STT locally 😌
-
LLM is non-negotiable (AI SDK is our go-to choice)
-
A Dashboard and a Landing Page
We’re just getting started. Dialgen.AI is already proving what’s possible when you blend good ideas, stubborn dev energy, and the right tools. Now it’s time to make it battle-ready.
If you’re as tired of old-school IVRs as we are—and want to be among the first to experience AI-powered calls that actually work—join the waitlist on Dialgen.AI.
We’re cooking something 🔥. You’ll want a front-row seat.