The grounds
A few years ago, the first public large AI models (LLMs) were developed. But back then, they struggled with basic concepts and failed to understand context. However, things have changed dramatically recently.
Switching to transformer-based architecture allowed us to create models that store and connect information across billions of parameters. In simpler terms, researchers and developers created algorithms that allow a few lines of code to store, connect, and apply knowledge - essentially teaching computers to understand and reason using patterns.
We use the same principles - but on a smaller, more practical scale. Our aim is not to build general-purpose AIs that try to answer everything. Instead, we focus on specialized models that serve niche specific answers - because no one needs an AI that wastes time answering irrelevant questions.
Training
In an ideal world, we should train every model from scratch for each client to make it trully specific. But full training requires enormous amounts of data, computing power, and time - which makes it too expensive and inefficient for most use cases.
Instead, we rely on fine-tuning. Bassically we learn existing pre-trained models to adapt and perform specialized tasks.
We choose high-quality base model which we customize to achieve nearly the same performance as training a model from scratch. There are some limitations, of course, but its not something you will notice when we deploy it. So, how we do it?
Full Fine-Tuning
This is the most powerful, but not our preffered method due to the uneven and demanding resource allocation. However, some uses cases needs a full fine-tuning.
How this method works?
Basically we retrain all the model’s parameters - billions of them - so it learns your data from the ground up. This gives us complete control over the model’s behavior and output but requires significant computing resources.
QLoRA
Or in full, Quantized Low-Rank Adaptation is our go to method when making custom models. Basically we fine-tune a model using only a portion of the parameters, with up to 90% of the precision of full fine-tuning.
Here’s how it works in simpler terms:
We take a base model, “freeze” most of it, reduce its size through quantization (either 4-bit or 8-bit), and then we add and train layers to learn model the new information.
Adding RAG
To further enhance the behaviour of a model we add RAG (Retrieval-Augmented Generation). It allows the model to pull data from a verified source - company documents or databases.
In practice if a hospital uses a custom RAG-based model, it will only respond to queries related to medical operations or internal documentation - not unrelated topics like physics or sports. This makes the model more reliable and safer to use.
Humanization
Finally, we "humanize" our models. In short, we apply ranking or reinforcement-based optimization for more natural and user-friendly answers.
This step helps the AI learn what “good” answers look like - tone, length, and clarity.
Ready to work with us?