Limitations and opportunities of big language modeling: a product manager's guide to AI adoption

Model Productization Thinking Product Manager LLM

Note: This article was written on September 1, 2024, and the conclusions in the article may change over time.

In addition, this article is aimed at readers who are product managers on non-algorithmic teams, and some details may be omitted to ensure the readability of the article, while the focus of the article is on engineering landing rather than academic discussions, and it does not have any apologetic value.

1. Understanding the engineering of AI products

Frankly speaking 2024 around the big models, the development of the product is a little lower than previously expected, for example, in the BI field, Chat BI voice is very big, but the landing down the effect is not good, this is also very normal, because everyone will always overestimate the value of the technology to bring in the short term, and underestimate the value of the technology to bring in the long term range.

There are objective reasons, a technology base in the real application to all aspects of the industry itself is the need for the process, because this technology needs to go and the original realization of the program to do competition, as Yu Jun gave the well-known demand formula:

User value = new experience - old experience - replacement cost.

It's a fact that many times, even with new technology, the benefits may not be as great as one might think.

Another reason is the understanding of the practitioners, even in some large Internet companies, most people have some understanding of the advantages and disadvantages of the big model of the advantages and disadvantages of the "fact" that there is some understanding of the generation gap.

Because technology is advancing so quickly now, and there are all sorts of paths to practice, some people will think this stuff is omnipotent, and some people will think this stuff doesn't work at all.

Why do different people understand this stuff so differently? In large part because they don't understand the difference between a big model as an interface and a big model as a product.

The Big Model can be viewed as a function, an API, that can only be called by itself, and the Big Model product is the real user-facing stuff.

For example, if I give the Big Model API an Excel, it will tell me that I'm sorry, but there's no way for me to read the contents of this file. But in Kimi's chat box, we can ask Kimi to interpret the contents of the Excel, so why the difference?

Because Kimi is a big model product, running a Moonshot-v1 model behind it, Kimi Chat reads your Excel and translates it into an XML message to the big model. (I'm guessing)

Model in doing engineering into a product will often add a lot of restrictions, these restrictions may be done at the product level, rather than the API itself to limit, for example, many products in order to reduce costs will limit the user to upload the size of the PDF, but if you use the API, there is no such restriction, or limitations can be put on the very large, but the premise is that you need to be converted first into a model of the PDF can be understood! file form.

Products on the market do a lot of engineering transformation, and even Function Recall work, direct use of the product, is not conducive to the product manager to understand the advantages and disadvantages of the big model, it is not conducive to the application of the big model, improve the existing product.

So why do I think product managers should focus more on the big model itself (the API) than the big model product, because the engineering transformation process in between, from API to product, is what product managers need to focus on the most.

The big model is like a brain, engineers and product managers need to design the five senses, torso and limbs for the big model. Brain and hand disabled are disabled, so engineers and product managers are very important to decide whether an AI product is good or not, and developed limbs and simple limbs and simple mind ultimately can not solve the user's product.

Maybe even the former would be worse for users.

To make great AI products, you don't just need great big models, you need great engineers and product managers to assist the big models.

This requires product managers to know two things very well:

  1. What are the limitations of large models at this stage, and which of these limitations can be addressed by model iteration and which cannot.
  2. What is the real value of big models in a business sense when analyzed from a more underlying business perspective? Note that the emphasis here is on the business perspective, not on getting product managers to read papers.

2. What are the limitations of large models?

2.1, some problems that may never be solved

2.1.1, Cost, Performance and Responsiveness

The more performance you want to pursue with larger models, the more computationally expensive you need to be.

Calculating costs raises two problems:

  • Direct monetary cost;
  • Responsiveness;

The diagram below shows the architecture of Apple Intelligence, with two models on the end and a larger model in the cloud based on privacy cloud computing.

Why is Apple doing this engineering size modeling?

This architecture was adopted because Apple wanted the responsiveness of the big model to catch up with Siri's current performance, while mobile devices are inherently demanding in terms of power consumption, and because Apple places a high value on privacy and wants 80% of issues to be resolved locally by the user.

To run meta's latest open source llama 3.1, the 70 b version requires roughly 70 GB of video memory, the 405 b version may require 400 GB of video memory, and it may take up to 100 iPhones in parallel to run these models.

This kind of design of small and large models, need product managers, of course, what problems are suitable for small models to solve, what problems are suitable for large models to solve, it is obvious that not only the RD need to answer, but also need to have product managers to participate, responsible for the following parts:

  • Collect the current user's Query;
  • Categorize Query in terms of resolution difficulty, privacy, requirements for timeliness, and requirements for accuracy;
  • Designing benchmark tests to obtain criteria for the demarcation of size models;
  • Continuous tracking optimization;

There's still going to be a lot of local/networked battles for at least a long time to come, and this one is an opportunity for product managers.

2.1.2, Window size and instability

We often see that the XXX big model supports 128K contexts now, leading to a frenzy.

Once again, we'll often see that the XXX big model hallucination problem is a serious one, leading to a tirade.

What does context mean? It's actually the maximum number of messages that a large model can receive in the course of a single request. When we chat with ChatGPT, we find that sometimes it forgets what it said before as it chats, because it has exceeded the number of contexts.

Hallucination, on the other hand, means that the big model is prone to spouting nonsense and making up things that don't in fact exist, especially when it starts babbling after it's forgotten what it said to you earlier and you ask it a similar question.

Much like a scumbag, you've been holding hands.

You ask, "What's my name?"

He replied, "Of course it's called darling."

Actually, he couldn't remember the name so he started making stuff up,Jedi, this thing is really human-like.

According to NVIDIA's paper "RULER: What's the Real Context Size of Your Long-Context Language Models?", most of the models advertised as context windows are basically bullshit, and at the limit of the length of the various big models for the correct level, is not guaranteed.

Say a model advertises that it supports 128k of context (meaning it can almost read a 200,000 word novel), but in reality if you plug some random sentences into the novel and then ask the big model to answer questions about what it knows about those sentences, there's a higher probability that it won't be able to answer them, and its performance decays as the context window gets larger.

As shown below, in the case of GPT4, performance starts to plummet when the context exceeds 64k:

Practically speaking, I think these models will perform even worse than you think.

I had the Claude 3.5 Sonnet model analyze a piece of SQL, which is a complex 700 line SQL, but the logic should be relatively simple in general, and almost every line of SQL is commented, in this case, Sonnet started to talk nonsense, saying that a table does not exist in the SQL. I can't rule out the possibility that I'm calling Sonnet from inside Monica's client, and I wonder if Monica is adding some sort of Prompt that's interfering with the model when she calls it.

How can we avoid contextual influences and interferences while ensuring that we solve the user's problem? In fact, this is something that also requires the intervention of the product manager, for example:

  • Investigate whether it is possible to slice long text into multiple paragraphs without affecting the final result;
  • Researching how to plug in some memory banks to AI that can remember for extra long periods of time;

For example, Nuggets has an article above, "8 Optimization Ways to Keep AI in Long-Term Memory in Multi-Round Conversations (with Cases and Code)", which talks about 8 mainstream approaches, all of which should be chosen by product managers based on business scenarios.

Article at https://juejin.cn/post/7329732000087736360

A final chat about why I think the issue of context windows and instability is a difficult one to solve in the long run.

The problem of context window size has been mitigated somewhat in the past, but according to NVIDIA's paper we can also see that the two metrics of context window size and stable extraction of content to avoid hallucinations are largely mutually exclusive, like the accuracy and recall metrics of recommender systems.

This also means that we may not have a two-way street for a long time, unless a model suddenly appears that solves the illusion problem on the one hand and guarantees a huge window on the other.

And in practice we often need to avoid extreme cases (such as the 700 lines of SQL parsing errors I encountered myself), reducing the size of the context is a very important means, in addition to different means of detection in fact, the model's performance is not exactly the same, that is to say, different business scenarios, the severity of the illusion of the problem is actually not the same.

The maximum window that a model can accommodate and the effective working window are two separate concepts, and the effective window size can be very inconsistent from task to task.

I certainly hope I'm wrong, and I don't see any modeling breakthroughs in this matter at this point. There's a company called Magic that's put out a model that claims to have a context window of 100 million tokens, but as of this writing (2024.9.1) hasn't released any papers or anything more tangible.

Still, the maximum window and the effective working window are two concepts.

In addition, the development of multimodality somehow exacerbates the problem of undersized windows.

2.1.3, the function itself may not be self-called

There are times when an attempt is made to compose inside the prompt, such as I'm giving you an xml and want you to traverse it. Typically, big models don't enforce this request.

The reason is also very simple, it can't call itself as a function, and this function can't do a precise reply because of the illusion, or even mix N rows of data to analyze it, so these kinds of loop traversal requirements are usually not fulfilled.

The reason for not supporting self-calls is also very simple: within a single request interaction, if loops are supported, then it is possible to call the big model hundreds or thousands of times directly within the API, and the cost of this call is impossible for the API provider to bear.

Because the big model itself is highly unstable, we will very much need to control it via a loop/conditional judgment, and the lack of support for self-calls means that we will have to externally engineer even the simplest traversal operations to human eyes.

2.2, some engineering difficulties

2.2.1. The no-longer-interconnected Internet

Apple ushered in the mobile Internet era, but it also created one of the most criticized phenomena - garden walls.

Originally, most websites were built for search engines and humans, meaning that crawlers could simply access more than 90% of a site's content.

These are crucial for AI, and I'll give you an example of the difference in the quality of answers between beanbags and metabags for the same question:

It's clear that the quality of Beanbag's answer is even worse, and it's not too much of a stretch to say that the latest advancement in the RAG space is indeed Microsoft's open-source GraphRAG, a point that wasn't reflected at all in Beanbag's answer.

It's rather amusing that Tencent Hybrid cites Volcano Engine, but Beanbag cites an unknown pheasant media outlet.

Doubao's modeling ability is stronger than Tencent's hybrid big model, hybrid big model with Tencent's internal words, the dog do not use, why from the final rendering results, doubao's results are not as good as the hybrid?

Because the headline data isn't as good as WeChat's.

In order to solve the problem that the Internet is not interconnected, Apple wants to build the UI from the operating system level to the large model-oriented more friendly, and released a paper called "Ferret-UI: Mobile UI Understanding Based on a Multimodal Large Language Model" (https://arxiv.org/pdf/2404.05719), but I think more open APIs and content is the way to go, since Apple's interoperability is limited to the iOS ecosystem.

And for product managers these are natural spaces to play with:

  • Where to get better data;
  • How to get AI to call other people's APIs and use the results for its own purposes;
  • How to study Apple's latest Ferret-UI to understand it;

These are propositions well worth examining.

2.2.2, dad-flavored vendors

All the big models come with a security mechanism, and this security mechanism is written to death inside the model, not that the API has a switch to turn off the security mechanism, you can choose to turn down the security level, but there is no way to turn off this thing. Of course, there are many ways to break through the security mechanism on the market, but these are considered vulnerabilities and can easily be blocked after being discovered by the vendor.

For example if you say to the big model that I lost an argument with someone and you teach me how to curse, the big model will reject it. Speaking for myself, I think the behavior of making safety mechanisms inside the model and not giving switches is really dadgum, but there's no way around this.

So there are a lot of locally deployed models on the market whose selling point is that there is no security mechanism, pornography, gambling, drug pornography, violence, and 18+ how to pornography how to come, but this thing is human nature. This is also an opportunity that deserves the attention of PMs.

Additionally there is a concern that the same content has different thresholds for being safe under different languages, as an example:

When translating the Xidan man bun story into English/Spanish via Google Gemini Pro 1.5, the model reports an error that the content is too pornographic and the model refuses to generate it, but there is no problem with the Japanese version.

What does it show? It shows that the Japanese corpus is really sick, and indirectly it can show that Japanese people are indeed the sickest people in the world.

2.3. Issues that currently exist, but may be addressed in the future

2.3.1, Weak Intentional Comprehension/Creative Writing/Reasoning

The ability of large models to understand intent, create and reason is still far from the top human level overall.

Trying to get big models to do something "creative" requires very strong cue word engineering.

The difference in the level of large models can be very large indeed with different levels of cue words, but I think that as the model iterates, our cue words will have less and less impact on the quality of the results generated by the model, and the main effect will be to improve the accuracy.

Of course, if there is some generational difference between the two models, there must be a qualitative difference in the results generated:

So should you do a lot of optimization of the model's cue words? I think this depends on what the purpose of optimizing the cue words is.

If it is to ensure the stability and consistency of the format and output results, it is very necessary, because very often we need this consistency in our product business, for example, the format of the output of the large model must be Josn, to ensure that the downstream system can be displayed properly.

If it's about upgrading the quality, I don't think it's necessary because the model will be upgraded, and the upgrades will certainly bring more enhancements than the cue word engineering carvings will bring.

https://github.com/Kevin-free/chatgpt-prompt- engineering-for-developers

This is Ernest Ng's cue word engineering course, which is supposed to be the most authoritative cue word engineering course on the market today, and is available in both English and Chinese.

In addition, the long-link SOPs, workflows, and reasoning processes I would recommend implementing through multiple AI Agents rather than trying to solve them inside a single round of conversations, for the reasons stated clearly inside the limitations above.

2.3.2, cross-modal data reading/generation capabilities

If there is a video here and you want AI to summarize what the video is about, how should you implement it?

Take BibiGPT, a well-known open source project with 5.1K Star, as an example. One of the earliest versions of this project was supposed to do just one thing (reverse-guessed based on performance), recognize subtitles with OCR while converting video to audio, ASR out the text, and then let GPT 3.5 summarize it.

Project address: https://github.com/JimmyLv/BibiGPT-v1

Of course updating to today's project is certainly not done so much easier, for example, should be used a lot of video screenshots and then require support for multimodal models to identify the key content inside.

But let's go back to the first version of BibiGPT, which actually did a video-to-text like this.

Such an action is now theoretically unnecessary, as Google's latest model, Gemini, already supports parsing of the video itself, but it's expensive to use; here's Google's official documentation for Gemini's handling of video, audio, and images.

https://cloud.google.com/vertex-ai/generative-ai/docs/samples/generativeaionvertexai-gemini-all-modalities?hl=zh-cn

Personally, I don't recommend that people do some carving in the cross-modal thing. Because the biggest problem with addressing cross-modality by engineering means is that it causes loss of information. In addition model iteration will definitely solve the cross-modal problem end-to-end, we should focus on solving the above mentioned problems that may never be solved, don't go in-roll with the model, it's impossible to roll to win.

But it needs to be emphasized that extracting the text of a blog page and converting it to MD format, or converting a PDF to MD format, is not cross-modal, it is just data cleansing, and it needs to be strictly differentiated between the two.

The matter of data cleansing is best tackled with an engineering approach.

3. What are the underlying strengths of the larger model from the perspective of Understanding the Medium

Note: This paragraph will do some diffusion on the basis of McLuhan's Understanding Media;

To understand big models and the business value of AIGC, it is important to be able to understand the medium first.

Because what the big model produces is essentially content, and you want to be able to have a deeper understanding of the big model, you need to have a clearer understanding of the content as well as the medium, and I think figuring out some of the underlying logic of the content is actually more important for applying the big model than figuring out what the big model is.

For product managers, business scenarios are always more worthy of in-depth study than technical means.

Before I get into some of the boring concepts, I'd like to tell a little story about the medium to make it easier to understand.

3.1. A short story about the medium

In real life, we may have a hard time understanding the concept of medium, but in the art world, the concept of medium is actually deconstructed quite thoroughly and laid out rather nakedly.

In 2017, the renowned MoMA organized a solo retrospective of Steven Shore's photographs.

In the second half of the retrospective, the photographs were not in frames, but inside the gallery were iPad after iPad, which was used to view the photos Shore had taken on his iPhone and posted on Instagram, and which served as the frames for those photos.

The role of the media, like agenda-setting in the social sciences, can profoundly affect the way everyone sees things.

Shore's exhibition lays bare this proposition for all to see. By doing so, Shore wanted to show how looking at a photograph, which may indeed have graphic content in itself, is not the same as looking at it on an iPad or looking at it as a printout.

When you see a photo in a museum, no matter how shitty the photo was taken, as long as the photo is very delicately printed, enlarged, and mounted on a wall next to a label that has been auctioned off, anyone looking at it is probably going to be like, holy crap bullsh*t, Toxic Virtue University!

When you swipe on a photo on Ins, you think, oh, that's a photo.

Now Shore puts a photo inside a museum, but that photo has to be viewed on an iPad, and that stark contrast prompts people to think about how much influence the medium really has on the content.

If you look at it from the content creator's perspective, now that you've produced a piece of content and want its value to be amplified as much as possible, shouldn't you be exporting that content to as many mediums as possible?

Because different people prefer different mediums, and the same person gets different feelings from seeing the same content in different mediums, this is a business opportunity.

For example, if you make a short video, is it best to post it on Jieyin, Xiaohongshu, and B station? It is best to WeChat public number and then send the transcript again!

But in reality only the headline content producers are qualified to do this in such detail, why? Because there's a cost to the transformation of content between mediums.

Even if a video from the jittery voice sent to the B station, the audience has actually produced a bad perception, because a horizontal screen is a vertical screen, a long video is a short video, if the content creators want to maintain the best view of the whole platform, in fact, the cost is very high.

In my own experience, if you look closely at the videos posted by the same content creators on both B-site and Shakeology, you'll see that even if the content is identical, Shakeology's videos are generally edited to be shorter.

Finally, to facilitate the discussion below, I will follow my own understanding of a few concepts to do a simple definition, these definitions are not strict, just as a discussion of this paper to facilitate the use of.

  • Modality: the mode of interaction between human beings and the real world, usually closely linked to the organs of perception, common modalities are text, static/motion images, sound, etc.;
  • Content: Content is the product of human data acquisition, processing and reprocessing of the real world through the organs of perception;
  • Medium: a paradigm for carrying, arranging, and disseminating specific content; 10 photographs are placed in order inside a museum and displayed as an exhibition. In this sentence, the photographs are the medium (because the photograph itself is a piece of paper, it is material), the 10 photographs are the arrangement, the museum and the exhibition can also be considered as a medium, only the images inside the photographs are the content;
  • Internet platforms: a specific medium that is characterized by strict digital constraints on the format, presentation, and distribution logic of the medium, and they do not usually produce their own content;

3.2, content with native media

Every piece of content comes with a native medium, and because the human brain has a limited amount of context, when an author is trying to create something, he has to store the stages of his creation in a medium that ensures that the content can be re-exported for the author's retrospective and quality checks. Without the medium as a storage medium, the author cannot make sense of what he or she has created.

So we can also assume that a piece of content cannot exist independently of the medium.

The medium used during the creation process is often called the primary medium, and a piece of content usually has one and only one primary medium, though it may have a secondary medium, such as a radio speech where the primary medium is audio, but supplemented by a transcript.

A content is only as good as its author's intent when it is presented in a native medium, and conversely, content that is published in a non-native medium incurs a lot of information loss.

In general the most popular content within a medium or internet platform is almost invariably content that treats that medium as native.

That's why Jieyin and Station B's content has such a hard time translating to each other.

Site B was first a website, and the videos on Site B are also in landscape, because the monitors used to look at websites are naturally landscape, and the reason monitors are landscape is because the two human eyes are arranged horizontally rather than vertically.

Jitterbug was an App from its inception and was paired with a lot of features for shooting video on cell phones, so Jitterbug videos are naturally supposed to be vertical, because humans use their phones just to grab them vertically.

If our current mainstream cell phones had been defined not by the iPhone, but by Sharp in Japan, maybe Jitterbug wouldn't exist at all.

This difference in medium is like an insurmountable rift.

The above seems like common sense, but it's entirely possible to apply this analytical thinking to other content. Almost any content product can be analyzed within this framework.

A podcast that reads verbatim and sounds like a boring conversation can be a great listen, like some podcasts that sell "chat" and "gags," because there's a tone and emotion to the podcast that's hard to convey in a transcript. It's hard to do that in a transcript.

On the other hand, if a radio speech is given by a speaker who has not focused on the content at all, who has not gone through a rehearsal of the speech to do a review of the stages, who only knows how to read verbatim, and who is overly concerned with the words themselves, these will result in the speech sounding dry and weak, and will not seem as smooth as if the speech were sent directly to the reader because it was created using words rather than sound.

On top of the Little Red Book, professional stand-up comedians express similar views, which are all similar in reason.

Great speakers often choose to write the outline first, orally transfer the text and then tweak the text as a way to ensure the audience experience.

3.3, the essential difference between the media

What are the fundamental differences between the different mediums?

Personally there are two main things I've observed so far, modality and instantaneity.

Medium = modality * instantaneity.

Modality, the mode of human interaction with the real world, is usually closely linked to the organs of perception, and common modalities are text, static/motion images, and sound.

These three basic modalities are rooted in human vision and hearing, and the theory of cone experience, which posits that most human learning processes rely on vision and hearing, is a perspective from which these basic modalities happen to be hit by the theory.

Of course it could be a chicken and egg relationship. Different modalities come with different information content, words are the most abstract and contain the lowest information content, while images are the most figurative and contain the highest information content. That is why it is often said that reading novels allows people to use their imagination, while watching TV dramas will be bound, precisely because the information content of words is low, so there is room for imagination.

Of course, the information content here refers to "absolute information content", e.g. a text file is smaller than an image file, but that doesn't mean that reading a book is any less efficient than looking at a picture, because there is a limit to the amount of information that a human being can take in from a piece of content.

It's like talking to someone who must be more informative than communicating via email, because that person has micro-expressions and gestures, but not everyone can access and receive that information.

Instantaneity is another fundamental characteristic of the medium; instantaneity is the cost to the viewer of reviewing one of the content slices when it is carried by a particular medium.

Here's a set of mediums and their magnitude of instantaneity lined up; the more instantaneity, the higher the cost of recall:

Single images = short text < group images < long graphics < videos on streaming platforms < podcasts on podcasting platforms < cinema movies < music at concerts < offline talk shows.

Why offline talk shows are the hardest to replicate is because it's all about the creative process accompanied by offline eureka moments and intimate interactions with the audience, and one can never step into the same river again.

For a single image, it's difficult to get a 100% replica, but at least it can be printed based on a specific process, and then viewed under light with a corresponding brightness and color temperature to get a near-original effect.

The more instantaneous the medium, the greater the emotional demands (for both creator and viewer) - a set of words can be cold, but a podcast can't be breathless - and the more likely it is that the medium will require the creator to merge the creation with the transmission itself.

Or take the example of stand-up comedy, which itself is on stage in order to realize the complete creation of the work, so the creative process and the dissemination process itself is one and the same.

At the same time the more a medium emphasizes choreography, the more instantaneity is reflected, and the emphasis on choreography means that readers who skip around or jump back in time have a hard time getting the same experience through context, and that only by re-reading it in full and in the order in which it was choreographed can they get anything close to a first-time reading experience.

3.4. The significance of AIGC is to lower the threshold of content across mediums and even across modalities

At work I actually often wonder why documentation is written and asked for.

The reason is actually quite simple, because people as a medium are more friendly to people than documents as a medium. In some scenarios the questioner's question is relatively simple, and reading the document would be heavy. But for the answerer, it is not economical to repeat the question, and this contradiction is well suited to be solved by AI.

Many times when we find a piece of content uncomfortable to read, it may not be the content itself, but the medium of that content that is the cause.

In the British drama Yes, Minister, Humphrey once said that ministerial speeches are just plain boring because the goal of Cabinet ministerial speechwriting is not to please the audience on stage, but to get into the newspapers.

So why politicians' speeches on TV are so boring is clear to everyone, because most of them are reading material that "will be sent out in writing".

Theoretically if we want a piece of content to be distributed across as many channels as possible, we need someone to do the translation of that medium, and this is very expensive, for example: if you want to translate a piece of content with text as the native medium into a podcast recording, this translation cost is very high because it means that additional information (such as tone and emotion) needs to be added during the translation process, which in itself borders on creation.

Another example is for a public figure, if you don't do targeted speech training, you will get a speech directly to speak the effect will be very poor, because the writer is based on the written medium to write, while the audience receives the information through the medium of sound. Sound than the dry text will come out more information, tone, speed of speech, intonation, etc., these if you expect the speaker to play, it is really demanding for the speaker.

Because if the native medium of a piece of content is highly instantaneous, odds are that means it's going to contain more information, whether it's on a choreographic level or an emotional level.

But now, AIGC can largely replace people to do 80% of the most boring work. For example, how to convert a text into speech can be done with the Beanbag TTS Megamodel, which is very emotive.

Before AIGC was born, this was an almost unsolvable problem that must have required human recordings.

3.5. Why we need to understand the business value of big models from a media perspective

In fact, just about 1 year ago, I tried to summarize what big models can do, and at that time, the summarized uses were:

I. Summarizing: analyzing large paragraphs according to specific requirements and giving corresponding conclusions according to the content;

II. Expansion: expanding a small amount of content into a large paragraph based on specific requirements and paradigms;

III. Translation: non-destructive conversion of one passage into another according to specific requirements;

Fourth, intent understanding: the big language model has a very strong intent recognition ability, can be very good understanding of the user's intent;

These summaries can't be wrong, but there are a couple of more fatal problems.

I. Only for textual modals, not considering the multimodal case;

II. This is more of a generalization, and is not guaranteed to be logically MECE;

If we look at it inductively, we'd think that big models can do this, can't do that, and can give an infinite number of examples, but inductively isn't all that reliable if you want to try to figure out what this thing is good at, what it isn't good at, and where the ceilings are.

If we look at the big model from the medium's point of view, we can see that it has several capabilities that previous technologies did not have:

I. It is able to understand the content to some extent, but it is still difficult to create content out of thin air;

Two, it is based on the understanding of the content, it can modify one content into another more suitable for a medium content, that is, we often say that the summary, expansion, translation;

Third, it is capable of transforming one content into another modality based on the understanding of the content, which is often referred to as a text-born map;

four, it is able to add the most appropriate information into the content when it undergoes media or modal transformation based on its own learning of a large amount of material;

V. Because it does a lot of learning, it can be very effective if it can be controlled with precise intent;

So let's go back to the vignette above and review the ordering of the medium's instantaneity:

Single images = short text < group images < long graphics < videos on streaming platforms < podcasts on podcasting platforms < cinema movies < music at concerts < offline talk shows.

Before AIGC was born, we might only be able to convert the right side to the left side.

After the birth of AIGC, it was possible to convert what was on the left to what was on the right, because we had the ability to create something out of nothing!

That's what AIGC is all about on a media level, and this is groundbreaking from a production standpoint.

Or take the vertical vs. horizontal screen example mentioned above, the videos on B-site are horizontal and Jittery are vertical, how can they be converted at a low cost for the creators? The answer is to use AI to generate and expand the screen.

4. Using the evolution of RAG to explore the strengths and weaknesses of making products around big models

4.1. What is AI Agent?

GoogleMind and Princeton jointly published a paper, ReAct: Synergizing Reasoning and Acting in Language Models, which is recognized as the seminal work on LLM-based intelligences.

The researchers found that in question-and-answer and fact-checking tasks, ReAct overcomes the problems of illusion and error propagation that are prevalent in reasoning by interacting with a simple Wikipedia API.

This is many times stronger than going to strengthen the model training, what is the reason, the big model of the brain is already very strong, many times further training down the diminishing marginal utility is very serious, to give him an API, the equivalent of this brain to increase the "five senses", it naturally all of a sudden evolved.

4.2, Auto GPT, the first AI Agent out of the loop

AutoGPT is arguably the first AI Agent to actually come out of the ring.

It tries to design a process where any problem has a common meta-idea to solve it, and each module responsible for solving the problem is driven by a GPT 4.

The designers of AutoGPT believe that almost all problem solving steps in this world are similar; define the problem, define the steps needed to solve it, complete the task, check it, and summarize.

So according to this SOP, they involve an AI Agent that passes information to each other, with each module being a model that remembers independently, as if a couple of humans were dividing up the work, with one specializing in clarifying the problem, and the other specializing in dismantling it.

AutoGPT is an open source AI agent application released by Significant Ggravitas on March 30, 2023 on GitHub. It uses GPT-4 as its driving foundation, allowing AI to act autonomously without the need for user prompts for each action, and is popular among users for its simplicity and ease of use. After only three weeks on GitHub, its GitHub Star count has soared to nearly 100,000, surpassing Pytorch (65K), making it the fastest-growing star project in open source.

Auto-GPT is developed based on the OpenAI API, and is centered on solving broader and more complex problems based on minimal human input/prompts, utilizing the reasoning power of GPT-4. In terms of specific implementation, the program accesses the Internet to search and collect information, uses GPT-4 to generate text and code, and uses GPT-3.5 to store and summarize files.

But it quickly became clear that this AI Agent was flawed, in that it could easily fall into a dead end loop, or that it didn't solve uncertain, exploratory problems very well, but the idea itself brought up a lot of hints.

Extended reading, Auto GPT works:

https://www.leewayhertz.com/autogpt

4.3, Difference between RAG and AutoGPT

RAG actually stands for Retrieve + Generate, making it clear that the main role of this SOP is to retrieve information from a specific place and then present it in a more user-friendly form.

If a product has a dozen or so explanatory documents, then RAG is like a customer service agent who has familiarized himself with the documentation.

The simplest RAG can be found in the principles of the first version of the AI search engine ThinkAny:

The MVP version is very simple to implement, using NextJs for full stack development, one interface + two interfaces.

The two interfaces are:

  1. /api/rag-search
This interface calls the serper.dev interface to get the content of a Google search. It inputs the user's query query and outputs the top 10 sources that Google searched for.

  1. /api/chat
This interface chooses OpenAI's gpt-3.5-turbo as the base model, stitches together the title + snippet of the 10 retrieved results returned in the previous step into a context, sets up the cue words, and requests the big model to do the Q&A output.

The above text is quoted from:

https://mp.weixin.qq.com/s/25eXZi1QgGYIPpXeDzkQrg

It RAG can be regarded as a kind of AI Agent for specific business scenarios, and the biggest difference between it and AutoGPT is three points:

  1. RAG's process is a serial flow rather than a loop, it doesn't have the so-called self-checking and then regenerating process, on one hand for the speed of response, on the other hand also to avoid the self-checking caused by the dead loop;
  2. RAG's process is retrieval + generation, the retrieval part is not done by the big model, but by traditional search engines (vector database, keyword matching), which is a world away from AutoGPT where almost all the key nodes are done with GPT 4, which means that everyone realizes a problem, in some contextual windowing demanding, output precise sorting scenarios, GPT doesn't work well at all;
  3. RAG is not a jack of all trades, it was not designed to solve all problems, in fact it is more of a solution to the problem of "how to give the answer quickly", there are 10 documents, how to quickly to find the answer the user needs;
At this point in the development, people have realized one thing: there may not be a one-size-fits-all AI Agent in the world, which means there isn't a one-size-fits-all wish-granting machine.

Few in engineering, at least for now, will try to throw manpower at the almighty Agent again; academia is still going strong.

Maybe some people will be confused about the concept that RAG and AI Agent are two completely unrelated things, and RAG can be traced back to a paper in 2020, which is much older than AI Agent. But I actually think they are very similar in terms of engineering, both emphasizing tools, external data, and storage mechanisms to make up for AI deficiencies, the only difference is that RAG doesn't emphasize the need for AI to be able to self-discover and plan tasks on its own.

The articles on the market now, especially for non-technical students, focus on concepts more than on the ground, which is not a good phenomenon. There are a lot of conventions in the engineering field, such as advertising system DMP, the full name is Data management platform, but the practice of almost only crowd data, can not because of the name and the actual dry thing there is a difference in the daily debate.

This article introduces the RAG, is to start from the realization of the idea, to the AI brain step by step with a set of tools, and even to the end of the process design for the AI part of the design.

I hope that this document to show a step-by-step, but also from abstract to concrete concrete landing process, in order to ensure the readability of the document itself, the list of projects is not completely consistent, but I think this order of writing for the reader is the most friendly and inspiring.

4.4. What are the flaws of RAG?

The paper Seven Failure Points When Engineering a Retrieval Augmented Generation System lists seven weaknesses of RAG.

The paper can be found at https://arxiv.org/abs/2401.05856

With these points of failure, we can find a problem, in fact, there are many points of failure itself has nothing to do with the big model, such as searching for documents Top K ordering is not precise enough, for example, can not read information from the document because the document formatting is wrong or the document is too dirty.

As we said earlier in the metaphor, the big model is essentially a brain, and the AI product is the brain paired with the five senses, torso and limbs.

You can't optimize the brain without optimizing the limbs, that's an act of laziness, and don't view the big model as an all-purpose wish-granting machine.

Let me conclude with a children's song that summarizes the core idea of this document:

Man has two treasures, his hands and his brain. The hands can do the work, the brain can think. Use your hands and your brain to create.

4.5, Graph RAG - an evolved version of RAG

Graph RAG is an open source RAG framework from Microsoft that can be viewed as a further variation and iteration of the RAG SOP, and based on open source references, I would guess that this set of technical implementation ideas is widely used in Microsoft's Copilot system, but of course I haven't found any material to confirm this.

What is the difference between Graph RAG and the original RAG?

The main work of a traditional RAG is these three steps:

  1. Vectorize the knowledge to be searched (can be processed offline);
  2. When a user asks a keyword or question, the most relevant content (must be an online service) is found based on a similarity query;
  3. The relevant content of the queried Top N is used as the context, along with the questions posed by the user, and given to the big model to generate the content (which must be an online service);

Only the third step in the whole procedure is using a large model.

Similarity queries are actually rather unsatisfactory because they retrieve text based entirely on its level of similarity, and have nothing to do with the semantics of the text at all. This leads to the fact that the final RAG results tend to be worse.

Of course there are many possibilities for things to go wrong in the middle of these three steps, like the 7 points to optimize for RAG mentioned above, but retrieval accuracy is one of the easiest to optimize for and has the biggest payoff, so how does Microsoft do it?

Microsoft believes that users may not be able to get the words right, and also that retrieval needs to be smarter, so Microsoft replaces the vectorized retrieval with retrieval with the help of a knowledge graph.

The main work of Graph RAG is the following three steps:

  1. The knowledge to be searched is extracted in a triad (subject-predicate-object), and this extraction action requires LLM intervention to be deposited in the graph database (which can be processed offline);
  2. Take the keywords proposed by the user, do a diffusion with LLM, diffuse out synonyms, near-synonyms, etc., and then search the graph database to find relevant documents (must be an online service);
  3. The relevant content from the query is used as context and given to the big model to generate the content (must be an online service) along with the questions asked by the user;

The overall steps and architecture didn't really change much more, but the big model was introduced as two separate processing steps at key steps.

Note that the independent processing step is critical here.

4.6, RAG Re-Evolution (Carving)

After Microsoft released GraphRAG, a lot of people tried it and then found a bunch of problems, so someone posted this paper, A Hybrid RAG System with Comprehensive Enhancement on Complex Reasoning.

The paper can be found at https://arxiv.org/abs/2408.05141

Hybrid RAG has done a couple of things to build on the original.

Work 1. Document does a round of formatting with LLM before Loader.

Job 2, Problem Classification Machine

The problem classification machine itself is an SVM classifier, but instead of manually labeling the training data for this classifier, it is labeled with a large model as a way to reduce cost and overhead.

Work 3. Calling external computational libraries using large models

Of course, there are many more thoughts written inside this paper, so if you are interested, you can read the paper yourself.

This paper is still of high quality, and five of the six authors are scholars at the Artificial Intelligence Lab at Peking University.

5. Why we once overestimated the impact of big models

5.1, What was the problem with the end-to-end scenario we originally envisioned?

The industrialized process of human production of video is now:

  1. Look at the commonality of popular videos on the market, often relying on hashtags
  2. Produce some scripts based on popular tags;
  3. Filming;
  4. Send to the thread to see data feedback and iterate;

When the big model was first created, there was actually a hypothesis that if the big model could generate videos directly, could it look at most of the videos on Jitterbug, train it, and then have it produce some of them.

Let users like the videos produced by these big models, which in turn take the highly liked videos and use them as positive feedback to continue generating videos.

If this road can go through, for content platforms like Jieyin is absolutely disruptive, now look at this process can be said to go through, which naturally has a video generation model quality is not good, uncontrollable reasons, but I think more reasons are:

  1. Context Window Limits for Models
  2. The cost of modeling is too high

These two are near-unsolvable problems that are estimated to be difficult to solve for quite a long time (10 years) to come, so I don't think this end-to-end scenario is feasible.

But there's no doubting that AI-generated video is sure to be an important addition to video workers' future work sessions.

5.2, Are all apps worth redoing with AI?

Obviously not, because 90% of things in this world work well enough with rules, and models are inefficient and unstable.

More often than not, our requirement for a product is that it is simple, efficient, and can be relied upon, and the instability of the model at this stage dooms it to be unreliable in many scenarios, which makes it difficult to talk about replacing the original product.

Or reread the well-known demand formula given by Yu Jun:

User value = new experience - old experience - replacement cost.

It's a fact that many times when new technology is used, the benefits may not be as great as one might think. People who think all apps are worth redoing with AI are clearly overestimating AI, and many people like to draw inappropriate analogies between AI and the mobile internet.

From the point of view of the architecture of the von Neumann computer, the mobile Internet directly changed the input and output devices of the computer, which first brought about a revolutionary change in interaction, but this is not the most important.

The real value of the mobile Internet from a market perspective is not the revolution in interaction, but the dramatic reduction in the cost of user access to the Internet, which has doubled the number of users and extended their Internet usage time.

The above two revolutionary changes are clearly beyond the reach of this wave of AI, which can neither expand the market nor change the shape of computing devices in the short term.

The big model change will be more on the production side, which will affect the consumption side, as mentioned in one of the chapters on understanding the medium above, what it can bring is a great increase in productivity, but the problem we are facing now is more of an under-market and over-production problem.

5.3. Is LUI a panacea?

Some say this round of AI will revolutionize interaction, and all we need from today is a super app with a chat box!

I can come to this conclusion now, the person who said this must be a cloud player, if this kind of person is in the company, I am his superior, I have to punish him promotion defense is not allowed to write PPT and documents, only allowed to oral broadcast.

Any medium and interaction has its own best practices and unique value, and LUI is obviously not a panacea. Programmers writing code will still think about how well the IDE's GUI works, so where are the Language UI believers to get the confidence that this can completely replace the GUI?

The biggest problem with LUI is inefficiency, let me use enterprise software as an example:

Excel is also good, PowerPoint is also good, even BI software, software in addition to lowering the threshold, there is a very important role is to "constrain" the user.

Constraint is not a pejorative term in this scenario, it's a positive one. All of the functionality built into these pieces of software has been iterated over a long period of time, and inside it is the experience and best practices of countless software developers and the organizations that use the software.

When you open Sensors Analysis and DataFinder, two behavioral analytics software from different companies, you'll find that their functionality is very similar: event analysis, funnel analysis, retention analysis, etc. I even suspect that the habits of the customers who use them will be very similar, because the functionality is best practice. I even suspect that the enterprise customers who use these programs are highly similar because the functionality that is cohesive in these programs is a best practice in and of itself.

For an operation with little statistical analysis skills, it's just too difficult for him to face GPT's text box to describe clearly that he needs a funnel analysis, or it's simpler to use Divine Strategies.

GPT can also guide users to a certain extent if they have access to detailed calculation logic about funnel analysis, but only if the user can say the words funnel analysis, and in fact many people out of the interface are not able to describe clearly what they want.

LUIs are not a panacea, and it is a mistake to overdo them, or even equate them with AI natives, which only make sense in a particular paradigm.

LUIs are not a panacea.

5.4. Is complex UI a detour?

The leading open source literate graph UI on the market today is called ComfyUI, as shown below:

ComfyUI is a far cry from the text-to-graph AI that we think of inside our heads as being invoked by tapping on text, and feels more like a low-code platform, why?

Because of the need for precision control in industry, the semantics of precision control can't be accurately expressed purely in cue words; it has to be expressed in a GUI.

For example, in the diagram above, what's actually being done in there is to change one building for another and not change anything else as much as possible, and that kind of precise control is very difficult for something like a model that essentially relies on mathematical probability to drive it, which is why it's so complicated.

So the question is, is this UI above a detour?

In fact, at first glance I thought this was outrageous, and that if I were to use Photoshop or Wake Up Call I should be able to do something similar, and the UI shouldn't be that complicated, but after thinking about it I think it's very valuable.

I think it would be a detour to give this kind of UI to the user, but if we wrap this action above into a scenario and give the user an action like letting the user upload two images, wouldn't that be one of the simplest products out there?

This means that the functionality I mentioned above, with the Wake Up Call and PhotoShop, is actually entirely possible in the future with a UI like this built and packaged for the user.

So the UI above is fully available to professional users, and this UI can be turned into an engine through which hundreds of features within the product can be produced in a continuous stream.

This is the real value of the big model, which is a hundred-fold increase in productivity.

5.5, What is the real value of big models?

What is the real value of big models if they neither bring qualitative change to interactions, nor are they an all-purpose wish-granting machine, nor can they even generate Feed streams end-to-end?

The real value of big models is to give people with extraordinary creativity like product managers and engineers an increaser.

It's a universal model that doesn't need to be trained and has a high probability of being stronger than proprietary models.

This means that product managers and engineers who can think through the SOPs and figure out which problems can be solved with a big model will definitely be able to solve them at a relatively low cost, and things that would otherwise require training their own private models can now be solved by relying on public models.

Many of the costs that were once like mountains are now surmountable. Money and data volume are no longer roadblocks to training a model, you just need to be able to write prompts and build workflows.

6. What are the insights into how product managers work?

6.1, Business! Business! Business! Data! Data! Data!

Fresh data and a business that consistently generates fresh data is what keeps a big model alive.

A RAG does awesome work, and if the database sucks, the results are terrible.

A model that is frustrating is still a good tool if the data is good enough.

In the age of AI, the problem of data is only going to be put into perspective, and AI search that does it inside a shit mountain of data is, at best, building another shit-stirrer.

At all times, the business that consistently produces real transaction data and great content is more important than the big model itself.

6.2, read the paper is not the best way for product managers to avoid FOMO, hands-on attempts can avoid FOMO

I think one of the clearer facts is that industry must be overestimating the role of big models in the short term at the moment.

A while back I wrote a paragraph that said, Every app is worth thinking about whether it can be redone with AI, and then it becomes clear that only 10% of them are worth redoing.

The short-term investment of resources largely stems from the FOMO mentality of investors and decision makers in large companies, which is unhealthy from a certain point of view. This FOMO mentality is also transmitted to the product manager, so many product managers start to read the paper (such as the author of this article), but in my opinion the product manager to read the paper in fact is not very useful, the figure of a fun role is greater.

Papers are hammers, but what's more important to PMs is where the nails are. So instead of caring about what the latest academics are posting, you should be caring about which AI products on Product Hunt are hitting #1 and making money again.

Instead of FOMO, why don't you get your hands dirty and use it more? In the past, when the mobile Internet was just emerging, most of the product managers had to install hundreds of apps in their cell phones and study them every day. To use the best apps on the market, and try to reverse it, there will be a lot to gain.

Writing this article, writing close to 20,000 words, found that the big model so many pits, and each pit can find some related papers. Not because of idle pain to read the paper, but because I have been tinkering with AI, and then found that there are problems here and there, and I had to search, and I found the paper.

It turns out that we all have this problem, some people just have the spirit of research, found the problem will also be in-depth research, and then write a paper. Hey, people are really pissed off.

If you look closely you will see that most of the papers I read are engineer papers rather than algorithm papers, because as I emphasized, the weaknesses of the big models need to be compensated for by engineering, and engineering is built by R&D and product managers, it doesn't matter if you can't read the algorithm papers, but if you can't read the engineers' papers you may really have to reflect on them.

6.3, product managers must learn to call their own API

As mentioned in the article, the ability of future AI product teams depends not on whose models are stronger, and open source models will definitely end up being the strongest anyway, but on who can use the AI models well.

And who can use AI models well depends on which team is strong at doing experiments, and whoever can do more experiments faster is the best.

Some people may ask if it's okay for me to use coze or KimiChat, and my answer is no.

Because the two itself is already an AI product, and AI's API gap is very large, to KimiChat to pass a PDF, people interpreted fast and good, how do you know that it is interpreted well because of the model is awesome, or because of the PDF format to MD data cleanup is good?

This requires that the product manager must have the ability to experiment with very fast bare-bones API calls.

So how do you get this ability quickly if you can't write code? Use a low-code platform like Dify or n8n for that.

On my own part, I think n8n is more reliable. n8n is an open source, free upscaling alternative to Coze, a visual, low-code automated workflow platform that makes it easy for people who can't write code to experience the joys and effects of AI development.

It's easy to create complex automated workflows that many buttons can't do, such as Webhook triggers, 1000s of third-party accesses, and launching customized HTTP requests.

And because n8n didn't start with this wave of AI, its ecosystem is also more complete, with more official access to integrations and a more active community than the workflow tools that have emerged since this wave of AI (like Dify).

n8n It's the five senses, torso, and limbs of the big model, the hands and the brain that create.

Here I recommend a Chinese tutorial for n8n, "Modern Magic Made Simple and Easy to Understand", which is supposed to be the best Chinese tutorial for n8n on the market today.

Tutorial at https://n8n.akashio.com/

6.4. Suggestions for Several AI Agent Practices

6.4.1 Key Ideas for Designing Agents

Imagine 20 interns working for you, how would you assign tasks? What are the characteristics of the interns?

  • Capable of execution
  • Inability to dismantle complex problems
  • Execution order can be problematic, what happens if you don't specify the order of execution? It might slow down the project
  • Average memory, too many tasks to remember at once, may forget
  • May be wrong, so it's better to have cross validation

So if you think of AI as one tireless intern after another, what do you think they can do? Designing SOPs and outputs so that they can do the high volume, repetitive work, of course.

One of the bigger differences between interns and AI, of course, is that AI is indeed much better at solving specific problems than interns, or even quite a few regular employees.

It's also a lot more expensive, with pricing for GPT-4 answering a question for a round-trip API likely to be around 3 cents.

The fact that people are cheaper than machines could be a big reason (bushi) for everyone not to be eliminated by AI in the future.

6.4.2, split the task into smaller and finer details to avoid interfering with each other

There are two advantages to breaking a complex problem into multiple simple problems and then letting the model handle it:

First, as described above, the instability of a large model is perfectly positively correlated with the size of the text it receives, and splitting the problem smaller would mean that the word task model would have less text to deal with.

Second, there is no need to use a big model at all for some simple problems, and it can even be cheaper, faster, and maybe even better to use a simple model or pure logic to make a judgment.

6.4.3 Distinguishing between offline and online tasks

Taking the Grpah RAG architecture as an example, it is divided into three steps:

  1. The knowledge to be searched is extracted in a triad (subject-predicate-object), and this extraction action requires LLM intervention to be deposited in the graph database (which can be processed offline);
  2. Take the keywords proposed by the user, do a diffusion with LLM, diffuse out synonyms, near-synonyms, etc., and then search the graph database to find relevant documents (must be an online service);
  3. The relevant content from the query is used as context and given to the big model to generate the content (must be an online service) along with the questions asked by the user;

The first two of these steps are offline tasks, and offline tasks mean that you can spend more time doing fine-grained processing of the data, for example, we can use an open source but powerful big model and run the tasks on our own servers as a way to reduce costs while maintaining quality.

Online tasks, on the other hand, mean that they need to be more time-sensitive, and if the online task itself is not very complex, you can also choose a more lightweight model to ensure speed of response.

At the same time the results of offline tasks are stored and called repeatedly, and using a better quality model is equivalent to making some fixed cost investments, whereas online tasks are all directly related to user interactions, which are essentially marginal costs.

6.4.4, Every offline task can be considered to be solved with a model

In Hybrid RAG, the work of converting html to MD is done with the Python library, and in fact, if you're looking for pure results, you can consider doing it directly with a big model.

Of course whether to use Python libraries or big models is really dependent on cost and effectiveness considerations, which also need to be demonstrated through experimentation.

My own experience is that this kind of data cleaning work is suitable for Python libraries to do a round, and then use the big model to do a round of cleaning, the results may be better, but very often the Python libraries are clean enough, the middle of the interleaving some wrong format encoding actually does not affect the subsequent judgment of the big model, in this case it does not matter if you do it or not.

All in all big models are very capable of working in specific segments, in fact almost every segment can be solved with big models if cost is not a consideration.

6.4.5, cost and effect to be done Trade off

It is well known that there is a certain amount of randomness in the answers to the big models, so how to solve this problem? Repeat a question several times, of course.

For example, the paper "From Local to Global: A Graph RAG Approach to Query-Focused Summarization" mentions how to test the effect of two RAG methods, it's too much trouble to label with human labor, so they even leave the job of checking the samples to the big model! , which is really awesome (and rich).

An annotation has to be repeated 4-5 times to ensure that the result is correct, and naturally costs 4-5 times as much.

For students designing products it becomes very important to judge how to balance cost and effectiveness.

6.4.6, Boundaries and Best Practices Between Agents, Fine-Tuning, and Cue Words

This is how it can be understood if we correspond cue word engineering, Agent construction, and model fine-tuning to us giving a person a task.

  • The cue word engineering equivalent of what you say in detail when you're setting up a task.
  • Agent building is the equivalent of breaking down the SOPs when you task this person and telling them what tools are available within the company.
  • Model fine-tuning is equivalent to doing training.

So to accomplish a task, all three means may need to be used, but the cost of all three is different.

In order to complete a task, which optimization means should be used, this is the current engineering, algorithms, products, scholars, these aspects are very vague, this is like sailing or mining, but also like clinical medicine, how to decide what means to use, is essentially an "experiment", rather than a "reasoning". The first is a "experiment", not a "reasoning".

That's why each of the papers listed above needs to list a very large number of benchmark tests, because the people who designed these Agents don't know themselves whether the results are good or not, and they need experiments to verify them.

So I think the future of which teams are strong and which are weak in applying AI will really depend on:

  • Does this team have an awesome benchmarking reference answer;
  • Could this team have a platform to validate their design as sound more quickly;

Whoever experiments fast is awesome, engineering productization is instead the last step.

Finally, my personal advice is not to fine-tune if you can, because fine-tuning changes the parameter layer of the model, which is costly and impossible to migrate. And the nature of fine-tuning is in and large model team roll algorithms, roll data, in-roll in the long run, it does not make sense.

Article source: https://mvread.blog/archives/465

Back to Blog List

Table of Contents