The Agents of AI: Data Analysis with LLMs and LangChain Agents

9 min readMay 7, 2023

Remember Agent Smith from the ‘Matrix’ trilogy ?

Large Language Models (LLMs) are in a rage these days — there have been a lot of them in the past few weeks, with every major company releasing a version that is better than other. The question is what next? LLMs can build a version of ChatGPT for sure, but can they do anything else? Yes, many more things, and the answer lies in the framework called LangChain and its Agents.

What is LangChain

For the overview of LangChain — here is a previous post

LangChain is a framework that sits between the Large Language Models (LLMS) and the Tools (Google Drive, Python, Wikipedia, Calculator, Wolfram Alpha, etc). It also connects to a custom data base via a Vector Store (Pinecone, Milvus, Chroma, etc) and has Agents that can can chain together different actions in a LangChain.

Tools: Connect to LangChain with external sources, something like a function that performs a specific extern duty (external to LLMs) — like google search, database lookup, python REPL, mathematical calcuations, wikipedia lookup, etc. They are the interface between the LLM and external sources.

Agents: Think of Agents as ‘Bots’ that make AI do things for you. They are the interface between LLM and the tools, and figure out the task (what needs to be done) and the tool (what is the right tool for this specific task). There are a lot of predefined agents which can already do a lot of things, and you can also make your own agent to do a specific task.

Vector Database: Though this topic deserve a separate post, but here is a primer on Vector DB. A vector database is used to store unstructured data and the embeddings associated with it, and the unstructured data is clusted together based on similarity. It uses a combination of different algorithms that all participate in Approximate Nearest Neighbour (ANN) search, and are used for fast retrieval of information to be fed to LLMs.

Here is a simplified diagram of the LangChain — Tools — Agents -LLMs-VectorDB ecosystem

A flow diagram of how a LangChain framework works

What Tools are available for use?

Here is a list of all the available tools. More and more tools are being developed everyday — think of them as Apps on the Apple Store. Each tool is designed to do a specific task very well

Some key tools are:

Python REPL: A Python shell used to evaluating and executing Python commands. It takes python code as input and outputs the result. The input python code can be generated from another tool in the LangChain

Wolfram-Alpha: WA search plugin — can answer complex mathematics, physics or any query. Takes a search query as input

Wikipedia: Generates results from wikipedia

Terminal: Executes commands in a terminal. Input should be valid commands, and the output will be any output from running that command.

llm-math: Answer questions about math

There are a lot of these tools available — have a look at the link above and see for yourself. Some of the tools will be referenced in the later part of the post.

Why are these developments significant?

While there are a lot of use cases of LangChain and its Agents, I would like to focus on one aspect — automating Data Analysis with LangChain framework, Large Language Models, Tools and Agents.

Imagine you are a retailer and you have customer’s transactional data with you. You want to generate some basic insights from the data (like avg spend by gender and agegroup, what products each type of customer buys, which city and store have highest sales for a product, and so on !). In addition to insights, you wan to create a RFM segmentation on the data so that you can identify your best customers.

This problem statements sounds like the type of work a data analyst should do, and this is exactly what it is. A data analyst will take the data, write some SQL or Python or R code, generate insights, and create the RFM segmentation. It will cost the retailer some money, and the analyst will take maybe a week’s time to deliver these results.

Not Anymore!

Rather than writing about it, let me just show how exactly this can be done, with the help of a notebook

#Install Dependencies
!pip -q install langchain openai
openai_api_key = "<<INSERT YOUR OPENAI KEY HERE>>"

Dataset : Black Friday Sales from Kaggle — Any other transaction database can also be used here

Building a CSV Agent

This agent calls the Pandas DataFrame agent under the hood, which in turn calls the Python agent, which executes LLM generated Python code

from langchain.agents import create_csv_agent
from langchain.llms import OpenAI
agent = create_csv_agent(OpenAI(openai_api_key=openai_api_key,temperature=0), 
                         '/content/train.csv', 
                         verbose=True)

agent.agent.llm_chain.prompt.template

Output:

You are working with a pandas dataframe in Python. The name of the dataframe is df. You should use the tools below to answer the question posed of you:
python_repl_ast: A Python shell. Use this to execute python commands. Input should be a valid python command. When using this tool, sometimes output is abbreviated — make sure it does not look abbreviated before using it in your answer.
Use the following format:
Question: the input question you must answer Thought: you should always think about what to do Action: the action to take, should be one of [python_repl_ast] Action Input: the input to the action Observation: the result of the action … (this Thought/Action/Action Input/Observation can repeat N times) Thought: I now know the final answer Final Answer: the final answer to the original input question
This is the result of print(df.head()): {df}
Begin! Question: {input} {agent_scratchpad}

That’s it !

You can now query your dataset with simple prompts in plain english, not in SQL or Python. Leave it to the agent to write the python code. You can see the generated Python code in the output

You can also see the actual result also from df.shape

Not only it can do the basic EDA, it can do intermediate level tasks like RFM segmentation (this is from a different dataset with dates)

The future (not so far away)

As of now there are some limitations to these agents and tools of a LangChain — they cannot output a dataframe, they cannot build complex model (yet), or do a custom methodology (like a different version of RFM segmentation). The main thing here is that all these tasks can be done independently with the LLMs like GPT-4 but they are not yet integrated in an agent.

LLMs ability to build Machine Learning models

For example, if you ask the GPT-4 model to write a production grade API for a product recommendation system, it can do it quite well (or at least 95% of the work)

Here is the output from GPT4

The only thing left here is to integrate this code with the dataset, the framework is already there. All you need is to create an agent to integrate this model with the dataset.

LLMs connecting to Internet via Plugins

Another key thing here is the integration on LLMs — this can revolutionalize the field of data science as it can already go a google search and wikipedia or wolfram-alpha search to look for sophisticated methodologies, and write a code for the same in any language (Python, scala, C++, PySpark, etc). Not only this, you can give them a documentation of your specific methodology or a research paper and LLMs will churn out the code for you in less than a minute, rather than a PhD level data scientist coming up with the working code in a week

Connect to Action : Send campaigns for your customer

There is already a list of plugins available for GPT models by Open AI. Expect more of them to come soon. These plugins can connect to already existing platforms like Expedia, Instacart, Opentable and can book tickets for you, order groceries and book your dinner reservation. More and more plugins will come — think of this as initial days of app store. You will soon have a plugin to connect to SalesForce, or Adobe, or any tool, or any microservice, or any website.

ChatGPT Code Interpreter

This is probably the most interesting plugin from OpenAI. Even though it is in limited Alpha version and not much information is available, but people who have used it are blown away by its ability to write code. You don’t have to remember the code or any visualisation library, you can just ask the code interpreter to“make the charts more beautiful” and it’ll do that.

Read more about ChatGPT code interpreter here.

Here are some snippets (courtesy twitter user @emollick):

How this will change the future of data analyst and data scientist

The problem statement above of taking in a data, building a product recommendation model, and sending campaign to a specific customer who will generate most ROI is currently done by a team of data analyst and data scientist (of course depends a lot on multiple factors like data size and complexity, but lets assume a simple use case). For a simple use case 1 data analyst and 1 data scientist will do it in around 10 days, then a team of delivery experts will connect it to a campaign orchestration tool and execute the campaign in another 2–3days.

With the integrations of LLM, Langchain, plugins, agents and tools, a non-techie person like a marketing manager of a brand will just write this requirement in plain english

“ Send a campaign of 20% off to my most profitable customers who are repeat buyers from top 3 segments generated by RFM segmentation, and execute the campaign at 3PM tomorrow via SMS and email. Also send a 15% off to top 25% of customers who are about to churn — top churn customers will be based on their lifetime value contribution in last 2 years. Both the campaigns should include products recommended by a product recommendation engine, the the message should be personalized with their {first_name} and {recommended_product} ”

The framework will do the rest, from creating segments, building the models and executing campaigns. It will probably take 5 minutes in total to build a product recommendation, lifetime value, churn model, RFM segments and do all the data processing, campaign execution and orchestration. It may need some human intervention or verification or some other data related tasks, but 95% of their work will be done by the AI. The components have already been built, and we will probably have something like this ready in another 3-6 months.

Why would a company spend thousands of dollars to build and execute this by a team of data scientists, when they can do all this in a fraction of cost ? Think about it !