How to automate your data science tasks with Large Language Models

Ashutosh Kumar
9 min readJun 11, 2023

Automate your code, visualization and model codes with LLMs and Langchain agents

Will AI replace data analysts?

Ever since the introduction of ChatGPT on November 22, a question that continues to resonate is 'will AI take over my job?'. There is no clear-cut answer to this query primarily because AI, while excelling in certain tasks, can sometimes fall short in others within the same domain.

This article aims to delve into AI's competencies within the realm of data science, specifically its proficiency in data analysis. We intend to showcase different ways in which Large Language Models (LLMs) and Generative AI can collaborate with transactional data, thereby unlocking potential insights from it, and focus on the possibilities offered by AI integration in data analysis.

The use case — why is this a big thing

Imagine a retail business or an e-commerce giant, or any entity that generates transactional data. The Chief Marketing Officer (CMO) or marketing manager may want to assess the performance of their campaigns or monitor customer behavior during a sales season. Often, these businesses engage a third-party data science service provider. They hand over their business issues or tasks to these experts, who analyze the data and provide insights. This process could take anywhere from 2 to 4 weeks, even for straightforward tasks.

However, with a Large Language Model (LLM) at their disposal — be it OpenAI or any open-source LLM that integrates with their data — the process becomes much simpler. Instead of writing a complex SQL query like “Select count(*) from database.table_name where condition 1 and condition2 group by gender, age, …”, (ok, this query is not complex, but think from the point of view of someone who has no clue about SQL or Python) they can just write “how many customers have purchased in brand x this campaign season, plot the graph by age, gender, etc”. This plain-English instruction to the LLM fetches results instantly, without any SQL knowledge or long waits required!

The Labors of Hercules (or the data scientists)

The value of preliminary exploration and preparation in data analysis cannot be overstated. Prior to creating any machine learning model, it is essential to meticulously process and explore the data. This exploration, also known as Exploratory Data Analysis (EDA), often includes various operations that will vary based on the nature of the data, the problem being addressed, and numerous other factors.

For a detailed analysis, you’ll need to clean the data (missing values/ outliers), observe variables interactions, correlation, scatter plots, density plots etc. You may also need to test your hypothesis with t-test, or create some great visualizations that tell the story of data. Building a machine learning model could be the next step, but doing an exploratory data analysis properly is equally important.

Listing down some of the things here — not exhaustive, there could be more, depends on data, problem, etc. etc. … I just want to show some of the tasks that could be done with LLMs

  1. Metadata Extraction— rows, columns, # of Numerical and Categorical columns, column names
  2. Data Cleaning — finding missing values for each column, outliers detection by understanding percentile distribution and boxplots
  3. Univariate Analysis — probability distributions of numeric fields (density plots), histogram of certain columns, other plots and charts
  4. Bivariate analysis — visualizations with two or more variables, scatter plots, correlation table, heatmaps for correlation
  5. Create Visualizations — different charts, changing color or label or chart type to display a specific chart for a specific data and filters
  6. Segmentation of customers (RFM of other)
  7. SQL Query — generating specific results
  8. Hypothesis Testing — running t-test to estimate if there is a statistically valid hypothesis on the data

SQL analysts are no more needed

In numerous organisations, the role of a junior analyst frequently involves taking raw data, applying SQL for data extraction and manipulation, performing exploratory data analysis (EDA), and creating visualisations, before passing the insights on for further decision-making.

However, with the advent of Large Language Models (LLMs), the landscape of data analysis is transforming. There is no longer a need for manually writing SQL queries or basic Python codes. Instead, one simply needs to “prompt” the LLM. This task involves providing the LLM with clear instructions in plain, uncomplicated English. Consequently, the model handles all the intricate coding work, thereby streamlining and simplifying the process significantly.

Why OpenAI GPT models?

There are a lot of LLMs, some are open source and some are like GPT-4. This is totally up to you to use any LLM in your task, there is no compulsion to use OpenAI’s models. However, it is easy to use, by far the most accurate and works well for quick prototyping, so I am using it.

I will cover how to build the same application with an opensource LLM in the next post.

Let’s dive into the code

Note: The code blocks are visible here but the outputs of the code blocks are not visibl eor quite small in size. Somehow these images are not getting uploaded properly. Feel free to go to the Colab notebook link below to view the output, meanwhile I’ll try to fix this issue of output images not visible.

All the codes are available in the Colab notebook here. Feel free to view it and run in your environment.

#Install Dependencies
!pip install langchain openai

import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from getpass import getpass
from langchain.agents import create_pandas_dataframe_agent
from langchain.llms import OpenAI

openai_api_key = "Put your API Key here"

You can get your own openAI API key, if you are unsure please follow the instructions here

Data : We will be using a very common opensource dataset for this demo — the Titanic survivors dataset which is freely available here. This is a straight forward simple analysis, but as stated before the idea here is not to show complexity, but the possibility of what is possible.

While using the data, be mindful of the fact that sending data to OpenAI APIs may compromise the data security and this data can be used for further training of OpenAI models. If you are using your client’s data or company data, get explicit permissions before using OpneAI APIs

df = pd.read_csv('titanic_train.csv')

Build a LangChain Agent

agent = create_pandas_dataframe_agent(OpenAI(temperature=0, openai_api_key=openai_api_key),df, verbose=True)
openai = OpenAI(temperature=0.0, openai_api_key=openai_api_key)

And you are all set to let AI work for you! No need to remember any python or SQL, just write in plain english and get the results.

Let’s use some queries to generate the EDA plots and numbers. These queries may look simplistic in terms on complexity, but the point here is just to shaw what all is possible — feel free to try out more complex queries yourself with your own data !

1 . Metadata Extraction and Data Exploration

results = agent("how many records are there? Give me rows and columns")
results = agent("What are Column names?")
results = agent("Which columns are categorical and which are numeric?")

2 . Basic Exploration

Missing Values

results = agent("Are there any missing values in columns")
results = agent("What is the frequency of unique values in sex and age column")

Outlier Detection

query = """Are there any outliers in terms of Fare? Find out using a boxplot"""
results = agent(query)

3 . Probability Calculations

results = agent("What is the probability of survival among male and females")
results = agent("What is the probability of survival for females over 35 years of age")

4 . Data Visualizations

results = agent("Draw a bar plot to show number of survivors by gender")
results = agent("Draw a histgram on fares with bins from 0 to 10, 10 to 20,
and so on. Make it in red color")

5 . Plotting and comparing density plots

results = agent("Compare the distribution of fares between males and females
with Kde plots")

6 . Bi-variate Analysis

results = agent("Plot is the correlation between survivor, age and fare in 
a heatmap, use the seaborn library for vizualization, and tab10 color palette")
results = agent("Show the scatter plot between fare and age, color coded
by sex, color palette Set3")

7 . Hypothesis testing

The null hypothesis (H₀) signifies a declaration of lack of impact, absence of distinction, or absence of correlation within the population. On the other hand, the alternative hypothesis (H₁) signifies the contrary to the null hypothesis, indicating the existence of an impact, distinction, or correlation within the population.

Let us test the Hypothesis with the help of LLM, without writing a single line of code:

H₀: Fares of those who survived Titanic is same as those who did not survive.

H₁: Fares of those who survived Titanic is different than those who did not survive

Quick question: If you are a data scientist and don’t do this regularly, how long will it take for you to google the code for it : )

query = """Do t-test for the following hypothesis.
Null Hypothesis: Fares of those who survived Titanic is same as those who did not survive.
Alternate Hypothesis: Fares of those who survived Titanic is different than those who did not survive
"""
results = agent(query)

H₁: Fares of those who survived Titanic is different than those who did not survive

Final Answer: The t-test result is a statistic of 7.939 and a p-value of 6.12e-15, which indicates that the fares of those who survived Titanic is significantly different than those who did not survive.

8 . Clustering with K-means

9 . Generating Model codes

Build a RandomForest Classifier

query = """Build a random forest model to predict the probability of 
survival given this data.. Give me the coefficients of all variales and
tell me the accuracy metrics from the model evaluation
"""
results = agent(query)

Compare multiple ML models


query = """With survided as a target variable, compare the accuracy,
precision and recall of logistic regression, xgboost, and random forest model"""
results = agent(query)

What does this mean for the future of data analysis?

In my opinion, the day is not very far when either the database providers will integrate LLMs into their data warehouses, or numerous companies will come up who will give this flexibility of “asking questions in plain english” to everyone. This may mean the elimination of SQL coding and almost all database querying workflows.

Keep in mind that it has not been even 3–4 months since these tools and technologies have come to the market. Think about it, what will happen in next 6 months or 12 months or 2 years !

--

--

Ashutosh Kumar

Data Science @ Epsilon ; interested in technology, data , algorithms and blockchain. Reach out to me at ashu.iitkgp@gmail.com