LLMs Meet SQL: Revolutionizing Data Querying with Natural Language Processing

Using Prompt Engineering, SQL Agents, and Much More

Senthil E

Published in

Level Up Coding

56 min readMar 5, 2024

Introduction:

Welcome to the exciting world of combining large language models (LLMs) with structured data like tables and SQL databases! Imagine having a super-smart assistant that can talk to databases in their own language, making it super easy for us to get the information we need. This isn’t just about asking questions and getting answers; it’s about creating tools that feel like magic.

In this article, we’re diving into how these powerful models can make our lives easier in many ways. We’ll explore how they can write database queries just by understanding our questions, help us build chatbots that know stuff from databases, and let us set up custom dashboards to see the info we care most about. But that’s not all — we’ll also discover more amazing things we can do when we combine the brains of LLMs with the organized world of structured data. So, get ready to unlock new possibilities and make interacting with data a breeze!

A simple query like this

or a complex one like this

Let's dive in.

Part: 1 — Choose the SQL Database, Create the Schema, and Load the data

Part: 2:Agents & SQL Agent

Part: 3:Using Prompt Engineering

Part: 4:SQL Query Validation:

Part: 5:Non-descriptive or language-specific table and field names in databases

Part: 6:Challenges

Part: 7-Text 2 Visualization:

Part:8-Using Ollama for Text 2 SQL

Part:9-Text 2 SQL Evaluation

Part: 10 -Other Articles and Repositories to Refer

Part: 11 — Text2SQL Using Llamaindex

Part:13 — Finetuning a LLM with SQL Dataset

Part:14 — Pandas AI

Part:15:Some more relevant papers

Part: 1 — Choose the SQL Database, Create the Schema, and Load the data

SQL and Its Role in Data Management:

SQL stands for Structured Query Language, and it’s like the magic words you use to talk to databases — telling them what data to store, find, or change.
It’s super important because pretty much every app out there uses a database, and SQL is the go-to language to manage all that data.

Choosing the Right Database:

Picture PostgreSQL, MySQL, and SQLAlchemy as different types of toolboxes for your data.
MySQL is chosen here because it’s easy to use, quick, and fits all sizes of projects. It’s like your reliable kitchen knife — great for many tasks.
PostgreSQL is actually the crowd favorite for its robust features, but for this guide, we’re keeping it simple with MySQL.
If you’re working straight from a Jupyter notebook and want something more straightforward, SQLAlchemy is your buddy. It lets you talk to your database with Python code, no heavy lifting required.

And that’s why MySQL is the pick for this project — simple, speedy, and suits everything from your weekend hobby to your full-blown business needs!

Download MySQL :

MySQL :: MySQL Community Downloads

Edit description

dev.mysql.com

For our project purpose, we will create a sales order schema:

Schema:

In a relational database, a schema is like a blueprint that defines how your data is structured and organized. It includes details about tables, relationships, and data types, setting the stage for storing and retrieving data efficiently.

The SalesOrder schema is crafted to capture and reflect the intricacies of sales transactions. It’s designed to store everything from customer information and inventory status to detailed sales orders and supplier data.

The schema encompasses seven pivotal tables:

Customer: Keeps track of customer details, purchase history, and contact information.
Employee: Records information about the staff, including their roles, contact details, and salaries.
InventoryLog: Monitors inventory changes, providing insights into stock levels and movement.
LineItem: Details each item within a sales order, including price and quantity.
Product: Catalogs products, their descriptions, prices, and stock quantities.
SalesOrder: Central to the schema, this table records the sales transactions, including dates, statuses, and payment details.
Supplier: Contains data on the vendors supplying the products, essential for managing the supply chain.

Image by the Author

SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE, IS_NULLABLE, COLUMN_DEFAULT, CHARACTER_MAXIMUM_LENGTH 
FROM INFORMATION_SCHEMA.COLUMNS 
WHERE TABLE_SCHEMA = 'SalesOrderSchema' 
ORDER BY TABLE_NAME, ORDINAL_POSITION;

Image by the Author

Load the data:

To generate and load data for tables like Customer, Employee, and Product, one would:

Install Faker via pip.
Use Faker to create data for each table’s fields, considering the specific needs (e.g., names, addresses, product details).
Write Python scripts that insert this fake data into a MySQL database, utilizing libraries like mysql-connector-python or SQLAlchemy for database interaction.

This script is useful for populating a database with sample data for testing or development purposes.

Imports necessary libraries: The script uses mysql.connector to connect to a MySQL database and Faker to generate fake data.

Initializes Faker: Faker is set up to create realistic but fake data like names, emails, phone numbers, addresses, and dates.

Connects to the MySQL database: It establishes a connection to a MySQL database located on localhost with the database name SalesOrderSchema. The user is root, and you should replace “Your MySQL Password” with the actual password.

Creates a cursor object: A cursor is used to execute SQL commands through Python.

Generates and inserts data:For each of the 100 iterations, it generates fake data for a customer, including first name, last name, email, phone number, address, the date they became a customer, and whether they are an active customer.If the generated phone number is longer than 20 characters, it truncates it to ensure it fits into the database column.
Both the billing and shipping addresses are set to the same generated address.It then inserts this data into the Customer table of the database using an SQL INSERT statement.
Commits the transaction: After all the data is inserted, the changes are saved to the database with conn.commit().Closes the cursor and connection: Finally, it cleans up by closing the cursor and the connection to the database.

#The code for loading data into the customer table
#Customer Table
import mysql.connector
from faker import Faker

# Initialize Faker
fake = Faker()

# Connect to MySQL
conn = mysql.connector.connect(
  host="localhost",
  user="root",
  password="Your MySQL Password",
  database="SalesOrderSchema"
)
cursor = conn.cursor()

# Generate and insert data
for _ in range(100):  # Let's say we want to generate 100 records
    first_name = fake.first_name()
    last_name = fake.last_name()
    email = fake.email()
    phone = fake.phone_number()
    if len(phone) > 20:  # Assuming the 'Phone' column is VARCHAR(20)
        phone = phone[:20]  # Truncate phone number to fit into the column
    address = fake.address()
    customer_since = fake.date_between(start_date='-5y', end_date='today')
    is_active = fake.boolean()
    
    # Insert customer data
    cursor.execute("""
        INSERT INTO Customer (FirstName, LastName, Email, Phone, BillingAddress, ShippingAddress, CustomerSince, IsActive)
        VALUES (%s, %s, %s, %s, %s, %s, %s, %s)
    """, (first_name, last_name, email, phone, address, address, customer_since, is_active))

# Commit the transaction
conn.commit()

# Close the cursor and connection
cursor.close()
conn.close()

Employee Table: Populate using the Faker

#Employee Table
import mysql.connector
from faker import Faker

# Initialize Faker
fake = Faker()

# Connect to MySQL
conn = mysql.connector.connect(
  host="localhost",
  user="root",
  password="Your MySQL Password",
  database="SalesOrderSchema"
)
cursor = conn.cursor()

# Generate and insert 1000 employee records
for _ in range(1000):
    first_name = fake.first_name()
    last_name = fake.last_name()
    email = fake.email()
    phone = fake.phone_number()
    if len(phone) > 20:  # Truncate phone number if necessary
        phone = phone[:20]
    hire_date = fake.date_between(start_date='-5y', end_date='today')
    position = fake.job()
    salary = round(fake.random_number(digits=5), 2)  # Generate a 5 digit salary
    
    # Insert employee data
    cursor.execute("""
        INSERT INTO Employee (FirstName, LastName, Email, Phone, HireDate, Position, Salary)
        VALUES (%s, %s, %s, %s, %s, %s, %s)
    """, (first_name, last_name, email, phone, hire_date, position, salary))

# Commit the transaction
conn.commit()

# Close the cursor and connection
cursor.close()
conn.close()

print("1000 employee records inserted successfully.")

#Product Table
import mysql.connector
from faker import Faker
import random

# Initialize Faker
fake = Faker()

# Connect to MySQL
conn = mysql.connector.connect(
  host="localhost",
  user="root",
  password="Your MySQL Password",
  database="SalesOrderSchema"
)
cursor = conn.cursor()

# Generate and insert data into the Product table
for _ in range(1000):  # Generate 1000 product records
    product_name = fake.word().capitalize() + " " + fake.word().capitalize()
    description = fake.sentence(nb_words=10)
    unit_price = round(random.uniform(10, 500), 2)  # Random price between $10 and $500
    stock_quantity = random.randint(10, 1000)  # Random stock quantity between 10 and 1000
    reorder_level = random.randint(5, 50)  # Random reorder level between 5 and 50
    discontinued = random.choice([0, 1])  # Randomly choose between 0 (false) and 1 (true)

    # Insert product data
    cursor.execute("""
        INSERT INTO Product (ProductName, Description, UnitPrice, StockQuantity, ReorderLevel, Discontinued)
        VALUES (%s, %s, %s, %s, %s, %s)
    """, (product_name, description, unit_price, stock_quantity, reorder_level, discontinued))

# Commit the transaction
conn.commit()

# Close the cursor and connection
cursor.close()
conn.close()

print("Products inserted successfully.")

#Supplier Table
import mysql.connector
from faker import Faker
import random

# Initialize Faker
fake = Faker()

# Connect to MySQL
conn = mysql.connector.connect(
  host="localhost",
  user="root",
  password="Your MySQL Password",
  database="SalesOrderSchema"
)
cursor = conn.cursor()

# Generate and insert data into the Supplier table
for _ in range(1000):  # Assuming you want to insert 1000 records
    company_name = fake.company()
    contact_name = fake.name()
    contact_title = fake.job()
    # Ensure ContactTitle does not exceed the column's max length, e.g., VARCHAR(50)
    contact_title = contact_title[:50] if len(contact_title) > 50 else contact_title
    address = fake.address().replace('\n', ', ')  # Replace newlines with commas for address
    phone = fake.phone_number()
    # Ensure phone does not exceed the column's max length, e.g., VARCHAR(20)
    phone = phone[:20] if len(phone) > 20 else phone
    email = fake.email()

    # Insert supplier data
    cursor.execute("""
        INSERT INTO Supplier (CompanyName, ContactName, ContactTitle, Address, Phone, Email)
        VALUES (%s, %s, %s, %s, %s, %s)
    """, (company_name, contact_name, contact_title, address, phone, email))

# Commit the transaction
conn.commit()

# Close the cursor and connection
cursor.close()
conn.close()

print("Suppliers inserted successfully.")

#Sales Order Table
import mysql.connector
from faker import Faker
from datetime import timedelta
import random

fake = Faker()

conn = mysql.connector.connect(
  host="localhost",
  user="root",
  password="Your MySQL Password",
  database="SalesOrderSchema"
)
cursor = conn.cursor(buffered=True)

# Fetch customer IDs
cursor.execute("SELECT CustomerID FROM Customer")
customer_ids = [id[0] for id in cursor.fetchall()]

# Insert data into SalesOrder
for _ in range(1000):  # Let's say we want to generate 1000 sales orders
    customer_id = random.choice(customer_ids)
    order_date = fake.date_between(start_date='-2y', end_date='today')
    required_date = order_date + timedelta(days=random.randint(1, 30))
    shipped_date = order_date + timedelta(days=random.randint(1, 30)) if random.choice([True, False]) else None
    status = random.choice(['Pending', 'Completed', 'Shipped'])
    is_paid = random.choice([True, False])

    cursor.execute("""
        INSERT INTO SalesOrder (CustomerID, OrderDate, RequiredDate, ShippedDate, Status, IsPaid)
        VALUES (%s, %s, %s, %s, %s, %s)
    """, (customer_id, order_date, required_date, shipped_date, status, is_paid))

conn.commit()

#Sales Order Line Item
# Fetch product IDs
cursor.execute("SELECT ProductID FROM Product")
product_ids = [id[0] for id in cursor.fetchall()]

# Fetch sales order IDs
cursor.execute("SELECT SalesOrderID FROM SalesOrder")
sales_order_ids = [id[0] for id in cursor.fetchall()]

# Insert data into LineItem
for _ in range(5000):  # Assuming multiple line items per order
    sales_order_id = random.choice(sales_order_ids)
    product_id = random.choice(product_ids)
    quantity = random.randint(1, 10)
    unit_price = round(random.uniform(10, 100), 2)  # Assuming you have this info or fetch it from Product table
    total_price = quantity * unit_price

    cursor.execute("""
        INSERT INTO LineItem (SalesOrderID, ProductID, Quantity, UnitPrice, TotalPrice)
        VALUES (%s, %s, %s, %s, %s)
    """, (sales_order_id, product_id, quantity, unit_price, total_price))

conn.commit()

cursor.close()
conn.close()

#Inventory Table
import mysql.connector
from faker import Faker
import random

# Initialize Faker
fake = Faker()

# Connect to MySQL
conn = mysql.connector.connect(
  host="localhost",
  user="root",
  password="Your MySQL Password",
  database="SalesOrderSchema"
)
cursor = conn.cursor()

# Fetch Product IDs
cursor.execute("SELECT ProductID FROM Product")
product_ids = [row[0] for row in cursor.fetchall()]

# Assuming you want to insert 1000 inventory log records
for _ in range(1000):
    product_id = random.choice(product_ids)  # Randomly select a product ID
    change_date = fake.date_between(start_date="-1y", end_date="today")
    quantity_change = random.randint(-100, 100)  # Assuming inventory can increase or decrease
    notes = "Inventory " + ("increased" if quantity_change > 0 else "decreased")

    # Insert inventory log data
    cursor.execute("""
        INSERT INTO InventoryLog (ProductID, ChangeDate, QuantityChange, Notes)
        VALUES (%s, %s, %s, %s)
    """, (product_id, change_date, quantity_change, notes))

# Commit the transaction
conn.commit()

# Close the cursor and connection
cursor.close()
conn.close()

print("Inventory logs inserted successfully.")

Text 2 SQL Flow 👆

Part: 2:Agents & SQL Agent:

Let's create the text2SQL using SQL Agent First.

What is an AI Agent?

Definition: An AI agent is a computer program designed to perform tasks by simulating some aspects of human intelligence. It can make decisions, interact with its environment, or solve problems without continuous human guidance.

Capabilities:Decision-making: AI agents can evaluate situations and make choices based on the data they have or the rules programmed into them.

Problem-solving: They are capable of navigating through complex scenarios to achieve specific goals or find solutions to problems.

Learning: Some AI agents have the ability to learn from data or past experiences, improving their performance over time. This is often referred to as machine learning.

Types of AI agents:

Simple reflex agents: React to the current situation or environment based on pre-defined rules without considering the past or future.
Model-based reflex agents: Take into account the current state of the world and how it changes in response to actions, allowing for a more informed decision-making process.
Goal-based agents: Operate by considering future actions and their outcomes to achieve specific goals.
Utility-based agents: Evaluate the success of their actions based on a utility function, aiming to maximize their satisfaction or benefit.
Learning agents: Improve their performance and adapt to new situations by learning from their environment and past actions.

Applications:

Virtual assistants: Such as Siri or Alexa, that can perform tasks or services for an individual.
Autonomous vehicles: Cars or drones that navigate and operate without human intervention.
Recommendation systems: Like those used by Netflix or Amazon to suggest products or movies based on your preferences.
Healthcare: AI agents can assist in diagnosing diseases, predicting patient outcomes, or personalizing treatment plans.

Benefits:

Efficiency: They can automate and perform tasks faster and more accurately than humans in many cases.
Availability: AI agents are available 24/7, providing consistent service without the need for breaks or sleep.
Personalization: Capable of tailoring experiences, recommendations, and interactions to individual preferences.

Challenges:

Ethical and privacy concerns: Decisions on how AI agents use and share data need to be made carefully.
Dependence: Over-reliance on AI agents could impact human skills and employment.
Complexity in development and maintenance: Creating and updating AI agents requires significant expertise and resources.

SQL Agent:

Capabilities:

Natural Language Queries: Allows users to interact with databases through natural language, making it easier for non-technical users to extract information without knowing SQL syntax.
AI-Assisted Database Interaction: Enhances database interactions with AI, enabling more complex queries, data analysis, and insights extraction through conversational interfaces.
Integration with Language Models: Combines AI language models with SQL databases, facilitating the automatic generation of SQL queries from natural language inputs and interpreting the results for the user.

Components:

Language Models: Pre-trained AI models capable of understanding and generating human-like text.
Query Generation: Mechanism to translate natural language requests into SQL queries.
Result Interpretation: Converts SQL query results back into a human-readable format or summary.

Applications:

Data Exploration: Enables more intuitive data exploration and analysis for users without deep technical knowledge.
Business Intelligence: Facilitates the generation of reports and insights through conversational interfaces.
Automation: Streamlines interactions between users and databases, automating query generation and data retrieval processes.

Let's see how we can use an SQL agent and do the text-to-SQL conversion.

The code is below.



import os
import streamlit as st
from langchain_openai import ChatOpenAI
from langchain_community.utilities import SQLDatabase
from langchain_community.agent_toolkits import create_sql_agent

# Set your OpenAI API key here
os.environ["OPENAI_API_KEY"] = "Your OpenAI API Key"

# Directly using database connection details
host = "localhost"
user = "root"
password = "Your MySQL Password"
database = "SalesOrderSchema"

# Setup database connection
db_uri = f"mysql+mysqlconnector://{user}:{password}@{host}/{database}"
db = SQLDatabase.from_uri(db_uri)
llm = ChatOpenAI(model="gpt-4", temperature=0)
agent_executor = create_sql_agent(llm, db=db, agent_type="openai-tools", verbose=True)

# Streamlit app layout
st.title('SQL Chatbot')

# User input
user_query = st.text_area("Enter your SQL-related query:", "List Top 10 Employees by Salary?")

if st.button('Submit'):
    #try:
        # Processing user input
        #response = agent_executor.invoke(user_query)
        #response = agent_executor.invoke({"query": user_query})
        #if st.button('Submit'):
    try:
        # Processing user input
        response = agent_executor.invoke({
            "agent_scratchpad": "",  # Assuming this needs to be an empty string if not used
            "input": user_query  # Changed from "query" to "input"
        })
        st.write("Response:")
        st.json(response)  # Use st.json to pretty print the response if it's a JSON
    except Exception as e:
        st.error(f"An error occurred: {e}")

        #st.write("Response:")
        #st.write(response)
    #except Exception as e:
        #st.error(f"An error occurred: {e}")

About the Script:

Import libraries and modules: The script starts by importing necessary libraries such as os, streamlit (as st), and specific modules from langchain_openai and langchain_community for creating and managing the SQL chatbot.

Set OpenAI API key: It sets an environment variable OPENAI_API_KEY with your OpenAI API key, which is necessary for accessing OpenAI’s language models.

Database connection details: The script defines variables for database connection details including the host, user, password, and database name to connect to a MySQL database.

Setup database connection: Creates a connection to the specified MySQL database using the provided credentials and constructs a SQLDatabase object.

Initialize language model and SQL agent: It initializes a language model from OpenAI’s GPT-4 and creates a SQL agent. This agent can interpret SQL-related queries and interact with the database using natural language processing.

Streamlit app interface:

Sets up a simple web interface title as ‘SQL Chatbot’.
Provides a text area for users to input their SQL-related queries.
Includes a submit button for users to execute their queries.
Process and display responses:
Upon clicking the submit button, the app attempts to process the user’s query using the SQL agent.
It formats and displays the SQL agent’s response as JSON in the Streamlit interface.
If an error occurs during the process, it displays an error message.
Error handling: The script includes a try-except block to catch and display errors that may occur during the query processing or response generation phase.

How to run the Streamlit Python script:

Install Streamlit: 
If you haven't installed Streamlit yet, open your terminal 
(or Command Prompt/PowerShell in Windows) and run the following command:

pip install streamlit

Save your script: Make sure your Streamlit script (for example, app.py) 
is saved in a known directory on your computer.

Open a terminal: Navigate to the directory where your Streamlit script is saved. You can do this by using the cd command followed by the path to the directory. For example:

cd path/to/your/script
Run the Streamlit script: In the terminal, execute the following command 
to run your Streamlit script:

streamlit run app.py
Replace app.py with the name of your Streamlit script file if it's different.

Access the web interface: After running the command, Streamlit will 
start the server and provide you with a local URL, usually something 
like http://localhost:8501. Open this URL in your web browser to view 
your Streamlit application.

Interact with your app: Use the web interface to interact with 
your Streamlit application. You can enter inputs, press buttons, 
and see the output of your script in real-time.

Stop the Streamlit server: When you're done, you can stop the Streamlit 
server by pressing Ctrl+C in the terminal where the server is running.

These steps will allow you to run and interact with any Streamlit script, 
turning your Python scripts into interactive web applications easily.

Query:1

Generate a report showing current inventory levels, reorder level, and 
whether the product is below reorder level for each product ,execute the 
query and display the results.

Query:2

Calculate the lifetime value (total sales) of each customer who has made a 
purchase within the last three years, sorted by the highest value first and 
display the top 5?

Part: 3:Using Prompt Engineering:

Prompt engineering involves crafting the input given to the model in a way that guides it towards generating the desired output. This is particularly useful for text-to-SQL tasks where the goal is to translate natural language queries into precise SQL statements.

How It Works:

Examples as Guides: By including examples of natural language queries alongside their correct SQL translations in the prompt, you essentially provide the model with a template to follow. This method uses the principle of few-shot learning, where the model uses the provided examples to understand and generalize the task at hand.

Contextual Information: Adding information about the database schema, such as table names and relationships, can help the model generate more accurate SQL queries. This contextual backdrop aids the model in mapping natural language terms to their corresponding entities in the database.

Task Description: Starting the prompt with a clear task description (e.g., “Translate the following natural language queries into SQL:”) signals the model’s expected task, priming it for better performance on the specified task.

Example Prompt:

I'm an SQL assistant designed to translate natural language queries into SQL. 
Given a sales database schema with tables for Customer, Employee, InventoryLog,
LineItem, Product, SalesOrder, and Supplier, here are some examples:

- Natural language query: "Show me all products priced above $100."
- SQL translation: SELECT * FROM Product WHERE price > 100;

- Natural language query: "List all orders placed by CustomerID 123 in 2023."
- SQL translation: SELECT * FROM SalesOrder WHERE CustomerID = 123 AND YEAR(OrderDate) = 2023;

Translate the following query:
"Find the total sales for each product in March 2024."

For example: The system prompt will be

Given the following database schema, generate a SQL query that corresponds 
to the user's request. 

Tables and their relevant fields:
- Product (ProductID, ProductName)
- LineItem (ProductID, TotalPrice)
- SalesOrder (SalesOrderID, OrderDate)

The user is interested in the total sales for each product during the last 
calendar year. The query should return the name of each product along with 
the sum of the total price for all sales in that period. Ensure the query 
considers only those line items linked to sales orders from the last year.

Example SQL template for reference:
"SELECT [columns] FROM [table] JOIN [other_table] ON [condition] WHERE 
[condition] GROUP BY [column];"

Remember to alias table names where appropriate for readability and maintain 
SQL best practices in your query structure.

and the user prompt:

"Show me the total sales for each product last year."

LLM produces the following output:

SELECT 
  P.ProductName, 
  SUM(L.TotalPrice) AS TotalSales
FROM 
  Product P
  JOIN LineItem L ON P.ProductID = L.ProductID
  JOIN SalesOrder S ON L.SalesOrderID = S.SalesOrderID
WHERE 
  YEAR(S.OrderDate) = YEAR(CURDATE()) - 1
GROUP BY 
  P.ProductName;

Choose the dialect-specific prompting-Langchain

from langchain.chains.sql_database.prompt import SQL_PROMPTS

list(SQL_PROMPTS)

['crate',
 'duckdb',
 'googlesql',
 'mssql',
 'mysql',
 'mariadb',
 'oracle',
 'postgresql',
 'sqlite',
 'clickhouse',
 'prestodb']

Since we are using MySQL, then it will be like

from langchain.chains import create_sql_query_chain
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-3.5-turbo", temperature="0")
chain = create_sql_query_chain(llm, db)
chain.get_prompts()[0].pretty_print()

The prompt will be

You are a MySQL expert. Given an input question, first create a syntactically 
correct MySQL query to retrieve the answer, then execute the query and return 
the answer to the input question.

- Use the LIMIT clause to query for at most 5 results unless the user 
  specifies a different number.
- Never query for all columns from a table. Select only the columns necessary 
  to answer the question.
- Enclose each column name in backticks (`) to prevent syntax errors due to 
  reserved keywords or special characters.
- Be mindful of the column names and their respective tables.
- Use CURDATE() for the current date if the question involves "today".

Structure your response like this:

Question: [The user's question]
MySQL Query: [The SQL query you generated]
Result: [The result of the SQL query, if available]
Answer: [The final answer based on the query results]

Only use the following tables in the SalesOrder schema: Customer, Employee, 
InventoryLog, LineItem, Product, SalesOrder, Supplier. Here's some sample data from these tables for your reference: [include sample data].

Question: [input from user]

To get the table names, their schemas, and a sample of rows from each table use the following:

from langchain_community.utilities import SQLDatabase

# Adjust the connection URI for MySQL
db = SQLDatabase.from_uri("mysql+mysqlconnector://'user_id':'password'@localhost/SalesOrderSchema")

# Print the SQL dialect (should now reflect MySQL)
print(db.dialect)

# Print the usable table names in your MySQL database
print(db.get_usable_table_names())

# Run a sample query - adjust the table name as per your schema
# This is an example; replace `Artist` with an actual table name from your schema, like `Customer`
db.run("SELECT * FROM Customer LIMIT 10;")


mysql
['Customer', 'Employee', 'InventoryLog', 'LineItem', 'Product', 'SalesOrder', '
Supplier']
"[(1, 'Sandra', 'Cruz', 'rhonda24@example.net', '511-949-6987x21174', 
'18018 Kyle Streets Apt. 606\\nShaneville, AZ 85788', '18018 Kyle Streets Apt. 
606\\nShaneville, AZ 85788', datetime.date(2023, 5, 2), 0), (2, 'Robert', 
'Williams', 'traciewall@example.net', '944-649-2491x60774', '926 Mitchell 
Pass Apt. 342\\nBrianside, SC 83374', '926 Mitchell Pass Apt. 342\\nBrianside,
 SC 83374', datetime.date(2020, 9, 1), 0), (3, 'John', 'Greene', 
'travis92@example.org', '279.334.1551', '36019 Bill Manors Apt. 
219\\nDominiquefort, AK 55904', '36019 Bill Manors Apt. 219\\nDominiquefort, 
AK 55904', datetime.date(2021, 3, 15), 0), (4, 'Steven', 'Riley', 
'greennathaniel@example.org', '+1-700-682-7696x189', '76545 Hebert 
Crossing Suite 235\\nForbesbury, MH 14227', '76545 Hebert Crossing Suite 
235\\nForbesbury, MH 14227', datetime.date(2022, 12, 5), 0), (5, 'Christina', 
'Blake', 'christopher87@example.net', '584.263.4429', '8342 Shelly Fork\\nWest 
Chasemouth, CT 81799', '8342 Shelly Fork\\nWest Chasemouth, CT 81799', 
datetime.date(2019, 11, 12), 0), (6, 'Michael', 'Stevenson', 
'lynnwilliams@example.org', '328-637-4320x7025', '7503 Mallory Mountains Apt. 
199\\nMeganport, MI 81064', '7503 Mallory Mountains Apt. 199\\nMeganport, MI
 81064', datetime.date(2024, 1, 1), 1), (7, 'Anna', 'Kramer', 'steven23@example
.org', '+1-202-719-6886x844', '295 Mcgee Fort\\nManningberg, PR 93309', '295
 Mcgee Fort\\nManningberg, PR 93309', datetime.date(2022, 3, 6), 1), (8, 
'Michael', 'Sullivan', 'bbailey@example.com', '988.368.5033', '772 Bruce 
Motorway Suite 583\\nPowellbury, MH 42611', '772 Bruce Motorway Suite 583\\
nPowellbury, MH 42611', datetime.date(2019, 3, 23), 1), (9, 'Kevin', 'Moody', 
'yoderjennifer@example.org', '3425196543', '371 Lee Lake\\nNew Michaelport, 
CT 99382', '371 Lee Lake\\nNew Michaelport, CT 99382', datetime.date(2023, 12, 3), 1), (10, 'Jeremy', 'Mejia', 'spencersteven@example.org', '449.324.7097', '90137 Harris Garden\\nMatthewville, IA 39321', '90137 Harris Garden\\nMatthewville, IA 39321', datetime.date(2019, 5, 20), 1)]"

context = db.get_context()
print(list(context))
print(context["table_info"])

The output:

Image by the Author

We can use the above to update our prompt or pass everything. Chatgpt4 context size is 128K, and we are good.

prompt_with_context = chain.get_prompts()[0].partial(table_info=context["table_info"])
print(prompt_with_context.pretty_repr()[:1500])

For more details on prompting please check

How to Prompt LLMs for Text-to-SQL: A Study in Zero-shot, Single-domain, and Cross-domain Settings

Few Shot examples:

Here are some examples:

Image by the Author

from langchain_core.prompts import FewShotPromptTemplate, PromptTemplate


example_prompt = PromptTemplate.from_template("User input: {input}\nSQL query: {query}")

prompt = FewShotPromptTemplate(
    examples=examples[:15],  
    example_prompt=example_prompt,
    prefix="You are a MySQL expert. Given an input question, create a 
    syntactically correct MySQL query to run. Unless otherwise specified, 
    do not return more than {top_k} rows.\n\nHere is the relevant table info: 
    {table_info}\n\nBelow are a number of examples of questions and their corresponding SQL queries.",
    suffix="User input: {input}\nSQL query: ",
    input_variables=["input", "top_k", "table_info"],
)


table_info = """
- Customer (CustomerID INT, FirstName VARCHAR(100), LastName VARCHAR(100), Email VARCHAR(255), Phone VARCHAR(20), BillingAddress TEXT, ShippingAddress TEXT, CustomerSince DATE, IsActive TINYINT)
- Employee (EmployeeID INT, FirstName VARCHAR(100), LastName VARCHAR(100), Email VARCHAR(255), Phone VARCHAR(20), HireDate DATE, Position VARCHAR(100), Salary DECIMAL)
- InventoryLog (LogID INT, ProductID INT, ChangeDate DATE, QuantityChange INT, Notes TEXT)
- LineItem (LineItemID INT, SalesOrderID INT, ProductID INT, Quantity INT, UnitPrice DECIMAL, Discount DECIMAL, TotalPrice DECIMAL)
- Product (ProductID INT, ProductName VARCHAR(255), Description TEXT, UnitPrice DECIMAL, StockQuantity INT, ReorderLevel INT, Discontinued TINYINT)
- SalesOrder (SalesOrderID INT, CustomerID INT, OrderDate DATE, RequiredDate DATE, ShippedDate DATE, Status VARCHAR(50), Comments TEXT, PaymentMethod VARCHAR(50), IsPaid TINYINT)
- Supplier (SupplierID INT, CompanyName VARCHAR(255), ContactName VARCHAR(100), ContactTitle VARCHAR(50), Address TEXT, Phone VARCHAR(20), Email VARCHAR(255))
"""

# Example usage
input_example = "List all products currently in stock."
top_k = 10
formatted_prompt = prompt.format(input=input_example, top_k=top_k, table_info=table_info)
print(formatted_prompt)

The output:

You are a MySQL expert. Given an input question, create a syntactically 
correct MySQL query to run. Unless otherwise specified, do not return more 
than 10 rows.

Here is the relevant table info: 
- Customer (CustomerID INT, FirstName VARCHAR(100), LastName VARCHAR(100), Email VARCHAR(255), Phone VARCHAR(20), BillingAddress TEXT, ShippingAddress TEXT, CustomerSince DATE, IsActive TINYINT)
- Employee (EmployeeID INT, FirstName VARCHAR(100), LastName VARCHAR(100), Email VARCHAR(255), Phone VARCHAR(20), HireDate DATE, Position VARCHAR(100), Salary DECIMAL)
- InventoryLog (LogID INT, ProductID INT, ChangeDate DATE, QuantityChange INT, Notes TEXT)
- LineItem (LineItemID INT, SalesOrderID INT, ProductID INT, Quantity INT, UnitPrice DECIMAL, Discount DECIMAL, TotalPrice DECIMAL)
- Product (ProductID INT, ProductName VARCHAR(255), Description TEXT, UnitPrice DECIMAL, StockQuantity INT, ReorderLevel INT, Discontinued TINYINT)
- SalesOrder (SalesOrderID INT, CustomerID INT, OrderDate DATE, RequiredDate DATE, ShippedDate DATE, Status VARCHAR(50), Comments TEXT, PaymentMethod VARCHAR(50), IsPaid TINYINT)
- Supplier (SupplierID INT, CompanyName VARCHAR(255), ContactName VARCHAR(100), ContactTitle VARCHAR(50), Address TEXT, Phone VARCHAR(20), Email VARCHAR(255))


Below are a number of examples of questions and their corresponding SQL queries.

User input: List all customers.
SQL query: SELECT * FROM Customer;

User input: Find all orders placed by customer with ID 1.
SQL query: SELECT * FROM SalesOrder WHERE CustomerID = 1;

User input: List all products currently in stock.
SQL query: SELECT * FROM Product WHERE StockQuantity > 0;

User input: Find the supplier for product with ID 10.
SQL query: SELECT s.CompanyName FROM Supplier s JOIN Product p ON s.SupplierID = p.SupplierID WHERE p.ProductID = 10;

User input: List the sales orders that have not been shipped yet.
SQL query: SELECT * FROM SalesOrder WHERE Status = 'Pending';

User input: How many employees work in the sales department?
SQL query: SELECT COUNT(*) FROM Employee WHERE Position LIKE '%sales%';

User input: List the top 5 most sold products.
SQL query: SELECT ProductID, SUM(Quantity) AS TotalQuantity FROM LineItem GROUP BY ProductID ORDER BY TotalQuantity DESC LIMIT 5;

User input: Find the total sales amount for orders completed this year.
SQL query: SELECT SUM(TotalPrice) FROM SalesOrder WHERE YEAR(OrderDate) = YEAR(CURDATE()) AND Status = 'Completed';

User input: List all suppliers from 'New York'.
SQL query: SELECT * FROM Supplier WHERE Address LIKE '%New York%';

User input: How many products are low on stock (below reorder level)?
SQL query: SELECT COUNT(*) FROM Product WHERE StockQuantity < ReorderLevel;

User input: List all orders made by customer named 'John Doe'.
SQL query: SELECT so.* FROM SalesOrder so JOIN Customer c ON so.CustomerID = c.CustomerID WHERE c.FirstName = 'John' AND c.LastName = 'Doe';

User input: Show inventory logs for product with ID 20.
SQL query: SELECT * FROM InventoryLog WHERE ProductID = 20;

User input: List all products currently in stock.
SQL query:

Dynamic Few Shot examples:

Imagine you have a big box of tools, but you only want to pull out the ones most useful for the job at hand. The SemanticSimilarityExampleSelector does something similar. It looks at your question and picks out the most similar examples from a collection to help the model understand what you’re asking.

#pip install faiss-cpu

from langchain_community.vectorstores import FAISS
from langchain_core.example_selectors import SemanticSimilarityExampleSelector
from langchain_openai import OpenAIEmbeddings

example_selector = SemanticSimilarityExampleSelector.from_examples(
    examples,
    OpenAIEmbeddings(),
    FAISS,
    k=5,
    input_keys=["input"],
)

example_selector.select_examples({"input": "List all customers?"})

Output:

[{'input': 'List all customers.', 'query': 'SELECT * FROM Customer;'},
 {'input': "List all orders made by customer named 'John Doe'.",
  'query': "SELECT so.* FROM SalesOrder so JOIN Customer c ON so.CustomerID = c.CustomerID WHERE c.FirstName = 'John' AND c.LastName = 'Doe';"},
 {'input': 'List all products currently in stock.',
  'query': 'SELECT * FROM Product WHERE StockQuantity > 0;'},
 {'input': 'Find all orders placed by customer with ID 1.',
  'query': 'SELECT * FROM SalesOrder WHERE CustomerID = 1;'},
 {'input': "List all suppliers from 'New York'.",
  'query': "SELECT * FROM Supplier WHERE Address LIKE '%New York%';"}]

prompt = FewShotPromptTemplate(
    example_selector=example_selector,
    example_prompt=example_prompt,
    prefix="You are a MySQL expert. Given an input question, create a syntactically correct MySQL query to run. Unless otherwise specified, do not return more than {top_k} rows.\n\nHere is the relevant table info: {table_info}\n\nBelow are a number of examples of questions and their corresponding SQL queries.",
    suffix="User input: {input}\nSQL query: ",
    input_variables=["input", "top_k", "table_info"],
)


# Create the SQL query chain with the adjusted parameters
chain = create_sql_query_chain(llm, db, prompt)

# Invoke the chain with a relevant question for your database
chain.invoke({"question": "how many products are currently in stock?"})

Output:

'SELECT COUNT(*) FROM Product WHERE StockQuantity > 0;

Enhancing Few-shot Text-to-SQL Capabilities of Large Language Models: A Study on Prompt Design Strategies.

How to Prompt LLMs for Text-to-SQL: A Study in Zero-shot, Single-domain, and Cross-domain SettingsKey Points To Remember.

A comprehensive evaluation of ChatGPT’s zero-shot Text-to-SQL capability

Exploring Chain-of-Thought Style Prompting for Text-to-SQL

Key Points to Remember:

Understand the Schema: Familiarize yourself with the database structure, knowing the tables, their relationships, and the data types of each column.

Clear User Prompts: Encourage users to provide clear and specific details in their queries. For instance, instead of asking “Give me sales numbers,” they should ask “What are the total sales for product X in the last quarter?”

Crafting System Prompts: Design system prompts that instruct the LLM on how to interpret user questions and structure SQL queries. This can include specifying the SQL dialect, the format of the output, and any constraints like date ranges or specific fields.

Handling Multiple Tables: When dealing with complex schemas, include instructions on joining tables and managing relationships between them. For example, if a user wants to know about sales, you might need to join Sales, Customers, and Products tables.

Incorporate Examples: Include a few examples of natural language queries transformed into SQL queries. This provides the LLM with a pattern to follow.

Test and Iterate: Test your prompts with a variety of queries to ensure they produce the correct SQL statements. Be prepared to refine your prompts based on these tests.

User Prompt: "How much did each product contribute to sales last year?"

System Prompt: "To answer, join the Product and SalesOrder tables, filter 
orders from last year, and sum the sales for each product."
Example SQL: "SELECT Product.ProductName, SUM(SalesOrder.TotalSales) 
FROM Product JOIN SalesOrder ON Product.ProductID = SalesOrder.ProductID 
WHERE YEAR(SalesOrder.OrderDate) = YEAR(CURDATE()) - 1 GROUP BY Product.
ProductName;"
User Prompt: "List the top 5 customers by sales volume."

System Prompt: "Identify customers with the highest sales volume by 
joining Customer and SalesOrder tables, grouping by customer, and 
ordering by the sum of sales."
Example SQL: "SELECT Customer.FirstName, Customer.LastName, 
SUM(SalesOrder.TotalSales) AS TotalSales FROM Customer JOIN SalesOrder 
ON Customer.CustomerID = SalesOrder.CustomerID GROUP BY Customer.CustomerID 
ORDER BY TotalSales DESC LIMIT 5;"
User Prompt: "What's the average sale per product category last month?"

System Prompt: "Calculate the average sale by joining Product, SalesOrder, 
and possibly Category tables, filter by last month's date, and group by 
category."
Example SQL: "Assuming you have a Category table, the query might look 
like 'SELECT Category.Name, AVG(SalesOrder.TotalSales) FROM SalesOrder 
JOIN Product ON SalesOrder.ProductID = Product.ProductID JOIN Category ON 
Product.CategoryID = Category.CategoryID WHERE MONTH(SalesOrder.OrderDate) = 
MONTH(CURRENT_DATE - INTERVAL 1 MONTH) GROUP BY Category.Name;'"

How to Generate Dynamic Prompts:

1. Analyze the User Prompt: Determine the intent and required data from the user’s question.
2. Select Relevant Tables: Based on the intent, identify which tables and fields from the database are needed.
3. Generate System Prompt: Use templates or predefined patterns to create a system prompt that guides the LLM towards generating the correct SQL query. This can involve specifying the task, the tables involved, and any particular SQL syntax or functions that should be used.

# Define a mapping between user intents and database schema elements
intent_to_tables = {
    "total sales last year": {
        "tables": ["SalesOrder", "LineItem", "Product"],
        "fields": ["ProductName", "SUM(TotalPrice)"],
        "conditions": ["YEAR(OrderDate) = YEAR(CURDATE()) - 1"]
    }
    # Add more intents and corresponding schema elements here
}

def generate_system_prompt(user_prompt):
    # Analyze user prompt to determine intent
    # For simplicity, assume intent is directly provided
    intent = user_prompt  # In practice, use NLP techniques to determine intent
    
    # Retrieve relevant tables, fields, and conditions based on intent
    schema_info = intent_to_tables.get(intent, {})
    
    # Generate system prompt
    system_prompt = f"Generate a SQL query to calculate {intent}. "
    system_prompt += f"Use tables: {', '.join(schema_info.get('tables', []))}. "
    system_prompt += f"Select fields: {', '.join(schema_info.get('fields', []))}. "
    system_prompt += f"Under conditions: {', '.join(schema_info.get('conditions', []))}."
    
    return system_prompt

# Example usage
user_prompt = "total sales last year"
system_prompt = generate_system_prompt(user_prompt)
print(system_prompt)

Generate a SQL query to calculate total sales last year. Use tables: 
SalesOrder, LineItem, Product. Select fields: ProductName, SUM(TotalPrice). 
Under conditions: YEAR(OrderDate) = YEAR(CURDATE()) - 1.

A Comprehensive System Prompt:

Image by the Author

Database Schema Overview: Details the structure of the SalesOrder database schema, listing tables and their purposes.
SQL Query Generation Guidelines: Offers principles for creating efficient and accurate SQL queries, covering JOINs, WHERE clauses, and aggregate functions.
Handling Specific Cases: Instructions for managing NULL values and ensuring date ranges are inclusive.
Sample Records for Each Table: Provides examples from the Supplier, Product, Employee, Customer, InventoryLog, LineItem, and SalesOrder tables to illustrate the type of data stored.
Adaptation Instructions: Encourages tailoring queries to the specific analytical needs, with examples of common queries and their rationales.

Part: 4:SQL Query Validation:

Query Validation:

Ask the LLM to validate the generated query by adding the validation rules in the system prompt.

Double check the user's {dialect} query for common mistakes, including:
- Using NOT IN with NULL values
- Using UNION when UNION ALL should have been used
- Using BETWEEN for exclusive ranges
- Data type mismatch in predicates
- Properly quoting identifiers
- Using the correct number of arguments for functions
- Casting to the correct data type
- Using the proper columns for joins

If there are any of the above mistakes, rewrite the query. If there are no mistakes, just reproduce the original query.

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

system = """Given the database schema below, generate a MySQL query based on the user's question. Remember to calculate totals from line items if needed and ensure all date ranges are inclusive unless stated otherwise. Correctly aggregate data for summarization, such as averaging order totals. 
- Customer (CustomerID INT, FirstName VARCHAR(100), LastName VARCHAR(100), Email VARCHAR(255), Phone VARCHAR(20), BillingAddress TEXT, ShippingAddress TEXT, CustomerSince DATE, IsActive TINYINT)
- Employee (EmployeeID INT, FirstName VARCHAR(100), LastName VARCHAR(100), Email VARCHAR(255), Phone VARCHAR(20), HireDate DATE, Position VARCHAR(100), Salary DECIMAL)
- InventoryLog (LogID INT, ProductID INT, ChangeDate DATE, QuantityChange INT, Notes TEXT)
- LineItem (LineItemID INT, SalesOrderID INT, ProductID INT, Quantity INT, UnitPrice DECIMAL, Discount DECIMAL, TotalPrice DECIMAL)
- Product (ProductID INT, ProductName VARCHAR(255), Description TEXT, UnitPrice DECIMAL, StockQuantity INT, ReorderLevel INT, Discontinued TINYINT)
- SalesOrder (SalesOrderID INT, CustomerID INT, OrderDate DATE, RequiredDate DATE, ShippedDate DATE, Status VARCHAR(50), Comments TEXT, PaymentMethod VARCHAR(50), IsPaid TINYINT)
- Supplier (SupplierID INT, CompanyName VARCHAR(255), ContactName VARCHAR(100), ContactTitle VARCHAR(50), Address TEXT, Phone VARCHAR(20), Email VARCHAR(255))

Example task: Calculate the average total price of orders from customers without a registered phone number, for orders placed within a specific period.
Example query: "SELECT AVG(sum_li.TotalPrice) FROM SalesOrder JOIN (SELECT SalesOrderID, SUM(TotalPrice) AS TotalPrice FROM LineItem GROUP BY SalesOrderID) sum_li ON SalesOrder.SalesOrderID = sum_li.SalesOrderID JOIN Customer ON SalesOrder.CustomerID = Customer.CustomerID WHERE Customer.Phone IS NULL AND SalesOrder.OrderDate BETWEEN '2003-01-01' AND '2009-12-31';"
Double check the user's {dialect} query for common mistakes, including:
- Using NOT IN with NULL values
- Using UNION when UNION ALL should have been used
- Using BETWEEN for exclusive ranges
- Data type mismatch in predicates
- Properly quoting identifiers
- Using the correct number of arguments for functions
- Casting to the correct data type
- Using the proper columns for joins

If there are any of the above mistakes, rewrite the query. If there are no mistakes, just reproduce the original query.

Output the final SQL query only.No text and please output only the SQL query"""
prompt = ChatPromptTemplate.from_messages(
    [("system", system), ("human", "{query}")]
).partial(dialect=db.dialect)
validation_chain = prompt | llm | StrOutputParser()

full_chain = {"query": chain} | validation_chain

Part: 5:Non-descriptive or language-specific table and field names in databases:

For databases with non-descriptive or language-specific table and field names, creating a comprehensive mapping or dictionary that translates these technical identifiers into more understandable terms can be very helpful. This mapping can serve as a bridge between the natural language queries of the users and the actual database schema, enabling the LLM to generate accurate SQL queries.

Create a Translation Mapping: Develop a comprehensive mapping that translates technical identifiers (table names, field names) into more understandable or English terms. This helps bridge the gap between natural language queries and the actual database schema.

Incorporate Mapping into LLM Input: Integrate this mapping directly into the system prompt or use it to preprocess user queries, making it easier for the LLM to understand and generate the correct SQL queries.

Fine-tuning LLM with Custom Data: Consider fine-tuning the LLM on examples that include both the technical identifiers and their translations, to improve its ability to handle such cases directly.

Use Annotated Examples: In prompts or as part of the training data, include examples where technical identifiers are used in natural language questions and their corresponding SQL queries, providing context for the LLM.

# Example mapping of technical identifiers to understandable terms
mapping = {
    "kunnr": "customer",
    "lifnr": "vendor",
    # Add more mappings as needed
}

def translate_query(user_query, mapping):
    for technical_term, common_term in mapping.items():
        user_query = user_query.replace(common_term, technical_term)
    return user_query

# Example usage
user_query = "Show sales for vendor last year"
translated_query = translate_query(user_query, mapping)
# Now translated_query can be passed to the LLM for SQL generation

Part: 6:Large Scale Enterprise Challenges

Large-scale enterprise databases face several challenges:

Complex Schema: Enterprises often have complex database schemas with hundreds of tables and relationships, making it difficult for LLMs to generate accurate SQL queries without extensive domain knowledge.

Non-Standard Naming Conventions: As mentioned, databases might use non-intuitive naming conventions or different languages, requiring mappings or additional context for effective query generation.

Security and Privacy: Directly executing generated SQL queries on real databases can pose security risks, including potential data exposure or injection attacks.

Performance Concerns: Generating and executing SQL queries on large databases can be resource-intensive, affecting database performance.

Data Freshness: Enterprises require real-time data, but LLMs might generate queries that don’t account for the most recent data updates, leading to outdated insights.

Part: 7-Text 2 Visualization:

Designing a text-to-visualization tool with an LLM involves:

Understanding User Queries: Parse natural language inputs to discern the user’s intent and the data they want to visualize.
Mapping to Database Queries: Convert the parsed intent into SQL queries to fetch the relevant data from a database.
Choosing Visualization Types: Based on the query results and user intent, select appropriate visualization types (bar charts, line graphs, pie charts).
Generating Visualizations: Use data visualization libraries (e.g., Matplotlib, Plotly) in Python to create the visual representation of the queried data.
Integrating with LLMs: Leverage LLMs to refine query understanding, suggest visualization types, and improve user interaction through feedback loops.
User Interface Design: Create a user-friendly interface that allows for easy input of text queries and displays visualizations effectively.

Pseudo Code:

from your_llm_library import LLM  # Hypothetical LLM library
import matplotlib.pyplot as plt
import pandas as pd
import sql_connector  # Hypothetical module to connect and execute SQL queries

# Initialize your LLM
llm = LLM(api_key="your_api_key")

def query_database(sql_query):
    # Connect to your database and execute the SQL query
    # Return the results as a DataFrame
    connection = sql_connector.connect(host="your_host", database="your_db", user="your_user", password="your_password")
    return pd.read_sql(sql_query, connection)

def generate_visualization(data, visualization_type):
    # Generate visualization based on the type and data
    if visualization_type == "bar":
        data.plot(kind="bar")
        plt.show()
    # Add more visualization types as needed

def text_to_sql(text_input):
    # Use LLM to convert text input to SQL query
    sql_query = llm.generate_sql_query(text_input)
    return sql_query

def text_to_visualization(text_input):
    # Convert text input to SQL query
    sql_query = text_to_sql(text_input)
    
    # Query database
    data = query_database(sql_query)
    
    # Determine the type of visualization based on the data or user input
    visualization_type = "bar"  # This could be dynamically determined
    
    # Generate visualization
    generate_visualization(data, visualization_type)

# Example usage
text_input = "Show total sales per product last year"
text_to_visualization(text_input)

This pseudocode outlines the steps to take a user’s natural language input, convert it to an SQL query, fetch data from a database, and generate a visualization based on the fetched data. The actual implementation would depend on the LLM library you’re using, your database setup, and the Python data visualization libraries you prefer.

This is a poc code and where you can keep improving on top of it.

from langchain_community.utilities import SQLDatabase

# Adjust the connection URI for MySQL
db = SQLDatabase.from_uri("mysql+mysqlconnector://'your user id':'your password@localhost/SalesOrderSchema")

import pandas as pd

from langchain.chains import create_sql_query_chain
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
import pandas as pd

# Assuming 'db' is your SQLDatabase instance from langchain_community.utilities
# and execute_query is a function that executes SQL and returns a pandas DataFrame

# Initialize the LLM with GPT-4
llm = ChatOpenAI(model="gpt-4", temperature=0)

# Create a SQL query chain with the LLM and your database configuration
chain = create_sql_query_chain(llm, db)

# Define a system message for generating SQL queries based on the schema
system = """
Given the database schema below, generate a MySQL query based on the user's question. Remember to calculate totals from line items if needed and ensure all date ranges are inclusive unless stated otherwise. Correctly aggregate data for summarization, such as averaging order totals.
- Customer (CustomerID INT, FirstName VARCHAR(100), LastName VARCHAR(100), Email VARCHAR(255), Phone VARCHAR(20), BillingAddress TEXT, ShippingAddress TEXT, CustomerSince DATE, IsActive TINYINT)
- Employee (EmployeeID INT, FirstName VARCHAR(100), LastName VARCHAR(100), Email VARCHAR(255), Phone VARCHAR(20), HireDate DATE, Position VARCHAR(100), Salary DECIMAL)
- InventoryLog (LogID INT, ProductID INT, ChangeDate DATE, QuantityChange INT, Notes TEXT)
- LineItem (LineItemID INT, SalesOrderID INT, ProductID INT, Quantity INT, UnitPrice DECIMAL, Discount DECIMAL, TotalPrice DECIMAL)
- Product (ProductID INT, ProductName VARCHAR(255), Description TEXT, UnitPrice DECIMAL, StockQuantity INT, ReorderLevel INT, Discontinued TINYINT)
- SalesOrder (SalesOrderID INT, CustomerID INT, OrderDate DATE, RequiredDate DATE, ShippedDate DATE, Status VARCHAR(50), Comments TEXT, PaymentMethod VARCHAR(50), IsPaid TINYINT)
- Supplier (SupplierID INT, CompanyName VARCHAR(255), ContactName VARCHAR(100), ContactTitle VARCHAR(50), Address TEXT, Phone VARCHAR(20), Email VARCHAR(255))
"""

prompt = ChatPromptTemplate.from_messages(
    [("system", system)]
).partial(dialect=db.dialect)

# Assume 'execute_query' is a function to execute the generated SQL query and return a DataFrame
def execute_query(sql_query):
    # Implementation of executing SQL query and returning a pandas DataFrame
    pass

# Use the chain to generate a SQL query based on a natural language input
query_input = "Show me the total sales for each product last year"
query_response = chain.invoke({"question": query_input})

# Execute the generated SQL query
#df = execute_query(query_response)
query_results = db.run(query_response)



import pandas as pd
from decimal import Decimal

# Dummy data for illustration; replace this with your actual query_results
#query_results_str = "[('Reflect Sea', Decimal('25.31')), ('Avoid American', Decimal('514.63'))]"
query_results_str = query_results
query_results_str = query_results_str.replace("Decimal('", "").replace("')", "")

# Try to safely evaluate the string to a list of tuples
try:
    query_results_evaluated = eval(query_results_str, {'Decimal': Decimal})
except Exception as e:
    print(f"Error during evaluation: {e}")
    query_results_evaluated = []

# Convert Decimal to float and prepare for DataFrame
query_results_converted = [(name, float(total)) if isinstance(total, Decimal) else (name, total) for name, total in query_results_evaluated]

# Create DataFrame
df = pd.DataFrame(query_results_converted, columns=['ProductName', 'TotalSales'])

df_json = df.to_json(orient='split', index=False)

# Continue with your workflow...
# Prepare the prompt template for generating Matplotlib visualization code
# Prepare the prompt template for generating Matplotlib visualization code
prompt_template = ChatPromptTemplate.from_template(
    "Given the following data, generate Python code using Matplotlib to create a suitable visualization that best represents the data's insights. Decide on the type of chart (e.g., bar chart, pie chart, line graph) that would best display the data. Provide a brief explanation for why you chose this type of visualization. Here is the data: {data}"
)


# Initialize the model with GPT-4
model = ChatOpenAI(model="gpt-4")

# Initialize the output parser to extract string responses
output_parser = StrOutputParser()

# Create the chain: prompt + model + output parser
visualization_chain = prompt_template | model | output_parser

# Invoke the chain with the DataFrame JSON as input
visualization_code = visualization_chain.invoke({"data": df_json})

# Print the generated code for review
#print(visualization_code)


mixed_text = visualization_code
code_start = mixed_text.find("python") + len("python")
code_end = mixed_text.find("```", code_start)
generated_code = mixed_text[code_start:code_end].strip()

#print(generated_code)
exec(generated_code)

Output — if you enable all the print statements and you can see the output.

SELECT `Product`.`ProductName`, SUM(`LineItem`.`TotalPrice`) AS `TotalSales`
FROM `Product`
JOIN `LineItem` ON `Product`.`ProductID` = `LineItem`.`ProductID`
JOIN `SalesOrder` ON `LineItem`.`SalesOrderID` = `SalesOrder`.`SalesOrderID`
WHERE YEAR(`SalesOrder`.`OrderDate`) = YEAR(CURDATE()) - 1
GROUP BY `Product`.`ProductName`
LIMIT 5;
[('Reflect Sea', Decimal('25.31')), ('Avoid American', Decimal('514.63')), ('Certain Identify', Decimal('1260.98')), ('Impact Agreement', Decimal('518.32')), ('Million Agreement', Decimal('250.02'))]
The data provided can be represented using a bar chart. A bar chart is suitable because we are comparing the total sales of different product names. Here is a Python code that uses Matplotlib to visualize this data:

```python
import matplotlib.pyplot as plt
import pandas as pd

# data
data_dict = {"columns":["ProductName","TotalSales"],"data":[["Reflect Sea",25.31],["Avoid American",514.63],["Certain Identify",1260.98],["Impact Agreement",518.32],["Million Agreement",250.02]]}

# convert the data to pandas DataFrame
df = pd.DataFrame(data_dict['data'], columns=data_dict['columns'])

# create bar chart
plt.figure(figsize=(10,6))
plt.barh(df['ProductName'], df['TotalSales'], color='blue')
plt.xlabel('Total Sales')
plt.ylabel('Product Name')
plt.title('Total Sales by Product Name')
plt.show()
```

This code first converts the data into a pandas DataFrame, which makes it easier to manipulate and plot. It then creates a horizontal bar chart with the 'ProductName' column on the y-axis and the 'TotalSales' column on the x-axis. The size of the figure is set to 10x6 for better clarity, and labels and a title are added for context. The color of the bars is set to blue, but you can change this to any color you prefer.
import matplotlib.pyplot as plt
import pandas as pd

# data
data_dict = {"columns":["ProductName","TotalSales"],"data":[["Reflect Sea",25.31],["Avoid American",514.63],["Certain Identify",1260.98],["Impact Agreement",518.32],["Million Agreement",250.02]]}

# convert the data to pandas DataFrame
df = pd.DataFrame(data_dict['data'], columns=data_dict['columns'])

# create bar chart
plt.figure(figsize=(10,6))
plt.barh(df['ProductName'], df['TotalSales'], color='blue')
plt.xlabel('Total Sales')
plt.ylabel('Product Name')
plt.title('Total Sales by Product Name')
plt.show()

Code Overview:

Initialize the LLM: Using GPT-4, you set up an instance ready to process inputs.
Create a SQL Query Chain: Combining the LLM with your database, this chain is designed to translate natural language questions into SQL queries.
Define a System Message: This provides context about your database schema to help generate accurate SQL queries.
Generate SQL Query: Converts a user’s natural language input (“Show me the total sales for each product last year”) into an executable SQL query.
Execute SQL Query: Runs the generated query against the database to fetch the required data.
Prepare Data for Visualization: Converts the query results into a format suitable for visualization.
Generate Visualization Code: Crafts Python code for creating a visualization (e.g., using Matplotlib) that represents the data insights.
Execute Visualization Code: Runs the generated Python code to produce the visualization.

Part:8-Using Ollama for Text 2 SQL:

You can also use Ollama to download the models to your own machine and the try out. Install Ollama on your machine before using the below commands.

Ollama

Get up and running with large language models, locally.

ollama.com

I tried the Google Gemma model.

!ollama pull gemma

!ollama run gemma

System prompt:

v_sys = """
You are the MySQL expert and you are going to generate MySQL queries on the user question.
Given the database schema below, generate a MySQL query based on the user's question. Ensure to consider totals from line items, inclusive date ranges, and correct data aggregation for summarization. Remember to handle joins, groupings, and orderings effectively.

Database schema:
- Customer (CustomerID, FirstName, LastName, Email, Phone, BillingAddress, ShippingAddress, CustomerSince, IsActive)
- Employee (EmployeeID, FirstName, LastName, Email, Phone, HireDate, Position, Salary)
- InventoryLog (LogID, ProductID, ChangeDate, QuantityChange, Notes)
- LineItem (LineItemID, SalesOrderID, ProductID, Quantity, UnitPrice, Discount, TotalPrice)
- Product (ProductID, ProductName, Description, UnitPrice, StockQuantity, ReorderLevel, Discontinued)
- SalesOrder (SalesOrderID, CustomerID, OrderDate, RequiredDate, ShippedDate, Status, Comments, PaymentMethod, IsPaid)
- Supplier (SupplierID, CompanyName, ContactName, ContactTitle, Address, Phone, Email)

Guidelines for SQL query generation:
1. **Ensure Efficiency and Performance**: Opt for JOINs over subqueries where possible, use indexes effectively, and mention any specific performance considerations to keep in mind.
2. **Adapt to Specific Analytical Needs**: Tailor WHERE clauses, JOIN operations, and aggregate functions to precisely meet the analytical question being asked.
3. **Complexity and Variations**: Include a range from simple to complex queries, illustrating different SQL functionalities such as aggregate functions, string manipulation, and conditional logic.
4. **Handling Specific Cases**: Provide clear instructions on managing NULL values, ensuring date ranges are inclusive, and handling special data integrity issues or edge cases.
5. **Explanation and Rationale**: After each generated query, briefly 
explain why this query structure was chosen and how it addresses the 
analytical need, enhancing understanding and ensuring alignment with 
requirements. """

import ollama

r = ollama.generate(
    model='gemma',
    system= v_sys,
    prompt="""List Top 10 Employees by Salary?"""
)

print(r['response'])

## List Top 10 Employees by Salary

```sql
SELECT * FROM Employee ORDER BY Salary DESC LIMIT 10;
```

**Rationale:**

This query selects all employees and sorts them in descending order based on their salaries. The `ORDER BY` clause specifies the sorting criteria, and the `LIMIT 10` clause restricts the results to the top 10 employees.

**Notes:**

* This query assumes that the `Salary` column in the `Employee` table contains 
numeric values representing salaries.
* The query does not filter employees based on any specific criteria. To 
restrict the results to a specific group of employees, you can add additional conditions to the `WHERE` clause.

**Additional Considerations:**

* Index on the `Salary` column to improve query performance.
* Handle NULL values in the `Salary` column appropriately.

They also have a couple of fine-tuned models for text to SQL

Part:9-Text 2 SQL Evaluation:

Bird-A Big Bench for Large-Scale Database Grounded Text-to-SQLs

BIRD-bench

BIRD-benchbird-bench.github.io

2. Spider 1.0 — Yale Semantic Parsing and Text-to-SQL Challenge

Spider: Yale Semantic Parsing and Text-to-SQL Challenge

Yale Spider is a large dataset for complex and cross-domain semantic parsing and text-to-SQL Task introduced by our…

yale-lily.github.io

3. WikiSQL

GitHub - salesforce/WikiSQL: A large annotated semantic parsing corpus for developing natural…

A large annotated semantic parsing corpus for developing natural language interfaces. - salesforce/WikiSQL

github.com

4. Defog- SQL Evaluation

GitHub - defog-ai/sql-eval: Evaluate the accuracy of LLM generated outputs

Evaluate the accuracy of LLM generated outputs. Contribute to defog-ai/sql-eval development by creating an account on…

github.com

How they evaluate:

Our testing procedure comprises the following steps. For each question/query
 pair:

We generate a SQL query (possibly from an LLM).
We run both the "gold" query and the generated query on their respective 
database to obtain 2 dataframes with the results.
We compare the 2 dataframes using an "exact" and a "subset" match. TODO 
add link to blogpost.
We log these alongside other metrics of interest (e.g. tokens used, latency) 
and aggregate the results for reporting.

You can check it out.

Part: 10 -Other Articles and Repositories to Refer:

Here are a few articles.

Generating value from enterprise data: Best practices for Text2SQL and generative AI | Amazon Web…

Generative AI has opened up a lot of potential in the field of AI. We are seeing numerous uses, including text…

aws.amazon.com

Build a robust text-to-SQL solution generating complex queries, self-correcting, and querying…

Structured Query Language (SQL) is a complex language that requires an understanding of databases and metadata. Today…

aws.amazon.com

Boosting RAG-based intelligent document assistants using entity extraction, SQL querying, and…

Conversational AI has come a long way in recent years thanks to the rapid developments in generative AI, especially the…

aws.amazon.com

GitHub - vanna-ai/vanna: 🤖 Chat with your SQL database 📊. Accurate Text-to-SQL Generation via…

🤖 Chat with your SQL database 📊. Accurate Text-to-SQL Generation via LLMs using RAG 🔄. - vanna-ai/vanna

github.com

Fine-tune our LLM on your enterprise data

This public facing demo runs a limited version of SQLCoder with greedy search, and will have worse performance than our…

defog.ai

AI That Quacks: Introducing DuckDB-NSQL-7B, A LLM for DuckDB SQL

Our first Text2SQL model release!

motherduck.com

Architectural Patterns for Text-to-SQL: Leveraging LLMs for Enhanced BigQuery Interactions

TLDR: This article delves into the Text-to-SQL domain, demonstrating the growing reliance on Large Language Models…

medium.com

Part: 11 — Text2SQL Using Llamaindex.

Lets see how we can do text2SQL using Llamaindex.

Lets work on our same database

MySQL database.
Sales Order Schema
7 tables created earlier.

pip install all the required libraries.

!pip install mysql-connector-python SQLAlchemy pandas

%pip install llama-index-embeddings-openai

!pip install llama-index-llms-openai  # This is hypothetical; replace with the correct library name if different.
!pip install pyvis networkx

2. Provide the OpenAi key:

import os

os.environ["OPENAI_API_KEY"] = "Your API Key"

3.Connect to MySQL database:

from sqlalchemy import create_engine

# SQLAlchemy connection string for MySQL
database_url = "mysql+mysqlconnector://"Your userID":"Password"@localhost/SalesOrderSchema"

# Create engine
mysql_engine = create_engine(database_url)

Importing create_engine from SQLAlchemy:This line imports the create_engine function from the SQLAlchemy library, which is used to create a connection engine to your database.
Defining the Database URL:database_url is a string that specifies the connection details for your MySQL database. It includes the database adapter (mysql+mysqlconnector), the username (root), the password (‘Your Password’), the host (localhost), and the database name (SalesOrderSchema).
The format of this URL is: dialect+driver://username:password@host/database.
Creating the Engine:mysql_engine = create_engine(database_url) uses the create_engine function to create an SQLAlchemy engine. This engine is an object that manages connections to the database, using the connection details provided in the database_url.The engine is responsible for connecting to the database whenever a connection is required but does not establish a connection at the time of its creation.
Purpose and Usage:The engine created by create_engine is a core component of SQLAlchemy’s SQL Expression Language and ORM (Object-Relational Mapping) capabilities. It can be used to execute raw SQL queries, interact with the database using an ORM, and more.
This setup allows you to use SQLAlchemy’s powerful and flexible tools for database operations, abstracting away many of the complexities of direct database access.

4.Table information:

table_infos = [
    {"table_name": "Customer", "table_summary": "Holds customer information including contact details and addresses."},
    {"table_name": "Employee", "table_summary": "Contains employee records, their positions, and salary information."},
    {"table_name": "InventoryLog", "table_summary": "Tracks inventory changes, including product quantities and relevant notes."},
    {"table_name": "LineItem", "table_summary": "Details each item within a sales order, including pricing and quantities."},
    {"table_name": "Product", "table_summary": "Lists available products, their descriptions, prices, and stock levels."},
    {"table_name": "SalesOrder", "table_summary": "Records customer orders, including order dates, shipping information, and payment status."},
    {"table_name": "Supplier", "table_summary": "Stores information about suppliers, including company and contact details."}
]

List of Dictionaries:table_infos is a list, a collection that is ordered and changeable. Each item in the list is a dictionary representing a table in your database.
Dictionary Structure:Each dictionary in the list has two keys: table_name and table_summary.table_name is a string that holds the name of the table in the database. This is how the table is identified in SQL queries.
table_summary is a string that provides a brief description of what the table contains or represents. This summary gives context about the table’s role within the database, which can be helpful for documentation, code readability, or providing insights to a system that might use these descriptions to understand the database schema better.
Purpose:This structure is designed to provide metadata about the tables in your database in a structured format. It can be used for documentation purposes, to aid in generating dynamic queries, or to interface with systems that require a description of the database schema.
Example Use Case:If you’re using a system that converts natural language queries into SQL queries (like the text-to-SQL functionalities discussed earlier), table_infos can provide the system with necessary context about each table. For example, knowing that the Customer table holds customer information, including contact details and addresses, might help the system generate more accurate queries when a user asks for customer contact information.
Extensibility:This approach is easily extendable. If your database schema changes (e.g., adding a new table or changing the purpose of an existing table), you can simply update the table_infos list to reflect those changes.

5. Object index + retriever to store table schemas:

from llama_index.core.objects import (
    SQLTableNodeMapping,
    ObjectIndex,
    SQLTableSchema,
)
from llama_index.core import SQLDatabase, VectorStoreIndex

sql_database = SQLDatabase(engine=mysql_engine)

table_node_mapping = SQLTableNodeMapping(sql_database)
table_schema_objs = [
    SQLTableSchema(table_name=t['table_name'], context_str=t['table_summary'])
    for t in table_infos
]
obj_index = ObjectIndex.from_objects(
    table_schema_objs,
    table_node_mapping,
    VectorStoreIndex,
)
obj_retriever = obj_index.as_retriever(similarity_top_k=3)

Import Statements:The code begins by importing various classes from the llama_index.core package. These classes include SQLTableNodeMapping, ObjectIndex, SQLTableSchema, SQLDatabase, and VectorStoreIndex. Each of these plays a role in creating an indexed representation of your database schema for use within the LlamaIndex framework.
Creating an SQLDatabase Instance:sql_database = SQLDatabase(engine=mysql_engine) creates an instance of SQLDatabase using the mysql_engine you previously defined. This SQLDatabase object is intended to provide an interface for interacting with your MySQL database within the LlamaIndex system, abstracting away direct SQL operations.
Initializing a Table Node Mapping:table_node_mapping = SQLTableNodeMapping(sql_database) initializes an object responsible for mapping between the logical structure of your database tables (as nodes) and the physical database represented by sql_database. This mapping is crucial for operations that require understanding the database schema, such as generating SQL queries from natural language inputs.
Creating Table Schema Objects:The list comprehension [SQLTableSchema(table_name=t[‘table_name’], context_str=t[‘table_summary’]) for t in table_infos] iterates over the table_infos list you defined. For each entry, it creates an instance of SQLTableSchema, which represents the schema of a table in your database, including its name and a contextual summary. This forms the basis for understanding and interacting with each table within the LlamaIndex system.
Creating an Object Index:obj_index = ObjectIndex.from_objects(table_schema_objs, table_node_mapping, VectorStoreIndex,) creates an indexed representation of your database schema. It uses the table schema objects (table_schema_objs), the table-node mapping (table_node_mapping), and a vector store index (VectorStoreIndex) to facilitate efficient retrieval and querying of schema information based on similarity or relevance to natural language queries.
Initializing a Retriever:obj_retriever = obj_index.as_retriever(similarity_top_k=3) initializes a retriever object from the object index. This retriever is configured to fetch the top-k (3 in this case) most relevant table schema objects based on a given query. This functionality is likely used to dynamically identify which tables should be considered when converting natural language queries to SQL, based on the similarity of the query to the contextual summaries of the tables.

6. SQLDatabase object to connect to the above tables + SQLRetriever:

from llama_index.core.retrievers import SQLRetriever
from typing import List
from llama_index.core.query_pipeline import FnComponent

sql_retriever = SQLRetriever(sql_database)


def get_table_context_str(table_schema_objs: List[SQLTableSchema]):
    """Get table context string."""
    context_strs = []
    for table_schema_obj in table_schema_objs:
        table_info = sql_database.get_single_table_info(
            table_schema_obj.table_name
        )
        if table_schema_obj.context_str:
            table_opt_context = " The table description is: "
            table_opt_context += table_schema_obj.context_str
            table_info += table_opt_context

        context_strs.append(table_info)
    return "\n\n".join(context_strs)


table_parser_component = FnComponent(fn=get_table_context_str)

SQLRetriever Initialization:sql_retriever = SQLRetriever(sql_database) creates an instance of SQLRetriever using the previously initialized sql_database object. This suggests that sql_retriever is capable of performing database retrievals, possibly leveraging the schema information encapsulated by sql_database.
Defining a Function to Get Table Context Strings:get_table_context_str is a function that accepts a list of SQLTableSchema objects (table_schema_objs) and generates a context string for each table. This context string likely serves to provide additional information about each table, which can be useful in generating or understanding SQL queries derived from natural language inputs.Inside the function, it iterates over table_schema_objs, fetching additional table information from sql_database for each table schema object (table_schema_obj). If the table_schema_obj includes a context_str (a summary or description of the table), it appends this to the table information, enhancing the detail provided about each table.
The resulting strings are collected into the context_strs list, which is then joined into a single string with double newlines (\n\n) as separators, making the final output easier to read or process further.
Creating a Table Parser Component:table_parser_component = FnComponent(fn=get_table_context_str) wraps the get_table_context_str function in an FnComponent, making it a component that can be integrated into a query processing pipeline. In the context of LlamaIndex, this component could be used to dynamically generate context-rich descriptions of tables relevant to a given natural language query, supporting the conversion of that query into an accurate SQL statement.

7.Text-to-SQL Prompt:

from llama_index.core.prompts.default_prompts import DEFAULT_TEXT_TO_SQL_PROMPT
from llama_index.core import PromptTemplate
from llama_index.core.query_pipeline import FnComponent
from llama_index.core.llms import ChatResponse


def parse_response_to_sql(response: ChatResponse) -> str:
    """Parse response to SQL."""
    response = response.message.content
    sql_query_start = response.find("SQLQuery:")
    if sql_query_start != -1:
        response = response[sql_query_start:]
        # TODO: move to removeprefix after Python 3.9+
        if response.startswith("SQLQuery:"):
            response = response[len("SQLQuery:") :]
    sql_result_start = response.find("SQLResult:")
    if sql_result_start != -1:
        response = response[:sql_result_start]
    return response.strip().strip("```").strip()


sql_parser_component = FnComponent(fn=parse_response_to_sql)

text2sql_prompt = DEFAULT_TEXT_TO_SQL_PROMPT.partial_format(
    dialect=mysql_engine.dialect.name
)
print(text2sql_prompt.template)

Parsing the Model’s Response:parse_response_to_sql is a function that takes a response from a chat or language model (encapsulated in a ChatResponse object) and extracts an SQL query from it. This is necessary because the model’s response might include additional information or formatting beyond just the SQL query.It searches the response for specific markers (SQLQuery: and SQLResult:) to isolate the SQL query portion of the text. This parsing strategy implies that the language model’s responses are expected to follow a structured format where the SQL query and its result are clearly delineated.
Creating a SQL Parser Component:sql_parser_component = FnComponent(fn=parse_response_to_sql) wraps the parse_response_to_sql function in an FnComponent, allowing it to be integrated into a query processing pipeline. This setup facilitates the automated extraction of SQL queries from model responses, which can then be executed against a database or further processed.
Setting Up a Text-to-SQL Prompt:text2sql_prompt = DEFAULT_TEXT_TO_SQL_PROMPT.partial_format(dialect=mysql_engine.dialect.name) initializes a prompt template that is used to generate SQL queries from natural language questions. The template is customized to the specific SQL dialect of the database (in this case, MySQL), which is determined by the dialect attribute of the SQLAlchemy engine (mysql_engine).This customization ensures that the prompts generated for the language model are appropriate for the SQL syntax supported by MySQL, increasing the likelihood that the resulting SQL queries will be syntactically correct and executable.
Printing the Prompt Template:print(text2sql_prompt.template) prints the formatted prompt template to the console. This could be useful for debugging or understanding what prompts are being sent to the language model.

Given an input question, first create a syntactically correct {dialect} query to run, then look at the results of the query and return the answer. You can order the results by a relevant column to return the most interesting examples in the database.

Never query for all the columns from a specific table, only ask for a few relevant columns given the question.

Pay attention to use only the column names that you can see in the schema description. Be careful to not query for columns that do not exist. Pay attention to which column is in which table. Also, qualify column names with the table name when needed. You are required to use the following format, each taking one line:

Question: Question here
SQLQuery: SQL Query to run
SQLResult: Result of the SQLQuery
Answer: Final answer here

Only use tables listed below.
{schema}

Question: {query_str}
SQLQuery:

8. Response synthesis Prompt:

from llama_index.core.program import LLMTextCompletionProgram
from llama_index.core.bridge.pydantic import BaseModel, Field
from llama_index.llms.openai import OpenAI


response_synthesis_prompt_str = (
    "Given an input question, synthesize a response from the query results.\n"
    "Query: {query_str}\n"
    "SQL: {sql_query}\n"
    "SQL Response: {context_str}\n"
    "Response: "
)
response_synthesis_prompt = PromptTemplate(
    response_synthesis_prompt_str,
)

Defining the Response Synthesis Prompt Template:response_synthesis_prompt_str is a string that defines a template for generating prompts to send to a language model. The template includes placeholders for an input question ({query_str}), the SQL query generated based on that question ({sql_query}), and the SQL response or result of executing the query ({context_str}). The template is structured to guide the language model in synthesizing a human-readable response based on this information.The purpose of this prompt structure is to provide the language model with all the necessary context to generate an informative response. It frames the question, shows the SQL query that represents the question’s intent, and provides the query’s results, asking the model to synthesize these components into a coherent answer.
Creating a PromptTemplate Instance:response_synthesis_prompt is an instance of PromptTemplate initialized with the response_synthesis_prompt_str. This object encapsulates the prompt template, making it easy to generate prompts dynamically by filling in the placeholders with specific query strings, SQL queries, and query results.
This setup suggests that the response_synthesis_prompt will be used to generate prompts for a language model (like OpenAI’s GPT-3) to produce answers that are based on actual data retrieved from a database but are formatted and phrased in a way that’s understandable and useful to humans.

9. LLM:

llm = OpenAI(model="gpt-4")

10. Define Query Pipeline:

from llama_index.core.query_pipeline import (
    QueryPipeline as QP,
    Link,
    InputComponent,
    CustomQueryComponent,
)

qp = QP(
    modules={
        "input": InputComponent(),
        "table_retriever": obj_retriever,
        "table_output_parser": table_parser_component,
        "text2sql_prompt": text2sql_prompt,
        "text2sql_llm": llm,
        "sql_output_parser": sql_parser_component,
        "sql_retriever": sql_retriever,
        "response_synthesis_prompt": response_synthesis_prompt,
        "response_synthesis_llm": llm,
    },
    verbose=True,
)

Components of the Query Pipeline

QueryPipeline (QP): This is the main structure organizing the flow of data and operations in the system. It defines how an input query is processed through various stages or components to produce an output. The verbose=True parameter indicates that the pipeline should provide detailed logs or output about its processing stages, which can be helpful for debugging or understanding the pipeline's behavior.
Modules:
Each key-value pair in the modules dictionary represents a component of the pipeline:
"input": An InputComponent that likely handles the initial reception of queries.
"table_retriever": Uses obj_retriever (previously defined) to identify relevant database tables based on the query.
"table_output_parser": Processes the output from the table retriever, likely formatting or further refining it.
"text2sql_prompt": Manages the generation of prompts for converting text queries into SQL queries.
"text2sql_llm": Represents a language model (LLM) component tasked with processing the text-to-SQL prompts.
"sql_output_parser": Parses the output of SQL queries, extracting relevant information or preparing it for further processing.
"sql_retriever": Executes SQL queries against the database and retrieves the results.
"response_synthesis_prompt": Handles the creation of prompts for synthesizing human-readable responses from SQL query results.
"response_synthesis_llm": A language model component that processes prompts for response synthesis.

How the Pipeline Works

Input Handling: The pipeline starts by receiving a natural language query through the input component.
Table Retrieval: It then uses the table_retriever to determine which database tables are relevant to the query.
Output Parsing: The table_output_parser might format or extract useful information from the table retrieval step to inform the next stages.
Prompt Generation and LLM Processing:

The text2sql_prompt creates a structured prompt based on the input query and possibly table information, which is then processed by the text2sql_llm to generate an SQL query.
After executing the SQL query and retrieving results (sql_retriever), the response_synthesis_prompt component creates another prompt designed to synthesize a human-readable response. This prompt is processed by the response_synthesis_llm, which uses the SQL results and the original query context to generate a final answer.

5. Linkages: The QueryPipeline object's setup implies that there are defined linkages between these components, though the specific linkages (Link, CustomQueryComponent) are not explicitly shown in the snippet. These linkages dictate the flow of data and control between components, ensuring that the output of one component serves as the input to the next according to the pipeline's logic.

qp.add_chain(["input", "table_retriever", "table_output_parser"])
qp.add_link("input", "text2sql_prompt", dest_key="query_str")
qp.add_link("table_output_parser", "text2sql_prompt", dest_key="schema")
qp.add_chain(
    ["text2sql_prompt", "text2sql_llm", "sql_output_parser", "sql_retriever"]
)
qp.add_link(
    "sql_output_parser", "response_synthesis_prompt", dest_key="sql_query"
)
qp.add_link(
    "sql_retriever", "response_synthesis_prompt", dest_key="context_str"
)
qp.add_link("input", "response_synthesis_prompt", dest_key="query_str")
qp.add_link("response_synthesis_prompt", "response_synthesis_llm")

11. Pipeline Visualization:

from pyvis.network import Network

net = Network(notebook=True, cdn_resources="in_line", directed=True)
net.from_nx(qp.dag)
net.show("text2sql_dag.html")

12. Execute some Queries:

Query: 1

response = qp.run(
    query="List Top 10 Employees by Salary?"
)
print(str(response))

> Running module input with input: 
query: List Top 10 Employees by Salary?

> Running module table_retriever with input: 
input: List Top 10 Employees by Salary?

> Running module table_output_parser with input: 
table_schema_objs: [SQLTableSchema(table_name='Employee', context_str='Contains employee records, their positions, and salary information.'), SQLTableSchema(table_name='Customer', context_str='Holds customer information...

> Running module text2sql_prompt with input: 
query_str: List Top 10 Employees by Salary?
schema: Table 'Employee' has columns: EmployeeID (INTEGER), FirstName (VARCHAR(100)), LastName (VARCHAR(100)), Email (VARCHAR(255)), Phone (VARCHAR(20)), HireDate (DATE), Position (VARCHAR(100)), Salary (DECI...

> Running module text2sql_llm with input: 
messages: Given an input question, first create a syntactically correct mysql query to run, then look at the results of the query and return the answer. You can order the results by a relevant column to return ...

> Running module sql_output_parser with input: 
response: assistant: SELECT EmployeeID, FirstName, LastName, Salary FROM Employee ORDER BY Salary DESC LIMIT 10
SQLResult: 
EmployeeID | FirstName | LastName | Salary
1          | John      | Doe      | 100000
...

> Running module sql_retriever with input: 
input: SELECT EmployeeID, FirstName, LastName, Salary FROM Employee ORDER BY Salary DESC LIMIT 10

> Running module response_synthesis_prompt with input: 
query_str: List Top 10 Employees by Salary?
sql_query: SELECT EmployeeID, FirstName, LastName, Salary FROM Employee ORDER BY Salary DESC LIMIT 10
context_str: [NodeWithScore(node=TextNode(id_='02c6e159-d328-4bab-8911-eef1daf06bb2', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text="[(959, 'Je...

> Running module response_synthesis_llm with input: 
messages: Given an input question, synthesize a response from the query results.
Query: List Top 10 Employees by Salary?
SQL: SELECT EmployeeID, FirstName, LastName, Salary FROM Employee ORDER BY Salary DESC LI...

assistant: The top 10 employees by salary are:
1. Jessica Marsh with a salary of $99,846.00
2. Mary Clarke with a salary of $99,607.00
3. Shelby Cochran with a salary of $99,463.00
4. Christine Mason with a salary of $99,100.00
5. Henry Robinson with a salary of $99,090.00
6. Donald Morris with a salary of $99,086.00
7. Ruth White with a salary of $99,065.00
The remaining employees' details are not fully provided in the query response.

Query: 2

response = qp.run(
    query= """Calculate the average number of days between order placement and shipment for all orders shipped last year, 
and identify the customer with the longest average shipping delay?""")
print(str(response))


> Running module input with input: 
query: Calculate the average number of days between order placement and shipment for all orders shipped last year, 
and identify the customer with the longest average shipping delay?

> Running module table_retriever with input: 
input: Calculate the average number of days between order placement and shipment for all orders shipped last year, 
and identify the customer with the longest average shipping delay?

> Running module table_output_parser with input: 
table_schema_objs: [SQLTableSchema(table_name='SalesOrder', context_str='Records customer orders, including order dates, shipping information, and payment status.'), SQLTableSchema(table_name='Customer', context_str='Ho...

> Running module text2sql_prompt with input: 
query_str: Calculate the average number of days between order placement and shipment for all orders shipped last year, 
and identify the customer with the longest average shipping delay?
schema: Table 'SalesOrder' has columns: SalesOrderID (INTEGER), CustomerID (INTEGER), OrderDate (DATE), RequiredDate (DATE), ShippedDate (DATE), Status (VARCHAR(50)), Comments (TEXT), PaymentMethod (VARCHAR(5...

> Running module text2sql_llm with input: 
messages: Given an input question, first create a syntactically correct mysql query to run, then look at the results of the query and return the answer. You can order the results by a relevant column to return ...

> Running module sql_output_parser with input: 
response: assistant: SELECT AVG(DATEDIFF(SalesOrder.ShippedDate, SalesOrder.OrderDate)) AS AverageDays, Customer.FirstName, Customer.LastName 
FROM SalesOrder 
JOIN Customer ON SalesOrder.CustomerID = Customer....

> Running module sql_retriever with input: 
input: SELECT AVG(DATEDIFF(SalesOrder.ShippedDate, SalesOrder.OrderDate)) AS AverageDays, Customer.FirstName, Customer.LastName 
FROM SalesOrder 
JOIN Customer ON SalesOrder.CustomerID = Customer.CustomerID ...

> Running module response_synthesis_prompt with input: 
query_str: Calculate the average number of days between order placement and shipment for all orders shipped last year, 
and identify the customer with the longest average shipping delay?
sql_query: SELECT AVG(DATEDIFF(SalesOrder.ShippedDate, SalesOrder.OrderDate)) AS AverageDays, Customer.FirstName, Customer.LastName 
FROM SalesOrder 
JOIN Customer ON SalesOrder.CustomerID = Customer.CustomerID ...
context_str: [NodeWithScore(node=TextNode(id_='1e988d3f-4019-451b-8b84-62144e09d15a', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text="[(Decimal(...

> Running module response_synthesis_llm with input: 
messages: Given an input question, synthesize a response from the query results.
Query: Calculate the average number of days between order placement and shipment for all orders shipped last year, 
and identify ...

assistant: The average number of days between order placement and shipment for all orders shipped last year was 30 days. The customer with the longest average shipping delay was Jonathan Burke.

Query:3

response = qp.run(
    query= """Calculate the lifetime value (total sales) of each customer who has made a purchase 
       within the last three years, sorted by the highest value first and display the top 5?""")
print(str(response))

> Running module input with input: 
query: Calculate the lifetime value (total sales) of each customer who has made a purchase 
       within the last three years, sorted by the highest value first and display the top 5?

> Running module table_retriever with input: 
input: Calculate the lifetime value (total sales) of each customer who has made a purchase 
       within the last three years, sorted by the highest value first and display the top 5?

> Running module table_output_parser with input: 
table_schema_objs: [SQLTableSchema(table_name='SalesOrder', context_str='Records customer orders, including order dates, shipping information, and payment status.'), SQLTableSchema(table_name='Customer', context_str='Ho...

> Running module text2sql_prompt with input: 
query_str: Calculate the lifetime value (total sales) of each customer who has made a purchase 
       within the last three years, sorted by the highest value first and display the top 5?
schema: Table 'SalesOrder' has columns: SalesOrderID (INTEGER), CustomerID (INTEGER), OrderDate (DATE), RequiredDate (DATE), ShippedDate (DATE), Status (VARCHAR(50)), Comments (TEXT), PaymentMethod (VARCHAR(5...

> Running module text2sql_llm with input: 
messages: Given an input question, first create a syntactically correct mysql query to run, then look at the results of the query and return the answer. You can order the results by a relevant column to return ...

> Running module sql_output_parser with input: 
response: assistant: SELECT Customer.CustomerID, Customer.FirstName, Customer.LastName, SUM(LineItem.TotalPrice) as LifetimeValue
FROM Customer
JOIN SalesOrder ON Customer.CustomerID = SalesOrder.CustomerID
JOI...

> Running module sql_retriever with input: 
input: SELECT Customer.CustomerID, Customer.FirstName, Customer.LastName, SUM(LineItem.TotalPrice) as LifetimeValue
FROM Customer
JOIN SalesOrder ON Customer.CustomerID = SalesOrder.CustomerID
JOIN LineItem ...

> Running module response_synthesis_prompt with input: 
query_str: Calculate the lifetime value (total sales) of each customer who has made a purchase 
       within the last three years, sorted by the highest value first and display the top 5?
sql_query: SELECT Customer.CustomerID, Customer.FirstName, Customer.LastName, SUM(LineItem.TotalPrice) as LifetimeValue
FROM Customer
JOIN SalesOrder ON Customer.CustomerID = SalesOrder.CustomerID
JOIN LineItem ...
context_str: [NodeWithScore(node=TextNode(id_='c37b8a16-0132-48d5-9425-5eba1281d7a5', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text="[(61, 'Bra...

> Running module response_synthesis_llm with input: 
messages: Given an input question, synthesize a response from the query results.
Query: Calculate the lifetime value (total sales) of each customer who has made a purchase 
       within the last three years, s...

assistant: The top five customers with the highest lifetime value (total sales) in the last three years are as follows:
1. Bradley Quinn with a total sales of $34,389.10
2. Laura Mueller with a total sales of $33,312.42
3. Michael Jordan with a total sales of $24,295.07
4. Don Lowery with a total sales of $24,055.13
5. Jerry Haas with a total sales of $23,851.82

Part:13 — Finetuning a LLM with SQL Dataset:

Please check the colab notebook. I have provided comments for most of the cells. This is just to show how you can fine-tune a model .

Install necessary Python libraries for the Gemma model, data handling, and fine-tuning.
Authenticate with Hugging Face to access models and datasets.
Load the Gemma model optimized for efficient inference.
Prepare and format the dataset for fine-tuning.
Configure LoRA for parameter-efficient fine-tuning.
Initialize and configure the fine-tuning trainer.
Execute the fine-tuning process.
Generate SQL queries from natural language prompts using the fine-tuned model.

Please refer for more information on fine-tuning gemma models.

Gemma: Introducing new state-of-the-art open models

Gemma is a family of lightweight, state\u002Dof\u002Dthe art open models built from the same research and technology…

blog.google

Part:14— Pandas AI:

Checkout their doc:

PandasAI

PandasAI is a Python library that integrates generative artificial intelligence capabilities into pandas, making…

docs.pandas-ai.com

According to their documentation:

PandasAI is a Python library that makes it easy to ask questions to your data (CSV, XLSX, PostgreSQL, MySQL, BigQuery, Databrick, Snowflake, etc.) in natural language. xIt helps you to explore, clean, and analyze your data using generative AI.
Beyond querying, PandasAI offers functionalities to visualize data through graphs, cleanse datasets by addressing missing values, and enhance data quality through feature generation, making it a comprehensive tool for data scientists and analysts.

Please check out :

Google Colaboratory

Edit description

colab.research.google.com

Pandas AI:

# Using poetry (recommended)
poetry add pandasai

# Using pip
pip install pandasai

import pandas as pd
from pandasai import SmartDataframe

# Sample DataFrame
products_data = pd.DataFrame({
    "category": ["Electronics", "Clothing", "Toys", "Electronics", "Clothing", "Toys", "Electronics", "Clothing", "Toys", "Electronics"],
    "product": ["Laptop", "Jeans", "Teddy Bear", "Smartphone", "T-Shirt", "Puzzle", "Tablet", "Dress", "Board Game", "Camera"],
    "price": [1200, 80, 20, 800, 25, 15, 500, 100, 30, 450],
    "units_sold": [300, 1500, 800, 500, 2000, 1500, 400, 1200, 1000, 350]
})

# Instantiate a LLM
from pandasai.llm import OpenAI
llm = OpenAI(api_token="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")

# Convert to SmartDataframe
df = SmartDataframe(products_data, config={"llm": llm})

# Ask the LLM a question
df.chat('What is the most profitable product category?')

# Question 1: Find the average price of products in each category.
response = df.chat('What is the average price of products in each category?')
print(response)

   category   price
0     Clothing   500.0
1  Electronics    15.0
2         Toys  1200.0

# Question 2: Determine the total revenue generated by each category.
response = df.chat('How much total revenue has each product category generated?')
print(response)

      category  revenue
0     Clothing   500000
1  Electronics    12000
2         Toys   600000

# Question 3: Ask for a comparison of units sold between two categories.
response = df.chat('How do sales of Electronics compare to Toys in terms of units sold?')
print(response)

Sales of Electronics are higher than Toys in terms of units sold.

# Question 4: Query for the most expensive product in the dataset.
response = df.chat('What is the most expensive product listed?')
print(response)

The most expensive product listed is Smartphone.

df.chat('Plot the total revenue generated by each product category, 
using different colors for each catego')

Part:15:Some more relevant papers:

1. Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation (https://arxiv.org/abs/2308.15363)
Provides a comprehensive evaluation of various prompt engineering methods for LLM-based text-to-SQL.
Presents DAIL-SQL, a new solution that tops the Spider leaderboard benchmark.
Explores the use and fine-tuning of open-source LLMs.
2.Evaluating the Text-to-SQL Capabilities of Large Language Models (https://arxiv.org/abs/2204.00498)
Analyzes the capabilities of the Codex language model for text-to-SQL tasks.
Demonstrates strong performance without any fine-tuning on the Spider benchmark.
Explores the impact of few-shot examples for further improvement.
3.Analyzing the Effectiveness of Large Language Models on Text-to-SQL Synthesis (https://arxiv.org/html/2401.12379v1)
Investigates how well LLMs can address the nuances of text-to-SQL, like dealing with aggregations.
Highlights the strengths and limitations of LLMs in this domain.
4.Global Reasoning over Database Structures for Text-to-SQL Parsing (https://arxiv.org/abs/2004.07320)
Introduces an approach for global reasoning to better capture the relationships between different components of a database schema.
5.RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers (https://arxiv.org/abs/1911.04942)
Proposes a method integrating schema encoding and linking for improved accuracy in text-to-SQL systems.
6.Bridging Textual and Tabular Data for Cross-Domain Text-to-SQL Semantic Parsing (https://arxiv.org/abs/1805.07555)
Focuses on handling queries across multiple database domains.
7.Enhancing Text-to-SQL Capabilities of Large Language Models: A Study on Prompt Design Strategies (https://arxiv.org/abs/2305.12586)
Explores different prompt design strategies for prompting LLMs in the Text-to-SQL task.
Investigates how the selection of demonstrations and instruction formats can influence results.
8.TypeSQL: Knowledge-Based Type-Aware Neural Text-to-SQL Generation (https://arxiv.org/abs/2106.09669)
Introduces TypeSQL, which incorporates type awareness into prompting for better understanding of schema constraints.
9.Break It Down: A Question Decomposition Approach to Text-to-SQL(https://arxiv.org/abs/2008.06750)
Proposes a method for decomposing complex SQL queries into smaller, easier-to-translate sub-queries.
10.RAT-SQL++: Relational-Aware Schema Encoding and Linking for Text-to-SQL Parsers (https://aclanthology.org/2020.acl-main.369.pdf)
Improves on the RAT-SQL model with enhanced relational encoding techniques to better handle complex SQL structures.
11.Cross-Domain Text-to-SQL Generation from Conversational Context (https://arxiv.org/abs/2003.10024)
Focuses on text-to-SQL in conversational settings, where the domain of the query might evolve within a dialogue.

Conclusion:

Using Text2SQL at a large enterprise comes with lot of challenges, such as making sure our data stays safe, ensuring the translations are correct, and integrating this technology smoothly into our existing systems. Despite these hurdles, the potential benefits for making data more accessible and decision-making more efficient are significant and also this text2SQL is growing fast.

References:

All reference are provided above.