In this post

GPT-4 as your company’s data engineer

Written By

Suyog Sonwalkar

Published onMay 26, 2023

Large language models like GPT-4 can already be used to write SQL queries based on natural language questions. For simple prompts with limited knowledge about your database structure, GPT-4 will make reasonable assumptions and still generate SQL.

However, we can get much better results if we can provide an LLM with information about our database.

In this post, we’ll start by exploring some relatively straightforward prompt engineering to get significantly better results from GPT-4 as well as discuss some more ideal scenarios that could be enabled via chaining prompts or closer integration of the LLM into the database itself.

We can get much better results if we can provide an LLM with information about our database.

Background

For simple prompts like asking for the number of users created in the last week, GPT-4 will make some assumptions about your table name, columns & come up with a response similar to below:

Write some SQL that can query the number of users created in the last week from my database

We can also reduce GPT-4’s verbosity by specifically asking it to only return SQL.

Write some SQL that can query the number of users created in the last week from my database. Do not explain your work, only return the SQL and make assumptions on table names and column names yourself.

Issues

As we can already see with the examples above, there are a number of issues trying to directly use this SQL with our database.

First, GPT-4 doesn’t know the schema of our database, including our table names, column names, column descriptions, and relationships between tables. This prevents us from directly copying the generated SQL into our query tool.

Second, GPT-4 doesn’t know if we have limitations on our SQL dialect based on the underlying database. Postgres, MySQL, Oracle, and others can have slight differences that become more nuanced as queries become more complex. Even in the basic answers above, Postgres would have a syntax error around the interval clause.

Third, in large organizations, the industry has moved towards data pipeline solutions like Airflow and Data warehouses like Redshift which can integrate with multiple data sources and run long running jobs to answer more complex queries.

Thankfully, we can resolve the first two with some prompt engineering and a way of passing our database structure to GPT-4. For the third, we at LastMile AI want to start a conversation about how to develop industry standards to really leverage the power of LLMs across the entire data stack.

Prompt Engineering as a temporary solution

We can fix some of the issues by simple prompt engineering. Informing the LLM of the database we’re using as well as just copying the table structure from a client.

In the following example, I use postgres to retreive table information with a \d public."User" statement and pass the information to GPT-4 along with the question I want to answer.

Given a postgres table schema like below:

```
\d public."User";
                                    Table "public.User"
    Column     |              Type              | Collation | Nullable |      Default      
---------------+--------------------------------+-----------+----------+-------------------
 id            | text                           |           | not null | 
 email         | text                           |           |          | 
 name          | text                           |           |          | 
 attributes    | jsonb                          |           |          | 
 createdAt     | timestamp(3) without time zone |           | not null | CURRENT_TIMESTAMP
 updatedAt     | timestamp(3) without time zone |           | not null | CURRENT_TIMESTAMP
 emailVerified | timestamp(3) without time zone |           |          | 
 image         | text                           |           |          | 
```

I want to find the number of users who joined in the past week. Give me some SQL that will work against this table in my database. Only give me the SQL, do not explain your work. Make sure to quote column names.

This produces an exact SQL query that I can copy and paste into my psql client to get the result I’m looking for.

We can take this further and also provide another table’s information in order to perform simple relational queries as well.

I also have a Posts table like this:
```
\d public.”Post”
                                           Table "public.Post”
        Column         |              Type              | Collation | Nullable |         Default          
-----------------------+--------------------------------+-----------+----------+--------------------------
 id                    | text                           |           | not null | 
 createdAt             | timestamp(3) without time zone |           | not null | CURRENT_TIMESTAMP
 updatedAt             | timestamp(3) without time zone |           | not null | 
 creatorId             | text                           |           | not null | 
 type                  | “PostType”                     |           | not null | 'INFERENCE’::”PostType”
 content               | text                           |           | not null | 
Indexes:
    “Post_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
    “Post_creatorId_fkey" FOREIGN KEY ("creatorId") REFERENCES "User"(id) ON UPDATE CASCADE ON DELETE RESTRICT
```

I now want to find how many trials the past week's new users created, give me the SQL for that. Only return the SQL, do not explain your work. Make sure to quote the column names.

Something not done here would be to pass column descriptions to explain what information columns contain. In the future, it may be beneficial to provide optimized column descriptions for an LLM. Essentially, in addition to providing human readable descriptions, it might be valuable to have token optimized descriptions specifically for LLM understanding.

This relatively simple solution using prompt engineering can get you more reasonable responses, however they are not always perfect. First the queries may not be optimized. They may also have slight issues like column names not being quoted properly.

A larger issue is that you might not be able to pass your entire database schema to the LLM due to prompt token limits. In these cases, you may only want to pass the information for the exact tables you’re looking to get data from with your question.

Immediate Value

We can already see above that with some basic prompt engineering and passing database information to GPT-4, we can get workable code directly from the LLM. We can ask business questions in natural language about our data and use the queries generated to get the answers or metrics we care about.

Ideal solutions

What would an ideal solution look like for our examples above and for more complicated queries?

First, it would be ideal to have longer prompt lengths to be able to pass larger schemas to the LLM. With Claude looking at 100k context limits, this may already be coming soon — although output performance and speed may still be limiting factors for extremely large contexts.

Maybe the LLM would be able to chain together information about all your tables (\d in Postgres to get the list of relations), then narrow down on detail when you pose your question without having to manually input the schemas. This would require some form of chaining multiple prompts and responses together which ideally would not be in a chat interface.

In addition, what if a user didn’t have to manually put in any information on their schema and a query tool like PgAdmin (or similar for other databases) did this automatically & integrated a prompt UI?

Ideally the solution would also need to take into account how your schemas are changing over time. A question that is using an out of date schema would potentially give incorrect SQL and could have downstream implications.

Tools like BabyAGI and AutoGPT are attempting to use LLMs as part of automated agents — would we be comfortable with the LLM taking an action like actually executing SQL without human intervention?

Finally, in larger organizations, what is an ideal solution deal with data access permissions based on who is performing the queries? And how would a solution integrate with tools like Redshift?

Working towards better solutions

We are just getting started at LastMile AI, and would love to hear from you, especially if you’re working towards better solutions integrating data and LLMs. You can reach us here:

We would also appreciate your feedback on our initial product offering, available at lastmileai.dev.