NLP to SQL Queries | Ashna Arora

GitHub Repository: View on GitHub

Introduction

This project demonstrates how to build and deploy a FastAPI application that translates natural language questions into SQL queries and runs them against a dataset hosted in Amazon Athena. The entire pipeline is containerized, stored in Amazon ECR, and deployed on an AWS EKS cluster, accessible via a public endpoint.

Problem

Data analysts and non-technical stakeholders often struggle to query data if they don’t know SQL. A natural language interface can bridge this gap, allowing anyone to get insights from data using plain English. In this service, users can type a question in plain English (e.g., “Show me all products ordered in July”), and the system generates the corresponding SQL query, executes it against the database, and returns the results in a CSV file - all without touching SQL.

Approach

Application Development
- Built with FastAPI to handle HTTP requests.
- Integrated OpenAI API to translate natural language into SQL queries.
- Connected to Amazon Athena to execute SQL queries on S3-hosted data.
Containerization
- Created a multi-architecture Docker image supporting amd64 and arm64.
- Stored images in Amazon ECR.
Continuous Integration & Deployment
- Used GitHub Actions to automate:
  - Building the Docker image.
  - Pushing it to ECR.
  - Deploying updates to EKS.
Infrastructure
- Deployed on an AWS EKS cluster.
- Exposed via a LoadBalancer service to get a public URL.

Results

Successfully deployed a public API where users can send natural language questions and receive SQL results.
Achieved end-to-end automation from code commit to deployment.
Validated scalability on AWS infrastructure.

Challenges

GitHub Actions not triggering due to missing workflow events.
Docker architecture mismatch (built on ARM Mac, deployed on AMD AWS nodes).
OpenAI API quota issues causing runtime errors.
Kubernetes CrashLoopBackOff errors due to missing environment variables.
Cost management — keeping EKS running racks up charges quickly.

How I Solved

Added on: push triggers in GitHub Actions to automate builds.
Used --platform linux/amd64 in Docker builds to match AWS node architecture.
Stored sensitive credentials (OpenAI API key, AWS creds) in GitHub Secrets and Kubernetes secrets.
Implemented cleanup scripts to delete EKS clusters, ECR repos, and S3 data to avoid extra costs.

Conclusion

This project demonstrates the full lifecycle of deploying an NLP-to-SQL application in the cloud, from local development to production deployment on AWS. It also highlights the importance of CI/CD, cross-platform containerization, and cost management.

Future Improvements

Implement caching to reduce repeated OpenAI calls and save on API costs.
Add authentication and rate-limiting for the public API.
Expand support for multiple databases beyond Athena.
Enhance natural language processing for more complex queries.
Add monitoring and logging dashboards using AWS CloudWatch and Grafana.