Creating Synthetic Data In Snowflake Part II
Summary
This article, "Creating Synthetic Data In Snowflake Part II," details various Snowflake SQL functions for generating diverse synthetic data fields for testing and development. It demonstrates how to create unique employee IDs using `Seq4()` or `Uniform()` with `Random()`, and how to generate alphanumeric employee names with `Randstr()` combined with `Upper()` and `Regexp_Replace()` for formatting. The guide also covers generating numerical data such as employee join years, months, and days using `Uniform()` within specified ranges (e.g., 2010-2025 for years, 1-28 for days). Additionally, it illustrates creating 6-digit pincodes, normally distributed salaries with `Normal(mean, std_dev, Random())`, and categorical gender data using `Get()` with an array and `Uniform()` for index selection. The article concludes by combining these techniques to populate a comprehensive `Employee_Details_Final` table with 100,000 rows of synthetic data.
Key takeaway
For Data Engineers or Data Scientists needing to populate Snowflake tables with realistic, non-sensitive data for testing or development, you should utilize the demonstrated SQL functions. This approach allows for rapid creation of large datasets for various fields like IDs, names, dates, salaries, and categorical data, ensuring your test environments accurately reflect production data characteristics without compromising privacy. Pay attention to function limits, like `Seq1()`'s 128-row wrap-over, and use `Try_To_Date` for robust date conversions.
Key insights
Snowflake functions like `Seq`, `Uniform`, `Randstr`, `Normal`, and `Get` enable robust synthetic data generation.
Principles
- Use `Seq` functions for sequential IDs.
- Employ `Uniform` for bounded random numbers.
- Leverage `Normal` for bell-curve distributed data.
Method
Generate synthetic data in Snowflake by combining functions like `Seq`, `Uniform`, `Random`, `Randstr`, `Normal`, and `Get` with `Generator(Rowcount => N)` to populate tables with various data types.
In practice
- Generate unique IDs with `Seq4()`.
- Create random strings using `Randstr(length, Random())`.
- Simulate salaries with `Normal(mean, std_dev, Random())`.
Topics
- Snowflake
- Synthetic Data Generation
- SQL Functions
- Test Data Management
- Random Data Generation
Best for: Data Engineer, Data Scientist, AI Student
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Data Science on Medium.