Creating Synthetic Data In Snowflake Part II

· Source: Data Science on Medium · Field: Technology & Digital — Data Science & Analytics, Software Development & Engineering, Artificial Intelligence & Machine Learning · Depth: Intermediate, medium

Summary

This article, "Creating Synthetic Data In Snowflake Part II," details various Snowflake SQL functions for generating diverse synthetic data fields for testing and development. It demonstrates how to create unique employee IDs using `Seq4()` or `Uniform()` with `Random()`, and how to generate alphanumeric employee names with `Randstr()` combined with `Upper()` and `Regexp_Replace()` for formatting. The guide also covers generating numerical data such as employee join years, months, and days using `Uniform()` within specified ranges (e.g., 2010-2025 for years, 1-28 for days). Additionally, it illustrates creating 6-digit pincodes, normally distributed salaries with `Normal(mean, std_dev, Random())`, and categorical gender data using `Get()` with an array and `Uniform()` for index selection. The article concludes by combining these techniques to populate a comprehensive `Employee_Details_Final` table with 100,000 rows of synthetic data.

Key takeaway

For Data Engineers or Data Scientists needing to populate Snowflake tables with realistic, non-sensitive data for testing or development, you should utilize the demonstrated SQL functions. This approach allows for rapid creation of large datasets for various fields like IDs, names, dates, salaries, and categorical data, ensuring your test environments accurately reflect production data characteristics without compromising privacy. Pay attention to function limits, like `Seq1()`'s 128-row wrap-over, and use `Try_To_Date` for robust date conversions.

Key insights

Snowflake functions like `Seq`, `Uniform`, `Randstr`, `Normal`, and `Get` enable robust synthetic data generation.

Principles

Method

Generate synthetic data in Snowflake by combining functions like `Seq`, `Uniform`, `Random`, `Randstr`, `Normal`, and `Get` with `Generator(Rowcount => N)` to populate tables with various data types.

In practice

Topics

Best for: Data Engineer, Data Scientist, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Data Science on Medium.