Snowflake User Guide¶
Overview¶
Snowflake is a elastically scalable cloud data warehouse. By utlizing the power of Snowflake Spectus can provide an enhanced data exploration experience to our clients.
From 2024 Snowflake will be the main SQL engine utilized by Spectus Users. Working interactively through the JupyterLab Interface is still a main offering of the Platform. In addition, a new UI is availabe in the form of Snowsight from Snowflake.
Snowsight Interface¶
Snowflake Snowsight has been integrated into the set of tools on Spectus Builder in order to provide a seemless solution for multiple processes, including:
- Easily referencing the on-platform Cuebiq data assets
- Visualizing and Monitoring Compute usage
- Scheduling Tasks/Procedures
In this User Guide you will find and overview of the main features of Snowsight relevant to Spectus users, 1. Projects, 2. Data, and 3. Admin. To learn more about the Snowsight interface see the Snowsight User Guide provided by Snowflake.
Projects¶
Worksheets¶
Worksheets provide a simple way for you to write SQL queries (DML and DDL), see the results, and interact with them. With a worksheet, you can do any of the following:
-
Run ad hoc queries and other DDL/DML operations.
-
Write Snowpark Python code in a Python worksheet.
-
Review the query history and results of queries that you executed.
-
Examine multiple worksheets, each with its own separate session.
-
Export the results for a selected statement, while results are still available.
For a full understanding of the Worksheet capabilities see the Snowflake Documentation or the "How Do I...?" section below for specfic workflows common to Spectus users such as:
Data¶
Databases¶
You can explore databases and objects, including tables, functions, views, and more using the database object explorer.
Through the Spectus platform client have access to 3 main Databases
-
CUEBIQ_DATA
consists of Views of the data hosted directly by Cuebiq (change log). For a full introduction to the data available on the Platform please contact your Customer Sucess Manager or Book a Demo with our team. -
DEDICATED
is a local Database available for your team to create schemas/tables etc. as required for your project. For more details see Dedicated Workspace. -
SNOWFLAKE
is a Database that stores information about your usage of the platform to facilitate monitoring.
Admin¶
Cost Management¶
Snowflake provides visibilty into direct monitoring of your teams usage through it's Cost Management dashboard. This is avilable throught the Admin >> Cost Management menu item.
How do I...?¶
In this section you will find the workflow associated with some of the more common actions performed by Spectus users on the Snowflake platform
- Connect to Snowflake SQL Engine from within JupyterLab
- Preview the contents of database tables
- Run a SQL query in Snowsight
- Create a scheduled Task
- Run Python Code in Snowsight
- Create a Python Stored Procedure
- Create a procedure with runtime varibles
- Use the Data Event Trigger
- Export a table via s3
- Integrate with git
- Create an output logger
- Connect via the Snowflake API
Connect to Snowflake SQL Engine within JupyterLab¶
SQL queries through JuypterLab interface are now powered via the Snowflake SQL query engine.
To connect to Snowflake, simply copy and paste the contents below into the cell of a notebook and run the cell. You will need to run this cell within each notebook each time you start a sessions on the platform.
%load_ext cuebiqmagic.magics
%init_cuebiq_data
snow_engine = get_ipython().user_ns['instance']
Once you run the command, you should receive a "Connected" message as below:
Preview the contents of database tables¶
New with the introduction of Snowflake is the ability to preview the contents of a table through the Graphical Interface.
From within the Data >> Databases screen
- Select the Database of interest (e.g.
CUEBIQ_DATA
) - Select the Schema of interest (e.g.
PAAS_CDA_V3
) - Select the Table of interest (e.g.
GEOGRAPHY_REGISTRY
) - Select Data Preview
- Select a Warehouse to run the Preview through (default
COMPUTE_WH
) - Click Preview
Run a SQL Query in Snowsight¶
Create a SQL Worksheet from Projects >> Worksheets
Within the SQL Worksheet you can run individual SQL queries by highlighting the query and running with with blue arrow.
Create a Scheduled Task¶
Within SQL Worksheet the user can create a TASK
. A task can execute any one of the following types of SQL code:
-
Single SQL statement
-
Call to a stored procedure
Create a SQL Worksheet from Projects >> Worksheets
-
Use the command
CREATE TASK
to build the task either from a single SQL statement or an earlier prepared procedure (see how to create procedure below) -
Highlight the command
-
Click the Blue Run Button
Newly created or cloned tasks are created suspended. For information on resuming suspended tasks see ALTER TASK … RESUME .
You can also manage your created TASK
under the Data >> Databases. Select the … actions button to edit the task name or comment, clone the task, drop the task, or transfer ownership of the task to another role.
Running Python Code in Snowsight¶
Create a Python Worksheet from Projects >> Worksheets
Within the Python worksheet you can add code within the MAIN function. Additionaly you can import packages using the Anaconda package manager.
Creating a Python Stored Procedure¶
The Snowflake guide on creating a procedure from a python worksheet can be found here.
You can now manage your created PROCEDURE
under the Data >> Databases.
This stored procedure can then be called using the SQL CALL
command and when creating a TASK
in a similar way to method discussed above for SQL code Create a scheduled Task.
For more information about using the CALL
command with stored procedures see the Snowflake CALL
documentation page.
Create a procedure with runtime varibles¶
- Create a new Python Worksheet from Projects >> Worksheets screen and keep the session signature unchanged
def main(session: snowpark.Session):
-
Write your code within the main function (as outlined above) and utilize local variables at this stage.
-
When ready to deploy click the DEPLOY button (beside the blue RUN arrow) and complete the required details. Select now "OPEN IN WORKSHEETS" to view the procedure creation code in a SQL Worksheet.
Example code:
create procedure test()
returns Table()
language python
runtime_version = 3.8
packages =('snowflake-snowpark-python')
handler = 'main'
as 'import snowflake.snowpark as snowpark
def main(session: snowpark.Session):
param = 1
df = session.table(''snowflake_sample_data.tpch_sf10.lineitem'').limit(param)
return df';
To alter this to include runtime variables the following 3 steps are required:
- removing local Python variables
- changing main function signature
- changing SQL stored procedure signature
After:
create procedure test(param INT)
returns Table()
language python
runtime_version = 3.8
packages =('snowflake-snowpark-python')
handler = 'main'
as 'import snowflake.snowpark as snowpark
def main(session: snowpark.Session, param:int):
df = session.table(''snowflake_sample_data.tpch_sf10.lineitem'').limit(param)
return df';
Use the Data Event Trigger¶
The data tables within the CUEBIQ_DATA
database are updated with new partitions daily. A Data Event trigger is available for the following tables:
- device_location
- stop
- trajectory
To trigger your code based on the creation of new paritions in these tables follow the below steps
- Define the code in Python that you wish to run when triggered
-
Create the following procedure from within a SQL Worksheet:
CREATE OR REPLACE SCHEMA dedicated.my_schema; CREATE OR REPLACE PROCEDURE dedicated.my_schema.process_daily_stop(processing_date INTEGER) RETURNS VARCHAR LANGUAGE PYTHON` RUNTIME_VERSION=3.10 HANDLER='process' PACKAGES = ('snowflake-snowpark-python') EXECUTE AS OWNER AS $$ import logging logger = logging.getLogger("process_daily_stop") def process(session, processing_date): logger.info(f"processing stops {processing_date}") #Enter your code here return "DONE" $$;
-
CALL the
dedicated.dedicated_event.create_event_trigger()
procedure. This was created by the Cuebiq team to set up theSTREAM
that links the creation of the new partition with the procedure that runs your script.CALL dedicated.dedicated_event.create_event_trigger( exec_on_warehouse => 'COMPUTE_WH', data_event_table => 'cuebiq_data.paas_cda_v3_event.stop', procedure_to_exec => 'dedicated.my_schema.process_daily_stop' );
-
Verify the
STREAM
andTASK
are created by running:SHOW STREAMS IN SCHEMA dedicated.my_schema; SHOW TASKS IN SCHEMA dedicated.my_schema;
Import/Export a table via s3¶
Within the DEDICATED
Database available on Snowflake in the SPECTUS_S3_IMPORT_EXPORT
there are two mounted stages that directly links to the export and import paths within the Cuebiq hosted S3 bucket you were provided when joining Spectus.
In order to unload your tables from other schemas in the DEDICATED
Database you will use the
COPY INTO command.
Below is example code to unload the table from Snowflake to a single uncompressed CSV file
COPY INTO @dedicated.spectus_s3_export.spectus_s3_export/SUBPATH/FILENAME.csv FROM dedicated.test_schema.test_table FILE_FORMAT = (TYPE = 'CSV'
FIELD_DELIMITER = ',' COMPRESSION = NONE) SINGLE=TRUE max_file_size=4900000000 HEADER = TRUE;
Below is example code to load a single uncompressed file into a Snowflake table
COPY INTO dedicated.test_schema.test_table FROM @dedicated.spectus_s3_export.spectus_s3_export/SUBPATH/FILENAME.csv
FILE_FORMAT = (TYPE = 'CSV'
FIELD_DELIMITER = ',' COMPRESSION = NONE) SINGLE=TRUE max_file_size=4900000000 HEADER = TRUE;
Best practices to query tables¶
It's a common best practice in the big data environment to rely on the partition columns to speed-up che query execution.
If you plan to query the device_location
table, please make sure to process one date at the time to avoid a long running query.
- For all tables, if you need to run an analysis over multiple dates, please specify them within rounded parenthesis like this ('20241201', '20241202', etc..) instead of using the BETWEEN operator.
- Always specify the
country_code
partition when available. - Always specify the
data_provider
partition when available.
Integrate with Git¶
To learn more about using Git within the Snowsight interface see the Snowflake documentation here.
In order to setup a git integration run the following SQL snippet in a Snowsight SQL Worksheet after replacing the following variables:
INTEGRATION_NAME
ACCOUNT_NAME
git_username
github_access_token
MY_GIT_REPO
REPO_NAME
SQL snippet:
USE ROLE USER;
USE DATABASE dedicated;
USE SCHEMA dedicated.git_integration;
CALL dedicated.git_integration.create_git_api_integration(
integration_name => 'INTEGRATION_NAME',
git_account_url => 'https://github.com/ACCOUNT_NAME'
);
CREATE OR REPLACE SECRET dedicated.git_integration.git_credentials
TYPE = password
USERNAME = 'git_username'
PASSWORD = 'github_access_token';
CREATE OR REPLACE GIT REPOSITORY dedicated.git_integration.MY_GIT_REPO
API_INTEGRATION = INTEGRATION_NAME
GIT_CREDENTIALS = git_credentials
ORIGIN = 'https://github.com/ACCOUNT_NAME/REPO_NAME.git';
SHOW GIT REPOSITORIES IN DATABASE dedicated;
ALTER GIT REPOSITORY dedicated.git_integration.my_git_repo FETCH;
SHOW GIT BRANCHES IN dedicated.git_integration.my_git_repo;
LS @dedicated.git_integration.MY_GIT_REPO/branches/master;
The setup contains a call to a stored procedure dedicated.git_integration.create_git_api_integration
. This allows you to complete the setup autonomously with your standard permissions. You will find a schema within the dedicated
catalog called git_integration
that houses this stored procedure.
Create an SNS Integration¶
To create an SNS integration for notifications, use the CREATE_SNS_INTEGRATION
stored procedure available in the DEDICATED.PUBLIC
schema. The procedure accepts three parameters:
integration_name
: The name you want to give to your SNS integrationaws_sns_role_arn
: The ARN of the IAM role that Snowflake will assumeaws_sns_topic_arn
: The ARN of the SNS topic that will receive the notifications
Example:
CALL create_sns_integration(
integration_name => 'sns_test',
aws_sns_role_arn => 'arn:aws:iam::ACCOUNT_ID:role/service-role/sns_test',
aws_sns_topic_arn => 'arn:aws:sns:REGION:ACCOUNT_ID:test-topic'
);
Then, to retrieve the external ID you need to call:
DESCRIBE NOTIFICATION INTEGRATION sns_test;
Create an output logger¶
The dedicated.dedicated_event.event_table
table has been created to store logged events within your Snowflake instance.
To create a logger for your python outputs first import the logging module into your Worksheet. Additionally, you must create a logger to track your log entries. See below for example code to create your logger and add an entry:
import snowflake.snowpark as snowpark
from snowflake.snowpark.functions import col
import logging
def main(session: snowpark.Session):
# Your code goes here, inside the "main" handler.
logger = logging.getLogger("REPLACE_WITH_YOUR_LOGGER_NAME")
logger.info(f"Logging from Python")
Once logged you can retrieve the logs from the event_table
using SQL. See below for an example query to see your events logged in the dedicated.dedicated_event.event_table
table:
SELECT
TIMESTAMP AS time,
RESOURCE_ATTRIBUTES['snow.executable.name'] as executable,
RECORD['severity_text'] AS severity,
VALUE AS message
FROM dedicated.dedicated_event.event_table
WHERE RECORD_TYPE = 'LOG' AND SCOPE['name'] = 'REPLACE_WITH_YOUR_LOGGER_NAME';
Connect via the Snowflake API¶
See the full documentation on the Spectus Snowflake API integration