Amazon redshift distribution key

9/19/2023

Joins are typically merges, rather than the index-based “nested loops” of an OLTP database: a decision support database take two large rowsets, orders them by the join column, and then merges the rows where that column matches. For one thing, most decision support queries start with a table-scan, something that’s anathema to OLTP developers.

These sorts of queries require a completely different execution engine than that used for OLTP. For example, “select total revenue by month for the last year, stratified by how long the user had been a customer at the time of their first transaction.” In a decision support database, by comparison, you tend to write queries that summarize entire tables, perhaps with joins to other tables. Your database will be set up with foreign-key constraints to ensure that your code can’t insert a row with an invalid user or product ID. That row is probably very small: it contains the user’s ID, the ID of a row in the PRODUCT table, and a quantity. For example, if you have an eCommerce application and a customer adds an item to their cart, this translates into a single insert operation: add a row to the USER_CART table. If you’re looking at Redshift with the background of an online transaction processing (OLTP) database developer, it may seem very strange: where are the indexes?Īn OLTP application typically “touches” only a few rows for each transaction. Decision Support versus OLTP, or “Why Redshift”īefore I get started, let’s set some context. Instead, I look at the user experience, from the perspective of a person who’s been working with Redshift for many years. I don’t currently have access to a production-scale dataset, so my performance numbers are based on dummy data and should be taken with a grain of salt. There’s also a new section in the Cluster Management guide, published after I started writing this post.Īs I said, I think it’s a great idea, and one that it could be useful to several of Chariot’s clients, so I decided to spend some of the $500 in “new user” credits that AWS provides and kick the tires. AWS released a blog post that announced the service, and if you’re signed up for re:Invent virtual, the ANT216 session goes into a little more detail, including comparisons with (the existing) Provisioned Redshift. Unfortunately, there’s not a lot of information out there. And for a lot of use-cases, I think that’s a great idea. Here we are in 2021, and AWS has just announced Redshift Serverless, in which you pay for the compute and storage that you use, rather than a fixed monthly cost for a fixed number of nodes with a fixed amount of storage. And the financial services company that I worked for at the time thought it was a bargain, because it could run analysis queries that no contemporary Oracle or Sybase system could even attempt. By comparison, in the early 90s I worked with a similar system that had 64 nodes, a then-astronomical 512 GB of disk, and cost three million dollars. Here was a massively parallel database system that could be rented for 25 cents per node-hour.

I imagine if we would find 'DS_BCAST_INNER' in query plan frequently, we'd better to consider 'DISTSTYLE ALL'.īut still not sure what can be best practice for master tables.Īny suggestion would be appreciated again.Amazon Redshift’s launch in 2012 was one of the “wow!” moments in my experience with AWS. MST_CUSTOMER (about 7 millions, 850MB in SQL Server) CUSTOMER_ID <- Primary Key, Unique and High CardinalityĬUSTOMER_CATEGORY <- Low Cardinality (like VIP, BZ) but not always uses in BI queryįirst, if we can use both columns for SORTKEY, which is the best order in this case? 1.Should Low cardinality comes first (CUST_CATEGORY, CUST_ID)Ģ.Hifh frequency comes first (CUST_ID, CUST_CATEGORY)ģ.should chose only single column eigher of CUST_ID or CUST_CATEGORYĪnd generally, may I ask what is the best combination in such case? 1.DISTSTYLE KEY & COMPOUND SORTKEY We have several master tables with differnet size, purpose and cardinalities and now we're verify what SORTKEY and DISTKEY combination would be more appropriate.įor example, we have two main key in our customer master table, frequent use in our BI queries and join key for other tables, as follows Įx.

Let me ask about what can be approptiate distkey(dist style) and sort key for master tables in Redshift.

0 Comments

I'm James. This is my year of travel.

Amazon redshift distribution key

Leave a Reply.

Author

Archives

Categories