A.I. Tools

How to Compare Two Tables For Equality in BigQuery | by Giorgos Myrianthous | Jan, 2023

Photo by Zakaria Ahada on Unsplash

Comparing tables in BigQuery is a crucial task when testing the results of data pipelines and queries prior to productionizing them. The ability to compare tables allows for the detection of any changes or discrepancies in the data, ensuring that the data remains accurate and consistent.

In this article we will demonstrate how to compare two (or more) tables on BigQuery and extract the records that differ (if any). More specifically, we will showcase how to compare tables with identical columns as well as tables with a different amount of columns.

First, let’s start by creating two tables with some dummy values that we will then be referencing throughout this tutorial in order to demonstrate a few different concepts.

— Create the first tableCREATE TABLE `temp.tableA` (`first_name` STRING,`last_name` STRING,`is_active` BOOL,`no_of_purchases` INT)INSERT `temp.tableA` (first_name, last_name, is_active, no_of_purchases)VALUES (‘Bob’, ‘Anderson’, True, 12),(‘Maria’, ‘Brown’, False, 0),(‘Andrew’, ‘White’, True, 4)

— Create the second tableCREATE TABLE `temp.tableB` (`first_name` STRING,`last_name` STRING,`is_active` BOOL,`no_of_purchases` INT)INSERT `temp.tableB` (first_name, last_name, is_active, no_of_purchases)VALUES (‘Bob’, ‘Anderson’, True, 12),(‘Maria’, ‘Brown’, False, 0),(‘Andrew’, ‘White’, True, 6),(‘John’, ‘Down’, False, 0)

Comparing records of tables with the same columns

Now that we have created our two example tables, you should have noticed that there are a couple of differences between them.

SELECT * FROM `temp.tableA`;

+————+———–+———–+—————–+| first_name | last_name | is_active | no_of_purchases |+————+———–+———–+—————–+| Bob | Anderson | true | 12 || Andrew | White | true | 4 || Maria | Brown | false | 0 |+————+———–+———–+—————–+

SELECT * FROM `temp.tableB`;

+————+———–+———–+—————–+| first_name | last_name | is_active | no_of_purchases |+————+———–+———–+—————–+| Bob | Anderson | true | 12 || Andrew | White | true | 6 || Maria | Brown | false | 0 || John | Down | false | 0 |+————+———–+———–+—————–+

Now assuming that table temp.tableB is the latest version of some dataset whereas temp.tableA is an older one and we would like to see the actual differences (in terms of records) between the two tables, all we need is the following query:

WITHtable_a AS (SELECT * FROM `temp.tableA`),table_b AS (SELECT * FROM `temp.tableB`),rows_mismatched AS (SELECT’tableA’ AS table_name,*FROM (SELECT*FROMtable_a EXCEPT DISTINCTSELECT*FROMtable_b )

UNION ALL

SELECT’tableB’ AS table_name,*FROM (SELECT*FROMtable_b EXCEPT DISTINCTSELECT*FROMtable_a ))

SELECT * FROM rows_mismatched

Now the result will contain all the differences observed between the tables along with a reference to the table name where the records were found.

In our specific examples, tables A and B were having a difference in two records; The first one seems to be the record for Andrew White since this person has a different value for no_of_purchases field. Additionally, table tableB has one additional record that is not even present on table tableA.

+————+————+———–+———–+—————–+| table_name | first_name | last_name | is_active | no_of_purchases |+————+————+———–+———–+—————–+| tableB | John | Down | false | 0 || tableB | Andrew | White | true | 6 || tableA | Andrew | White | true | 4 |+————+————+———–+———–+—————–+

Note: If you are not familiar with the WITH clause and Common Table Expressions (CTEs) in SQL, make sure to read the following article:

Comparing records of tables with different columns

Now let’s suppose you would like to compare the records between two tables having a different amount of columns. Obviously, we would have to do an apples-to-apples comparison meaning that we somehow need to extract only the common fields from both tables in order to be able to perform a meaningful comparison.

Let’s re-create our tables in order to generate some mis-matching columns so that we can then demonstrate how to deal with these cases:

— Create the first tableCREATE TABLE `temp.tableA` (`first_name` STRING,`last_name` STRING,`is_active` BOOL,`dob` STRING)INSERT `temp.tableA` (first_name, last_name, is_active, dob)VALUES (‘Bob’, ‘Anderson’, True, ’12/02/1993′),(‘Maria’, ‘Brown’, False, ’10/05/2000′),(‘Andrew’, ‘White’, True, ’14/12/1997′)

— Create the second tableCREATE TABLE `temp.tableB` (`first_name` STRING,`last_name` STRING,`is_active` BOOL,`no_of_purchases` INT)INSERT `temp.tableB` (first_name, last_name, is_active, no_of_purchases)VALUES (‘Bob’, ‘Anderson’, True, 12),(‘Maria’, ‘Brown’, True, 0),(‘Andrew’, ‘White’, True, 6),(‘John’, ‘Down’, False, 0)

Now our new tables have only three columns in common, namely first_name, last_name and is_active.

SELECT * FROM `temp.tableA`;

+————+———–+———–+————–+| first_name | last_name | is_active | dob |+————+———–+———–+————–+| Bob | Anderson | true | ’12/02/1993′ || Andrew | White | true | ’10/05/2000′ || Maria | Brown | false | ’14/12/1997′ |+————+———–+———–+————–+

SELECT * FROM `temp.tableB`;

+————+———–+———–+—————–+| first_name | last_name | is_active | no_of_purchases |+————+———–+———–+—————–+| Bob | Anderson | true | 12 || Andrew | White | true | 6 || Maria | Brown | false | 0 || John | Down | false | 0 |+————+———–+———–+—————–+

Now if we attempt to run the query we executed in the previous section where the two tables were having the same columns, we will end up with this error:

Column 4 in EXCEPT DISTINCT has incompatible types: STRING, INT64 at [13:7]

This is absolutely normal given that our tables no longer have matching columns. We need to slightly amend our initial query such that the very first CTEs will only select the mutual columns for every table. Our query will look as below:

WITHtable_a AS (SELECT first_name,last_name,is_activeFROM `temp.tableA`),table_b AS (SELECT first_name,last_name,is_active FROM `temp.tableB`),rows_mismatched AS (SELECT’tableA’ AS table_name,*FROM (SELECT*FROMtable_a EXCEPT DISTINCTSELECT*FROMtable_b )

UNION ALL

SELECT’tableB’ AS table_name,*FROM (SELECT*FROMtable_b EXCEPT DISTINCTSELECT*FROMtable_a ))

SELECT * FROM rows_mismatched

The tables created in this section were having the following mismatches (when considering only their mutual columns):

The record for Maria Brown has differences in column is_activeTable tableB has one additional record (John Down) which is not present in tableA

These differences can be observed in query results shared below:

+————+————+———–+———–+| table_name | first_name | last_name | is_active |+————+————+———–+———–+| tableB | Maria | Brown | false || tableB | John | Down | false | | tableA | Maria | Brown | true | +————+————+———–+———–+

Final Thoughts

In this article, we provided a comprehensive guide on how to compare tables in BigQuery. We highlighted the importance of this task in ensuring the accuracy and consistency of data and demonstrated multiple techniques for comparing tables with identical columns as well as tables with different amounts of columns. We also walked through the process of extracting records that differ between tables (if any).

Overall, this article aimed to equip readers with the necessary tools and knowledge to effectively and efficiently compare tables in BigQuery. I hope you found it useful!

Become a member and read every story on Medium. Your membership fee directly supports me and other writers you read. You’ll also get full access to every story on Medium.

Related articles you may also like


Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button
Translate »