Skip to content

Values blocks

Values blocks are a really useful little bit of code in SQL Server. Basically, they are a block of defined values that you can use pretty much like any other data set.

The main place you may have encountered them before is as a source for an input. Often when people need to add a set of records to a table I see something like this:

INSERT INTO dbo.Table1
    (
          Column1
        , Column2
    )
SELECT 1, 'some value'
UNION
SELECT 2, 'some other value';

Or even worse:

INSERT INTO dbo.Table1 (Column1, Column2)
SELECT 1, 'some value';
INSERT INTO dbo.Table1 (Column1, Column2)
SELECT 2, 'some other value';

The first attempt is ok, but it’s bulky and unnecessary, and it needs you to keep typing UNION over and over. The second attempt is actively inefficient, as each row is inserted individually instead of inserting everything as a set.

The cleaner way, that uses a VALUES block, is:

INSERT INTO dbo.Table1
    (
          Column1
        , Column2
    )
VALUES
      (1, 'some value')
    , (2, 'some other value');

This saves you from typing out the UNION all the time, and in my opinion looks neater on the page and makes your block of values easier to read.

The basic rules for a values block are:

  1. The data in each row is comma separated
  2. Each row of data is wrapped in a set of brackets
  3. The rows themselves are also separated by commas
  4. Each row has the same number of values (2 in the example above)
  5. Each value position has to hold data of the same type for every row (int, varchar in the example above)
  6. NULLs are allowed for any value

This use case is useful to know about by itself, but I think the more powerful use of values blocks comes when you start using them in other queries. To do this you need to treat them as a subquery, like in this example:

SELECT
      val.Column2
    , tbl.Column4
FROM dbo.Table2 AS tbl
INNER JOIN
    (
        VALUES
              (1, 'some value')
            , (2, 'some other value')
    ) AS val(Column1, Column2)
    ON tbl.Column1 = val.Column1;

So, all you have to do is wrap the values block in brackets, alias it , and name the columns, and you can use it like any other subquery. The only thing that’s different here is the need to name those columns when you do the aliasing, but you do that simply by listing the names in brackets after the alias.

Interestingly, this is something you can do with regular subqueries as well. It’s probably not something you will use very often as you’re more likely to rename the column in the subquery, but it never hurts to know about these things.

Finally, if I’m using a big values block in a SQL statement, it can be a bit unwieldy to have it in a subquery. It can dominate the rest of the statement, and if it’s long enough you won’t be able to see all of the statement on the screen, and the individual values rarely add much to your understanding of the code. That’s why I will often put the values block in a common table expression at the start of the statement. That also allows you to reuse it if you need to refer to it more than once. Example below:

WITH val AS
(
    SELECT
          val.Column1
        , val.Column2
    FROM
        (
            VALUES
                  (1, 'some value')
                , (2, 'some other value')
        ) AS val(Column1, Column2)
)
SELECT
      val.Column2
    , tbl.Column4
FROM dbo.Table2 AS tbl
INNER JOIN val
    ON tbl.Column1 = val.Column1;

Another option here might be to put the values into a table variable or temp table, but those won’t be as efficient for the query to process. The only time I’d consider that is if there were several statements that wanted to access the same data set, and the data set was small.

So, that’s pretty much everything I know about values blocks. Hope that was useful.

Code noise

I love the term code noise. It’s one of those terms that succinctly encapsulates a quite complex topic in a couple of words, and is instantly recognisable to anyone who’s encountered it even if they had never heard the term before.

Basically code noise is anything that pulls your attention away from what the code is supposed to be doing, or obscures the true nature of the code in some way. It’s not something we consider enough when writing T-SQL code, but I think there is a lot to be said for writing code the next person will be able to read.

As a small example, I was debugging something recently and found that all of the insert statements had ORDER BY clauses. I couldn’t work out why these were making me so angry, after all it’s not doing anything to hurt performance, and in fact isn’t doing anything at all, until one of the other devs in the office pointed out that it’s one example of the code noise that the whole code base is filled with.

More extreme examples are the tendency some developers have to load data into temporary tables after temporary table, or to write nested subqueries 5 layers deep. Both of these things largely just hide where the actual logic of the code is, and make it a nightmare to debug. Here’s an example:

WITH WorkOrderRoutingPlusLocation
AS
    (
        SELECT
              WrkOrdrRtng.WorkOrderID
            , WrkOrdrRtng.ProductID
            , WrkOrdrRtng.ActualCost
            , WrkOrdrRtng.PlannedCost
            , WrkOrdrRtng.OperationSequence
            , Loc.Name AS LocationName
        FROM Production.WorkOrderRouting AS WrkOrdrRtng
        INNER JOIN Production.[Location] AS Loc
            ON WrkOrdrRtng.LocationID = Loc.LocationID
    )
SELECT
      WrkOrdr.WorkOrderID
    , WrkOrdr.ScrappedQty
    , WrkOrdrRtng.OperationSequence
    , Prod.Name AS ProductName
    , WrkOrdrRtng.LocationName AS LocationMovedTo
    , PrevWrkOrdrRtng.LocationName AS LocationMovedFrom
    , WrkOrdrRtng.PlannedCost
    , WrkOrdrRtng.ActualCost
    , CASE
          WHEN WrkOrdrRtng.ActualCost
              > WrkOrdrRtng.PlannedCost
          THEN 'Over budget'
          WHEN WrkOrdrRtng.ActualCost
              < WrkOrdrRtng.PlannedCost
          THEN 'Under budget'
          WHEN WrkOrdrRtng.ActualCost
              = WrkOrdrRtng.PlannedCost
          THEN 'On budget'
          ELSE NULL
      END AS BudgetStatus
FROM Production.WorkOrder AS WrkOrdr
INNER JOIN WorkOrderRoutingPlusLocation AS WrkOrdrRtng
    ON WrkOrdr.WorkOrderID = WrkOrdrRtng.WorkOrderID
INNER JOIN Production.Product AS Prod
    ON WrkOrdrRtng.ProductID = Prod.ProductID
LEFT JOIN WorkOrderRoutingPlusLocation AS PrevWrkOrdrRtng
    ON WrkOrdrRtng.WorkOrderID = PrevWrkOrdrRtng.WorkOrderID
    AND WrkOrdrRtng.ProductID = PrevWrkOrdrRtng.ProductID
    AND WrkOrdrRtng.OperationSequence 
        = PrevWrkOrdrRtng.OperationSequence + 1
INNER JOIN
    (
        SELECT
              WrkOrdrRtng.WorkOrderID
            , SUM(WrkOrdrRtng.PlannedCost) AS TotalPlannedCost
            , SUM(WrkOrdrRtng.ActualCost) AS TotalActualCost
        FROM Production.WorkOrderRouting AS WrkOrdrRtng
        GROUP BY
              WrkOrdrRtng.WorkOrderID
    ) AS sq_WrkOrdrRtngTotals
    ON WrkOrdr.WorkOrderID = sq_WrkOrdrRtngTotals.WorkOrderID
WHERE 1 = 1
    AND sq_WrkOrdrRtngTotals.TotalActualCost
        > sq_WrkOrdrRtngTotals.TotalPlannedCost
ORDER BY
      WrkOrdr.WorkOrderID
    , WrkOrdrRtng.OperationSequence

I think this is fairly self-explanatory code, even without any comments. There’s not much here that isn’t necessary, just the 1 = 1 in the WHERE clause, but that’s to help with debugging. The CTE is there because we use these tables joined together more than once in the query, and one of those times is the right side of a left join. The subquery is there because we genuinely want to look at things at a different level of aggregation to the main query. Everything else is joined together very logically.

Under other circumstances I would have formatted it slightly different, but to make it fit well on the blog post I’ve tried to make it as thin as possible. To that end I’ve done things like splitting predicates across multiple lines that I wouldn’t ordinarily do, but I don’t think that affects the readability of the code too much.

Now, consider this alternative way of writing this query:

SELECT
      Sub3.WorkOrderID
    , WrkOrdr.ScrappedQty
    , Sub3.OperationSequence
    , Prod.Name AS ProductName
    , LNam AS LocationMovedTo
    , L.[Name] AS LocationMovedTo
FROM
    (
        SELECT
              Sub2.WorkOrderID
            , ProductID
            , BudgetStatus
            , OperationSequence
        FROM
            (
                SELECT
                      WorkOrderID
                    , ScrappedQty
                    , SUM(PlannedCost) AS TotalPlannedCost
                    , SUM(ActualCost) AS TotalActualCost
                FROM
                    (
                        SELECT DISTINCT
                              WrkOrdr.WorkOrderID
                            , ScrappedQty
                            , PlannedCost
                            , ActualCost
                        FROM Production.WorkOrder AS WrkOrdr
                        INNER JOIN Production.WorkOrderRouting AS WrkOrdrRtng
                            ON WrkOrdr.WorkOrderID
                                = WrkOrdrRtng.WorkOrderID
                    ) AS Sub1
                GROUP BY
                      WorkOrderID
                    , ScrappedQty
            ) AS Sub2
        INNER JOIN
            (
                SELECT
                      CASE
                          WHEN ActualCost > PlannedCost
                          THEN 'Over budget'
                          WHEN ActualCost < PlannedCost
                          THEN 'Under budget'
                          WHEN ActualCost = PlannedCost
                          THEN 'On budget'
                          ELSE NULL
                      END AS BudgetStatus
                    , WorkOrderID
                    , ProductID
                    , OperationSequence
                FROM Production.WorkOrderRouting AS WrkOrdrRtng
            ) AS Sub21
            ON Sub2.WorkOrderID = Sub21.WorkOrderID
        WHERE 1 = 1
            AND TotalPlannedCost < TotalActualCost
    ) AS Sub3
INNER JOIN Production.Product AS Prod
    ON Sub3.ProductID = Prod.ProductID
INNER JOIN Production.WorkOrder AS WrkOrdr
    ON Sub3.WorkOrderID = WrkOrdr.WorkOrderID
INNER JOIN Production.WorkOrderRouting AS WrkOrdrRtng
    ON Sub3.WorkOrderID = WrkOrdrRtng.WorkOrderID
    AND Sub3.ProductID = WrkOrdrRtng.ProductID
    AND Sub3.OperationSequence = WrkOrdrRtng.OperationSequence
INNER JOIN
    (
        SELECT
              L.LocationID
            , L.[Name] AS LNam
        FROM Production.[Location] AS L
    ) AS Sub4
    ON WrkOrdrRtng.LocationID= Sub4.LocationID
LEFT JOIN
    (
        SELECT
              WorkOrderID
            , ProductID
            , OperationSequence
            , LocationID
        FROM Production.WorkOrderRouting
    ) SubWOR
    ON WrkOrdrRtng.WorkOrderID = SubWOR.WorkOrderID
    AND WrkOrdrRtng.ProductID = SubWOR.ProductID
    AND WrkOrdrRtng.OperationSequence
        = SubWOR.OperationSequence + 1
LEFT JOIN Production.[Location] AS L
    ON SubWOR.LocationID = L.LocationID

Now, I think these two statements do the same thing, although to be honest I got a little lost writing this second one. It should be obvious, however, that the second statement is not nearly as clear, that there is a lot of extra code around it making a lot of noise.

Obviously this is a made up example, but it is similar to a lot of real-world examples I’ve seen. In particular, the overuse of subqueries (and subqueries inside subqueries inside subqueries) to filter or join data. The danger here, apart from it looking ugly, is that another developer comes along, can’t read the intention behind the original code because of all the noise, and just hacks something else onto the existing mess. You can see this has happened on line 63 when someone has added the WorkOrderRouting table to the query, because they need to join from it to the Location table. The WorkOrderRouting table is already part of this query, in Sub21 inside of Sub3, but the new developer hasn’t been able to figure this out, or maybe they’re not sure about how to bubble up the LocationID through all of the subqueries (especially as WorkOrderRouting also exists in Sub1 inside Sub2 inside Sub3 but can’t be bubbled up because there’s some aggregation along the way). Instead they’ve just hacked a new join to the table onto the existing mess and everything has gotten that much harder to understand.

Another thing that’s obviously bad about this statement is the inconsistent naming standards. Sub1 is always a terrible name, you need to alias anything, but especially a subquery, with something meaningful. I like to prefix any subquery aliases with sq_ so when you reference it elsewhere in the query you know you’re referencing a subquery. You also need to make all column names 2 part. Where does LNam in the outer SELECT come from? Or OperationSequence in Sub3? Without 2 part names for all columns, this can be a nightmare to figure out.

I want to end with an example of really bad code noise I found yesterday in my actual work. Table and column names are changed, but the rest of the code is as is:

DELETE FROM TriggerTable_A
WHERE A_GUID IN (SELECT tmp_Sync7.A_GUID
                        FROM tmp_Sync7
                        INNER JOIN TriggerTable_A a on tmp_Sync7.A_GUID = a.A_GUID
                        WHERE tmp_Sync7.UpdatedDate = a.UpdatedDate)

This looks a bit odd, but the main question you have looking at it is what’s that tmp_Sync7 table, right? Well, almost 300 lines of code previously we have this little code snippet (and I checked a few times and there is nothing in those 300 lines of code that does anything to tmp_Sync7):

IF EXISTS(SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME = 'tmp_Sync7') 
    DROP TABLE tmp_Sync7

SELECT * INTO tmp_Sync7 from TriggerTable_A

My brain pretty much exploded when I saw this. What we’re saying here is DELETE from TriggerTable_A if A_GUID is in TriggerTable_A joined to TriggerTable_A. Basically, DELETE FROM TriggerTable_A. As a bonus, you don’t have that nasty permanent tmp_Sync7 taking up space on your database, and you improve performance because you’re doing a lot less.

Bottom line is code noise makes it harder for other developers to read your code. It makes it harder for you to read your code when you come back to it in 6 months to fix some crazy bug that’s only just started showing up. It can make it harder for the query optimiser to read your code, meaning it takes longer to come up with an execution plan, and has more chance of hitting the time-out limit and returning a sub-optimal plan. Both of these things hurt performance, but ultimately your drive to eliminate code noise shouldn’t just be driven by that. You should want to keep your code clean, simple, and elegant, so that when your fellow developer come to build on what you’ve done in 6 months or a year, they can easily understand what your code is doing and make clean, simple, elegant changes to it themselves.

T-SQL Tuesday #125: Unit testing databases – we need to do this!!

This is my second time taking part in T-SQL Tuesday This week the topic is all about unit testing in databases, and whether it is valuable or not.

This is a topic that is quite close to me heart. I don’t come from a computing background before I started working with SQL Server, so I was quite ignorant when it came to a lot of best practices that other developers who have worked with other languages are aware of. Because of this, I had no idea about unit testing until I attended a talk at a SQL Saturday all about tSQLt. If anyone isn’t aware (as I wasn’t) tSQLt is a free open source unit testing framework for use in SQL Server databases. It is the basis of Redgate’s SQL Test software, and is the most used framework for writing unit tests in SQL Server.

Since then I’ve worked to try and get employers to adopt this as part of a standard development life cycle, with mixed success at best. My current employer is quite keen, but there are two major problems. First, we have a huge amount of legacy code that obviously has no unit tests in place; and second, the way people code is not conducive to unit testing.

It’s the second issue I want to talk about today, maybe I will cover writing unit tests for legacy systems in another blog someday but for now I want to discuss how the way you code may need to change if you adopt unit testing, in particular how you need to start adopting a more modular style of coding.

What is a unit test?

Because there can be a lot of confusion about this, I thought it was best to start by defining unit tests. I see good unit tests as having a few main characteristics:

  1. They test against units of code (stored procedures, functions, views etc.)
  2. Each unit test only tests one thing.
  3. A unit test isolates all the dependencies before testing.
  4. Unit tests can be rerun in an automated way.
  5. Unit tests always pass or always fail for a given set of inputs.

In the original T-SQL Tuesday prompt post, these are listed as:

  1. Decisive – the unit test has all info to determine success/failure
  2. Valid – it produces a result that matches the intention of the code written
  3. Complete – it contains all information it needs to run correctly within the test harness
  4. Repeatable – always gives the same results if the test harness and code are same
  5. Isolated – is not affected by other tests run before nor affects the tests run after it
  6. Automated – requires only a start signal in order to run to completion

How this affects the code I write

Whichever set of criteria you want to use, the results for your style of coding start to look the same. Put simply, unit tests are easier to write when a single unit of code does one thing and one thing only. That thing can be extremely complicated (in which case you will probably have quite a few unit tests around it) or very simple (in which case it may not even need testing at all) but your unit of code should not be attempting to do more than one thing at once.

What do I mean by this? Well, if we look at the Wide World Importers Microsoft sample database, we see a stored procedure called Website.RecordColdRoomTemperatures:

CREATE PROCEDURE Website.RecordColdRoomTemperatures
      @SensorReadings Website.SensorDataList READONLY
AS
    BEGIN TRY

		DECLARE @NumberOfReadings int = (SELECT MAX(SensorDataListID) FROM @SensorReadings);
		DECLARE @Counter int = (SELECT MIN(SensorDataListID) FROM @SensorReadings);

		DECLARE @ColdRoomSensorNumber int;
		DECLARE @RecordedWhen datetime2(7);
		DECLARE @Temperature decimal(18,2);

		-- note that we cannot use a merge here because multiple readings might exist for each sensor

		WHILE @Counter <= @NumberOfReadings
		BEGIN
			SELECT @ColdRoomSensorNumber = ColdRoomSensorNumber,
			       @RecordedWhen = RecordedWhen,
				   @Temperature = Temperature
			FROM @SensorReadings
			WHERE SensorDataListID = @Counter;

			UPDATE Warehouse.ColdRoomTemperatures
				SET RecordedWhen = @RecordedWhen,
				    Temperature = @Temperature
			WHERE ColdRoomSensorNumber = @ColdRoomSensorNumber;

			IF @@ROWCOUNT = 0
			BEGIN
				INSERT Warehouse.ColdRoomTemperatures
					(ColdRoomSensorNumber, RecordedWhen, Temperature)
				VALUES (@ColdRoomSensorNumber, @RecordedWhen, @Temperature);
			END;

			SET @Counter += 1;
		END;

    END TRY
    BEGIN CATCH
        THROW 51000, N'Unable to apply the sensor data', 2;

        RETURN 1;
    END CATCH;
END;

We can see the procedure is doing a couple of things here. First it takes the input parameter @SensorReadings as a custom table data type and iterates through it. As it does this, it inserts or updates the Warehouse.ColdRoomTemperatures table with the values from the current row of the table variable. This is not the most awkward thing to test, but it could be made simpler if the code between rows 23-39 is put into its own stored procedure. Then the outer procedure would look more like this:

CREATE PROCEDURE Website.RecordColdRoomTemperatures
      @SensorReadings Website.SensorDataList READONLY
AS
    BEGIN TRY

		DECLARE @NumberOfReadings int = (SELECT MAX(SensorDataListID) FROM @SensorReadings);
		DECLARE @Counter int = (SELECT MIN(SensorDataListID) FROM @SensorReadings);

		DECLARE @ColdRoomSensorNumber int;
		DECLARE @RecordedWhen datetime2(7);
		DECLARE @Temperature decimal(18,2);

		WHILE @Counter <= @NumberOfReadings
		BEGIN
			SELECT @ColdRoomSensorNumber = ColdRoomSensorNumber,
			       @RecordedWhen = RecordedWhen,
				   @Temperature = Temperature
			FROM @SensorReadings
			WHERE SensorDataListID = @Counter;

			EXEC Warehouse.UpdateColdRoomTemperatureBySensorNumber
                  @ColdRoomSensorNumber = @ColdRoomSensorNumber
                , @RecoredeWhen = @RecordedWhen
                , @Temperature = @Temperature

			SET @Counter += 1;
		END;

    END TRY
    BEGIN CATCH
        THROW 51000, N'Unable to apply the sensor data', 2;

        RETURN 1;
    END CATCH;
END;

This way, we can write unit tests against the outer procedure to see if it is looping effectively, and separate unit tests against the inner procedure to test if it is updating the table correctly. A nice side-effect is if we want to write any other code that wants to merge a single row into the Warehouse.ColdRoomTemperatures table we can use the Warehouse.UpdateColdRoomTemperatureBySensorNumber stored procedure.

This, really, is what modular coding is all about. Making sure that each module in your code base is only trying to do one thing. You can then wrap unit tests around that module to make sure it does that thing well, and re-use it over and over again throughout your code whenever you need to do that one thing.

To keep with the example above, I don’t like the way the code does the UPDATE and then the INSERT if @@ROWCOUNT = 0. Despite the comment in the code, you can use a MERGE provided you only merge in the one row being added in the WHILE block. The end result of this change should function the same as the original code but look more elegant. If I have the code related to the Warehouse.ColdRoomTemperatures in its own stored procedure, with some unit tests around it, I can change that stored procedure however I like provided the unit tests still pass, confident that any calling stored procedures will still function the same.

Orchestration procedures and code contracts

A key part of the modular approach to code are the orchestration procedures. These are the procedures that sit on top of everything else, call other stored procedures, manage control flow, pass variables etc. They do no do much themselves but decide what will be done. The example procedure above is a simple orchestration procedure, but they can get significantly more complex. They might function as the APIs of your database, called by external applications to do something or return some values.

They are quite simple to unit test really. All you do is map the possible paths the orchestration process could take, depending on values passed in and values retrieved during the process. Then you write a test for each of these paths e.g. if Parameter A is set to 32 and function X returns less than 10,000 we should execute stored procedures 3, 5, and 17 with specific parameters. Whether function X, or any of the stored procedures, perform as expected is not something to worry about when testing an orchestration procedure, that is taken care of by the unit tests around these code units.

This idea can also be thought of as a contract between a particular unit of code and the rest of the database. Defined by a specification (hopefully well documented somewhere, but we can’t expect miracles) and enforced by the unit tests, this contract says that this unit of code will behave in this way. You can change the unit however you like, provided it doesn’t break this contract, and if at some point you find it does, then you will most likely find yourself faced with a long, tedious job of trawling through every other bit of code that calls it to make sure this new contract still fits what they expect from the code unit.

Summary

If you adopt unit testing, you may need to change your coding style to a more modular one in order to get the best from it. This does come with other benefits, however, like clearer separation of responsibilities in a database, easier code reuse, and better defined functional units of code. I think even without unit tests, more modular coding is the way to go when possible (and sometimes performance issue will make it impossible) for all of the reasons just mentioned. I also think that writing unit tests changes your perspective on the code you write, it helps you think about things like error handling, and code contracts, and code reuse, and will make you a better coder.

T-SQL Tuesday #123: Life hacks to make your day easier – Custom shortcuts in SSMS/ADS

T-SQL Tuesday logo

This is my first time taking part in T-SQL Tuesday. It’s something I’ve known about for a while but not taken part in before. But one of my goals this year is to post at least once a week on this blog for 3 months, so I figure I should take every prompt I can. Plus this week’s topic is something I feel I can contribute to.

This week’s topic is about any hacks you have to make your day to day life easier, and I wanted to share a little trick someone showed me about 10 years ago, that I’ve found incredibly useful ever since.

The basics of it is, you can add your own query shortcuts to SSMS, and when you run them they append any highlighted text to the shortcut code. To explain, I’ll use the example I’ve found most useful:

One of the first things I do whenever I start a new job or install a new version of SSMS is I set up ctrl+5 as the shortcut for “SELECT TOP 100 * FROM “

Once I’ve done that I can highlight any table or view in my code, and use ctrl+5 to see 100 rows from that object, because it appends the highlighted text to the end of my shortcut text. But you can do more with it. Take this bit of code:

SELECT
      A.Col1
    , B.Col2
    , C.Col3
FROM TableA AS A
INNER JOIN TableB AS B
    ON A.ID = B.ID
LEFT JOIN
    (
        SELECT
              X.Col5
            , Y.Col3
            , X.ID
        FROM TableX AS X
        LEFT JOIN TableY AS Y
            ON X.ID = Y.ID
            AND X.Col3 = 'Some value'
    ) AS C
    ON A.ID = C.ID

I can highlight TableA, TableB, TableX, or TableY, and in each case Cltrl+5 will show me all of the columns in the table and a sample of the data. Or I can highlight TableA AS A INNER JOIN TableB AS B ON A.ID = B.ID and get a sample of all the columns available from that. Or I can do something similar inside the subquery and see all the columns I have available to me in that subquery.

The main other shortcut I like to set is Ctrl+6 as “SELECT SUM(1) FROM “, to tell me how many rows are in a table.

If you want to set these up it’s very easy:

In SSMS go to Tools>Options>Keyboard>Query Shortcuts.

In Azure Data Studio it’s a little more hidden. You might think you’d be able to do this in File>Preferences>Keyboard Shortcuts, but that only gives you a list of pre-canned commands that you can assign to different shortcuts as you see fit. To write your own you need to go to Cog icon(bottom left)>Settings>Data>Query to create your own shortcuts. You need to be a bit careful because there will already be other shortcuts assigned to your key combinations, so you will need to go into File>Preferences>Keyboard Shortcuts to move them to other shortcuts, otherwise Data Studio will try and do both things when you use your shortcut.

Anyway, that’s it, hopefully someone will find this useful.

Useful link for .NET training

I’m interested in learning some .NET coding. I’ve always been a pure SQL developer but more and more I feel my lack of .NET knowledge is holding me back. I’ve found a surprising lack of information online about how to get started with this, but I found one set of tutorials from Microsoft that talk through the basics of making various different applications.

I’m not sure how much use they would be to a complete beginner, but to someone who’s reasonably proficient with C# coding, and is aware of what a css file contributes to a website, but really isn’t sure how the different things fit together, I found them really useful.

https://docs.microsoft.com/en-us/visualstudio/get-started/csharp/tutorial-console?view=vs-2019

Interesting NULL issue

I had a question today from someone in my team wondering why their code wasn’t returning the correct results. They’d written:

SELECT
      *
FROM TableA;

SELECT
      *
FROM TableA
WHERE 1 = 1
    AND ID1 IN (SELECT ID1 FROM TableB);

SELECT
      *
FROM TableA
WHERE 1 = 1
    AND ID1 NOT IN (SELECT ID1 FROM TableB);

The first query returned about 600 rows, the second returned 300ish, the third returned nothing.

I spotted some NULLs in TableA.ID1, and added an OR ID1 IS NULL to the WHERE clause of the third query. This produced some results but not enough.

We looked at it for a little while longer, and eventually found some NULLs in TableB.ID1. These were causing the problem because a NULL represents unknown values. Therefore, when then query tries to evaluate if any particular value for TableA.ID1 is not in the list of TableB.ID1s from the subquery, it can’t be sure because one of the items in the list has an unknown value.

In summary, if you are doing a NOT IN, make sure you don’t have a NULL in your list anywhere, or the predicate will always return false and (if the predicate is in the WHERE clause) your query won’t return any values.

What’s the point of that object

I can be a bit of a purist when it comes to design principals, but I think this is an important one.

Every object, schema, database etc. that you create should have a purpose, and that purpose should contain a What and a Why.

So, in my data warehouse designs I have a Staging database. It’s not enough to say “the staging database exists to hold staged data”, we have to say “the staging database exists to stage (extract, validate, apply transformation logic to) data from source systems so that we can present that staged data for the data warehouse to consume”. By keeping this purpose in mind whenever we make any changes to the database we make sure it is not filled with objects that are irrelevant to the purpose of the database, and by keeping the purpose simple we avoid having the database attempting to do too much.

We can apply this to schemas, and to tables, and stored procs and functions and pretty much everything else.

This provides additional benefits around documenting, unit testing, and other best practices you should be following with databases.

LIKE and REPLICATE

The LIKE operator can be used for a lot of good things. Recently, however, I’ve discovered the REPLICATE function which adds another dimension to what you can do. Simply put, the REPLICATE function takes a string and an integer input, and outputs the string repeated a number of times determined by the integer. The key thing here is that the string can be a wildcard from the pattern searches you can do with a LIKE or PATINDEX in SQL Server.

For instance REPLICATE(‘[0-9]’, 3) will output ‘[0-9][0-9][0-9]’ and this can be used as one side of a LIKE operator. Or it can be concatenated with other text.

As an example, when checking if the column Col1 is a decimal, the following code will check this for us:

SELECT

      *

FROM Table1

WHERE 1 = 1

AND Col1 LIKE ‘[-0-9]’ + REPLICATE(‘[0-9]’, LEN(Col1)-4) + ‘.[0-9][0-9]’

The ‘[-0-9]’ at the start checks that the first character is either numeric or a – sign. The REPLICATE part checks that the next X characters are numeric (where X is the length of the string – 4). The final ‘.[0-9][0-9]’ checks that the string ends with a decimal point followed by 2 integers.

NULLs

I just saw a question on SQLServerCentral about NULLs behavior and thought I could write something about the common gotchas when using NULL in T-SQL, and why they behave the way they do.

To go into this, we need to understand the logic behind what a NULL represents. A NULL is used, or at least should be used, to denote the absence of information. It is not a 0 or a blank string or any other value. For instance, if you are looking at pizza orders and someone’s order comes to £0.00 for whatever reason (saved a lot of coupons, sleeping with the manager, etc.) then the OrderValue column should be set to 0.00. If, however, the order value is not known (maybe the system failed that day and orders are being input by hand) the OrderValue column should be set to NULL.
When you come to query this table, you would treat the 0 and NULL differently, if you want a total of sales values for the month you would include the 0 in your total, but the NULL makes that total unknowable. SQL Server recognizes this, which is why a SUM of the total sales values will return a NULL if at least one of the sales values is NULL.

That’s important behavior, but it becomes even more important when looking at WHERE conditions (and HAVING conditions, and CASE statements, and anything else that evaluates an expression). Usually an expression can be evaluates to TRUE or FALSE, so 1 = 1 evaluates to TRUE and 1 = 2 evaluates to FALSE. If one side of the expression is NULL then the expression evaluates to UNKNOWN, because in the case of 1 = NULL, you are asking does an unknown value equal 1 and obviously you don’t know the answer. This holds even if both sides of the expression are NULL, because in that case you are asking if two unknown value equal each other.

This gets really important when you consider that WHERE clauses are only looking for rows where the expressions evaluate to TRUE.

Going back to the pizza orders, if you are filtering for orders where OrderValue = 9.99, any orders with OrderValue of NULL will be excluded (expression evaluates to UNKNOWN). Where people often fall down is that if you are filtering for orders where OrderValue <> 9.99, and orders with OrderValue of NULL will still be excluded. To include these in the filter you would need to filter for orders where OrderValue <> 9.99 OR OrderValue IS NULL.

 

This may seem like a long-winded explanation for something that’s pretty simple, but what I’ve tried to do in this post is lay out the fundamentals of why NULLs work as they do. Hopefully you can then work out for yourself how NULLs will work in a particular situation.

Stuff I’ve learned from my degree part 1

I want to write about some of the things I’ve learned as part of my Open University degree so far.
The degree is in Computing & IT and Statistics, and I started it for a couple of reasons.

Firstly, I had reached a point in my career where all of my peers seemed to have a much better understanding of computing in general than I did and I needed a way to catch up, and secondly I noticed that very few people in the BI industry have any statistical knowledge and I thought that would be a good way to stand out. In both cases, I think it’s been beneficial but I wanted to write about some of the computing fundamentals and how they helped me.

So, before the course, I was aware that a bit was a unit of memory, and that byte, megabyte, gigabyte etc. were also units of memory, and that computers thought in 1s and 0s but I never understood what that meant or the implications for data storage and data types until I did this course.
What I learned was that a bit was a single 1 or 0, and a byte was a collection of 8 of these bits. Why 8? Because that gives you 256 distinct combinations, which is enough to represent the most commonly used letters, numbers and symbols in English (so 00000000 = A, 00000001 = B, 00000010 = C and so on).
Suddenly at this point a lightbulb went off in my head. A single character in a Char or Varchar field is represented by one of these bytes. From there, the differences between Char and Varchar became clear in terms of data storage.
There would be a specific combination of 1s and 0s that represented no character and that is what is used to pad a Char field to the required length, which is why a Char field always takes up the same amount of space on each row. The computer would read this long string of 1s and 0s that represented the whole field, slicing it at every 8th bit and translating these slices into characters, and when it reached the allocated number of bytes for the column it would know that the bits after this represented the next column.
For a Varchar field, there is no fixed length. Instead, there is a specific combination of 1s and 8s that represent the end of the column, and the computer will know that only bits before this represent this column.
From this, it follows that a Char field that is too long compared to the data it holds will take up excess space in the database because of the extra bytes to represent no character. It also follows that any Varchar field will take up 1 byte’s worth of space more than its length because of the ‘end of column’ byte.
Finally, the NChar and NVarchar fields also made sense. These are used to represent characters from many languages, and so they need far more possible combinations of 1s and 0s than can be handled in a single byte. So we represent each character with 2 bytes instead. This gives 256×256 = 65536 different combinations and so we can map these to 65536 characters. Apart from that they work exactly the same as the Char and Varchar fields, they just take up double the space.

None of this was exactly new knowledge, I knew that NChar took up double the space of Char, and that you shouldn’t use Char when the values are likely to be variable in length because it wasted storage, but understanding why made a huge difference to how I thought about assigning data types.