SQL Basics
Although SQL is an ANSI/ISO standard, there are different versions of the SQL language.
However, to be compliant with the ANSI standard, they all support at least the major commands (such as SELECT
, UPDATE
, DELETE
, INSERT
, WHERE
) in a similar manner.
semicolon
Some database systems require a semicolon at the end of each SQL statement. Semicolon is the standard way to separate each SQL statement in database systems that allow more than one SQL statement to be executed in the same call to the server.
Naming
This is not only relevant to SQL but most data analysis, as we tend to use all SQL, R, python or at least 2 of the 3.
Caps
capitalization/case sensitivity
With SQL, capitalization usually doesn’t matter. You could enter SELECT or select or SeLeCT. They all work! But if you use capitalization as part of a consistent style your queries will look more professional.
To enter SQL queries like a pro, it is always a good idea to use all caps for clause starters (e.g. SELECT, FROM, WHERE, etc.). Functions should also be in all caps (e.g. SUM()). The only time that capitalization does matter is when it is inside quotes (more on quotes below).
Vendors of SQL databases may use slightly different variations of SQL. These variations are called SQL dialects.
- Some SQL dialects are case sensitive. BigQuery is one of them. Vertica is another.
- But most, like MySQL, PostgreSQL, and SQL Server, aren’t case sensitive.
- This means if you searched for country_code = ‘us’, it will return all entries that have ‘us’, ‘uS’, ‘Us’, and ‘US’.
- This isn’t the case with BigQuery. BigQuery is case sensitive, so that same search would only return entries where the country_code is exactly ‘us’. If the country_code is ‘US’, BigQuery wouldn’t return those entries as part of your result.
Quotes
single or double quotes
There are two situations where it does matter what kind of quotes you use:
When you want strings to be identifiable in any SQL dialect. Within each SQL dialect there are rules for what is accepted and what isn’t. But a general rule across almost all SQL dialects is to use single quotes for strings. This helps get rid of a lot of confusion. So if we want to reference the country US in a
WHERE
clause (e.g.country_code = 'US'
), then use single quotes around the string'US'
.When your string contains an apostrophe or quotation marks. The second situation is when your string has quotes inside it. Suppose you have a column favorite_food in a table called FavoriteFoods and the other column corresponds to each friend.
Generally speaking, this should be the only time you would use double quotes instead of single quotes.
friend | favorite_food |
---|---|
Rachel DeSantos | Shepherd’s pie |
Sujin Lee | Tacos |
Najil Okoro | Spanish paella |
Columns
snake_case names
Column names should be all lowercase. This helps keep your queries consistent and easier to read while not impacting the data that will be pulled when you run them. So, spaces are bad in SQL names. Never use spaces.This means that ‘total tickets’, which has a space between the two words, should be entered as total_tickets with an underscore instead of a space.
It is always good to give your columns useful names, especially when using functions. After running your query, you want to be able to quickly understand your results.
Tables
CamelCase names for tables
Table names should be in CamelCase. CamelCase capitalization means that you capitalize the start of each word, like a two-humped (Bactrian) camel. So the table TicketsByOccasion uses CamelCase capitalization.
Please note that the capitalization of the first word in CamelCase is optional; camelCase is also used. Some people differentiate between the two styles by calling CamelCase, PascalCase, and reserving camelCase for when the first word isn’t capitalized, like a one-humped (Dromedary) camel; for example, ticketsByOccasion.
At the end of the day, CamelCase is a style choice. There are other ways you can name your tables, including:
- All lower or upper case, like ticketsbyoccasion or TICKETSBYOCCASION
- With snake_case, like tickets_by_occasion
Keep in mind, the option with all lowercase or uppercase letters can make it difficult to read your table name, so it isn’t recommended for professional use.
The second option, snake_case, is technically okay. With words separated by underscores, your table name is easy to read, but it can get very long because you are adding the underscores. It also takes more time to enter. If you use this table a lot, it can become a chore.
Indent
As a general rule, you want to keep the length of each line in a query <= 100 characters. Now it is much easier to understand what you are trying to do in the SELECT clause.
Sure, both queries will run without a problem because indentation doesn’t matter in SQL. But proper indentation is still important to keep lines short. And it will be valued by anyone reading your query, including yourself!
How To Query
Basics Summary
Here are some of the basic commands we’ll cover and that you’ll use often:
Command | Syntax | Description | Example |
---|---|---|---|
SELECT | SELECT column1, column2, … FROM table_name; | SELECT statement is used to fetch data from a database. | SELECT city FROM placeofinterest; |
WHERE | SELECT column1, column2, …FROM table_name WHERE condition; | WHERE clause is used to extract only those records that fulfill a specified condition. | SELECT * FROM placeofinterest WHERE city == ‘Rome’ ; |
COUNT | SELECT COUNT * FROM table_name ; | COUNT is a function that takes the name of a column as argument and counts the number of rows when the column is not NULL. | SELECT COUNT(country) FROM placeofinterest WHERE country=‘Canada’; |
DISTINCT | SELECT DISTINCT columnname FROM table_name; | DISTINCT function is used to specify that the statement is a query which returns unique values in specified columns. | SELECT DISTINCT country FROM placeofinterest WHERE type=‘historical’; |
LIMIT | SELECT * FROM table_name LIMIT number; | LIMIT is a clause to specify the maximum number of rows the result set must have. | SELECT * FROM placeofinterest WHERE airport=“pearson” LIMIT 5; |
INSERT | INSERT INTO table_name (column1,column2,column3…) VALUES(value1,value2,value3…); | INSERT is used to insert new rows in the table. | INSERT INTO placeofinterest (name,type,city,country,airport) VALUES(‘Niagara Waterfalls’,‘Nature’,‘Toronto’,‘Canada’,‘Pearson’); |
UPDATE | UPDATE table_name SET[[column1]=[VALUES]] WHERE [condition]; | UPDATE used to update the rows in the table. | UPDATE placeofinterest SET name = ‘Niagara Falls’ WHERE name = “Niagara Waterfalls”; |
DELETE | DELETE FROM table_name WHERE [condition]; | DELETE statement is used to remove rows from the table which are specified in the WHERE condition. | DELETE FROM placeofinterest WHERE city IN (‘Rome’,‘Vienna’); |
What is a query?
A query is a request for data or information from a database. When you query databases, you use SQL to communicate your question or request. You and the database can always exchange information as long as you speak the same language.
Every programming language, including SQL, follows a unique set of guidelines known as syntax. Syntax is the predetermined structure of a language that includes all required words, symbols, and punctuation, as well as their proper placement. As soon as you enter your search criteria using the correct syntax, the query starts working to pull the data you’ve requested from the target database.
Basic
The syntax of every SQL query is the same:
- Use SELECT to choose the columns you want to return.
- Use FROM to choose the tables where the columns you want are located.
- Use WHERE to filter for certain information.
A SQL query is like filling in a template. You will find that if you are writing a SQL query from scratch, it is helpful to start a query by writing the SELECT, FROM, and WHERE keywords in the following format:
Step 1
SELECT
FROM
WHERE ;
Notice that the SQL statement shown above has a semicolon at the end. The semicolon is a statement terminator and is part of the American National Standards Institute (ANSI) SQL-92 standard, which is a recommended common syntax for adoption by all SQL databases. However, not all SQL databases have adopted or enforce the semicolon, so it’s possible you may come across some SQL statements that aren’t terminated with a semicolon. If a statement works without a semicolon, it’s fine.
Step 2
- Enter the table name after the FROM
- the table columns you want after the SELECT
- finally, the conditions you want to place on your query after the WHERE.
- make sure to add a new line and indent when adding these, as shown below:
SELECT | Specifies the columns from which to retrieve data |
FROM | Specifies the table from which to retrieve data |
WHERE | Specifies criteria that the data must meet |
Following this method each time makes it easier to write SQL queries. It can also help you make fewer syntax errors.
Example of a query
Here is how a simple query would appear in BigQuery, a data warehouse on the Google Cloud Platform.
SELECT first_name
FROM customer_data.customer_name= 'Tony' WHERE first_name
The above query uses three commands to locate customers with the first_name, ‘Tony’:
SELECT
the column named first_nameFROM
a table named customer_name (in a dataset named customer_data) (The dataset name is always followed by a dot, and then the table name.)- But only return the data WHERE the first_name is ‘Tony’
As you can conclude, this query had the correct syntax, but wasn’t very useful after the data was returned because it only showed the first column and a long list of “Tony”
Multiple columns in a query
Of course, as a data professional, you will need to work with more data beyond customers named Tony. Multiple columns that are chosen by the same SELECT command can be indented and grouped together.
If you are requesting multiple data fields from a table, you need to include these columns in your SELECT command. Each column is separated by a comma as shown below:
SELECT
column A,
column B,
column C
FROM
customer_data.customer_name
WHERE= 'Tony'
condition1 2
AND condition 3; AND condition
Notice that unlike the SELECT
command that uses a comma to separate fields / variables / parameters, the WHERE
command uses the AND
statement to connect conditions.
Data Types
I will only cover part of SQL Server data types the rest could be found in the tables below or looked up for the server you are using:
varchar
Is a variable char STRING. SQL does not set a predetermined size it expands with the size. The max is 50 anything above that will truncate
varbinary
Can store any type of data, we can use HASH passwords, strings, integers…
String Data Types
Data type | Description | Max size | Storage |
---|---|---|---|
char(n) | Fixed width character string | 8,000 characters | Defined width |
varchar(n) | Variable width character string | 8,000 characters | 2 bytes + number of chars |
varchar(max) | Variable width character string | 1,073,741,824 characters | 2 bytes + number of chars |
text | Variable width character string | 2GB of text data | 4 bytes + number of chars |
nchar | Fixed width Unicode string | 4,000 characters | Defined width x 2 |
nvarchar | Variable width Unicode string | 4,000 characters | |
nvarchar(max) | Variable width Unicode string | 536,870,912 characters | |
ntext | Variable width Unicode string | 2GB of text data | |
binary(n) | Fixed width binary string | 8,000 bytes | |
varbinary | Variable width binary string | 8,000 bytes | |
varbinary(max) | Variable width binary string | 2GB | |
image | Variable width binary string | 2GB |
Numeric Data Types
Data type | Description | Storage |
---|---|---|
bit | Integer that can be 0, 1, or NULL | |
tinyint | Allows whole numbers from 0 to 255 | 1 byte |
smallint | Allows whole numbers between -32,768 and 32,767 | 2 bytes |
int | Allows whole numbers between -2,147,483,648 and 2,147,483,647 | 4 bytes |
bigint | Allows whole numbers between -9,223,372,036,854,775,808 and 9,223,372,036,854,775,807 | 8 bytes |
decimal(p,s) | Fixed precision and scale numbers. Allows numbers from -10^38 +1 to 10^38 –1. The p parameter indicates the maximum total number of digits that can be stored (both to the left and to the right of the decimal point). p must be a value from 1 to 38. Default is 18. The s parameter indicates the maximum number of digits stored to the right of the decimal point. s must be a value from 0 to p. Default value is 0 |
5-17 bytes |
numeric(p,s) | Fixed precision and scale numbers. Allows numbers from -10^38 +1 to 10^38 –1. The p parameter indicates the maximum total number of digits that can be stored (both to the left and to the right of the decimal point). p must be a value from 1 to 38. Default is 18. The s parameter indicates the maximum number of digits stored to the right of the decimal point. s must be a value from 0 to p. Default value is 0 |
5-17 bytes |
smallmoney | Monetary data from -214,748.3648 to 214,748.3647 | 4 bytes |
money | Monetary data from -922,337,203,685,477.5808 to 922,337,203,685,477.5807 | 8 bytes |
float(n) | Floating precision number data from -1.79E + 308 to 1.79E + 308. The n parameter indicates whether the field should hold 4 or 8 bytes. float(24) holds a 4-byte field and float(53) holds an 8-byte field. Default value of n is 53. |
4 or 8 bytes |
real | Floating precision number data from -3.40E + 38 to 3.40E + 38 | 4 bytes |
Date and Time Data Types
Data type | Description | Storage |
---|---|---|
datetime | From January 1, 1753 to December 31, 9999 with an accuracy of 3.33 milliseconds | 8 bytes |
datetime2 | From January 1, 0001 to December 31, 9999 with an accuracy of 100 nanoseconds | 6-8 bytes |
smalldatetime | From January 1, 1900 to June 6, 2079 with an accuracy of 1 minute | 4 bytes |
date | Store a date only. From January 1, 0001 to December 31, 9999 | 3 bytes |
time | Store a time only to an accuracy of 100 nanoseconds | 3-5 bytes |
datetimeoffset | The same as datetime2 with the addition of a time zone offset | 8-10 bytes |
timestamp | Stores a unique number that gets updated every time a row gets created or modified. The timestamp value is based upon an internal clock and does not correspond to real time. Each table may have only one timestamp variable |
Other Data Types
Data type | Description |
---|---|
sql_variant | Stores up to 8,000 bytes of data of various data types, except text, ntext, and timestamp |
uniqueidentifier | Stores a globally unique identifier (GUID) |
xml | Stores XML formatted data. Maximum 2GB |
cursor | Stores a reference to a cursor used for database operations |
table | Stores a result-set for later processing |
Comments
/* */
Some tables aren’t designed with descriptive naming conventions.
– –
Comments can also be added outside of a statement as well as within a statement.
#
You can use # in place of the two dashes, –, in the above query but keep in mind