SQL Server and facts about Data: April 2014

Sql and BI Practical Interview Questions

1.Difference between Delete and Truncate?

2.OFFSET 10 ROWS FETCH NEXT 5 ROWS ONLY;

CREATE SEQUENCE MySequence
START WITH 1
INCREMENT BY 1;
/****** Create Temp Table ******/
DECLARE @Person TABLE
(
ID int NOT NULL PRIMARY KEY,
FullName nvarchar(100) NOT
NULL
);
/****** Insert Some Data ******/
INSERT @Person (ID, FullName)
VALUES (NEXT VALUE FOR
MySequence, 'Umar Ali'),
(NEXT VALUE FOR MySequence,
'John Peter'),
(NEXT VALUE FOR MySequence,
'Mohamed Iqbal');
/****** Show the Data ******/
SELECT * FROM @Person;
The results would look like this:
ID FullName
1 Umar Ali
2 John Peter
3 Mohamed Iqbal

3. Isolation Modes

· Read Uncommitted

· Read Committed (The default)

· Repeatable Read

· Serializable

· Snapshot

4.Clustered Index versus Non-Clustered Index

A clustered index is a special type of index that reorders the way records in the table are physically stored.
Therefore table can have only one clustered index and this is usually made on the primary key. The leaf nodes of a clustered index contain the data pages.

A non clustered index is a special type of index in which the logical order of the index does not match the physical stored order of the rows on disk. The leaf node of a non clustered index does not consist of the data pages. Instead, the leaf nodes contain index rows.

There can be as many as 249 non clustered index per table

Firstly when you see clustered index think table. In SQL server if a table does not contain a clustered index it is a heap. Creating a clustered index on the table actually transforms the table into a b-tree type structure. Your clustered index IS your table it is not separate from the table

Ever wondered why you can only have one clustered index? Well if we had two clustered indexes we would need two copies of the table. It contains the data after all.

I am going to try and explain this by using a simple example.

NOTE: I created the table in this example and filled it with over 3 million random entries. Then ran the actual queries and pasted the execution plans here.

What you really need to grasp is O notation or operational efficiency. Lets assume you have the following table.

CREATE TABLE [dbo].[Customer](

[CustomerID] [int] IDENTITY(1,1) NOT NULL,

[CustomerName] [varchar](100) NOT NULL,

[CustomerSurname] [varchar](100) NOT NULL,

CONSTRAINT [PK_Customer] PRIMARY KEY CLUSTERED

(

[CustomerID] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF

, IGNORE_DUP_KEY = OFF,ALLOW_ROW_LOCKS = ON

, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

) ON [PRIMARY]

So here we have basic table with a clustered key on CustomerID (Primary key is clustered by default). Thus the table is arranged/ordered based on the primary key CustomerID. The intermediate levels will contain the CustomerID values. The data pages will contain the whole row thus it is the table row.

We will also create a non-clustered index on the CustomerName field. The following code will do it.

CREATE NONCLUSTERED INDEX [ix_Customer_CustomerName] ON [dbo].[Customer]

(

[CustomerName] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF

, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF

, DROP_EXISTING = OFF, ONLINE = OFF

, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

So in this index you would find on the data pages/leaf level nodes a pointer to the intermediate levels in the clustered index. The index is arranged/ordered around the CustomerName field. Thus the intermediate level contains the CustomerName values and the leaf level will contain the pointer( these pointer values are actually the primary key values or CustomerID column).

Right so if we execute the following query:

SELECT * FROM Customer WHERE CustomerID =1

SQL will probably read the clustered index via a seek operation. A seek operation is a binary search which is much more efficient than a scan which is a sequential search. So in our above example the index is read and by using a binary search SQL can eliminate the data that dont match the criteria we are looking for. See attached screen shot for the query plan.

Do binary search on clustered index by comparing the value searched for to the values in the intermediate level.

Return the values that match( remember since the clustered index has all the data in it can return all the columns from the index as it is the row data)

So it is two operations. However if we executed the following query:

SELECT * FROM Customer WHERE CustomerName ='John'

SQL will now use the non-clustered index on the CustomerName to do the search. However since this is a non-clustered index it does not contain the all of the data in the row.

So SQL will do the search on the intermediate levels to find the records that match then do a lookup using the values returned to do another search on the clustered index(aka the table) to retrieve the actual data. This sounds confusing I know but read on and all will become clear.

Since our non-clustered index only contains the Customer Name field(the indexed field values stored in the intermediate nodes) and the pointer to the data which is the CustomerID, the index has no record of the Customer Surname. The CustomerSurname has to be fetched from the clustered index or table.

When running this query I get the following execution plan:

SQL is saying I have a missing index(the text in green). SQL is suggesting I create a index on CustomerName which includes CustomerID and CustomerSurname.

You will also see that 99% of the time of the query is spent on doing a key lookup on the primary key index/clustered index.

Why is SQL suggesting the index on CustomerName again? Well since the index contains only the CustomerID and the CustomerName SQL still has to find the CustomerSurname from the table/clustered indexes.

If we created the index and we included the CustomerSurname column in the index SQL would be able to satisfy the entire query by just reading the non-clustered index. This is why SQL is suggesting I change my non-clustered index.

Here you can see the extra operation SQL needs to do to get the CustomerSurname column from the clustered key

Thus the number of operations are as follows:

Do binary search on non-clustered index by comparing the value searched for to the values in the intermediate level

For nodes that match read the leaf level node which will contain the pointer for the data in the clustered index (the leaf level nodes will contain the primary key values by the way).

For each value returned do a read on the clustered index(the table) to get the row values out here we would read the CustomerSurname.

Return matching rows

That is 4 operations to get the values out. Twice the amount of operations needed compared to reading the clustered index. The show you that your clustered index is your most powerful index as it contains all the data.

So just to clarify one last point. Why do I say that the pointer in the non-clustered index is the primary key value? Well to demonstrate that the leaf level nodes of the non-clustered index contains the primary key value I change my query to:

SELECT CustomerID

FROM Customer

WHERE CustomerName='Jane'

In this query SQL can read the CustomerID from the non-clustered index. It does not need to do a lookup on the clustered index. This you can see by the execution plan which looks like this.

enter image description here

Notice the difference between this query and the previous query. There is no lookup. SQL can find all the data in the non-clustered index

Hopefully you can begin to understand that clustered index is the table and non-clustered indexes DONT contain all the data. Indexing will speed up selects due to the fact that binary searches can be done but only clustered indexes contain all the data. So a search on a non-clustered index will almost always result in values being loaded from the clustered index. These extra operations make non-clustered indexes less efficient than a clustered index.

Hope this clears things up. If anything does not make sense please post a comment and I will try clarify. It is rather late here and my brain is feeling a wee bit flat. Time for a red bull.

SSIS Mode of Cache

· Full Cache Mode: In this cache mode, SSIS queries the database before the start of the data flow task execution (pre-execute phase) and loads all the data from lookup / reference table into SSIS lookup cache. Following are few considerations while using Lookup in Full Cache Mode:

· Size of available memory / cache should be taken into consideration before choosing this option.

· Use this mode when the reference table is small OR

· Use this mode when only a subset of reference table is loaded into lookup (using filter conditions in SQL command).

· Partial Cache Mode: In this cache mode, SSIS queries the database against new rows coming in from the source and if there is a match then that row is cached into SSIS Lookup Cache for rows coming subsequently in the data flow which might have a match. When the cache becomes full, then SSIS removes few of the rows from cache based on the usage / match statistics for those rows and loads the new matching rows into the Lookup Cache. Following are few considerations while using Lookup in Partial Cache Mode:

· Use this mode when the source data set or data to be processed is small OR

· Use this mode when the reference table is large and only few of the records from the reference table will be used / will match against the source data rows.

· No Cache Mode: In this cache mode, SSIS does not cache any rows into Lookup Cache (except for cases like two subsequent source data rows have a match with same lookup values etc). For every row coming from the source, the database is queried to get the matching value / data from the reference table. Here are few considerations while using Lookup in No Cache Mode:

· Use this mode when less number of rows are being processed OR

· Use this mode when the reference table is very frequently modified (Insertion / Updation / Deletion of records).

FROM
JOINs
WHERE
GROUP By
HAVING
ORDER BY
SELECT

FileTable are user table provided in SQL Server 2012 which is a special table of files for applications that require file and directory storage in the database, with Windows API compatibility and non-transactional access and has a pre-defined schema that stores FILESTREAM data, as well as file and directory hierarchy information and file attributes?

1. SELECT Name, Marks,

2. (SELECT COUNT(*)+1 FROM Student B WHERE A.Marks<B.Marks) ASRank FROM Student A

3. ORDER BY Marks DESC

Given an array of numbers, write a function to remove duplicates and sort the output

In an org, given two employees, find their common manager

SQL: Given an employee table with salaries, find the employees that make more than their manager

Self-join solution:
select e.*, m.Salary as "Manager Salary"
from Employee e
join Employee m on e.ManagerId = m.Id
where e.Salary > m.Salary

Here you are "joining" Employee table to itself on the FK relation of ManagerId and then querying on salary.

Sub-query solution:
select *
from Employee e
where e.Salary >
(select m.Salary from Employee m
where e.ManagerId = m.Id)

Dynamic management views (DMVs) and functions return server state information that can be used to monitor the health of a server instance, diagnose problems, and tune performance; that is, they let you see what is going on inside SQL Server. They were introduced in SQL Server 2005 as an alternative to system tables. One example is viewing operating system wait statistics via this query:

SELECT * FROM sys.dm_os_wait_stats;

Another example is examining current sessions, much like the sp_who2 command:

SELECT * FROM sys.dm_exec_sessions;

The WITH CHECK OPTION for a view prevents data modifications (to the data) that do not confirm to the WHERE clause of the view definition. This allows data to be updated via the view, but only if it belongs in the view.

Best Practices of Optimising and performance enhancing SSIS Package

If you know that data in a source is sorted, set IsSorted=TRUE on the source adapter output. This may save unnecassary SORTs later in the pipeline which can be expensive. Setting this value does not perform a sort operation, it only indicates that the data it sorted.
Rename all Name and Description properties from the default. This will help when debugging particularly if the person doing the debugging is not the person that built the package.
Only select columns that you need in the pipeline to reduce buffer size and reduce OnWarning events at execution time
Use Sequence containers to organise package structure into logical units of work. This makes it easier to identify what the package does and also helps to control transactions if they are being implemented
Where possible, use expressions on the SQLStatementType property of the Execute SQL Task instead of parameterised SQL statements. This removes ambiguity when different OLE DB providers are being used. It is also easier. (UPDATE: There is a caveat here. Results of expressions are limited to 4000 characters so be wary of this if using expressions ).
Use caching in your LOOKUP components where possible. It makes them quicker. Watch that you are not grabbing too many resources when you do this though.
LOOKUP components will generally work quicker than MERGE JOIN components where the 2 can be used for the same task
Where possible filter your data in the Source Adapter rather than filter the data using a Conditional Split transform component. This will make your data flow perform quicker because the unnecessary records don't go through the pipeline.
A sort with SQL Server is faster than the sort in SSIS, partly because SSIS does the sort in memory. So it pays to move the sort to a source component (where possible). Note you have to set IsSorted=TRUE on the source adapter output, but setting this value does not perform a sort operation; it only indicates that the data it sorted. After that change the SortKeyPosition of all output columns that are sorted.
Where possible, join data in the Source Adapter rather than using the Merge Join component. SQL Server does it faster than SSIS. But watch out that you are not making to complex queries because that will worsen the readability.
Where possible, aggregate your data in the Source Adapter rather than using the Aggregate component. SQL Server does it faster than SSIS.
Loading high volumes of data on a table with clustered and non-clustered indexes could take a lot of time.The most important thing to verify is if all indexes are really used. SQL Server 2005 and 2008 provide information about index usage with to views: sys.dm_db_index_operational_stats and sys.dm_db_index_usage_stats. Drop all rarely used and unused indexes first. Experience teaches that there are often a lot of unnecessary indexes. If you are absolute sure that all remaining indexes are necessary you can drop all indexes before loading the data and to recreate them afterwards. The performance profit of that depends on the number of records. The higher the number of records the more profit you gain.
If your target database is a local SQL server database, the SQL Server Destination Adapter will perform much better than the OLE DB Destination Adapter. However the SQL Server Destination Adapter works only on a local machine and via Windows security. You have to be absolute sure that your database stays local in the future otherwise you mapping will not work when moving the database.
Always use DTExec to perf test your packages. This is not the same as executing without debugging from SSIS Designer
If you want to conditionally execute a task at runtime use expressions on your precedence constraints. Do not use an expression on the "Disable" property of the task.
Don't pull all configurations into a single XML configuration file. Instead, put each configuration into a seperate XML configuration file. This is a more modular approach and means that configuration files can be reused by different packages more easily.
If you need a dynamic SQL statement in an OLE DB Source component, set AccessMode="SQL Command from variable" and build the SQL statement in a variable that has EvaluateAsExpression=TRUE.
When using checkpoints, use an expression to populate the CheckpointFilename property which will allow you to include the value returned from System::PackageName in the checkpoint filename. This will allow you to easily identify which package a checkpoint file is to be used by.
When storing information about an OLE DB Connection Manager in a configuration, don't store the individual properties such as Initial Catalog, Username, Password etc... just store the ConnectionString property.
Your variables should only be scoped to the containers in which they are used. Do not scope all your variables to the package container if they don't need to be.
Use ProtectionLevel=DontSaveSensitive. Other developers will not be restricted from opening your packages and you will be forced to use configurations (which is another recommended best practice)

Best Practices of development in SSIS Packages

1. Rename all default component names and give them explaining descriptions. This will help other developers that edit your packages. It is also very useful when debugging.

2. Use annotations. This is very useful if the Control Flow or Data Flow isn't self describing (for others).

3. Use Sequence containers to organize package structures into logical units of work. This makes it easier to identify what the package does. It also helps to control transactions if they are being implemented. * Update: SSIS 2012 has a grouping feature *

4. Flows should basically go top-down. This will make your packages more readable.

5. Do not use disabled Control Flow tasks in the Quality assurance or Production environment. If you want to conditionally execute a task at runtime use expressions on your precedence constraints. Do not use an expression on the “Disable” property of the task.

6. You can add more than one projects to your Visual Studio Solution to spread large number of packages. Think about proper layout. For example a datastaging project and a datawarehouse project.

7. Don't use too complex queries. Use a readable lay-out and add comments to explain parts of the query.

8. Give tasks and transformations a prefix. This makes it easier to read the logging.

--Error Handling of With Check option in View

Begin Try

Update Test_VW_Fz

set Lastname='Ayush'

WHERE lASTnAME LIKE '%e%'

END Try

begin Catch

INSERT INTO Error_Process

Select @@error,ERROR_MESSAGE()

end Catch

Reuse of Execution Plans

By Default Stored procedures can reuse a previously cached execution plan, thereby saving the resources involved

in generating a new execution plan.

Example for plan reuse. below code creates the get_order_info procedure

CREATE PROC get_order_info

@odate AS DATETIME

SELECT orderid, custid, empid, orderdate /* KPFJ-4959-91D6-F1EC81F8428F */

FROM tbl_order

WHERE orderdate >= @odate

Turn on the STATISTICS IO option to get back I/O information for your session’s activity

SET STATISTICS IO ON

EXEC get_order_info '20080506'

Run the stored procedure for the first time, providing an input with high selectivity (that is, an input for which a small percentage of rows will be returned):

First time the stored procedure is invoked, SQL Server generated an execution plan for it based on the selective input value and cached that plan.

Advantage for Reuse of Execution Plans

Now that you have a plan stored in cache and invocations of the stored

procedure will reuse it. That’s fine if you keep invoking the stored procedure with a highly

selective input. You will enjoy the fact that the plan is reused, and SQL Server will not waste

resources on generating new plans. That’s especially important with systems that invoke

stored procedures very frequently

If the blk column in the sp_who output is non-zero, this indicates that blocking is occurring on your system. If processes are blocking each other, the processes that are being blocked can experience longer execution times

Quest-SELECT 1 UNION SELECT NULL UNION SELECT '1'

Ans-1,0

You receive some data from UI and need to populate table Customer if the name received from UI does not exist in database. Which of options below should be used for this purpose:

declare @customername varchar(10)

set @customername='Ayush'

IF NOT EXISTS ( SELECT * FROM Customer WHERE Name = @CustomerName )

INSERT INTO [Customer] ( [Name]) VALUES ( @CustomerName)

declare @customername varchar(10)

set @customername='Ayushk'

INSERT INTO [Customer] ( [Name])

SELECT @CustomerName WHERE NOT EXISTS ( SELECT * FROM Customer WHERE Name = @CustomerName )

SQL Server and facts about Data

Thursday, 10 April 2014

SQL and BI Practical Interview Question 404