Sql and BI Practical Interview Questions
1.Difference between Delete and Truncate?
2.OFFSET 10 ROWS FETCH NEXT 5 ROWS ONLY;
CREATE SEQUENCE MySequence
START WITH 1
INCREMENT BY 1;
/****** Create Temp Table ******/
DECLARE @Person TABLE
(
ID int NOT NULL PRIMARY KEY,
FullName nvarchar(100) NOT
NULL
);
/****** Insert Some Data ******/
INSERT @Person (ID, FullName)
VALUES (NEXT VALUE FOR
MySequence, 'Umar Ali'),
(NEXT VALUE FOR MySequence,
'John Peter'),
(NEXT VALUE FOR MySequence,
'Mohamed Iqbal');
/****** Show the Data ******/
SELECT * FROM @Person;
The results would look like this:
ID FullName
1 Umar Ali
2 John Peter
3 Mohamed Iqbal
START WITH 1
INCREMENT BY 1;
/****** Create Temp Table ******/
DECLARE @Person TABLE
(
ID int NOT NULL PRIMARY KEY,
FullName nvarchar(100) NOT
NULL
);
/****** Insert Some Data ******/
INSERT @Person (ID, FullName)
VALUES (NEXT VALUE FOR
MySequence, 'Umar Ali'),
(NEXT VALUE FOR MySequence,
'John Peter'),
(NEXT VALUE FOR MySequence,
'Mohamed Iqbal');
/****** Show the Data ******/
SELECT * FROM @Person;
The results would look like this:
ID FullName
1 Umar Ali
2 John Peter
3 Mohamed Iqbal
·
Read Uncommitted
·
Read Committed (The
default)
·
Repeatable Read
·
Serializable
·
Snapshot
4.Clustered Index versus Non-Clustered Index
A clustered index is a special type of index
that reorders the way records in the table are physically stored.
Therefore table can have only one clustered index and this is usually made on the primary key. The leaf nodes of a clustered index contain the data pages.
Therefore table can have only one clustered index and this is usually made on the primary key. The leaf nodes of a clustered index contain the data pages.
A non clustered index is a special type of index
in which the logical order of the index does not match the physical stored order of the rows on disk. The leaf node of a non clustered index does not
consist of the data pages. Instead, the leaf nodes contain index rows.
There can be as many as 249 non clustered index
per table
Firstly when you see
clustered index think table. In SQL server if a table does not contain a
clustered index it is a heap. Creating a clustered index on the table actually
transforms the table into a b-tree type structure. Your clustered index IS your
table it is not separate from the table
Ever wondered why you
can only have one clustered index? Well if we had two clustered indexes we
would need two copies of the table. It contains the data after all.
I am going to try and
explain this by using a simple example.
NOTE: I created the
table in this example and filled it with over 3 million random entries. Then
ran the actual queries and pasted the execution plans here.
What you really need
to grasp is O notation or operational efficiency. Lets assume you have the
following table.
CREATE TABLE
[dbo].[Customer](
[CustomerID] [int]
IDENTITY(1,1) NOT NULL,
[CustomerName]
[varchar](100) NOT NULL,
[CustomerSurname]
[varchar](100) NOT NULL,
CONSTRAINT
[PK_Customer] PRIMARY KEY CLUSTERED
(
[CustomerID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF
, IGNORE_DUP_KEY = OFF,ALLOW_ROW_LOCKS = ON
, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
So here we have basic
table with a clustered key on CustomerID (Primary key is clustered by default).
Thus the table is arranged/ordered based on the primary key CustomerID. The
intermediate levels will contain the CustomerID values. The data pages will
contain the whole row thus it is the table row.
We will also create a
non-clustered index on the CustomerName field. The following code will do it.
CREATE NONCLUSTERED
INDEX [ix_Customer_CustomerName] ON [dbo].[Customer]
(
[CustomerName] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF
, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF
, DROP_EXISTING = OFF, ONLINE = OFF
, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
So in this index you
would find on the data pages/leaf level nodes a pointer to the intermediate
levels in the clustered index. The index is arranged/ordered around the
CustomerName field. Thus the intermediate level contains the CustomerName
values and the leaf level will contain the pointer( these pointer values are
actually the primary key values or CustomerID column).
Right so if we execute
the following query:
SELECT * FROM Customer WHERE CustomerID =1
SQL will probably read
the clustered index via a seek operation. A seek operation is a binary search
which is much more efficient than a scan which is a sequential search. So in
our above example the index is read and by using a binary search SQL can eliminate
the data that dont match the criteria we are looking for. See attached screen
shot for the query plan.
Do binary search on
clustered index by comparing the value searched for to the values in the
intermediate level.
Return the values that
match( remember since the clustered index has all the data in it can return all
the columns from the index as it is the row data)
So it is two
operations. However if we executed the following query:
SELECT * FROM Customer WHERE CustomerName
='John'
SQL will now use the
non-clustered index on the CustomerName to do the search. However since this is
a non-clustered index it does not contain the all of the data in the row.
So SQL will do the
search on the intermediate levels to find the records that match then do a
lookup using the values returned to do another search on the clustered
index(aka the table) to retrieve the actual data. This sounds confusing I know
but read on and all will become clear.
Since our
non-clustered index only contains the Customer Name field(the indexed field
values stored in the intermediate nodes) and the pointer to the data which is
the CustomerID, the index has no record of the Customer Surname. The
CustomerSurname has to be fetched from the clustered index or table.
When running this
query I get the following execution plan:
SQL is saying I have a
missing index(the text in green). SQL is suggesting I create a index on
CustomerName which includes CustomerID and CustomerSurname.
You will also see that
99% of the time of the query is spent on doing a key lookup on the primary key
index/clustered index.
Why is SQL suggesting
the index on CustomerName again? Well since the index contains only the
CustomerID and the CustomerName SQL still has to find the CustomerSurname from
the table/clustered indexes.
If we created the
index and we included the CustomerSurname column in the index SQL would be able
to satisfy the entire query by just reading the non-clustered index. This is
why SQL is suggesting I change my non-clustered index.
Here you can see the
extra operation SQL needs to do to get the CustomerSurname column from the
clustered key
Thus the number of
operations are as follows:
Do binary search on
non-clustered index by comparing the value searched for to the values in the
intermediate level
For nodes that match
read the leaf level node which will contain the pointer for the data in the
clustered index (the leaf level nodes will contain the primary key values by
the way).
For each value
returned do a read on the clustered index(the table) to get the row values out
here we would read the CustomerSurname.
Return matching rows
That is 4 operations
to get the values out. Twice the amount of operations needed compared to
reading the clustered index. The show you that your clustered index is your
most powerful index as it contains all the data.
So just to clarify one
last point. Why do I say that the pointer in the non-clustered index is the
primary key value? Well to demonstrate that the leaf level nodes of the
non-clustered index contains the primary key value I change my query to:
SELECT CustomerID
FROM Customer
WHERE CustomerName='Jane'
In this query SQL can
read the CustomerID from the non-clustered index. It does not need to do a
lookup on the clustered index. This you can see by the execution plan which
looks like this.
enter image
description here
Notice the difference
between this query and the previous query. There is no lookup. SQL can find all
the data in the non-clustered index
Hopefully you can
begin to understand that clustered index is the table and non-clustered indexes
DONT contain all the data. Indexing will speed up selects due to the fact that
binary searches can be done but only clustered indexes contain all the data. So
a search on a non-clustered index will almost always result in values being
loaded from the clustered index. These extra operations make non-clustered
indexes less efficient than a clustered index.
Hope this clears
things up. If anything does not make sense please post a comment and I will try
clarify. It is rather late here and my brain is feeling a wee bit flat. Time
for a red bull.
SSIS Mode of Cache
·
Full Cache Mode: In
this cache mode, SSIS queries the database before the start of the data flow
task execution (pre-execute phase) and loads all the data from lookup /
reference table into SSIS lookup cache. Following are few considerations while
using Lookup in Full Cache Mode:
·
Size of available memory / cache should be taken into
consideration before choosing this option.
·
Use this mode when the reference table is small OR
·
Use this mode when only a subset of reference table is loaded
into lookup (using filter conditions in SQL command).
·
Partial Cache Mode: In
this cache mode, SSIS queries the database against new rows coming in from the
source and if there is a match then that row is cached into SSIS Lookup Cache
for rows coming subsequently in the data flow which might have a match. When
the cache becomes full, then SSIS removes few of the rows from cache based on
the usage / match statistics for those rows and loads the new matching rows
into the Lookup Cache. Following are few considerations while using Lookup in
Partial Cache Mode:
·
Use this mode when the source data set or data to be processed
is small OR
·
Use this mode when the reference table is large and only few of
the records from the reference table will be used / will match against the
source data rows.
·
No Cache Mode: In
this cache mode, SSIS does not cache any rows into Lookup Cache (except for
cases like two subsequent source data rows have a match with same lookup values
etc). For every row coming from the source, the database is queried to get the
matching value / data from the reference table. Here are few considerations
while using Lookup in No Cache Mode:
·
Use this mode when less number of rows are being processed OR
·
Use this mode when the reference table is very frequently
modified (Insertion / Updation / Deletion of records).
FROM
JOINs
WHERE
GROUP By
HAVING
ORDER BY
SELECT
JOINs
WHERE
GROUP By
HAVING
ORDER BY
SELECT
FileTable are user table provided in SQL Server 2012 which is a
special table of files for applications that require file and directory storage
in the database, with Windows API compatibility and non-transactional access
and has a pre-defined schema that stores FILESTREAM data, as well as file and
directory hierarchy information and file attributes?
1.
SELECT Name, Marks,
2.
(SELECT COUNT(*)+1 FROM Student B WHERE A.Marks<B.Marks) ASRank FROM Student A
3.
ORDER BY Marks DESC
Given an array of
numbers, write a function to remove duplicates and sort the output
In an org, given two
employees, find their common manager
SQL: Given an employee
table with salaries, find the employees that make more than their manager
Self-join
solution:
select e.*, m.Salary as "Manager Salary"
from Employee e
join Employee m on e.ManagerId = m.Id
where e.Salary > m.Salary
select e.*, m.Salary as "Manager Salary"
from Employee e
join Employee m on e.ManagerId = m.Id
where e.Salary > m.Salary
Here you are "joining" Employee table to itself on the FK relation of ManagerId and then querying on salary.
Sub-query solution:
select *
from Employee e
where e.Salary >
(select m.Salary from Employee m
where e.ManagerId = m.Id)
Dynamic management
views (DMVs) and functions return server state information that can be used to
monitor the health of a server instance, diagnose problems, and tune
performance; that is, they let you see what is going on inside SQL Server. They
were introduced in SQL Server 2005 as an alternative to system tables. One
example is viewing operating system wait statistics via this query:
SELECT * FROM
sys.dm_os_wait_stats;
Another example is
examining current sessions, much like the sp_who2 command:
SELECT * FROM
sys.dm_exec_sessions;
The WITH CHECK OPTION
for a view prevents data modifications (to the data) that do not confirm to the
WHERE clause of the view definition. This allows data to be updated via the
view, but only if it belongs in the view.
Best Practices of Optimising and performance enhancing SSIS
Package
- If
you know that data in a source is sorted, set IsSorted=TRUE on the source
adapter output. This may save unnecassary SORTs later in the pipeline
which can be expensive. Setting this value does not perform a sort
operation, it only indicates that the data it sorted.
- Rename
all Name and Description properties from the default. This will help when
debugging particularly if the person doing the debugging is not the person
that built the package.
- Only
select columns that you need in the pipeline to reduce buffer size
and reduce OnWarning events at execution time
- Use
Sequence containers to organise package structure into logical units of
work. This makes it easier to identify what the package does and also
helps to control transactions if they are being implemented
- Where
possible, use expressions on the SQLStatementType property of the Execute
SQL Task instead of parameterised SQL statements. This removes ambiguity
when different OLE DB providers are being used. It is also easier.
(UPDATE: There is a caveat here. Results of expressions are limited to
4000 characters so be wary of this if using expressions ).
- Use
caching in your LOOKUP components where possible. It makes them quicker.
Watch that you are not grabbing too many resources when you do this
though.
- LOOKUP components will
generally work quicker than MERGE JOIN components where the 2 can be used
for the same task
- Where possible filter
your data in the Source Adapter rather than filter the data using a Conditional Split transform
component. This will make your data flow perform quicker because the
unnecessary records don't go through the pipeline.
- A sort with SQL
Server is faster than the sort in SSIS, partly because SSIS does the sort
in memory. So it pays to move the sort to a source component (where
possible). Note you have to set IsSorted=TRUE on the source adapter output, but setting this value does not perform a sort
operation; it only indicates that the data it sorted. After that change
the SortKeyPosition of all output columns that are sorted.
- Where possible, join
data in the Source Adapter rather than using the Merge Join component. SQL
Server does it faster than SSIS. But watch out that you are not making to
complex queries because that will worsen the readability.
- Where possible, aggregate your
data in the Source Adapter rather than using the Aggregate component. SQL
Server does it faster than SSIS.
- Loading high volumes
of data on a table with clustered and non-clustered indexes could take
a lot of time.The most important
thing to verify is if all indexes are really used. SQL Server 2005 and
2008 provide information about index usage with to views:
sys.dm_db_index_operational_stats and sys.dm_db_index_usage_stats. Drop
all rarely used and unused indexes
first. Experience teaches that there are often a lot of unnecessary
indexes. If you are absolute sure that all remaining indexes are necessary
you can drop all indexes before loading the data and to recreate them
afterwards. The performance profit of that depends on the number of
records. The higher the number of records the more profit you gain.
- If your target
database is a local SQL server database, the SQL Server Destination
Adapter will perform much better than the OLE DB Destination Adapter.
However the SQL Server Destination Adapter works only on a local machine
and via Windows security. You have to be absolute sure that your database
stays local in the future otherwise you mapping will not work when moving
the database.
- Always use DTExec to
perf test your packages. This is not the same as executing without
debugging from SSIS Designer
- If
you want to conditionally execute a task at runtime use expressions on
your precedence constraints. Do not use an expression on the
"Disable" property of the task.
- Don't
pull all configurations into a single XML configuration file. Instead, put
each configuration into a seperate XML configuration file. This is a more
modular approach and means that configuration files can be reused by
different packages more easily.
- If you need a dynamic
SQL statement in an OLE DB Source component, set AccessMode="SQL
Command from variable" and build the SQL statement in a variable that
has EvaluateAsExpression=TRUE.
- When using
checkpoints, use an expression to populate the CheckpointFilename
property which will allow you to include the value returned
from System::PackageName in the checkpoint filename. This will
allow you to easily identify which package a checkpoint file is to be used
by.
- When storing
information about an OLE DB Connection Manager in a configuration, don't
store the individual properties such as Initial Catalog, Username,
Password etc... just store the ConnectionString property.
- Your variables should
only be scoped to the containers in
which they are used. Do not scope all your variables to the package container if they don't need to be.
- Use
ProtectionLevel=DontSaveSensitive. Other developers will not be
restricted from opening your packages and you will be forced to use
configurations (which is another recommended best practice)
Best
Practices of development in SSIS Packages
1.
Rename all default component names and give them
explaining descriptions. This will help other developers that edit your
packages. It is also very useful when debugging.
2.
Use annotations. This is very useful if the Control Flow or Data Flow isn't self describing (for others).
3.
Use Sequence containers to organize package
structures into logical units of work. This makes it easier to identify what
the package does. It also helps to control transactions if they are being implemented. * Update: SSIS 2012 has a grouping feature *
4.
Flows should basically go top-down. This will
make your packages more readable.
5.
Do not use disabled Control Flow tasks in the Quality
assurance or Production environment. If you want to conditionally
execute a task at runtime use expressions on your precedence constraints. Do
not use an expression on the “Disable” property of the task.
6.
You can add more than one projects to your Visual
Studio Solution to spread large number of packages. Think about proper layout. For example a
datastaging project and a datawarehouse project.
7.
Don't use too complex queries. Use a readable
lay-out and add comments to explain parts of the query.
8.
Give tasks and transformations a prefix. This makes it easier to read the logging.
--Error Handling
of With Check option in View
Begin Try
Update
Test_VW_Fz
set Lastname='Ayush'
WHERE
lASTnAME LIKE '%e%'
END Try
begin Catch
INSERT INTO Error_Process
Select @@error,ERROR_MESSAGE()
end Catch
Reuse of
Execution Plans
By Default Stored procedures can reuse a previously cached
execution plan, thereby saving the resources involved
in generating a new execution plan.
Example for plan reuse. below code creates the get_order_info
procedure
CREATE PROC get_order_info
@odate AS DATETIME
AS
SELECT orderid, custid, empid, orderdate /*
KPFJ-4959-91D6-F1EC81F8428F */
FROM tbl_order
WHERE orderdate >= @odate
Turn on the STATISTICS IO option to get back I/O information
for your session’s activity
SET STATISTICS IO ON
EXEC get_order_info '20080506'
Run the stored procedure for the first time, providing an
input with high selectivity (that is, an input for which a small percentage of
rows will be returned):
First time the stored procedure is invoked, SQL Server
generated an execution plan for it based on the selective input value and
cached that plan.
Advantage for Reuse of Execution Plans
Now that you have a plan stored in cache and invocations of
the stored
procedure will reuse it. That’s fine if you keep invoking the
stored procedure with a highly
selective input. You will enjoy the fact that the plan is
reused, and SQL Server will not waste
resources on generating new plans. That’s especially
important with systems that invoke
stored procedures very frequently
If the blk column in the sp_who output is non-zero, this indicates
that blocking is occurring on your system. If processes are blocking each
other, the processes that are being blocked can experience longer execution
times
Quest-SELECT 1 UNION SELECT NULL UNION SELECT '1'
Ans-1,0
You receive some data from UI and need to populate table
Customer if the name received from UI does not exist in database. Which of
options below should be used for this purpose:
declare
@customername varchar(10)
set
@customername='Ayush'
IF NOT EXISTS ( SELECT * FROM Customer WHERE
Name = @CustomerName )
INSERT INTO [Customer] ( [Name]) VALUES (
@CustomerName)
declare
@customername varchar(10)
set
@customername='Ayushk'
INSERT INTO [Customer] ( [Name])
SELECT
@CustomerName WHERE NOT
EXISTS ( SELECT * FROM Customer WHERE Name =
@CustomerName )
No comments:
Post a Comment