I am using HLSL shader model 3, but this article should apply to other languages and shader models.

It is Good to MAD

One of the most basic optimisation is to use mad operation, which is to multiply 2 values and add a third value to the result.
This is two instructions for the price of one, luckily the compiler is usually smart enough to snap this bargain when they sense it.
It is still useful to look at the compiled asm to combine any exceptions

The Power of 4

The beauty of GPU is most instructions are process in 4 components data block. This will caused any instruction that uses 1 data to cost as much as the same instruction using 4 data.

Eg pow(NdotH, shininess) cost the same as pow(float4(light1NdotH, light2NdotH, light3NdotH, light4NdotH), shininess)

Step operation and comparision operator also work in 4
float4 sampleDepths;
float depthToCompare;

float4 results = step(sampleDepths, depthToCompare);
float4 results = (sampleDepths > depthToCompare);

When 4 become 1

After comparision, it is often desirable to combine the results together to get total number of samples compared correctly.
This is where dotproduct shine.
Remember dot product of 2 float4 is

vec1.x * vec2.x + vec1.y * vec2.y + vec1.z * vec2.z + vec1.w * vec2.w

which is 4 mad instruction for 1 dp4 instruction

float numberOfResultsTrue = dot(results, 1);

Cheap Matrix Inverse

Remember orthonormal matrices Inverse is equal to their Transpose. Tangent, Binormal(Bitangent), Normal are usually in unit length and are perpendicular to each other. Use their transpose to convert any tangent space vector back to object space.

float3x3 matTan = float3x3(T, B, N);
float3x3 matTanInverse = transpose(matTan);
float3 normalFromNormalMap;
float3 worldNormal = mul(matTanInverse, normalFromNormalMap);