Boosting the performance of Cartesian to Spherical co-ordinates conversion using Intel SDLT

ID 659085
Updated 3/28/2016
Version Latest
Public

author-image

By

Introduction

Most of the applications developed these days follow Object-Oriented design. The real world entities are modeled as objects and application logic is defined as interaction between these objects. For instance: for a 2D image captured, each pixel has 3 color components (Red, Green and Blue). So every pixel is modeled as a user defined struct as shown below:

struct rgb{
  unsigned char red,
  unsigned char green,
  unsigned char blue
};

An image being a collection of pixels, we can create an array of pixels to capture the image data. Two possible collections are:

rgb image[NUM_OF_PIXELS];

or 

std::vector<rgb> image(NUM_OF_PIXELS);

In either case, it is noticeable that developer creates a struct first and then creates a collection of the struct thus the final collection being an Array of Structures (AoS). But when it comes to most image processing algorithms like averaging filter, gaussian filter etc, the filters apply to every individual color component of the pixel. Since the red, green and blue pixel values are not in contiguous memory locations in the AoS data structure, it demands a gather instruction to get all the needed red pixels values on to a vector register (same applies for green and blue pixel values). So though the loop vectorizes, the vectorization efficiency is not the best. Another alternative is to change the image representation to something like this:

struct image{
  unsigned int red[NUM_OF_PIXELS];
  unsigned int green[NUM_OF_PIXELS];
  unsigned int blue[NUM_OF_PIXELS];
};

The problem with the above approach is that the developer doesn't use the Object Oriented model anymore (uses Structure of Array (SoA) format) and any changes made at the data structure level demands a change in algorithms which operate on those data structure making this a huge engineering effort. Intel SIMD Data Layout Template (Intel SDLT) library provides a means for developers to preserve the AoS interface while developing the application, but lay the data in a SoA format which is more vector friendly. The article demonstrates the ease of use and performance advantage of using Intel SDLT library on Cartesian to Spherical Co-ordinate system conversion.

Co-ordinate Systems:

In a 3D space, every point can represented in a Cartesian co-ordinate system (x,y,z). Many applications prefer having the points in Spherical co-ordinate systems (useful in analyzing systems that have some degree of symmetry about a point, such as volume integrals inside a sphere, the potential energy field surrounding a concentrated mass or charge, or global weather simulation in a planet's atmosphere. - from Wikipedia). This paper assumes we have a 3D image captured and now translate the individual points in the cartesian space to corresponding Spherical co-ordinate system.

Modeling point in 3D Cartesian space:

Below is the user defined structure for a Cartesian Point:

class CartesianPoint{
public:
	float x, y, z;
	CartesianPoint(){
		x = y = z = 0.0f;
	}
	explicit CartesianPoint(float x1, float y1, float z1){
		x = x1;
		y = y1;
		z = z1;
	}
	CartesianPoint(const CartesianPoint& other)
	: x(other.x)
	, y(other.y)
	, z(other.z)
	{
	}
	CartesianPoint& operator=(CartesianPoint& ref){
		x = ref.x;
		y = ref.y;
		z = ref.z;
		return *this;
	}
	CartesianPoint& operator=(float n){
		x = y = z = n;
		return *this;
	}
};

Modeling a Spherical Co-ordinate point:

Below is the user defined structure for Spherical co-ordinate point:

class SphericalPoint{
public:
	float r, theta, phi;
	SphericalPoint(){
		r = theta = phi = 0.0f;
	}
	explicit SphericalPoint(float r1, float theta1, float phi1){
		r = r1;
		theta = theta1;
		phi = phi1;
	}
	SphericalPoint& operator=(SphericalPoint ref){
		r = ref.r;
		theta = ref.theta;
		phi = ref.phi;
		return *this;
	}
	SphericalPoint(const CartesianPoint &ref){
		float x1 = ref.x;
		float y1 = ref.y;
		float z1 = ref.z;
		float xsq = x1 * x1;
		float ysq = y1 * y1;
		float zsq = z1 * z1;
		r = sqrt(xsq + ysq + zsq);
		theta = atan(ref.y / ref.x);
		phi = atan(sqrt((xsq + ysq) / ref.z));
	}

	SphericalPoint& operator=(CartesianPoint &ref){
		float xsq = ref.x * ref.x;
		float ysq = ref.y * ref.y;
		float zsq = ref.z * ref.z;
		r = sqrt(xsq + ysq + zsq);
		theta = atan(ref.y / ref.x);
		phi = atan(sqrt((xsq + ysq)/ref.z));
		return *this;
	}
	friend ostream& operator<<(ostream& os, const SphericalPoint& obj1){
		os << "r = "<< obj1.r <<", theta = "<< obj1.theta <<", phi = " << obj1.phi << "\n";
		return os;
	}
};

Array of Structure (AoS) Implementation:

Below is the user defined data type which models the collection of 3D cartesian points in AoS format:

class CartesianPointImage{
public:
	vector<CartesianPoint> cartpoints;
	int num_of_points() const { return static_cast<int>(cartpoints.size()); }
	CartesianPointImage(size_t num_of_elements){
		cartpoints.resize(num_of_elements);
		for (unsigned int i = 0; i < num_of_elements; i++)
		{
			cartpoints[i] = static_cast<float>(i);
		}
	}
};

As shown above the instead of standard arrays, STL vector is used since that is most widely used data structure in application development these days. 

Below is the user defined data type which models the collection of Spherical co-ordinate points in AoS format:

class SphericalPointImage{
public:
	vector<SphericalPoint> sphpoints;
	int num_of_points() const { return static_cast<int>(sphpoints.size()); }
	SphericalPointImage(size_t num_of_elements){
		sphpoints.resize(num_of_elements);
	}
	SphericalPointImage& operator=(CartesianPointImage& ref){
		const auto & source = ref.cartpoints;
		int count = static_cast<int>(sphpoints.size());
		#pragma simd
		for (unsigned int i = 0; i < count; i++){
			sphpoints[i] = SphericalPoint(source[i]);
		}
		return *this;
	}
	friend ostream& operator<<(ostream& os, SphericalPointImage& obj){
		for (unsigned int i = 0; i < obj.num_of_points(); i++){
			os << "SphericalPoint[" << i << "] : " << obj.sphpoints[i] << "\n";
		}
		return os;
	}
};

The program can be build using the following command line options:

icl Cartesian_to_Spherical.cpp timer.cpp /QxCORE-AVX2 /Ob2 /Qrestrict /Qansi-alias /Qstd=c++11 /O2

The time taken by the AoS implementation is 29 seconds.

Intel(R) Vector Advisor Memory Access Pattern (MAP) analysis result for the AoS implementation is as shown below:

The MAP report clearly states that the 24% of the memory access are not unit strided. This is because the data is laid in memory in AoS format. The obvious solution is to convert the data layout from AoS to SoA. There are two possible solutions: 1) manually convert the data structures to SoA from AoS and also change the corresponding algorithms 2) Use Intel(R) SDLT to make minimal changes to the code which preserves the AoS coding interface but lays out the data in SoA format.

Manually converting the Data Structure to SoA:

In order to make use of the SoA data layout, the CartesianPoint and SphericalPoint structures can't be used. Rather change the implementation of CartesianPointImage and SphericalPointImage data structures as shown below:

CartesianPointImage:

class CartesianPointImageSOA{
public:
	//vector<CartesianPoint> cartpoints;
	vector<float> x;
	vector<float> y;
	vector<float> z;
	int num_of_points() const { return static_cast<int>(x.size()); }
	CartesianPointImageSOA(size_t num_of_elements){
		x.resize(num_of_elements);
		y.resize(num_of_elements);
		z.resize(num_of_elements);
		for (unsigned int i = 0; i < num_of_elements; i++)
		{
			float temp = static_cast<float>(i);
			x[i] = temp;
			y[i] = temp;
			z[i] = temp;
		}
	}
};

SphericalPointImage:

class SphericalPointImageSOA{
public:
	//vector<SphericalPoint> sphpoints;
	vector<float> r;
	vector<float> theta;
	vector<float> phi;
	int num_of_points() const { return static_cast<int>(r.size()); }
	SphericalPointImageSOA(size_t num_of_elements){
		r.resize(num_of_elements);
		theta.resize(num_of_elements);
		phi.resize(num_of_elements);
	}
	SphericalPointImageSOA& operator=(CartesianPointImageSOA& ref){
		//const auto & source = ref.cartpoints;
		const auto & sourcex = ref.x;
		const auto & sourcey = ref.y;
		const auto & sourcez = ref.z;
		int count = static_cast<int>(r.size());
#pragma simd
		for (unsigned int i = 0; i < count; i++){
			//sphpoints[i] = SphericalPoint(source[i]);
			float x = sourcex[i];
			float y = sourcey[i];
			float z = sourcez[i];
			float xsq = x*x;
			float ysq = y*y;
			float zsq = z*z;
			r[i] = sqrt(xsq + ysq + zsq);
			theta[i] = atan(y / x);
			phi[i] = atan(sqrt((xsq + ysq) / z));
		}
		return *this;
	}
};

The changes needed to switch to SoA implementation is demonstrated by commenting the previous state of the data structure. This implementation takes a non-intuitive and non Object-Orientated programming approach though the performance is better than the AoS implementation.

To build the SoA implementation, use the below command:

icl Cartesian_to_Spherical.cpp timer.cpp /QxCORE-AVX2 /Ob2 /Qrestrict /Qansi-alias /Qstd=c++11 /D__SOA /O2

The time taken by the SoA implementation is 19 seconds. Also the result from MAP analysis is:

By switching to SoA implementation and making sure the data is laid out in contiguous memory locations, all the memory access is converted to unit stride. 

Enable Intel SDLT:

Since C++ doesn't support compile time reflection, Intel SDLT provides a macro SDLT_PRIMITIVE to let the SDLT library aware of the user defined data structures which needs to be stored the SDLT container. Below is an example of how to use SDLT_PRIMITVE macro for CartesianPoint and SphericalPoint data structures:

	SDLT_PRIMITIVE(
		CartesianPoint,
		x,
		y,
		z
	)

	SDLT_PRIMITIVE(
		SphericalPoint,
		r,
		theta,
		phi
	)
	typedef sdlt::soa1d_container<CartesianPoint> CartesianPointContainer;
	typedef sdlt::soa1d_container<SphericalPoint> SphericalPointContainer;

Now that SDLT library knows what CartesianPoint and SphericalPoint is composed off, they can be stored in soa1d_container. soa1d_container is a 1 dimensional container provided by SDLT library and it converts the CartesianPoint and SphericalPoint to be stored in SoA format.

CartesianPointImage enabled using Intel SDLT:

class CartesianPointImage{
public:
	CartesianPointContainer cartpoints;
	int num_of_points() const { return static_cast<int>(cartpoints.size()); }

	CartesianPointImage(size_t num_of_elements){
		cartpoints.resize(num_of_elements);
		auto a = cartpoints.access();
		for (unsigned int i = 0; i < num_of_elements; i++)
		{
			CartesianPoint temp(i, i, i);
			a[i] = temp;
		}
	}
};

Instead of std::vector, above version of CartesianPointImage uses sdlt::soa1d_container<CartesianPoint>.

SphericalPointImage enabled using Intel SDLT:

class SphericalPointImage{
public:
	SphericalPointContainer sphpoints;
	int num_of_points() const { return static_cast<int>(sphpoints.size()); }

	SphericalPointImage(size_t num_of_elements){
		sphpoints.resize(num_of_elements);
	}
	SphericalPointImage& operator=(CartesianPointImage& ref){
		SDLT_INLINE_BLOCK
		{
			auto source = ref.cartpoints.access();
			auto dest = sphpoints.access();
			//auto source = ref.cartpoints.begin();
			//auto dest = sphpoints.begin();
			int count = num_of_points();
			#pragma simd
			for (unsigned int i = 0; i < count; i++){
				dest[i] = SphericalPoint(source[i]);
			}
		}
		return *this;
	}
	friend ostream& operator<<(ostream& os, SphericalPointImage& obj){
		for (unsigned int i = 0; i < obj.num_of_points(); i++){
			os << "SphericalPoint[" << i << "] : " << obj.sphpoints[i] << "\n";
		}
		return os;
	}
};

As shown above, the AoS interface in the code remains the same. Without compromising the Object-Oriented programming, Unit strided memory access is acheived using Intel(R) SDLT. The SDLT enabled version can be build using the following command:

icl Cartesian_to_Spherical.cpp timer.cpp /QxCORE-AVX2 /Qrestrict /Qansi-alias /Qstd=c++11 /D__SDLT_DEFINED /O2 /Ob2

The time taken by SDLT version of the executable is 19 seconds (34% performance improvement) and the corresponding MAP analysis result is:

System Specifications:

Processor: Intel(R) Core(TM) i5-4300M @2.6GHz 2.59GHz
Compiler Version: Intel(R) C++ Compiler 16.0 Update 2
Memory: 8GB