Si può usare il codice C++ in un programma CUDA C? Quando includo il file di intestazione iostream, il compilatore confonde la chiamata al kernel.

CUDA non ha il supporto di STL. Vector semplicemente non funziona in un kernel CUDA. La cosa più vicina deve essere la libreria Thrust. For example,

  1. thrust::device_vector vect; 

but it doesn’t work in device kernel code. Its just an STL-like structure with a GPU backed storage so that you can do the math on it through GPU-computation using simple codes like this:

  1. thrust::transform( 
  2. vect.begin(),  
  3. vect.end(),  
  4. vect.begin(),  
  5. [=] __device__ (float x) {return x + 1;} 
  6. ); 

So its still a CUDA C/C++ program but not an explicit device kernel code. If you meant to run them in device kernel code(not host), then you have to implement one yourself because there are many use cases and every one of them will be highly fluctuating in performance depending on hardware and algorithm.

For example, a parallel-vector in a kernel could be like:

  1. class fake_heap 
  2. public: 
  3. __device__ fake_heap(float * heap) 
  4. h=heap; 
  5. ctr=0; 
  6. }  
  7. int ctr; 
  8. float * h;  
  9. }; 
  10.  
  11. class vector 
  12. public: 
  13. __device__ vector(fake_heap * heapInput) 
  14. heap = heapInput; 
  15. index = 0; 
  16.  
  17. __device__ float& operator[](int idx) 
  18. return heap->h[idx]; 
  19.  
  20. __device__ void push_back(float data) 
  21. index = atomicAdd(heap->h,1); 
  22. heap->h[index]=data; 
  23. __syncthreads(); 
  24.  
  25. int index; 
  26. fake_heap * heap; 
  27. }; 
  28.  
  29. extern "C" 
  30. __global__ void parallel_vector_test(int * __restrict__ data) 
  31. const int i=threadIdx.x + blockIdx.x * blockDim.x; 
  32. __shared__ float memory[256]; 
  33. fake_heap heap(memory); 
  34. vector vect(&heap); // a vector that spans N threads 
  35. vect.push_back(i); // all threads push into same vector 
  36. data[i]=vect[i]; 
  37. }  

output of data array is (launched with 32-threads, pushed into same same vector of 32 elements):

  1. 0,8,1,9,2,10,3,11,4,12,5,13,6,14, 
  2. 7,15,16,24,17,25,18,26,19,27,20, 
  3. 28,21,29,22,30,23,31 

and not everyone wants it this way. Può essere qualsiasi cosa. Qualcuno potrebbe volere i valori ordinati come quelli del thread-id, qualcuno potrebbe aver bisogno che la dimensione del vettore non sia limitata a pochi kilobyte e rimanga sulla memoria globale e molto probabilmente qualcuno potrebbe usare strutture vettoriali per-thread invece, per avere più vettori indipendenti.

Ma, come si può vedere dalle definizioni di "classe", si può usare "C++" (anche se un po' tagliato nelle caratteristiche) nel codice del kernel. Guardate l'operatore indice sovraccaricato. Non è certamente C99. Da queste capacità simili al C++, puoi costruire contenitori avanzati per fare il tuo lavoro nel modo che vuoi. For example, you can do this in a CUDA device function:

  1. float * mem = new float[256]; 
  2. delete mem; 

but this will be in local memory and possibly cached by L1 and is per-thread allocation/deletion but is too slow since there is (at least)256 stride-length between neighboring threads accessing their own mem buffer with same index. You may need a parallel allocator/indexer implementation for this to work fast.

When you include a header, it confuses its calling source

  1. __device__ 
  2. __global__ 
  3. __host__ 

these do not exist in definitions of functions in headers that are irrelevant to CUDA. You can try some CUDA compiler options to make it assume all functions in the header to be

  1. __device__ 

so that it may be called from device kernel.