Count each bit-position separately over many 64-bit bitmasks, with AVX but not AVX22019 Community Moderator...
Locked Away- What am I?
Gomel chasadim tovim - are there bad chasadim?
Under what conditions can the right to remain silent be revoked in the USA?
How should I solve this integral with changing parameters?
Can't make sense of a paragraph from Lovecraft
cannot log in to the server after changing SSH port
What would be the most expensive material to an intergalactic society?
Having the player face themselves after the mid-game
I can't die. Who am I?
If nine coins are tossed, what is the probability that the number of heads is even?
Short scifi story where reproductive organs are converted to produce "materials", pregnant protagonist is "found fit" to be a mother
Are all players supposed to be able to see each others' character sheets?
Is it appropriate to ask a former professor to order a book for me through an inter-library loan?
Is there a math expression equivalent to the conditional ternary operator?
Did Amazon pay $0 in taxes last year?
Has a sovereign Communist government ever run, and conceded loss, on a fair election?
What can I do if someone tampers with my SSH public key?
What sort of fish is this
What is better: yes / no radio, or simple checkbox?
Is this Paypal Github SDK reference really a dangerous site?
Called into a meeting and told we are being made redundant (laid off) and "not to share outside". Can I tell my partner?
What does *dead* mean in *What do you mean, dead?*?
Is divide-by-zero a security vulnerability?
Count each bit-position separately over many 64-bit bitmasks, with AVX but not AVX2
Count each bit-position separately over many 64-bit bitmasks, with AVX but not AVX2
2019 Community Moderator ElectionHow to generate a sse4.2 popcnt machine instructionIs there a more efficient way of expanding a char to an uint64_t?Fastest way to unpack 32 bits to a 32 byte SIMD vectoris there an inverse instruction to the movemask instruction in intel avx2?NASM: Count how many bits in a 32 Bit number are set to 1How to quickly count bits into separate bins in a series of ints on Sandy Bridge?Unsigned int bitmask for multiple bit positionsWhy do processors with only AVX out-perform AVX2 processors for many SIMD algorithms?Does AVX or AVX2 support 256 bit string instructions and mullo for unsigned short?AVX2 multiply 2 vectors of 64 bit integers discarding the upper half of each results?For what Intel's AVX-512 has 32 (so many!) 512-bit register vectors, ZMM0 to ZMM31?How to count number of avx and avx2 instruction setCount how many 1 there are in the odd positions of an array of 32-bit numbersCount bits set upto a given position for a 32 bit type
I am working on a project in C where I need to go through tens of millions of masks (of type ulong (64-bit)) and update an array (called target
) of 64 short integers (uint16) based on a simple rule:
// for any given mask, do the following loop
for (i = 0; i < 64; i++) {
if (mask & (1ull << i)) {
target[i]++
}
}
The problem is that I need do the above loops on tens of millions of masks and I need to finish in less than a second. Wonder if there are any way to speed it up, like using some sort special assembly instruction that represents the above loop.
Currently I use gcc 4.8.4 on ubuntu 14.04 (i7-2670QM, supporting AVX, not AVX2) to compile and run the following code and took about 2 seconds. Would love to make it run under 200ms.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/stat.h>
double getTS() {
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec + tv.tv_usec / 1000000.0;
}
unsigned int target[64];
int main(int argc, char *argv[]) {
int i, j;
unsigned long x = 123;
unsigned long m = 1;
char *p = malloc(8 * 10000000);
if (!p) {
printf("failed to allocaten");
exit(0);
}
memset(p, 0xff, 80000000);
printf("p=%pn", p);
unsigned long *pLong = (unsigned long*)p;
double start = getTS();
for (j = 0; j < 10000000; j++) {
m = 1;
for (i = 0; i < 64; i++) {
if ((pLong[j] & m) == m) {
target[i]++;
}
m = (m << 1);
}
}
printf("took %f secsn", getTS() - start);
return 0;
}
Thanks in advance!
c optimization x86 x86-64 simd
|
show 10 more comments
I am working on a project in C where I need to go through tens of millions of masks (of type ulong (64-bit)) and update an array (called target
) of 64 short integers (uint16) based on a simple rule:
// for any given mask, do the following loop
for (i = 0; i < 64; i++) {
if (mask & (1ull << i)) {
target[i]++
}
}
The problem is that I need do the above loops on tens of millions of masks and I need to finish in less than a second. Wonder if there are any way to speed it up, like using some sort special assembly instruction that represents the above loop.
Currently I use gcc 4.8.4 on ubuntu 14.04 (i7-2670QM, supporting AVX, not AVX2) to compile and run the following code and took about 2 seconds. Would love to make it run under 200ms.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/stat.h>
double getTS() {
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec + tv.tv_usec / 1000000.0;
}
unsigned int target[64];
int main(int argc, char *argv[]) {
int i, j;
unsigned long x = 123;
unsigned long m = 1;
char *p = malloc(8 * 10000000);
if (!p) {
printf("failed to allocaten");
exit(0);
}
memset(p, 0xff, 80000000);
printf("p=%pn", p);
unsigned long *pLong = (unsigned long*)p;
double start = getTS();
for (j = 0; j < 10000000; j++) {
m = 1;
for (i = 0; i < 64; i++) {
if ((pLong[j] & m) == m) {
target[i]++;
}
m = (m << 1);
}
}
printf("took %f secsn", getTS() - start);
return 0;
}
Thanks in advance!
c optimization x86 x86-64 simd
5
usinguint16
as counters would likely result in overflow
– VTT
10 hours ago
"like using some sort special assembly instruction that represents the above loop." --> a good compiler will see this. Refer to your compiles optimization settings to activate such.
– chux
10 hours ago
2
Suggestion: take down question, add a test harness around it to report time and post that code asking for improvements. As is now, question is too broad.
– chux
10 hours ago
1
You should fill the buffer with realistic data too. The number of set bits and their pattern affects the performance of this code, accidentally having all zeroes would unfairly benefit branchy code which would perform worse on maximally random bits (50% chance, uncorrelated)
– harold
8 hours ago
2
@Gene: no,popcnt
is a horizontal popcnt that adds up all the bits in one integer. The OP wants separate counts for every bit-position over multiple integers. You want to unpack bit to something wider (e.g. nibbles or bytes), then vertical add (likepaddb
) as many times as you can without overflow, then expand to wider counters.
– Peter Cordes
6 hours ago
|
show 10 more comments
I am working on a project in C where I need to go through tens of millions of masks (of type ulong (64-bit)) and update an array (called target
) of 64 short integers (uint16) based on a simple rule:
// for any given mask, do the following loop
for (i = 0; i < 64; i++) {
if (mask & (1ull << i)) {
target[i]++
}
}
The problem is that I need do the above loops on tens of millions of masks and I need to finish in less than a second. Wonder if there are any way to speed it up, like using some sort special assembly instruction that represents the above loop.
Currently I use gcc 4.8.4 on ubuntu 14.04 (i7-2670QM, supporting AVX, not AVX2) to compile and run the following code and took about 2 seconds. Would love to make it run under 200ms.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/stat.h>
double getTS() {
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec + tv.tv_usec / 1000000.0;
}
unsigned int target[64];
int main(int argc, char *argv[]) {
int i, j;
unsigned long x = 123;
unsigned long m = 1;
char *p = malloc(8 * 10000000);
if (!p) {
printf("failed to allocaten");
exit(0);
}
memset(p, 0xff, 80000000);
printf("p=%pn", p);
unsigned long *pLong = (unsigned long*)p;
double start = getTS();
for (j = 0; j < 10000000; j++) {
m = 1;
for (i = 0; i < 64; i++) {
if ((pLong[j] & m) == m) {
target[i]++;
}
m = (m << 1);
}
}
printf("took %f secsn", getTS() - start);
return 0;
}
Thanks in advance!
c optimization x86 x86-64 simd
I am working on a project in C where I need to go through tens of millions of masks (of type ulong (64-bit)) and update an array (called target
) of 64 short integers (uint16) based on a simple rule:
// for any given mask, do the following loop
for (i = 0; i < 64; i++) {
if (mask & (1ull << i)) {
target[i]++
}
}
The problem is that I need do the above loops on tens of millions of masks and I need to finish in less than a second. Wonder if there are any way to speed it up, like using some sort special assembly instruction that represents the above loop.
Currently I use gcc 4.8.4 on ubuntu 14.04 (i7-2670QM, supporting AVX, not AVX2) to compile and run the following code and took about 2 seconds. Would love to make it run under 200ms.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/stat.h>
double getTS() {
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec + tv.tv_usec / 1000000.0;
}
unsigned int target[64];
int main(int argc, char *argv[]) {
int i, j;
unsigned long x = 123;
unsigned long m = 1;
char *p = malloc(8 * 10000000);
if (!p) {
printf("failed to allocaten");
exit(0);
}
memset(p, 0xff, 80000000);
printf("p=%pn", p);
unsigned long *pLong = (unsigned long*)p;
double start = getTS();
for (j = 0; j < 10000000; j++) {
m = 1;
for (i = 0; i < 64; i++) {
if ((pLong[j] & m) == m) {
target[i]++;
}
m = (m << 1);
}
}
printf("took %f secsn", getTS() - start);
return 0;
}
Thanks in advance!
c optimization x86 x86-64 simd
c optimization x86 x86-64 simd
edited 6 hours ago
Peter Cordes
130k18197334
130k18197334
asked 10 hours ago
pktCoderpktCoder
510719
510719
5
usinguint16
as counters would likely result in overflow
– VTT
10 hours ago
"like using some sort special assembly instruction that represents the above loop." --> a good compiler will see this. Refer to your compiles optimization settings to activate such.
– chux
10 hours ago
2
Suggestion: take down question, add a test harness around it to report time and post that code asking for improvements. As is now, question is too broad.
– chux
10 hours ago
1
You should fill the buffer with realistic data too. The number of set bits and their pattern affects the performance of this code, accidentally having all zeroes would unfairly benefit branchy code which would perform worse on maximally random bits (50% chance, uncorrelated)
– harold
8 hours ago
2
@Gene: no,popcnt
is a horizontal popcnt that adds up all the bits in one integer. The OP wants separate counts for every bit-position over multiple integers. You want to unpack bit to something wider (e.g. nibbles or bytes), then vertical add (likepaddb
) as many times as you can without overflow, then expand to wider counters.
– Peter Cordes
6 hours ago
|
show 10 more comments
5
usinguint16
as counters would likely result in overflow
– VTT
10 hours ago
"like using some sort special assembly instruction that represents the above loop." --> a good compiler will see this. Refer to your compiles optimization settings to activate such.
– chux
10 hours ago
2
Suggestion: take down question, add a test harness around it to report time and post that code asking for improvements. As is now, question is too broad.
– chux
10 hours ago
1
You should fill the buffer with realistic data too. The number of set bits and their pattern affects the performance of this code, accidentally having all zeroes would unfairly benefit branchy code which would perform worse on maximally random bits (50% chance, uncorrelated)
– harold
8 hours ago
2
@Gene: no,popcnt
is a horizontal popcnt that adds up all the bits in one integer. The OP wants separate counts for every bit-position over multiple integers. You want to unpack bit to something wider (e.g. nibbles or bytes), then vertical add (likepaddb
) as many times as you can without overflow, then expand to wider counters.
– Peter Cordes
6 hours ago
5
5
using
uint16
as counters would likely result in overflow– VTT
10 hours ago
using
uint16
as counters would likely result in overflow– VTT
10 hours ago
"like using some sort special assembly instruction that represents the above loop." --> a good compiler will see this. Refer to your compiles optimization settings to activate such.
– chux
10 hours ago
"like using some sort special assembly instruction that represents the above loop." --> a good compiler will see this. Refer to your compiles optimization settings to activate such.
– chux
10 hours ago
2
2
Suggestion: take down question, add a test harness around it to report time and post that code asking for improvements. As is now, question is too broad.
– chux
10 hours ago
Suggestion: take down question, add a test harness around it to report time and post that code asking for improvements. As is now, question is too broad.
– chux
10 hours ago
1
1
You should fill the buffer with realistic data too. The number of set bits and their pattern affects the performance of this code, accidentally having all zeroes would unfairly benefit branchy code which would perform worse on maximally random bits (50% chance, uncorrelated)
– harold
8 hours ago
You should fill the buffer with realistic data too. The number of set bits and their pattern affects the performance of this code, accidentally having all zeroes would unfairly benefit branchy code which would perform worse on maximally random bits (50% chance, uncorrelated)
– harold
8 hours ago
2
2
@Gene: no,
popcnt
is a horizontal popcnt that adds up all the bits in one integer. The OP wants separate counts for every bit-position over multiple integers. You want to unpack bit to something wider (e.g. nibbles or bytes), then vertical add (like paddb
) as many times as you can without overflow, then expand to wider counters.– Peter Cordes
6 hours ago
@Gene: no,
popcnt
is a horizontal popcnt that adds up all the bits in one integer. The OP wants separate counts for every bit-position over multiple integers. You want to unpack bit to something wider (e.g. nibbles or bytes), then vertical add (like paddb
) as many times as you can without overflow, then expand to wider counters.– Peter Cordes
6 hours ago
|
show 10 more comments
4 Answers
4
active
oldest
votes
On my system, a 4 year old MacBook (2.7 GHz intel core i5) with clang-900.0.39.2 -O3
, your code runs in 500ms.
Just changing the inner test to if ((pLong[j] & m) != 0)
saves 30%, running in 350ms.
Further simplifying the inner part to target[i] += (pLong[j] >> i) & 1;
without a test brings it down to 280ms.
Further improvements seem to require more advanced techniques such as unpacking the bits into blocks of 8 ulongs and adding those in parallel, handling 255 ulongs at a time.
Here is an improved version using this method. it runs in 45ms on my system.
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/stat.h>
double getTS() {
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec + tv.tv_usec / 1000000.0;
}
int main(int argc, char *argv[]) {
unsigned int target[64] = { 0 };
unsigned long *pLong = malloc(sizeof(*pLong) * 10000000);
int i, j;
if (!pLong) {
printf("failed to allocaten");
exit(1);
}
memset(pLong, 0xff, sizeof(*pLong) * 10000000);
printf("p=%pn", (void*)pLong);
double start = getTS();
uint64_t inflate[256];
for (i = 0; i < 256; i++) {
uint64_t x = i;
x = (x | (x << 28));
x = (x | (x << 14));
inflate[i] = (x | (x << 7)) & 0x0101010101010101ULL;
}
for (j = 0; j < 10000000 / 255 * 255; j += 255) {
uint64_t b[8] = { 0 };
for (int k = 0; k < 255; k++) {
uint64_t u = pLong[j + k];
for (int kk = 0; kk < 8; kk++, u >>= 8)
b[kk] += inflate[u & 255];
}
for (i = 0; i < 64; i++)
target[i] += (b[i / 8] >> ((i % 8) * 8)) & 255;
}
for (; j < 10000000; j++) {
uint64_t m = 1;
for (i = 0; i < 64; i++) {
target[i] += (pLong[j] >> i) & 1;
m <<= 1;
}
}
printf("target = {");
for (i = 0; i < 64; i++)
printf(" %d", target[i]);
printf(" }n");
printf("took %f secsn", getTS() - start);
return 0;
}
The technique for inflating a byte to a 64-bit long are investigated and explained in the answer: https://stackoverflow.com/a/55059914/4593267 . I made the target
array a local variable, as well as the inflate
array, and I print the results to ensure the compiler will not optimize the computations away. In a production version you would compute the inflate
array separately.
Using SIMD directly might provide further improvements at the expense of portability and readability. This kind of optimisation is often better left to the compiler as it can generate specific code for the target architecture. Unless performance is critical and benchmarking proves this to be a bottleneck, I would always favor a generic solution.
A different solution by njuffa provides similar performance without the need for a precomputed array. Depending on your compiler and hardware specifics, it might be faster.
It appears that my CPU doesn't perform as well as yours :-) I triedif ((pLong[j] & m) != 0)
and it made no difference in time. Triedtarget[i] += (pLong[j] >> i) & 1;
, it is actually worse, since time goes to 2.74 seconds.
– pktCoder
7 hours ago
@pktCoder and chqrlie: Your numbers are more useful when you specify the CPU model used to run the experiments and the compiler options used to compile the code.
– Hadi Brais
6 hours ago
@HadiBrais My CPU isi7-2670QM
, mentioned in the post.
– pktCoder
6 hours ago
@pktCoder Right, you mentioned the CPU model and the compiler, but you forgot the compiler options used to compile the code.
– Hadi Brais
6 hours ago
2
@pktCoder: Also note that chqrlie is using clang, which normally unrolls inner loops, while gcc doesn't enable loop unrolling unless you use-fprofile-generate
/ run the program /-fprofile-use
. Also your gcc4.8 is quite old, barely newer than your CPU. A newer gcc version would optimize better.
– Peter Cordes
6 hours ago
|
show 7 more comments
For starters, the problem of unpacking the bits, because seriously, you do not want to test each bit individually.
So just follow the following strategy for unpacking the bits into bytes of a vector: https://stackoverflow.com/a/24242696/2879325
Now that you have padded each bit to 8 bits, you can just do this for blocks of up to 255 bitmasks at a time, and accumulate them all into a single vector register. After that, you would have to expect potential overflows, so you need to transfer.
After each block of 255, unpack again to 32bit, and add into the array. (You don't have to do exactly 255, just some convenient number less than 256 to avoid overflow of byte accumulators).
At 8 instructions per bitmask (4 per each lower and higher 32-bit with AVX2) - or half that if you have AVX512 available - you should be able to achieve a throughput of about half a billion bitmasks per second and core on an recent CPU.
Thanks @Ext3h, I will check it out.
– pktCoder
10 hours ago
1
See also is there an inverse instruction to the movemask instruction in intel avx2? for a summary of bitmap -> vector unpack methods for different sizes. But yes, unpack bits to something narrower than the final count gives you more elements per instruction. Bytes is efficient with SIMD, and then you can do up to 255 (not 256) verticalpaddb
without overflow, before unpacking to 32-bit counters. The inner loop could be unrolled to start with a vector load and unpack different ways. Note that the OP does not have AVX2 or BMI2 on Sandybridge.
– Peter Cordes
6 hours ago
1
Doing this in blocks of 256 allows the possibility of a bit that is on in each of them overflow its count beyond the 255 that eight bits can represent. So the blocks need to be 255 or fewer items.
– Eric Postpischil
5 hours ago
add a comment |
One way of speeding this up significantly, even without AVX, is to split the data into blocks of up to 255 elements, and accumulate the bit counts byte-wise in ordinary uint64_t
variables. Since the source data has 64 bits, we need an array of 8 byte-wise accumulators. The first accumulator counts bits in positions 0, 8, 16, ... 56, second accumulator counts bits in positions 1, 9, 17, ... 57; and so on. After we are finished processing a block of data, we transfers the counts form the byte-wise accumulator into the target
counts. A function to update the target
counts for a block of up to 255 numbers can be coded in a straightforward fashion according to the description above, where BITS
is the number of bits in the source data:
/* update the counts of 1-bits in each bit position for up to 255 numbers */
void sum_block (const uint64_t *pLong, unsigned int *target, int lo, int hi)
{
int jj, k, kk;
uint64_t byte_wise_sum [BITS/8] = {0};
for (jj = lo; jj < hi; jj++) {
uint64_t t = pLong[jj];
for (k = 0; k < BITS/8; k++) {
byte_wise_sum[k] += t & 0x0101010101010101;
t >>= 1;
}
}
/* accumulate byte sums into target */
for (k = 0; k < BITS/8; k++) {
for (kk = 0; kk < BITS; kk += 8) {
target[kk + k] += (byte_wise_sum[k] >> kk) & 0xff;
}
}
}
The entire ISO-C99 program, which should be able to run on at least Windows and Linux platforms is shown below. It initializes the source data with a PRNG, performs a correctness check against the asker's reference implementation, and benchmarks both the reference code and the accelerated version. On my machine (Intel Xeon E3-1270 v2 @ 3.50 GHz), when compiled with MSVS 2010 at full optimization (/Ox
), the output of the program is:
p=0000000000550040
ref took 2.020282 secs, fast took 0.027099 secs
where ref
refers to the asker's original solution. The speed-up here is about a factor 74x. Different speed-ups will be observed with other (and especially newer) compilers.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
LARGE_INTEGER t;
static double oofreq;
static int checkedForHighResTimer;
static BOOL hasHighResTimer;
if (!checkedForHighResTimer) {
hasHighResTimer = QueryPerformanceFrequency (&t);
oofreq = 1.0 / (double)t.QuadPart;
checkedForHighResTimer = 1;
}
if (hasHighResTimer) {
QueryPerformanceCounter (&t);
return (double)t.QuadPart * oofreq;
} else {
return (double)GetTickCount() * 1.0e-3;
}
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
struct timeval tv;
gettimeofday(&tv, NULL);
return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif
/*
From: geo <gmars...@gmail.com>
Newsgroups: sci.math,comp.lang.c,comp.lang.fortran
Subject: 64-bit KISS RNGs
Date: Sat, 28 Feb 2009 04:30:48 -0800 (PST)
This 64-bit KISS RNG has three components, each nearly
good enough to serve alone. The components are:
Multiply-With-Carry (MWC), period (2^121+2^63-1)
Xorshift (XSH), period 2^64-1
Congruential (CNG), period 2^64
*/
static uint64_t kiss64_x = 1234567890987654321ULL;
static uint64_t kiss64_c = 123456123456123456ULL;
static uint64_t kiss64_y = 362436362436362436ULL;
static uint64_t kiss64_z = 1066149217761810ULL;
static uint64_t kiss64_t;
#define MWC64 (kiss64_t = (kiss64_x << 58) + kiss64_c,
kiss64_c = (kiss64_x >> 6), kiss64_x += kiss64_t,
kiss64_c += (kiss64_x < kiss64_t), kiss64_x)
#define XSH64 (kiss64_y ^= (kiss64_y << 13), kiss64_y ^= (kiss64_y >> 17),
kiss64_y ^= (kiss64_y << 43))
#define CNG64 (kiss64_z = 6906969069ULL * kiss64_z + 1234567ULL)
#define KISS64 (MWC64 + XSH64 + CNG64)
#define N (10000000)
#define BITS (64)
#define BLOCK_SIZE (255)
/* cupdate the count of 1-bits in each bit position for up to 255 numbers */
void sum_block (const uint64_t *pLong, unsigned int *target, int lo, int hi)
{
int jj, k, kk;
uint64_t byte_wise_sum [BITS/8] = {0};
for (jj = lo; jj < hi; jj++) {
uint64_t t = pLong[jj];
for (k = 0; k < BITS/8; k++) {
byte_wise_sum[k] += t & 0x0101010101010101;
t >>= 1;
}
}
/* accumulate byte sums into target */
for (k = 0; k < BITS/8; k++) {
for (kk = 0; kk < BITS; kk += 8) {
target[kk + k] += (byte_wise_sum[k] >> kk) & 0xff;
}
}
}
int main (void)
{
double start_ref, stop_ref, start, stop;
uint64_t *pLong;
unsigned int target_ref [BITS] = {0};
unsigned int target [BITS] = {0};
int i, j;
pLong = malloc (sizeof(pLong[0]) * N);
if (!pLong) {
printf("failed to allocaten");
return EXIT_FAILURE;
}
printf("p=%pn", pLong);
/* init data */
for (j = 0; j < N; j++) {
pLong[j] = KISS64;
}
/* count bits slowly */
start_ref = second();
for (j = 0; j < N; j++) {
uint64_t m = 1;
for (i = 0; i < BITS; i++) {
if ((pLong[j] & m) == m) {
target_ref[i]++;
}
m = (m << 1);
}
}
stop_ref = second();
/* count bits fast */
start = second();
for (j = 0; j < N / BLOCK_SIZE; j++) {
sum_block (pLong, target, j * BLOCK_SIZE, (j+1) * BLOCK_SIZE);
}
sum_block (pLong, target, j * BLOCK_SIZE, N);
stop = second();
/* check whether result is correct */
for (i = 0; i < BITS; i++) {
if (target[i] != target_ref[i]) {
printf ("error @ %d: res=%u ref=%un", i, target[i], target_ref[i]);
}
}
/* print benchmark results */
printf("ref took %f secs, fast took %f secsn", stop_ref - start_ref, stop - start);
return EXIT_SUCCESS;
}
E3-1270 v2 is a quad-core IvyBridge, pretty comparable to the OP's Sandybridge in terms of microarchitecture. (Faster clock vs. 3.1GHz, and you have 8MB L3 vs. 6MB). Both lack AVX2. I think even with AVX, starting with 2-bit accumulators and widening gradually might be a win here vs. unpacking all the way to bytes to start with. Your SWAR byte add doesn't need to kill the carry, because you need to avoid inputs that could possible carry anyway.
– Peter Cordes
1 hour ago
@PeterCordes It occured to me that I could eliminate the carry killing logic in the byte-wise add, but there is also something to be said for re-using existing tested building blocks "as is", instead of fiddling with them at the last minute.
– njuffa
1 hour ago
Ok, but literally all you need is+
! It doesn't need a function at all.
– Peter Cordes
1 hour ago
@PeterCordes I agree 100%. I have been thinking too hard today, my old brain is tired, and so I completely overlooked that. Thanks for pointing it out, I will change it right away.
– njuffa
1 hour ago
You also wantsizeof(*pLong)
for malloc, notsizeof(pLong)
. I accidentally compiled with-m32
from an old command I recalled and edited, and it segfaults in the init loop because you only alloc half as much memory as you need if sizeof(pointer) is less than uint64_t. I would edit but you're probably about to make another edit. Also, style: a block of declarations at the top of a function hasn't been required for a long time: C99 allowsfor(int i = ...)
and so on, and it's generally considered better style to limit declarations to their scope.
– Peter Cordes
1 hour ago
|
show 3 more comments
Your best bet is SIMD, using AVX1 on your Sandybridge CPU. Compilers aren't smart enough to auto-vectorize this for you, even if you write it branchlessly to give them a better chance.
See is there an inverse instruction to the movemask instruction in intel avx2? for a summary of bitmap -> vector unpack methods for different sizes. Ext3h's suggestion in another answer is the best: Unpack bits to something narrower than the final count array gives you more elements per instruction. Bytes is efficient with SIMD, and then you can do up to 255 vertical paddb without overflow, before unpacking to 32-bit counters.
It only takes 4x 16-byte __m128i
vectors to hold all 64 elements, so those accumulators can stay in registers, only adding to memory when widening out to 32-bit counters in an outer loop.
The unpack doesn't have to be in-order: you can always shuffle target[]
after accumulating all the results.
The inner loop could be unrolled to start with a 64 or 128-bit vector load, and unpack 4 or 8 different ways using pshufb
(_mm_shuffle_epi8
).
An even better strategy might be to widen gradually, starting with 2-bit accumulators and then mask/shift to widen those to 4-bit. So in the inner-most loop most of the operations are working with "dense" data.
Using SWAR techniques inside scalar or SIMD registers is easy because we need to avoid the possibility of carry out the top of an element anyway. With proper SIMD, we'd lose those counts, with SWAR we'd corrupt the next element.
uint64_t *endp = input + length - 3*2*something; //
while(input < endp) {
...
// inner 2 loops that create 4-bit accumulators.
for(int j=0 ; j<2 ; j++){
uint64_t accum2_lo=0, accum2_hi=0;
for(int i=0 ; i<3 ; i++) { // the compiler should fully unroll this
uint64_t x = *input++; // load a new bitmask
uint64_t lo = x & 0b...01010101; // note: binary not hex constant
uint64_t hi = (x>>1) & 0b...01010101;
accum2_lo += lo;
accum2_hi += hi; // can do up to 3 iterations of this before
}
accum4_0 += accum2_lo & 0b...001100110011; // same constant 4 times, because we shift *first*
accum4_1 += (accum2_lo >> 2) & 0b...001100110011;
accum4_2 += accum2_hi & 0b...001100110011;
accum4_3 += (accum2_hi >> 2) & 0b...001100110011;
// 6*2 = 12 < 15 so we can repeat this twice before widening to bytes
}
... /// accum8_0..7 the same way.
}
We don't care about order, so accum4_0
has 4-bit accumulators for every 4th bit.
We can use any SIMD element width to implement these shifts; we have to mask anyway for widths lower than 16-bit (SSE/AVX doesn't have byte-granularity shifts, only 16-bit minimum.)
But your portable scalar version can be improved, too:
If you expect a random mix of zeros and ones, you want something branchless that won't mispredict. Doing += 0
for elements that were zero avoids that, and also means that the C abstract machine definitely touches that memory regardless of the data.
Compilers aren't allowed to invent writes, so if they wanted to auto-vectorize your if() target[i]++
version, they'd have to use a masked store like x86 vmaskmovps
to avoid a non-atomic read / rewrite of unmodified elements of target
. So some hypothetical future compiler that can auto-vectorize the plain scalar code would have an easier time with this.
Anyway, one way to write this is target[i] += (pLong[j] & m != 0);
, using bool->int conversion to get a 0 / 1 integer.
But we get better asm for x86 (and probably for most other architectures) if we just shift the data and isolate the low bit with &1
. Compilers are kinda dumb and don't seem to spot this optimization. They do nicely optimize away the extra loop counter, and turn m <<= 1
into add same,same
to efficiently left shift, but they still use xor-zero / test
/ setne
to create a 0 / 1 integer.
An inner loop like this compiles slightly more efficiently (but still much much worse than we can do with SSE2 or AVX, or even scalar using @chrqlie's lookup table which will stay hot in L1d when used repeatedly like this, allowing SWAR in uint64_t
):
for (int j = 0; j < 10000000; j++) {
#if 1 // extract low bit directly
unsigned long long tmp = pLong[j];
for (int i=0 ; i<64 ; i++) { // while(tmp) could mispredict, but good for sparse data
target[i] += tmp&1;
tmp >>= 1;
}
#else // bool -> int shifting a mask
unsigned long m = 1;
for (i = 0; i < 64; i++) {
target[i]+= (pLong[j] & m) != 0;
m = (m << 1);
}
#endif
Note that unsigned long
is not guaranteed to be a 64-bit type, and isn't in x86-64 System V x32 (ILP32 in 64-bit mode), and Windows x64. Or in 32-bit ABIs like i386 System V.
Compiled on the Godbolt compiler explorer by gcc, clang, and ICC, it's 1 fewer uops in the loop with gcc. But all of them are just plain scalar, with clang and ICC unrolling by 2.
# clang7.0 -O3 -march=sandybridge
.LBB1_2: # =>This Loop Header: Depth=1
# outer loop loads a uint64 from the src
mov rdx, qword ptr [r14 + 8*rbx]
mov rsi, -256
.LBB1_3: # Parent Loop BB1_2 Depth=1
# do {
mov edi, edx
and edi, 1 # isolate the low bit
add dword ptr [rsi + target+256], edi # and += into target
mov edi, edx
shr edi
and edi, 1 # isolate the 2nd bit
add dword ptr [rsi + target+260], edi
shr rdx, 2 # tmp >>= 2;
add rsi, 8
jne .LBB1_3 # } while(offset += 8 != 0);
This is slightly better than we get from test
/ setnz
. Without unrolling, bt
/ setc
might have been equal, but compilers are bad at using bt
to implement bool (x & (1ULL << n))
, or bts
to implement x |= 1ULL << n
.
If many words have their highest set bit far below bit 63, looping on while(tmp)
could be a win. Branch mispredicts make it not worth it if it only saves ~0 to 4 iterations most of the time, but if it often saves 32 iterations, that could really be worth it. Maybe unroll in the source so the loop only tests tmp
every 2 iterations (because compilers won't do that transformation for you), but then the loop branch can be shr rdx, 2
/ jnz
.
On Sandybridge-family, this is 11 fused-domain uops for the front end per 2 bits of input. (add [mem], reg
with a non-indexed addressing mode micro-fuses the load+ALU, and the store-address+store-data, everything else is single-uop. add/jcc macro-fuses. See Agner Fog's guide, and https://stackoverflow.com/tags/x86/info). So it should run at something like 3 cycles per 2 bits = one uint64_t per 96 cycles. (Sandybridge doesn't "unroll" internally in its loop buffer, so non-multiple-of-4 uop counts basically round up, unlike on Haswell and later).
vs. gcc's not-unrolled version being 7 uops per 1 bit = 2 cycles per bit. If you compiled with gcc -O3 -march=native -fprofile-generate
/ test-run / gcc -O3 -march=native -fprofile-use
, profile-guided optimization would enable loop unrolling.
This is probably slower than a branchy version on perfectly predictable data like you get from memset
with any repeating byte pattern. I'd suggest filling your array with randomly-generated data from a fast PRNG like an SSE2 xorshift+, or if you're just timing the count loop then use anything you want, like rand()
.
Thanks @peter-cordes for the writeup. I will have to study it carefully since I believe it's the way to go.
– pktCoder
2 hours ago
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55081525%2fcount-each-bit-position-separately-over-many-64-bit-bitmasks-with-avx-but-not-a%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
On my system, a 4 year old MacBook (2.7 GHz intel core i5) with clang-900.0.39.2 -O3
, your code runs in 500ms.
Just changing the inner test to if ((pLong[j] & m) != 0)
saves 30%, running in 350ms.
Further simplifying the inner part to target[i] += (pLong[j] >> i) & 1;
without a test brings it down to 280ms.
Further improvements seem to require more advanced techniques such as unpacking the bits into blocks of 8 ulongs and adding those in parallel, handling 255 ulongs at a time.
Here is an improved version using this method. it runs in 45ms on my system.
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/stat.h>
double getTS() {
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec + tv.tv_usec / 1000000.0;
}
int main(int argc, char *argv[]) {
unsigned int target[64] = { 0 };
unsigned long *pLong = malloc(sizeof(*pLong) * 10000000);
int i, j;
if (!pLong) {
printf("failed to allocaten");
exit(1);
}
memset(pLong, 0xff, sizeof(*pLong) * 10000000);
printf("p=%pn", (void*)pLong);
double start = getTS();
uint64_t inflate[256];
for (i = 0; i < 256; i++) {
uint64_t x = i;
x = (x | (x << 28));
x = (x | (x << 14));
inflate[i] = (x | (x << 7)) & 0x0101010101010101ULL;
}
for (j = 0; j < 10000000 / 255 * 255; j += 255) {
uint64_t b[8] = { 0 };
for (int k = 0; k < 255; k++) {
uint64_t u = pLong[j + k];
for (int kk = 0; kk < 8; kk++, u >>= 8)
b[kk] += inflate[u & 255];
}
for (i = 0; i < 64; i++)
target[i] += (b[i / 8] >> ((i % 8) * 8)) & 255;
}
for (; j < 10000000; j++) {
uint64_t m = 1;
for (i = 0; i < 64; i++) {
target[i] += (pLong[j] >> i) & 1;
m <<= 1;
}
}
printf("target = {");
for (i = 0; i < 64; i++)
printf(" %d", target[i]);
printf(" }n");
printf("took %f secsn", getTS() - start);
return 0;
}
The technique for inflating a byte to a 64-bit long are investigated and explained in the answer: https://stackoverflow.com/a/55059914/4593267 . I made the target
array a local variable, as well as the inflate
array, and I print the results to ensure the compiler will not optimize the computations away. In a production version you would compute the inflate
array separately.
Using SIMD directly might provide further improvements at the expense of portability and readability. This kind of optimisation is often better left to the compiler as it can generate specific code for the target architecture. Unless performance is critical and benchmarking proves this to be a bottleneck, I would always favor a generic solution.
A different solution by njuffa provides similar performance without the need for a precomputed array. Depending on your compiler and hardware specifics, it might be faster.
It appears that my CPU doesn't perform as well as yours :-) I triedif ((pLong[j] & m) != 0)
and it made no difference in time. Triedtarget[i] += (pLong[j] >> i) & 1;
, it is actually worse, since time goes to 2.74 seconds.
– pktCoder
7 hours ago
@pktCoder and chqrlie: Your numbers are more useful when you specify the CPU model used to run the experiments and the compiler options used to compile the code.
– Hadi Brais
6 hours ago
@HadiBrais My CPU isi7-2670QM
, mentioned in the post.
– pktCoder
6 hours ago
@pktCoder Right, you mentioned the CPU model and the compiler, but you forgot the compiler options used to compile the code.
– Hadi Brais
6 hours ago
2
@pktCoder: Also note that chqrlie is using clang, which normally unrolls inner loops, while gcc doesn't enable loop unrolling unless you use-fprofile-generate
/ run the program /-fprofile-use
. Also your gcc4.8 is quite old, barely newer than your CPU. A newer gcc version would optimize better.
– Peter Cordes
6 hours ago
|
show 7 more comments
On my system, a 4 year old MacBook (2.7 GHz intel core i5) with clang-900.0.39.2 -O3
, your code runs in 500ms.
Just changing the inner test to if ((pLong[j] & m) != 0)
saves 30%, running in 350ms.
Further simplifying the inner part to target[i] += (pLong[j] >> i) & 1;
without a test brings it down to 280ms.
Further improvements seem to require more advanced techniques such as unpacking the bits into blocks of 8 ulongs and adding those in parallel, handling 255 ulongs at a time.
Here is an improved version using this method. it runs in 45ms on my system.
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/stat.h>
double getTS() {
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec + tv.tv_usec / 1000000.0;
}
int main(int argc, char *argv[]) {
unsigned int target[64] = { 0 };
unsigned long *pLong = malloc(sizeof(*pLong) * 10000000);
int i, j;
if (!pLong) {
printf("failed to allocaten");
exit(1);
}
memset(pLong, 0xff, sizeof(*pLong) * 10000000);
printf("p=%pn", (void*)pLong);
double start = getTS();
uint64_t inflate[256];
for (i = 0; i < 256; i++) {
uint64_t x = i;
x = (x | (x << 28));
x = (x | (x << 14));
inflate[i] = (x | (x << 7)) & 0x0101010101010101ULL;
}
for (j = 0; j < 10000000 / 255 * 255; j += 255) {
uint64_t b[8] = { 0 };
for (int k = 0; k < 255; k++) {
uint64_t u = pLong[j + k];
for (int kk = 0; kk < 8; kk++, u >>= 8)
b[kk] += inflate[u & 255];
}
for (i = 0; i < 64; i++)
target[i] += (b[i / 8] >> ((i % 8) * 8)) & 255;
}
for (; j < 10000000; j++) {
uint64_t m = 1;
for (i = 0; i < 64; i++) {
target[i] += (pLong[j] >> i) & 1;
m <<= 1;
}
}
printf("target = {");
for (i = 0; i < 64; i++)
printf(" %d", target[i]);
printf(" }n");
printf("took %f secsn", getTS() - start);
return 0;
}
The technique for inflating a byte to a 64-bit long are investigated and explained in the answer: https://stackoverflow.com/a/55059914/4593267 . I made the target
array a local variable, as well as the inflate
array, and I print the results to ensure the compiler will not optimize the computations away. In a production version you would compute the inflate
array separately.
Using SIMD directly might provide further improvements at the expense of portability and readability. This kind of optimisation is often better left to the compiler as it can generate specific code for the target architecture. Unless performance is critical and benchmarking proves this to be a bottleneck, I would always favor a generic solution.
A different solution by njuffa provides similar performance without the need for a precomputed array. Depending on your compiler and hardware specifics, it might be faster.
It appears that my CPU doesn't perform as well as yours :-) I triedif ((pLong[j] & m) != 0)
and it made no difference in time. Triedtarget[i] += (pLong[j] >> i) & 1;
, it is actually worse, since time goes to 2.74 seconds.
– pktCoder
7 hours ago
@pktCoder and chqrlie: Your numbers are more useful when you specify the CPU model used to run the experiments and the compiler options used to compile the code.
– Hadi Brais
6 hours ago
@HadiBrais My CPU isi7-2670QM
, mentioned in the post.
– pktCoder
6 hours ago
@pktCoder Right, you mentioned the CPU model and the compiler, but you forgot the compiler options used to compile the code.
– Hadi Brais
6 hours ago
2
@pktCoder: Also note that chqrlie is using clang, which normally unrolls inner loops, while gcc doesn't enable loop unrolling unless you use-fprofile-generate
/ run the program /-fprofile-use
. Also your gcc4.8 is quite old, barely newer than your CPU. A newer gcc version would optimize better.
– Peter Cordes
6 hours ago
|
show 7 more comments
On my system, a 4 year old MacBook (2.7 GHz intel core i5) with clang-900.0.39.2 -O3
, your code runs in 500ms.
Just changing the inner test to if ((pLong[j] & m) != 0)
saves 30%, running in 350ms.
Further simplifying the inner part to target[i] += (pLong[j] >> i) & 1;
without a test brings it down to 280ms.
Further improvements seem to require more advanced techniques such as unpacking the bits into blocks of 8 ulongs and adding those in parallel, handling 255 ulongs at a time.
Here is an improved version using this method. it runs in 45ms on my system.
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/stat.h>
double getTS() {
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec + tv.tv_usec / 1000000.0;
}
int main(int argc, char *argv[]) {
unsigned int target[64] = { 0 };
unsigned long *pLong = malloc(sizeof(*pLong) * 10000000);
int i, j;
if (!pLong) {
printf("failed to allocaten");
exit(1);
}
memset(pLong, 0xff, sizeof(*pLong) * 10000000);
printf("p=%pn", (void*)pLong);
double start = getTS();
uint64_t inflate[256];
for (i = 0; i < 256; i++) {
uint64_t x = i;
x = (x | (x << 28));
x = (x | (x << 14));
inflate[i] = (x | (x << 7)) & 0x0101010101010101ULL;
}
for (j = 0; j < 10000000 / 255 * 255; j += 255) {
uint64_t b[8] = { 0 };
for (int k = 0; k < 255; k++) {
uint64_t u = pLong[j + k];
for (int kk = 0; kk < 8; kk++, u >>= 8)
b[kk] += inflate[u & 255];
}
for (i = 0; i < 64; i++)
target[i] += (b[i / 8] >> ((i % 8) * 8)) & 255;
}
for (; j < 10000000; j++) {
uint64_t m = 1;
for (i = 0; i < 64; i++) {
target[i] += (pLong[j] >> i) & 1;
m <<= 1;
}
}
printf("target = {");
for (i = 0; i < 64; i++)
printf(" %d", target[i]);
printf(" }n");
printf("took %f secsn", getTS() - start);
return 0;
}
The technique for inflating a byte to a 64-bit long are investigated and explained in the answer: https://stackoverflow.com/a/55059914/4593267 . I made the target
array a local variable, as well as the inflate
array, and I print the results to ensure the compiler will not optimize the computations away. In a production version you would compute the inflate
array separately.
Using SIMD directly might provide further improvements at the expense of portability and readability. This kind of optimisation is often better left to the compiler as it can generate specific code for the target architecture. Unless performance is critical and benchmarking proves this to be a bottleneck, I would always favor a generic solution.
A different solution by njuffa provides similar performance without the need for a precomputed array. Depending on your compiler and hardware specifics, it might be faster.
On my system, a 4 year old MacBook (2.7 GHz intel core i5) with clang-900.0.39.2 -O3
, your code runs in 500ms.
Just changing the inner test to if ((pLong[j] & m) != 0)
saves 30%, running in 350ms.
Further simplifying the inner part to target[i] += (pLong[j] >> i) & 1;
without a test brings it down to 280ms.
Further improvements seem to require more advanced techniques such as unpacking the bits into blocks of 8 ulongs and adding those in parallel, handling 255 ulongs at a time.
Here is an improved version using this method. it runs in 45ms on my system.
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/stat.h>
double getTS() {
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec + tv.tv_usec / 1000000.0;
}
int main(int argc, char *argv[]) {
unsigned int target[64] = { 0 };
unsigned long *pLong = malloc(sizeof(*pLong) * 10000000);
int i, j;
if (!pLong) {
printf("failed to allocaten");
exit(1);
}
memset(pLong, 0xff, sizeof(*pLong) * 10000000);
printf("p=%pn", (void*)pLong);
double start = getTS();
uint64_t inflate[256];
for (i = 0; i < 256; i++) {
uint64_t x = i;
x = (x | (x << 28));
x = (x | (x << 14));
inflate[i] = (x | (x << 7)) & 0x0101010101010101ULL;
}
for (j = 0; j < 10000000 / 255 * 255; j += 255) {
uint64_t b[8] = { 0 };
for (int k = 0; k < 255; k++) {
uint64_t u = pLong[j + k];
for (int kk = 0; kk < 8; kk++, u >>= 8)
b[kk] += inflate[u & 255];
}
for (i = 0; i < 64; i++)
target[i] += (b[i / 8] >> ((i % 8) * 8)) & 255;
}
for (; j < 10000000; j++) {
uint64_t m = 1;
for (i = 0; i < 64; i++) {
target[i] += (pLong[j] >> i) & 1;
m <<= 1;
}
}
printf("target = {");
for (i = 0; i < 64; i++)
printf(" %d", target[i]);
printf(" }n");
printf("took %f secsn", getTS() - start);
return 0;
}
The technique for inflating a byte to a 64-bit long are investigated and explained in the answer: https://stackoverflow.com/a/55059914/4593267 . I made the target
array a local variable, as well as the inflate
array, and I print the results to ensure the compiler will not optimize the computations away. In a production version you would compute the inflate
array separately.
Using SIMD directly might provide further improvements at the expense of portability and readability. This kind of optimisation is often better left to the compiler as it can generate specific code for the target architecture. Unless performance is critical and benchmarking proves this to be a bottleneck, I would always favor a generic solution.
A different solution by njuffa provides similar performance without the need for a precomputed array. Depending on your compiler and hardware specifics, it might be faster.
edited 1 hour ago
answered 7 hours ago
chqrliechqrlie
61.2k747104
61.2k747104
It appears that my CPU doesn't perform as well as yours :-) I triedif ((pLong[j] & m) != 0)
and it made no difference in time. Triedtarget[i] += (pLong[j] >> i) & 1;
, it is actually worse, since time goes to 2.74 seconds.
– pktCoder
7 hours ago
@pktCoder and chqrlie: Your numbers are more useful when you specify the CPU model used to run the experiments and the compiler options used to compile the code.
– Hadi Brais
6 hours ago
@HadiBrais My CPU isi7-2670QM
, mentioned in the post.
– pktCoder
6 hours ago
@pktCoder Right, you mentioned the CPU model and the compiler, but you forgot the compiler options used to compile the code.
– Hadi Brais
6 hours ago
2
@pktCoder: Also note that chqrlie is using clang, which normally unrolls inner loops, while gcc doesn't enable loop unrolling unless you use-fprofile-generate
/ run the program /-fprofile-use
. Also your gcc4.8 is quite old, barely newer than your CPU. A newer gcc version would optimize better.
– Peter Cordes
6 hours ago
|
show 7 more comments
It appears that my CPU doesn't perform as well as yours :-) I triedif ((pLong[j] & m) != 0)
and it made no difference in time. Triedtarget[i] += (pLong[j] >> i) & 1;
, it is actually worse, since time goes to 2.74 seconds.
– pktCoder
7 hours ago
@pktCoder and chqrlie: Your numbers are more useful when you specify the CPU model used to run the experiments and the compiler options used to compile the code.
– Hadi Brais
6 hours ago
@HadiBrais My CPU isi7-2670QM
, mentioned in the post.
– pktCoder
6 hours ago
@pktCoder Right, you mentioned the CPU model and the compiler, but you forgot the compiler options used to compile the code.
– Hadi Brais
6 hours ago
2
@pktCoder: Also note that chqrlie is using clang, which normally unrolls inner loops, while gcc doesn't enable loop unrolling unless you use-fprofile-generate
/ run the program /-fprofile-use
. Also your gcc4.8 is quite old, barely newer than your CPU. A newer gcc version would optimize better.
– Peter Cordes
6 hours ago
It appears that my CPU doesn't perform as well as yours :-) I tried
if ((pLong[j] & m) != 0)
and it made no difference in time. Tried target[i] += (pLong[j] >> i) & 1;
, it is actually worse, since time goes to 2.74 seconds.– pktCoder
7 hours ago
It appears that my CPU doesn't perform as well as yours :-) I tried
if ((pLong[j] & m) != 0)
and it made no difference in time. Tried target[i] += (pLong[j] >> i) & 1;
, it is actually worse, since time goes to 2.74 seconds.– pktCoder
7 hours ago
@pktCoder and chqrlie: Your numbers are more useful when you specify the CPU model used to run the experiments and the compiler options used to compile the code.
– Hadi Brais
6 hours ago
@pktCoder and chqrlie: Your numbers are more useful when you specify the CPU model used to run the experiments and the compiler options used to compile the code.
– Hadi Brais
6 hours ago
@HadiBrais My CPU is
i7-2670QM
, mentioned in the post.– pktCoder
6 hours ago
@HadiBrais My CPU is
i7-2670QM
, mentioned in the post.– pktCoder
6 hours ago
@pktCoder Right, you mentioned the CPU model and the compiler, but you forgot the compiler options used to compile the code.
– Hadi Brais
6 hours ago
@pktCoder Right, you mentioned the CPU model and the compiler, but you forgot the compiler options used to compile the code.
– Hadi Brais
6 hours ago
2
2
@pktCoder: Also note that chqrlie is using clang, which normally unrolls inner loops, while gcc doesn't enable loop unrolling unless you use
-fprofile-generate
/ run the program / -fprofile-use
. Also your gcc4.8 is quite old, barely newer than your CPU. A newer gcc version would optimize better.– Peter Cordes
6 hours ago
@pktCoder: Also note that chqrlie is using clang, which normally unrolls inner loops, while gcc doesn't enable loop unrolling unless you use
-fprofile-generate
/ run the program / -fprofile-use
. Also your gcc4.8 is quite old, barely newer than your CPU. A newer gcc version would optimize better.– Peter Cordes
6 hours ago
|
show 7 more comments
For starters, the problem of unpacking the bits, because seriously, you do not want to test each bit individually.
So just follow the following strategy for unpacking the bits into bytes of a vector: https://stackoverflow.com/a/24242696/2879325
Now that you have padded each bit to 8 bits, you can just do this for blocks of up to 255 bitmasks at a time, and accumulate them all into a single vector register. After that, you would have to expect potential overflows, so you need to transfer.
After each block of 255, unpack again to 32bit, and add into the array. (You don't have to do exactly 255, just some convenient number less than 256 to avoid overflow of byte accumulators).
At 8 instructions per bitmask (4 per each lower and higher 32-bit with AVX2) - or half that if you have AVX512 available - you should be able to achieve a throughput of about half a billion bitmasks per second and core on an recent CPU.
Thanks @Ext3h, I will check it out.
– pktCoder
10 hours ago
1
See also is there an inverse instruction to the movemask instruction in intel avx2? for a summary of bitmap -> vector unpack methods for different sizes. But yes, unpack bits to something narrower than the final count gives you more elements per instruction. Bytes is efficient with SIMD, and then you can do up to 255 (not 256) verticalpaddb
without overflow, before unpacking to 32-bit counters. The inner loop could be unrolled to start with a vector load and unpack different ways. Note that the OP does not have AVX2 or BMI2 on Sandybridge.
– Peter Cordes
6 hours ago
1
Doing this in blocks of 256 allows the possibility of a bit that is on in each of them overflow its count beyond the 255 that eight bits can represent. So the blocks need to be 255 or fewer items.
– Eric Postpischil
5 hours ago
add a comment |
For starters, the problem of unpacking the bits, because seriously, you do not want to test each bit individually.
So just follow the following strategy for unpacking the bits into bytes of a vector: https://stackoverflow.com/a/24242696/2879325
Now that you have padded each bit to 8 bits, you can just do this for blocks of up to 255 bitmasks at a time, and accumulate them all into a single vector register. After that, you would have to expect potential overflows, so you need to transfer.
After each block of 255, unpack again to 32bit, and add into the array. (You don't have to do exactly 255, just some convenient number less than 256 to avoid overflow of byte accumulators).
At 8 instructions per bitmask (4 per each lower and higher 32-bit with AVX2) - or half that if you have AVX512 available - you should be able to achieve a throughput of about half a billion bitmasks per second and core on an recent CPU.
Thanks @Ext3h, I will check it out.
– pktCoder
10 hours ago
1
See also is there an inverse instruction to the movemask instruction in intel avx2? for a summary of bitmap -> vector unpack methods for different sizes. But yes, unpack bits to something narrower than the final count gives you more elements per instruction. Bytes is efficient with SIMD, and then you can do up to 255 (not 256) verticalpaddb
without overflow, before unpacking to 32-bit counters. The inner loop could be unrolled to start with a vector load and unpack different ways. Note that the OP does not have AVX2 or BMI2 on Sandybridge.
– Peter Cordes
6 hours ago
1
Doing this in blocks of 256 allows the possibility of a bit that is on in each of them overflow its count beyond the 255 that eight bits can represent. So the blocks need to be 255 or fewer items.
– Eric Postpischil
5 hours ago
add a comment |
For starters, the problem of unpacking the bits, because seriously, you do not want to test each bit individually.
So just follow the following strategy for unpacking the bits into bytes of a vector: https://stackoverflow.com/a/24242696/2879325
Now that you have padded each bit to 8 bits, you can just do this for blocks of up to 255 bitmasks at a time, and accumulate them all into a single vector register. After that, you would have to expect potential overflows, so you need to transfer.
After each block of 255, unpack again to 32bit, and add into the array. (You don't have to do exactly 255, just some convenient number less than 256 to avoid overflow of byte accumulators).
At 8 instructions per bitmask (4 per each lower and higher 32-bit with AVX2) - or half that if you have AVX512 available - you should be able to achieve a throughput of about half a billion bitmasks per second and core on an recent CPU.
For starters, the problem of unpacking the bits, because seriously, you do not want to test each bit individually.
So just follow the following strategy for unpacking the bits into bytes of a vector: https://stackoverflow.com/a/24242696/2879325
Now that you have padded each bit to 8 bits, you can just do this for blocks of up to 255 bitmasks at a time, and accumulate them all into a single vector register. After that, you would have to expect potential overflows, so you need to transfer.
After each block of 255, unpack again to 32bit, and add into the array. (You don't have to do exactly 255, just some convenient number less than 256 to avoid overflow of byte accumulators).
At 8 instructions per bitmask (4 per each lower and higher 32-bit with AVX2) - or half that if you have AVX512 available - you should be able to achieve a throughput of about half a billion bitmasks per second and core on an recent CPU.
edited 2 hours ago
Peter Cordes
130k18197334
130k18197334
answered 10 hours ago
Ext3hExt3h
3,085827
3,085827
Thanks @Ext3h, I will check it out.
– pktCoder
10 hours ago
1
See also is there an inverse instruction to the movemask instruction in intel avx2? for a summary of bitmap -> vector unpack methods for different sizes. But yes, unpack bits to something narrower than the final count gives you more elements per instruction. Bytes is efficient with SIMD, and then you can do up to 255 (not 256) verticalpaddb
without overflow, before unpacking to 32-bit counters. The inner loop could be unrolled to start with a vector load and unpack different ways. Note that the OP does not have AVX2 or BMI2 on Sandybridge.
– Peter Cordes
6 hours ago
1
Doing this in blocks of 256 allows the possibility of a bit that is on in each of them overflow its count beyond the 255 that eight bits can represent. So the blocks need to be 255 or fewer items.
– Eric Postpischil
5 hours ago
add a comment |
Thanks @Ext3h, I will check it out.
– pktCoder
10 hours ago
1
See also is there an inverse instruction to the movemask instruction in intel avx2? for a summary of bitmap -> vector unpack methods for different sizes. But yes, unpack bits to something narrower than the final count gives you more elements per instruction. Bytes is efficient with SIMD, and then you can do up to 255 (not 256) verticalpaddb
without overflow, before unpacking to 32-bit counters. The inner loop could be unrolled to start with a vector load and unpack different ways. Note that the OP does not have AVX2 or BMI2 on Sandybridge.
– Peter Cordes
6 hours ago
1
Doing this in blocks of 256 allows the possibility of a bit that is on in each of them overflow its count beyond the 255 that eight bits can represent. So the blocks need to be 255 or fewer items.
– Eric Postpischil
5 hours ago
Thanks @Ext3h, I will check it out.
– pktCoder
10 hours ago
Thanks @Ext3h, I will check it out.
– pktCoder
10 hours ago
1
1
See also is there an inverse instruction to the movemask instruction in intel avx2? for a summary of bitmap -> vector unpack methods for different sizes. But yes, unpack bits to something narrower than the final count gives you more elements per instruction. Bytes is efficient with SIMD, and then you can do up to 255 (not 256) vertical
paddb
without overflow, before unpacking to 32-bit counters. The inner loop could be unrolled to start with a vector load and unpack different ways. Note that the OP does not have AVX2 or BMI2 on Sandybridge.– Peter Cordes
6 hours ago
See also is there an inverse instruction to the movemask instruction in intel avx2? for a summary of bitmap -> vector unpack methods for different sizes. But yes, unpack bits to something narrower than the final count gives you more elements per instruction. Bytes is efficient with SIMD, and then you can do up to 255 (not 256) vertical
paddb
without overflow, before unpacking to 32-bit counters. The inner loop could be unrolled to start with a vector load and unpack different ways. Note that the OP does not have AVX2 or BMI2 on Sandybridge.– Peter Cordes
6 hours ago
1
1
Doing this in blocks of 256 allows the possibility of a bit that is on in each of them overflow its count beyond the 255 that eight bits can represent. So the blocks need to be 255 or fewer items.
– Eric Postpischil
5 hours ago
Doing this in blocks of 256 allows the possibility of a bit that is on in each of them overflow its count beyond the 255 that eight bits can represent. So the blocks need to be 255 or fewer items.
– Eric Postpischil
5 hours ago
add a comment |
One way of speeding this up significantly, even without AVX, is to split the data into blocks of up to 255 elements, and accumulate the bit counts byte-wise in ordinary uint64_t
variables. Since the source data has 64 bits, we need an array of 8 byte-wise accumulators. The first accumulator counts bits in positions 0, 8, 16, ... 56, second accumulator counts bits in positions 1, 9, 17, ... 57; and so on. After we are finished processing a block of data, we transfers the counts form the byte-wise accumulator into the target
counts. A function to update the target
counts for a block of up to 255 numbers can be coded in a straightforward fashion according to the description above, where BITS
is the number of bits in the source data:
/* update the counts of 1-bits in each bit position for up to 255 numbers */
void sum_block (const uint64_t *pLong, unsigned int *target, int lo, int hi)
{
int jj, k, kk;
uint64_t byte_wise_sum [BITS/8] = {0};
for (jj = lo; jj < hi; jj++) {
uint64_t t = pLong[jj];
for (k = 0; k < BITS/8; k++) {
byte_wise_sum[k] += t & 0x0101010101010101;
t >>= 1;
}
}
/* accumulate byte sums into target */
for (k = 0; k < BITS/8; k++) {
for (kk = 0; kk < BITS; kk += 8) {
target[kk + k] += (byte_wise_sum[k] >> kk) & 0xff;
}
}
}
The entire ISO-C99 program, which should be able to run on at least Windows and Linux platforms is shown below. It initializes the source data with a PRNG, performs a correctness check against the asker's reference implementation, and benchmarks both the reference code and the accelerated version. On my machine (Intel Xeon E3-1270 v2 @ 3.50 GHz), when compiled with MSVS 2010 at full optimization (/Ox
), the output of the program is:
p=0000000000550040
ref took 2.020282 secs, fast took 0.027099 secs
where ref
refers to the asker's original solution. The speed-up here is about a factor 74x. Different speed-ups will be observed with other (and especially newer) compilers.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
LARGE_INTEGER t;
static double oofreq;
static int checkedForHighResTimer;
static BOOL hasHighResTimer;
if (!checkedForHighResTimer) {
hasHighResTimer = QueryPerformanceFrequency (&t);
oofreq = 1.0 / (double)t.QuadPart;
checkedForHighResTimer = 1;
}
if (hasHighResTimer) {
QueryPerformanceCounter (&t);
return (double)t.QuadPart * oofreq;
} else {
return (double)GetTickCount() * 1.0e-3;
}
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
struct timeval tv;
gettimeofday(&tv, NULL);
return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif
/*
From: geo <gmars...@gmail.com>
Newsgroups: sci.math,comp.lang.c,comp.lang.fortran
Subject: 64-bit KISS RNGs
Date: Sat, 28 Feb 2009 04:30:48 -0800 (PST)
This 64-bit KISS RNG has three components, each nearly
good enough to serve alone. The components are:
Multiply-With-Carry (MWC), period (2^121+2^63-1)
Xorshift (XSH), period 2^64-1
Congruential (CNG), period 2^64
*/
static uint64_t kiss64_x = 1234567890987654321ULL;
static uint64_t kiss64_c = 123456123456123456ULL;
static uint64_t kiss64_y = 362436362436362436ULL;
static uint64_t kiss64_z = 1066149217761810ULL;
static uint64_t kiss64_t;
#define MWC64 (kiss64_t = (kiss64_x << 58) + kiss64_c,
kiss64_c = (kiss64_x >> 6), kiss64_x += kiss64_t,
kiss64_c += (kiss64_x < kiss64_t), kiss64_x)
#define XSH64 (kiss64_y ^= (kiss64_y << 13), kiss64_y ^= (kiss64_y >> 17),
kiss64_y ^= (kiss64_y << 43))
#define CNG64 (kiss64_z = 6906969069ULL * kiss64_z + 1234567ULL)
#define KISS64 (MWC64 + XSH64 + CNG64)
#define N (10000000)
#define BITS (64)
#define BLOCK_SIZE (255)
/* cupdate the count of 1-bits in each bit position for up to 255 numbers */
void sum_block (const uint64_t *pLong, unsigned int *target, int lo, int hi)
{
int jj, k, kk;
uint64_t byte_wise_sum [BITS/8] = {0};
for (jj = lo; jj < hi; jj++) {
uint64_t t = pLong[jj];
for (k = 0; k < BITS/8; k++) {
byte_wise_sum[k] += t & 0x0101010101010101;
t >>= 1;
}
}
/* accumulate byte sums into target */
for (k = 0; k < BITS/8; k++) {
for (kk = 0; kk < BITS; kk += 8) {
target[kk + k] += (byte_wise_sum[k] >> kk) & 0xff;
}
}
}
int main (void)
{
double start_ref, stop_ref, start, stop;
uint64_t *pLong;
unsigned int target_ref [BITS] = {0};
unsigned int target [BITS] = {0};
int i, j;
pLong = malloc (sizeof(pLong[0]) * N);
if (!pLong) {
printf("failed to allocaten");
return EXIT_FAILURE;
}
printf("p=%pn", pLong);
/* init data */
for (j = 0; j < N; j++) {
pLong[j] = KISS64;
}
/* count bits slowly */
start_ref = second();
for (j = 0; j < N; j++) {
uint64_t m = 1;
for (i = 0; i < BITS; i++) {
if ((pLong[j] & m) == m) {
target_ref[i]++;
}
m = (m << 1);
}
}
stop_ref = second();
/* count bits fast */
start = second();
for (j = 0; j < N / BLOCK_SIZE; j++) {
sum_block (pLong, target, j * BLOCK_SIZE, (j+1) * BLOCK_SIZE);
}
sum_block (pLong, target, j * BLOCK_SIZE, N);
stop = second();
/* check whether result is correct */
for (i = 0; i < BITS; i++) {
if (target[i] != target_ref[i]) {
printf ("error @ %d: res=%u ref=%un", i, target[i], target_ref[i]);
}
}
/* print benchmark results */
printf("ref took %f secs, fast took %f secsn", stop_ref - start_ref, stop - start);
return EXIT_SUCCESS;
}
E3-1270 v2 is a quad-core IvyBridge, pretty comparable to the OP's Sandybridge in terms of microarchitecture. (Faster clock vs. 3.1GHz, and you have 8MB L3 vs. 6MB). Both lack AVX2. I think even with AVX, starting with 2-bit accumulators and widening gradually might be a win here vs. unpacking all the way to bytes to start with. Your SWAR byte add doesn't need to kill the carry, because you need to avoid inputs that could possible carry anyway.
– Peter Cordes
1 hour ago
@PeterCordes It occured to me that I could eliminate the carry killing logic in the byte-wise add, but there is also something to be said for re-using existing tested building blocks "as is", instead of fiddling with them at the last minute.
– njuffa
1 hour ago
Ok, but literally all you need is+
! It doesn't need a function at all.
– Peter Cordes
1 hour ago
@PeterCordes I agree 100%. I have been thinking too hard today, my old brain is tired, and so I completely overlooked that. Thanks for pointing it out, I will change it right away.
– njuffa
1 hour ago
You also wantsizeof(*pLong)
for malloc, notsizeof(pLong)
. I accidentally compiled with-m32
from an old command I recalled and edited, and it segfaults in the init loop because you only alloc half as much memory as you need if sizeof(pointer) is less than uint64_t. I would edit but you're probably about to make another edit. Also, style: a block of declarations at the top of a function hasn't been required for a long time: C99 allowsfor(int i = ...)
and so on, and it's generally considered better style to limit declarations to their scope.
– Peter Cordes
1 hour ago
|
show 3 more comments
One way of speeding this up significantly, even without AVX, is to split the data into blocks of up to 255 elements, and accumulate the bit counts byte-wise in ordinary uint64_t
variables. Since the source data has 64 bits, we need an array of 8 byte-wise accumulators. The first accumulator counts bits in positions 0, 8, 16, ... 56, second accumulator counts bits in positions 1, 9, 17, ... 57; and so on. After we are finished processing a block of data, we transfers the counts form the byte-wise accumulator into the target
counts. A function to update the target
counts for a block of up to 255 numbers can be coded in a straightforward fashion according to the description above, where BITS
is the number of bits in the source data:
/* update the counts of 1-bits in each bit position for up to 255 numbers */
void sum_block (const uint64_t *pLong, unsigned int *target, int lo, int hi)
{
int jj, k, kk;
uint64_t byte_wise_sum [BITS/8] = {0};
for (jj = lo; jj < hi; jj++) {
uint64_t t = pLong[jj];
for (k = 0; k < BITS/8; k++) {
byte_wise_sum[k] += t & 0x0101010101010101;
t >>= 1;
}
}
/* accumulate byte sums into target */
for (k = 0; k < BITS/8; k++) {
for (kk = 0; kk < BITS; kk += 8) {
target[kk + k] += (byte_wise_sum[k] >> kk) & 0xff;
}
}
}
The entire ISO-C99 program, which should be able to run on at least Windows and Linux platforms is shown below. It initializes the source data with a PRNG, performs a correctness check against the asker's reference implementation, and benchmarks both the reference code and the accelerated version. On my machine (Intel Xeon E3-1270 v2 @ 3.50 GHz), when compiled with MSVS 2010 at full optimization (/Ox
), the output of the program is:
p=0000000000550040
ref took 2.020282 secs, fast took 0.027099 secs
where ref
refers to the asker's original solution. The speed-up here is about a factor 74x. Different speed-ups will be observed with other (and especially newer) compilers.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
LARGE_INTEGER t;
static double oofreq;
static int checkedForHighResTimer;
static BOOL hasHighResTimer;
if (!checkedForHighResTimer) {
hasHighResTimer = QueryPerformanceFrequency (&t);
oofreq = 1.0 / (double)t.QuadPart;
checkedForHighResTimer = 1;
}
if (hasHighResTimer) {
QueryPerformanceCounter (&t);
return (double)t.QuadPart * oofreq;
} else {
return (double)GetTickCount() * 1.0e-3;
}
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
struct timeval tv;
gettimeofday(&tv, NULL);
return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif
/*
From: geo <gmars...@gmail.com>
Newsgroups: sci.math,comp.lang.c,comp.lang.fortran
Subject: 64-bit KISS RNGs
Date: Sat, 28 Feb 2009 04:30:48 -0800 (PST)
This 64-bit KISS RNG has three components, each nearly
good enough to serve alone. The components are:
Multiply-With-Carry (MWC), period (2^121+2^63-1)
Xorshift (XSH), period 2^64-1
Congruential (CNG), period 2^64
*/
static uint64_t kiss64_x = 1234567890987654321ULL;
static uint64_t kiss64_c = 123456123456123456ULL;
static uint64_t kiss64_y = 362436362436362436ULL;
static uint64_t kiss64_z = 1066149217761810ULL;
static uint64_t kiss64_t;
#define MWC64 (kiss64_t = (kiss64_x << 58) + kiss64_c,
kiss64_c = (kiss64_x >> 6), kiss64_x += kiss64_t,
kiss64_c += (kiss64_x < kiss64_t), kiss64_x)
#define XSH64 (kiss64_y ^= (kiss64_y << 13), kiss64_y ^= (kiss64_y >> 17),
kiss64_y ^= (kiss64_y << 43))
#define CNG64 (kiss64_z = 6906969069ULL * kiss64_z + 1234567ULL)
#define KISS64 (MWC64 + XSH64 + CNG64)
#define N (10000000)
#define BITS (64)
#define BLOCK_SIZE (255)
/* cupdate the count of 1-bits in each bit position for up to 255 numbers */
void sum_block (const uint64_t *pLong, unsigned int *target, int lo, int hi)
{
int jj, k, kk;
uint64_t byte_wise_sum [BITS/8] = {0};
for (jj = lo; jj < hi; jj++) {
uint64_t t = pLong[jj];
for (k = 0; k < BITS/8; k++) {
byte_wise_sum[k] += t & 0x0101010101010101;
t >>= 1;
}
}
/* accumulate byte sums into target */
for (k = 0; k < BITS/8; k++) {
for (kk = 0; kk < BITS; kk += 8) {
target[kk + k] += (byte_wise_sum[k] >> kk) & 0xff;
}
}
}
int main (void)
{
double start_ref, stop_ref, start, stop;
uint64_t *pLong;
unsigned int target_ref [BITS] = {0};
unsigned int target [BITS] = {0};
int i, j;
pLong = malloc (sizeof(pLong[0]) * N);
if (!pLong) {
printf("failed to allocaten");
return EXIT_FAILURE;
}
printf("p=%pn", pLong);
/* init data */
for (j = 0; j < N; j++) {
pLong[j] = KISS64;
}
/* count bits slowly */
start_ref = second();
for (j = 0; j < N; j++) {
uint64_t m = 1;
for (i = 0; i < BITS; i++) {
if ((pLong[j] & m) == m) {
target_ref[i]++;
}
m = (m << 1);
}
}
stop_ref = second();
/* count bits fast */
start = second();
for (j = 0; j < N / BLOCK_SIZE; j++) {
sum_block (pLong, target, j * BLOCK_SIZE, (j+1) * BLOCK_SIZE);
}
sum_block (pLong, target, j * BLOCK_SIZE, N);
stop = second();
/* check whether result is correct */
for (i = 0; i < BITS; i++) {
if (target[i] != target_ref[i]) {
printf ("error @ %d: res=%u ref=%un", i, target[i], target_ref[i]);
}
}
/* print benchmark results */
printf("ref took %f secs, fast took %f secsn", stop_ref - start_ref, stop - start);
return EXIT_SUCCESS;
}
E3-1270 v2 is a quad-core IvyBridge, pretty comparable to the OP's Sandybridge in terms of microarchitecture. (Faster clock vs. 3.1GHz, and you have 8MB L3 vs. 6MB). Both lack AVX2. I think even with AVX, starting with 2-bit accumulators and widening gradually might be a win here vs. unpacking all the way to bytes to start with. Your SWAR byte add doesn't need to kill the carry, because you need to avoid inputs that could possible carry anyway.
– Peter Cordes
1 hour ago
@PeterCordes It occured to me that I could eliminate the carry killing logic in the byte-wise add, but there is also something to be said for re-using existing tested building blocks "as is", instead of fiddling with them at the last minute.
– njuffa
1 hour ago
Ok, but literally all you need is+
! It doesn't need a function at all.
– Peter Cordes
1 hour ago
@PeterCordes I agree 100%. I have been thinking too hard today, my old brain is tired, and so I completely overlooked that. Thanks for pointing it out, I will change it right away.
– njuffa
1 hour ago
You also wantsizeof(*pLong)
for malloc, notsizeof(pLong)
. I accidentally compiled with-m32
from an old command I recalled and edited, and it segfaults in the init loop because you only alloc half as much memory as you need if sizeof(pointer) is less than uint64_t. I would edit but you're probably about to make another edit. Also, style: a block of declarations at the top of a function hasn't been required for a long time: C99 allowsfor(int i = ...)
and so on, and it's generally considered better style to limit declarations to their scope.
– Peter Cordes
1 hour ago
|
show 3 more comments
One way of speeding this up significantly, even without AVX, is to split the data into blocks of up to 255 elements, and accumulate the bit counts byte-wise in ordinary uint64_t
variables. Since the source data has 64 bits, we need an array of 8 byte-wise accumulators. The first accumulator counts bits in positions 0, 8, 16, ... 56, second accumulator counts bits in positions 1, 9, 17, ... 57; and so on. After we are finished processing a block of data, we transfers the counts form the byte-wise accumulator into the target
counts. A function to update the target
counts for a block of up to 255 numbers can be coded in a straightforward fashion according to the description above, where BITS
is the number of bits in the source data:
/* update the counts of 1-bits in each bit position for up to 255 numbers */
void sum_block (const uint64_t *pLong, unsigned int *target, int lo, int hi)
{
int jj, k, kk;
uint64_t byte_wise_sum [BITS/8] = {0};
for (jj = lo; jj < hi; jj++) {
uint64_t t = pLong[jj];
for (k = 0; k < BITS/8; k++) {
byte_wise_sum[k] += t & 0x0101010101010101;
t >>= 1;
}
}
/* accumulate byte sums into target */
for (k = 0; k < BITS/8; k++) {
for (kk = 0; kk < BITS; kk += 8) {
target[kk + k] += (byte_wise_sum[k] >> kk) & 0xff;
}
}
}
The entire ISO-C99 program, which should be able to run on at least Windows and Linux platforms is shown below. It initializes the source data with a PRNG, performs a correctness check against the asker's reference implementation, and benchmarks both the reference code and the accelerated version. On my machine (Intel Xeon E3-1270 v2 @ 3.50 GHz), when compiled with MSVS 2010 at full optimization (/Ox
), the output of the program is:
p=0000000000550040
ref took 2.020282 secs, fast took 0.027099 secs
where ref
refers to the asker's original solution. The speed-up here is about a factor 74x. Different speed-ups will be observed with other (and especially newer) compilers.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
LARGE_INTEGER t;
static double oofreq;
static int checkedForHighResTimer;
static BOOL hasHighResTimer;
if (!checkedForHighResTimer) {
hasHighResTimer = QueryPerformanceFrequency (&t);
oofreq = 1.0 / (double)t.QuadPart;
checkedForHighResTimer = 1;
}
if (hasHighResTimer) {
QueryPerformanceCounter (&t);
return (double)t.QuadPart * oofreq;
} else {
return (double)GetTickCount() * 1.0e-3;
}
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
struct timeval tv;
gettimeofday(&tv, NULL);
return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif
/*
From: geo <gmars...@gmail.com>
Newsgroups: sci.math,comp.lang.c,comp.lang.fortran
Subject: 64-bit KISS RNGs
Date: Sat, 28 Feb 2009 04:30:48 -0800 (PST)
This 64-bit KISS RNG has three components, each nearly
good enough to serve alone. The components are:
Multiply-With-Carry (MWC), period (2^121+2^63-1)
Xorshift (XSH), period 2^64-1
Congruential (CNG), period 2^64
*/
static uint64_t kiss64_x = 1234567890987654321ULL;
static uint64_t kiss64_c = 123456123456123456ULL;
static uint64_t kiss64_y = 362436362436362436ULL;
static uint64_t kiss64_z = 1066149217761810ULL;
static uint64_t kiss64_t;
#define MWC64 (kiss64_t = (kiss64_x << 58) + kiss64_c,
kiss64_c = (kiss64_x >> 6), kiss64_x += kiss64_t,
kiss64_c += (kiss64_x < kiss64_t), kiss64_x)
#define XSH64 (kiss64_y ^= (kiss64_y << 13), kiss64_y ^= (kiss64_y >> 17),
kiss64_y ^= (kiss64_y << 43))
#define CNG64 (kiss64_z = 6906969069ULL * kiss64_z + 1234567ULL)
#define KISS64 (MWC64 + XSH64 + CNG64)
#define N (10000000)
#define BITS (64)
#define BLOCK_SIZE (255)
/* cupdate the count of 1-bits in each bit position for up to 255 numbers */
void sum_block (const uint64_t *pLong, unsigned int *target, int lo, int hi)
{
int jj, k, kk;
uint64_t byte_wise_sum [BITS/8] = {0};
for (jj = lo; jj < hi; jj++) {
uint64_t t = pLong[jj];
for (k = 0; k < BITS/8; k++) {
byte_wise_sum[k] += t & 0x0101010101010101;
t >>= 1;
}
}
/* accumulate byte sums into target */
for (k = 0; k < BITS/8; k++) {
for (kk = 0; kk < BITS; kk += 8) {
target[kk + k] += (byte_wise_sum[k] >> kk) & 0xff;
}
}
}
int main (void)
{
double start_ref, stop_ref, start, stop;
uint64_t *pLong;
unsigned int target_ref [BITS] = {0};
unsigned int target [BITS] = {0};
int i, j;
pLong = malloc (sizeof(pLong[0]) * N);
if (!pLong) {
printf("failed to allocaten");
return EXIT_FAILURE;
}
printf("p=%pn", pLong);
/* init data */
for (j = 0; j < N; j++) {
pLong[j] = KISS64;
}
/* count bits slowly */
start_ref = second();
for (j = 0; j < N; j++) {
uint64_t m = 1;
for (i = 0; i < BITS; i++) {
if ((pLong[j] & m) == m) {
target_ref[i]++;
}
m = (m << 1);
}
}
stop_ref = second();
/* count bits fast */
start = second();
for (j = 0; j < N / BLOCK_SIZE; j++) {
sum_block (pLong, target, j * BLOCK_SIZE, (j+1) * BLOCK_SIZE);
}
sum_block (pLong, target, j * BLOCK_SIZE, N);
stop = second();
/* check whether result is correct */
for (i = 0; i < BITS; i++) {
if (target[i] != target_ref[i]) {
printf ("error @ %d: res=%u ref=%un", i, target[i], target_ref[i]);
}
}
/* print benchmark results */
printf("ref took %f secs, fast took %f secsn", stop_ref - start_ref, stop - start);
return EXIT_SUCCESS;
}
One way of speeding this up significantly, even without AVX, is to split the data into blocks of up to 255 elements, and accumulate the bit counts byte-wise in ordinary uint64_t
variables. Since the source data has 64 bits, we need an array of 8 byte-wise accumulators. The first accumulator counts bits in positions 0, 8, 16, ... 56, second accumulator counts bits in positions 1, 9, 17, ... 57; and so on. After we are finished processing a block of data, we transfers the counts form the byte-wise accumulator into the target
counts. A function to update the target
counts for a block of up to 255 numbers can be coded in a straightforward fashion according to the description above, where BITS
is the number of bits in the source data:
/* update the counts of 1-bits in each bit position for up to 255 numbers */
void sum_block (const uint64_t *pLong, unsigned int *target, int lo, int hi)
{
int jj, k, kk;
uint64_t byte_wise_sum [BITS/8] = {0};
for (jj = lo; jj < hi; jj++) {
uint64_t t = pLong[jj];
for (k = 0; k < BITS/8; k++) {
byte_wise_sum[k] += t & 0x0101010101010101;
t >>= 1;
}
}
/* accumulate byte sums into target */
for (k = 0; k < BITS/8; k++) {
for (kk = 0; kk < BITS; kk += 8) {
target[kk + k] += (byte_wise_sum[k] >> kk) & 0xff;
}
}
}
The entire ISO-C99 program, which should be able to run on at least Windows and Linux platforms is shown below. It initializes the source data with a PRNG, performs a correctness check against the asker's reference implementation, and benchmarks both the reference code and the accelerated version. On my machine (Intel Xeon E3-1270 v2 @ 3.50 GHz), when compiled with MSVS 2010 at full optimization (/Ox
), the output of the program is:
p=0000000000550040
ref took 2.020282 secs, fast took 0.027099 secs
where ref
refers to the asker's original solution. The speed-up here is about a factor 74x. Different speed-ups will be observed with other (and especially newer) compilers.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
LARGE_INTEGER t;
static double oofreq;
static int checkedForHighResTimer;
static BOOL hasHighResTimer;
if (!checkedForHighResTimer) {
hasHighResTimer = QueryPerformanceFrequency (&t);
oofreq = 1.0 / (double)t.QuadPart;
checkedForHighResTimer = 1;
}
if (hasHighResTimer) {
QueryPerformanceCounter (&t);
return (double)t.QuadPart * oofreq;
} else {
return (double)GetTickCount() * 1.0e-3;
}
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
struct timeval tv;
gettimeofday(&tv, NULL);
return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif
/*
From: geo <gmars...@gmail.com>
Newsgroups: sci.math,comp.lang.c,comp.lang.fortran
Subject: 64-bit KISS RNGs
Date: Sat, 28 Feb 2009 04:30:48 -0800 (PST)
This 64-bit KISS RNG has three components, each nearly
good enough to serve alone. The components are:
Multiply-With-Carry (MWC), period (2^121+2^63-1)
Xorshift (XSH), period 2^64-1
Congruential (CNG), period 2^64
*/
static uint64_t kiss64_x = 1234567890987654321ULL;
static uint64_t kiss64_c = 123456123456123456ULL;
static uint64_t kiss64_y = 362436362436362436ULL;
static uint64_t kiss64_z = 1066149217761810ULL;
static uint64_t kiss64_t;
#define MWC64 (kiss64_t = (kiss64_x << 58) + kiss64_c,
kiss64_c = (kiss64_x >> 6), kiss64_x += kiss64_t,
kiss64_c += (kiss64_x < kiss64_t), kiss64_x)
#define XSH64 (kiss64_y ^= (kiss64_y << 13), kiss64_y ^= (kiss64_y >> 17),
kiss64_y ^= (kiss64_y << 43))
#define CNG64 (kiss64_z = 6906969069ULL * kiss64_z + 1234567ULL)
#define KISS64 (MWC64 + XSH64 + CNG64)
#define N (10000000)
#define BITS (64)
#define BLOCK_SIZE (255)
/* cupdate the count of 1-bits in each bit position for up to 255 numbers */
void sum_block (const uint64_t *pLong, unsigned int *target, int lo, int hi)
{
int jj, k, kk;
uint64_t byte_wise_sum [BITS/8] = {0};
for (jj = lo; jj < hi; jj++) {
uint64_t t = pLong[jj];
for (k = 0; k < BITS/8; k++) {
byte_wise_sum[k] += t & 0x0101010101010101;
t >>= 1;
}
}
/* accumulate byte sums into target */
for (k = 0; k < BITS/8; k++) {
for (kk = 0; kk < BITS; kk += 8) {
target[kk + k] += (byte_wise_sum[k] >> kk) & 0xff;
}
}
}
int main (void)
{
double start_ref, stop_ref, start, stop;
uint64_t *pLong;
unsigned int target_ref [BITS] = {0};
unsigned int target [BITS] = {0};
int i, j;
pLong = malloc (sizeof(pLong[0]) * N);
if (!pLong) {
printf("failed to allocaten");
return EXIT_FAILURE;
}
printf("p=%pn", pLong);
/* init data */
for (j = 0; j < N; j++) {
pLong[j] = KISS64;
}
/* count bits slowly */
start_ref = second();
for (j = 0; j < N; j++) {
uint64_t m = 1;
for (i = 0; i < BITS; i++) {
if ((pLong[j] & m) == m) {
target_ref[i]++;
}
m = (m << 1);
}
}
stop_ref = second();
/* count bits fast */
start = second();
for (j = 0; j < N / BLOCK_SIZE; j++) {
sum_block (pLong, target, j * BLOCK_SIZE, (j+1) * BLOCK_SIZE);
}
sum_block (pLong, target, j * BLOCK_SIZE, N);
stop = second();
/* check whether result is correct */
for (i = 0; i < BITS; i++) {
if (target[i] != target_ref[i]) {
printf ("error @ %d: res=%u ref=%un", i, target[i], target_ref[i]);
}
}
/* print benchmark results */
printf("ref took %f secs, fast took %f secsn", stop_ref - start_ref, stop - start);
return EXIT_SUCCESS;
}
edited 58 mins ago
answered 1 hour ago
njuffanjuffa
12.7k34075
12.7k34075
E3-1270 v2 is a quad-core IvyBridge, pretty comparable to the OP's Sandybridge in terms of microarchitecture. (Faster clock vs. 3.1GHz, and you have 8MB L3 vs. 6MB). Both lack AVX2. I think even with AVX, starting with 2-bit accumulators and widening gradually might be a win here vs. unpacking all the way to bytes to start with. Your SWAR byte add doesn't need to kill the carry, because you need to avoid inputs that could possible carry anyway.
– Peter Cordes
1 hour ago
@PeterCordes It occured to me that I could eliminate the carry killing logic in the byte-wise add, but there is also something to be said for re-using existing tested building blocks "as is", instead of fiddling with them at the last minute.
– njuffa
1 hour ago
Ok, but literally all you need is+
! It doesn't need a function at all.
– Peter Cordes
1 hour ago
@PeterCordes I agree 100%. I have been thinking too hard today, my old brain is tired, and so I completely overlooked that. Thanks for pointing it out, I will change it right away.
– njuffa
1 hour ago
You also wantsizeof(*pLong)
for malloc, notsizeof(pLong)
. I accidentally compiled with-m32
from an old command I recalled and edited, and it segfaults in the init loop because you only alloc half as much memory as you need if sizeof(pointer) is less than uint64_t. I would edit but you're probably about to make another edit. Also, style: a block of declarations at the top of a function hasn't been required for a long time: C99 allowsfor(int i = ...)
and so on, and it's generally considered better style to limit declarations to their scope.
– Peter Cordes
1 hour ago
|
show 3 more comments
E3-1270 v2 is a quad-core IvyBridge, pretty comparable to the OP's Sandybridge in terms of microarchitecture. (Faster clock vs. 3.1GHz, and you have 8MB L3 vs. 6MB). Both lack AVX2. I think even with AVX, starting with 2-bit accumulators and widening gradually might be a win here vs. unpacking all the way to bytes to start with. Your SWAR byte add doesn't need to kill the carry, because you need to avoid inputs that could possible carry anyway.
– Peter Cordes
1 hour ago
@PeterCordes It occured to me that I could eliminate the carry killing logic in the byte-wise add, but there is also something to be said for re-using existing tested building blocks "as is", instead of fiddling with them at the last minute.
– njuffa
1 hour ago
Ok, but literally all you need is+
! It doesn't need a function at all.
– Peter Cordes
1 hour ago
@PeterCordes I agree 100%. I have been thinking too hard today, my old brain is tired, and so I completely overlooked that. Thanks for pointing it out, I will change it right away.
– njuffa
1 hour ago
You also wantsizeof(*pLong)
for malloc, notsizeof(pLong)
. I accidentally compiled with-m32
from an old command I recalled and edited, and it segfaults in the init loop because you only alloc half as much memory as you need if sizeof(pointer) is less than uint64_t. I would edit but you're probably about to make another edit. Also, style: a block of declarations at the top of a function hasn't been required for a long time: C99 allowsfor(int i = ...)
and so on, and it's generally considered better style to limit declarations to their scope.
– Peter Cordes
1 hour ago
E3-1270 v2 is a quad-core IvyBridge, pretty comparable to the OP's Sandybridge in terms of microarchitecture. (Faster clock vs. 3.1GHz, and you have 8MB L3 vs. 6MB). Both lack AVX2. I think even with AVX, starting with 2-bit accumulators and widening gradually might be a win here vs. unpacking all the way to bytes to start with. Your SWAR byte add doesn't need to kill the carry, because you need to avoid inputs that could possible carry anyway.
– Peter Cordes
1 hour ago
E3-1270 v2 is a quad-core IvyBridge, pretty comparable to the OP's Sandybridge in terms of microarchitecture. (Faster clock vs. 3.1GHz, and you have 8MB L3 vs. 6MB). Both lack AVX2. I think even with AVX, starting with 2-bit accumulators and widening gradually might be a win here vs. unpacking all the way to bytes to start with. Your SWAR byte add doesn't need to kill the carry, because you need to avoid inputs that could possible carry anyway.
– Peter Cordes
1 hour ago
@PeterCordes It occured to me that I could eliminate the carry killing logic in the byte-wise add, but there is also something to be said for re-using existing tested building blocks "as is", instead of fiddling with them at the last minute.
– njuffa
1 hour ago
@PeterCordes It occured to me that I could eliminate the carry killing logic in the byte-wise add, but there is also something to be said for re-using existing tested building blocks "as is", instead of fiddling with them at the last minute.
– njuffa
1 hour ago
Ok, but literally all you need is
+
! It doesn't need a function at all.– Peter Cordes
1 hour ago
Ok, but literally all you need is
+
! It doesn't need a function at all.– Peter Cordes
1 hour ago
@PeterCordes I agree 100%. I have been thinking too hard today, my old brain is tired, and so I completely overlooked that. Thanks for pointing it out, I will change it right away.
– njuffa
1 hour ago
@PeterCordes I agree 100%. I have been thinking too hard today, my old brain is tired, and so I completely overlooked that. Thanks for pointing it out, I will change it right away.
– njuffa
1 hour ago
You also want
sizeof(*pLong)
for malloc, not sizeof(pLong)
. I accidentally compiled with -m32
from an old command I recalled and edited, and it segfaults in the init loop because you only alloc half as much memory as you need if sizeof(pointer) is less than uint64_t. I would edit but you're probably about to make another edit. Also, style: a block of declarations at the top of a function hasn't been required for a long time: C99 allows for(int i = ...)
and so on, and it's generally considered better style to limit declarations to their scope.– Peter Cordes
1 hour ago
You also want
sizeof(*pLong)
for malloc, not sizeof(pLong)
. I accidentally compiled with -m32
from an old command I recalled and edited, and it segfaults in the init loop because you only alloc half as much memory as you need if sizeof(pointer) is less than uint64_t. I would edit but you're probably about to make another edit. Also, style: a block of declarations at the top of a function hasn't been required for a long time: C99 allows for(int i = ...)
and so on, and it's generally considered better style to limit declarations to their scope.– Peter Cordes
1 hour ago
|
show 3 more comments
Your best bet is SIMD, using AVX1 on your Sandybridge CPU. Compilers aren't smart enough to auto-vectorize this for you, even if you write it branchlessly to give them a better chance.
See is there an inverse instruction to the movemask instruction in intel avx2? for a summary of bitmap -> vector unpack methods for different sizes. Ext3h's suggestion in another answer is the best: Unpack bits to something narrower than the final count array gives you more elements per instruction. Bytes is efficient with SIMD, and then you can do up to 255 vertical paddb without overflow, before unpacking to 32-bit counters.
It only takes 4x 16-byte __m128i
vectors to hold all 64 elements, so those accumulators can stay in registers, only adding to memory when widening out to 32-bit counters in an outer loop.
The unpack doesn't have to be in-order: you can always shuffle target[]
after accumulating all the results.
The inner loop could be unrolled to start with a 64 or 128-bit vector load, and unpack 4 or 8 different ways using pshufb
(_mm_shuffle_epi8
).
An even better strategy might be to widen gradually, starting with 2-bit accumulators and then mask/shift to widen those to 4-bit. So in the inner-most loop most of the operations are working with "dense" data.
Using SWAR techniques inside scalar or SIMD registers is easy because we need to avoid the possibility of carry out the top of an element anyway. With proper SIMD, we'd lose those counts, with SWAR we'd corrupt the next element.
uint64_t *endp = input + length - 3*2*something; //
while(input < endp) {
...
// inner 2 loops that create 4-bit accumulators.
for(int j=0 ; j<2 ; j++){
uint64_t accum2_lo=0, accum2_hi=0;
for(int i=0 ; i<3 ; i++) { // the compiler should fully unroll this
uint64_t x = *input++; // load a new bitmask
uint64_t lo = x & 0b...01010101; // note: binary not hex constant
uint64_t hi = (x>>1) & 0b...01010101;
accum2_lo += lo;
accum2_hi += hi; // can do up to 3 iterations of this before
}
accum4_0 += accum2_lo & 0b...001100110011; // same constant 4 times, because we shift *first*
accum4_1 += (accum2_lo >> 2) & 0b...001100110011;
accum4_2 += accum2_hi & 0b...001100110011;
accum4_3 += (accum2_hi >> 2) & 0b...001100110011;
// 6*2 = 12 < 15 so we can repeat this twice before widening to bytes
}
... /// accum8_0..7 the same way.
}
We don't care about order, so accum4_0
has 4-bit accumulators for every 4th bit.
We can use any SIMD element width to implement these shifts; we have to mask anyway for widths lower than 16-bit (SSE/AVX doesn't have byte-granularity shifts, only 16-bit minimum.)
But your portable scalar version can be improved, too:
If you expect a random mix of zeros and ones, you want something branchless that won't mispredict. Doing += 0
for elements that were zero avoids that, and also means that the C abstract machine definitely touches that memory regardless of the data.
Compilers aren't allowed to invent writes, so if they wanted to auto-vectorize your if() target[i]++
version, they'd have to use a masked store like x86 vmaskmovps
to avoid a non-atomic read / rewrite of unmodified elements of target
. So some hypothetical future compiler that can auto-vectorize the plain scalar code would have an easier time with this.
Anyway, one way to write this is target[i] += (pLong[j] & m != 0);
, using bool->int conversion to get a 0 / 1 integer.
But we get better asm for x86 (and probably for most other architectures) if we just shift the data and isolate the low bit with &1
. Compilers are kinda dumb and don't seem to spot this optimization. They do nicely optimize away the extra loop counter, and turn m <<= 1
into add same,same
to efficiently left shift, but they still use xor-zero / test
/ setne
to create a 0 / 1 integer.
An inner loop like this compiles slightly more efficiently (but still much much worse than we can do with SSE2 or AVX, or even scalar using @chrqlie's lookup table which will stay hot in L1d when used repeatedly like this, allowing SWAR in uint64_t
):
for (int j = 0; j < 10000000; j++) {
#if 1 // extract low bit directly
unsigned long long tmp = pLong[j];
for (int i=0 ; i<64 ; i++) { // while(tmp) could mispredict, but good for sparse data
target[i] += tmp&1;
tmp >>= 1;
}
#else // bool -> int shifting a mask
unsigned long m = 1;
for (i = 0; i < 64; i++) {
target[i]+= (pLong[j] & m) != 0;
m = (m << 1);
}
#endif
Note that unsigned long
is not guaranteed to be a 64-bit type, and isn't in x86-64 System V x32 (ILP32 in 64-bit mode), and Windows x64. Or in 32-bit ABIs like i386 System V.
Compiled on the Godbolt compiler explorer by gcc, clang, and ICC, it's 1 fewer uops in the loop with gcc. But all of them are just plain scalar, with clang and ICC unrolling by 2.
# clang7.0 -O3 -march=sandybridge
.LBB1_2: # =>This Loop Header: Depth=1
# outer loop loads a uint64 from the src
mov rdx, qword ptr [r14 + 8*rbx]
mov rsi, -256
.LBB1_3: # Parent Loop BB1_2 Depth=1
# do {
mov edi, edx
and edi, 1 # isolate the low bit
add dword ptr [rsi + target+256], edi # and += into target
mov edi, edx
shr edi
and edi, 1 # isolate the 2nd bit
add dword ptr [rsi + target+260], edi
shr rdx, 2 # tmp >>= 2;
add rsi, 8
jne .LBB1_3 # } while(offset += 8 != 0);
This is slightly better than we get from test
/ setnz
. Without unrolling, bt
/ setc
might have been equal, but compilers are bad at using bt
to implement bool (x & (1ULL << n))
, or bts
to implement x |= 1ULL << n
.
If many words have their highest set bit far below bit 63, looping on while(tmp)
could be a win. Branch mispredicts make it not worth it if it only saves ~0 to 4 iterations most of the time, but if it often saves 32 iterations, that could really be worth it. Maybe unroll in the source so the loop only tests tmp
every 2 iterations (because compilers won't do that transformation for you), but then the loop branch can be shr rdx, 2
/ jnz
.
On Sandybridge-family, this is 11 fused-domain uops for the front end per 2 bits of input. (add [mem], reg
with a non-indexed addressing mode micro-fuses the load+ALU, and the store-address+store-data, everything else is single-uop. add/jcc macro-fuses. See Agner Fog's guide, and https://stackoverflow.com/tags/x86/info). So it should run at something like 3 cycles per 2 bits = one uint64_t per 96 cycles. (Sandybridge doesn't "unroll" internally in its loop buffer, so non-multiple-of-4 uop counts basically round up, unlike on Haswell and later).
vs. gcc's not-unrolled version being 7 uops per 1 bit = 2 cycles per bit. If you compiled with gcc -O3 -march=native -fprofile-generate
/ test-run / gcc -O3 -march=native -fprofile-use
, profile-guided optimization would enable loop unrolling.
This is probably slower than a branchy version on perfectly predictable data like you get from memset
with any repeating byte pattern. I'd suggest filling your array with randomly-generated data from a fast PRNG like an SSE2 xorshift+, or if you're just timing the count loop then use anything you want, like rand()
.
Thanks @peter-cordes for the writeup. I will have to study it carefully since I believe it's the way to go.
– pktCoder
2 hours ago
add a comment |
Your best bet is SIMD, using AVX1 on your Sandybridge CPU. Compilers aren't smart enough to auto-vectorize this for you, even if you write it branchlessly to give them a better chance.
See is there an inverse instruction to the movemask instruction in intel avx2? for a summary of bitmap -> vector unpack methods for different sizes. Ext3h's suggestion in another answer is the best: Unpack bits to something narrower than the final count array gives you more elements per instruction. Bytes is efficient with SIMD, and then you can do up to 255 vertical paddb without overflow, before unpacking to 32-bit counters.
It only takes 4x 16-byte __m128i
vectors to hold all 64 elements, so those accumulators can stay in registers, only adding to memory when widening out to 32-bit counters in an outer loop.
The unpack doesn't have to be in-order: you can always shuffle target[]
after accumulating all the results.
The inner loop could be unrolled to start with a 64 or 128-bit vector load, and unpack 4 or 8 different ways using pshufb
(_mm_shuffle_epi8
).
An even better strategy might be to widen gradually, starting with 2-bit accumulators and then mask/shift to widen those to 4-bit. So in the inner-most loop most of the operations are working with "dense" data.
Using SWAR techniques inside scalar or SIMD registers is easy because we need to avoid the possibility of carry out the top of an element anyway. With proper SIMD, we'd lose those counts, with SWAR we'd corrupt the next element.
uint64_t *endp = input + length - 3*2*something; //
while(input < endp) {
...
// inner 2 loops that create 4-bit accumulators.
for(int j=0 ; j<2 ; j++){
uint64_t accum2_lo=0, accum2_hi=0;
for(int i=0 ; i<3 ; i++) { // the compiler should fully unroll this
uint64_t x = *input++; // load a new bitmask
uint64_t lo = x & 0b...01010101; // note: binary not hex constant
uint64_t hi = (x>>1) & 0b...01010101;
accum2_lo += lo;
accum2_hi += hi; // can do up to 3 iterations of this before
}
accum4_0 += accum2_lo & 0b...001100110011; // same constant 4 times, because we shift *first*
accum4_1 += (accum2_lo >> 2) & 0b...001100110011;
accum4_2 += accum2_hi & 0b...001100110011;
accum4_3 += (accum2_hi >> 2) & 0b...001100110011;
// 6*2 = 12 < 15 so we can repeat this twice before widening to bytes
}
... /// accum8_0..7 the same way.
}
We don't care about order, so accum4_0
has 4-bit accumulators for every 4th bit.
We can use any SIMD element width to implement these shifts; we have to mask anyway for widths lower than 16-bit (SSE/AVX doesn't have byte-granularity shifts, only 16-bit minimum.)
But your portable scalar version can be improved, too:
If you expect a random mix of zeros and ones, you want something branchless that won't mispredict. Doing += 0
for elements that were zero avoids that, and also means that the C abstract machine definitely touches that memory regardless of the data.
Compilers aren't allowed to invent writes, so if they wanted to auto-vectorize your if() target[i]++
version, they'd have to use a masked store like x86 vmaskmovps
to avoid a non-atomic read / rewrite of unmodified elements of target
. So some hypothetical future compiler that can auto-vectorize the plain scalar code would have an easier time with this.
Anyway, one way to write this is target[i] += (pLong[j] & m != 0);
, using bool->int conversion to get a 0 / 1 integer.
But we get better asm for x86 (and probably for most other architectures) if we just shift the data and isolate the low bit with &1
. Compilers are kinda dumb and don't seem to spot this optimization. They do nicely optimize away the extra loop counter, and turn m <<= 1
into add same,same
to efficiently left shift, but they still use xor-zero / test
/ setne
to create a 0 / 1 integer.
An inner loop like this compiles slightly more efficiently (but still much much worse than we can do with SSE2 or AVX, or even scalar using @chrqlie's lookup table which will stay hot in L1d when used repeatedly like this, allowing SWAR in uint64_t
):
for (int j = 0; j < 10000000; j++) {
#if 1 // extract low bit directly
unsigned long long tmp = pLong[j];
for (int i=0 ; i<64 ; i++) { // while(tmp) could mispredict, but good for sparse data
target[i] += tmp&1;
tmp >>= 1;
}
#else // bool -> int shifting a mask
unsigned long m = 1;
for (i = 0; i < 64; i++) {
target[i]+= (pLong[j] & m) != 0;
m = (m << 1);
}
#endif
Note that unsigned long
is not guaranteed to be a 64-bit type, and isn't in x86-64 System V x32 (ILP32 in 64-bit mode), and Windows x64. Or in 32-bit ABIs like i386 System V.
Compiled on the Godbolt compiler explorer by gcc, clang, and ICC, it's 1 fewer uops in the loop with gcc. But all of them are just plain scalar, with clang and ICC unrolling by 2.
# clang7.0 -O3 -march=sandybridge
.LBB1_2: # =>This Loop Header: Depth=1
# outer loop loads a uint64 from the src
mov rdx, qword ptr [r14 + 8*rbx]
mov rsi, -256
.LBB1_3: # Parent Loop BB1_2 Depth=1
# do {
mov edi, edx
and edi, 1 # isolate the low bit
add dword ptr [rsi + target+256], edi # and += into target
mov edi, edx
shr edi
and edi, 1 # isolate the 2nd bit
add dword ptr [rsi + target+260], edi
shr rdx, 2 # tmp >>= 2;
add rsi, 8
jne .LBB1_3 # } while(offset += 8 != 0);
This is slightly better than we get from test
/ setnz
. Without unrolling, bt
/ setc
might have been equal, but compilers are bad at using bt
to implement bool (x & (1ULL << n))
, or bts
to implement x |= 1ULL << n
.
If many words have their highest set bit far below bit 63, looping on while(tmp)
could be a win. Branch mispredicts make it not worth it if it only saves ~0 to 4 iterations most of the time, but if it often saves 32 iterations, that could really be worth it. Maybe unroll in the source so the loop only tests tmp
every 2 iterations (because compilers won't do that transformation for you), but then the loop branch can be shr rdx, 2
/ jnz
.
On Sandybridge-family, this is 11 fused-domain uops for the front end per 2 bits of input. (add [mem], reg
with a non-indexed addressing mode micro-fuses the load+ALU, and the store-address+store-data, everything else is single-uop. add/jcc macro-fuses. See Agner Fog's guide, and https://stackoverflow.com/tags/x86/info). So it should run at something like 3 cycles per 2 bits = one uint64_t per 96 cycles. (Sandybridge doesn't "unroll" internally in its loop buffer, so non-multiple-of-4 uop counts basically round up, unlike on Haswell and later).
vs. gcc's not-unrolled version being 7 uops per 1 bit = 2 cycles per bit. If you compiled with gcc -O3 -march=native -fprofile-generate
/ test-run / gcc -O3 -march=native -fprofile-use
, profile-guided optimization would enable loop unrolling.
This is probably slower than a branchy version on perfectly predictable data like you get from memset
with any repeating byte pattern. I'd suggest filling your array with randomly-generated data from a fast PRNG like an SSE2 xorshift+, or if you're just timing the count loop then use anything you want, like rand()
.
Thanks @peter-cordes for the writeup. I will have to study it carefully since I believe it's the way to go.
– pktCoder
2 hours ago
add a comment |
Your best bet is SIMD, using AVX1 on your Sandybridge CPU. Compilers aren't smart enough to auto-vectorize this for you, even if you write it branchlessly to give them a better chance.
See is there an inverse instruction to the movemask instruction in intel avx2? for a summary of bitmap -> vector unpack methods for different sizes. Ext3h's suggestion in another answer is the best: Unpack bits to something narrower than the final count array gives you more elements per instruction. Bytes is efficient with SIMD, and then you can do up to 255 vertical paddb without overflow, before unpacking to 32-bit counters.
It only takes 4x 16-byte __m128i
vectors to hold all 64 elements, so those accumulators can stay in registers, only adding to memory when widening out to 32-bit counters in an outer loop.
The unpack doesn't have to be in-order: you can always shuffle target[]
after accumulating all the results.
The inner loop could be unrolled to start with a 64 or 128-bit vector load, and unpack 4 or 8 different ways using pshufb
(_mm_shuffle_epi8
).
An even better strategy might be to widen gradually, starting with 2-bit accumulators and then mask/shift to widen those to 4-bit. So in the inner-most loop most of the operations are working with "dense" data.
Using SWAR techniques inside scalar or SIMD registers is easy because we need to avoid the possibility of carry out the top of an element anyway. With proper SIMD, we'd lose those counts, with SWAR we'd corrupt the next element.
uint64_t *endp = input + length - 3*2*something; //
while(input < endp) {
...
// inner 2 loops that create 4-bit accumulators.
for(int j=0 ; j<2 ; j++){
uint64_t accum2_lo=0, accum2_hi=0;
for(int i=0 ; i<3 ; i++) { // the compiler should fully unroll this
uint64_t x = *input++; // load a new bitmask
uint64_t lo = x & 0b...01010101; // note: binary not hex constant
uint64_t hi = (x>>1) & 0b...01010101;
accum2_lo += lo;
accum2_hi += hi; // can do up to 3 iterations of this before
}
accum4_0 += accum2_lo & 0b...001100110011; // same constant 4 times, because we shift *first*
accum4_1 += (accum2_lo >> 2) & 0b...001100110011;
accum4_2 += accum2_hi & 0b...001100110011;
accum4_3 += (accum2_hi >> 2) & 0b...001100110011;
// 6*2 = 12 < 15 so we can repeat this twice before widening to bytes
}
... /// accum8_0..7 the same way.
}
We don't care about order, so accum4_0
has 4-bit accumulators for every 4th bit.
We can use any SIMD element width to implement these shifts; we have to mask anyway for widths lower than 16-bit (SSE/AVX doesn't have byte-granularity shifts, only 16-bit minimum.)
But your portable scalar version can be improved, too:
If you expect a random mix of zeros and ones, you want something branchless that won't mispredict. Doing += 0
for elements that were zero avoids that, and also means that the C abstract machine definitely touches that memory regardless of the data.
Compilers aren't allowed to invent writes, so if they wanted to auto-vectorize your if() target[i]++
version, they'd have to use a masked store like x86 vmaskmovps
to avoid a non-atomic read / rewrite of unmodified elements of target
. So some hypothetical future compiler that can auto-vectorize the plain scalar code would have an easier time with this.
Anyway, one way to write this is target[i] += (pLong[j] & m != 0);
, using bool->int conversion to get a 0 / 1 integer.
But we get better asm for x86 (and probably for most other architectures) if we just shift the data and isolate the low bit with &1
. Compilers are kinda dumb and don't seem to spot this optimization. They do nicely optimize away the extra loop counter, and turn m <<= 1
into add same,same
to efficiently left shift, but they still use xor-zero / test
/ setne
to create a 0 / 1 integer.
An inner loop like this compiles slightly more efficiently (but still much much worse than we can do with SSE2 or AVX, or even scalar using @chrqlie's lookup table which will stay hot in L1d when used repeatedly like this, allowing SWAR in uint64_t
):
for (int j = 0; j < 10000000; j++) {
#if 1 // extract low bit directly
unsigned long long tmp = pLong[j];
for (int i=0 ; i<64 ; i++) { // while(tmp) could mispredict, but good for sparse data
target[i] += tmp&1;
tmp >>= 1;
}
#else // bool -> int shifting a mask
unsigned long m = 1;
for (i = 0; i < 64; i++) {
target[i]+= (pLong[j] & m) != 0;
m = (m << 1);
}
#endif
Note that unsigned long
is not guaranteed to be a 64-bit type, and isn't in x86-64 System V x32 (ILP32 in 64-bit mode), and Windows x64. Or in 32-bit ABIs like i386 System V.
Compiled on the Godbolt compiler explorer by gcc, clang, and ICC, it's 1 fewer uops in the loop with gcc. But all of them are just plain scalar, with clang and ICC unrolling by 2.
# clang7.0 -O3 -march=sandybridge
.LBB1_2: # =>This Loop Header: Depth=1
# outer loop loads a uint64 from the src
mov rdx, qword ptr [r14 + 8*rbx]
mov rsi, -256
.LBB1_3: # Parent Loop BB1_2 Depth=1
# do {
mov edi, edx
and edi, 1 # isolate the low bit
add dword ptr [rsi + target+256], edi # and += into target
mov edi, edx
shr edi
and edi, 1 # isolate the 2nd bit
add dword ptr [rsi + target+260], edi
shr rdx, 2 # tmp >>= 2;
add rsi, 8
jne .LBB1_3 # } while(offset += 8 != 0);
This is slightly better than we get from test
/ setnz
. Without unrolling, bt
/ setc
might have been equal, but compilers are bad at using bt
to implement bool (x & (1ULL << n))
, or bts
to implement x |= 1ULL << n
.
If many words have their highest set bit far below bit 63, looping on while(tmp)
could be a win. Branch mispredicts make it not worth it if it only saves ~0 to 4 iterations most of the time, but if it often saves 32 iterations, that could really be worth it. Maybe unroll in the source so the loop only tests tmp
every 2 iterations (because compilers won't do that transformation for you), but then the loop branch can be shr rdx, 2
/ jnz
.
On Sandybridge-family, this is 11 fused-domain uops for the front end per 2 bits of input. (add [mem], reg
with a non-indexed addressing mode micro-fuses the load+ALU, and the store-address+store-data, everything else is single-uop. add/jcc macro-fuses. See Agner Fog's guide, and https://stackoverflow.com/tags/x86/info). So it should run at something like 3 cycles per 2 bits = one uint64_t per 96 cycles. (Sandybridge doesn't "unroll" internally in its loop buffer, so non-multiple-of-4 uop counts basically round up, unlike on Haswell and later).
vs. gcc's not-unrolled version being 7 uops per 1 bit = 2 cycles per bit. If you compiled with gcc -O3 -march=native -fprofile-generate
/ test-run / gcc -O3 -march=native -fprofile-use
, profile-guided optimization would enable loop unrolling.
This is probably slower than a branchy version on perfectly predictable data like you get from memset
with any repeating byte pattern. I'd suggest filling your array with randomly-generated data from a fast PRNG like an SSE2 xorshift+, or if you're just timing the count loop then use anything you want, like rand()
.
Your best bet is SIMD, using AVX1 on your Sandybridge CPU. Compilers aren't smart enough to auto-vectorize this for you, even if you write it branchlessly to give them a better chance.
See is there an inverse instruction to the movemask instruction in intel avx2? for a summary of bitmap -> vector unpack methods for different sizes. Ext3h's suggestion in another answer is the best: Unpack bits to something narrower than the final count array gives you more elements per instruction. Bytes is efficient with SIMD, and then you can do up to 255 vertical paddb without overflow, before unpacking to 32-bit counters.
It only takes 4x 16-byte __m128i
vectors to hold all 64 elements, so those accumulators can stay in registers, only adding to memory when widening out to 32-bit counters in an outer loop.
The unpack doesn't have to be in-order: you can always shuffle target[]
after accumulating all the results.
The inner loop could be unrolled to start with a 64 or 128-bit vector load, and unpack 4 or 8 different ways using pshufb
(_mm_shuffle_epi8
).
An even better strategy might be to widen gradually, starting with 2-bit accumulators and then mask/shift to widen those to 4-bit. So in the inner-most loop most of the operations are working with "dense" data.
Using SWAR techniques inside scalar or SIMD registers is easy because we need to avoid the possibility of carry out the top of an element anyway. With proper SIMD, we'd lose those counts, with SWAR we'd corrupt the next element.
uint64_t *endp = input + length - 3*2*something; //
while(input < endp) {
...
// inner 2 loops that create 4-bit accumulators.
for(int j=0 ; j<2 ; j++){
uint64_t accum2_lo=0, accum2_hi=0;
for(int i=0 ; i<3 ; i++) { // the compiler should fully unroll this
uint64_t x = *input++; // load a new bitmask
uint64_t lo = x & 0b...01010101; // note: binary not hex constant
uint64_t hi = (x>>1) & 0b...01010101;
accum2_lo += lo;
accum2_hi += hi; // can do up to 3 iterations of this before
}
accum4_0 += accum2_lo & 0b...001100110011; // same constant 4 times, because we shift *first*
accum4_1 += (accum2_lo >> 2) & 0b...001100110011;
accum4_2 += accum2_hi & 0b...001100110011;
accum4_3 += (accum2_hi >> 2) & 0b...001100110011;
// 6*2 = 12 < 15 so we can repeat this twice before widening to bytes
}
... /// accum8_0..7 the same way.
}
We don't care about order, so accum4_0
has 4-bit accumulators for every 4th bit.
We can use any SIMD element width to implement these shifts; we have to mask anyway for widths lower than 16-bit (SSE/AVX doesn't have byte-granularity shifts, only 16-bit minimum.)
But your portable scalar version can be improved, too:
If you expect a random mix of zeros and ones, you want something branchless that won't mispredict. Doing += 0
for elements that were zero avoids that, and also means that the C abstract machine definitely touches that memory regardless of the data.
Compilers aren't allowed to invent writes, so if they wanted to auto-vectorize your if() target[i]++
version, they'd have to use a masked store like x86 vmaskmovps
to avoid a non-atomic read / rewrite of unmodified elements of target
. So some hypothetical future compiler that can auto-vectorize the plain scalar code would have an easier time with this.
Anyway, one way to write this is target[i] += (pLong[j] & m != 0);
, using bool->int conversion to get a 0 / 1 integer.
But we get better asm for x86 (and probably for most other architectures) if we just shift the data and isolate the low bit with &1
. Compilers are kinda dumb and don't seem to spot this optimization. They do nicely optimize away the extra loop counter, and turn m <<= 1
into add same,same
to efficiently left shift, but they still use xor-zero / test
/ setne
to create a 0 / 1 integer.
An inner loop like this compiles slightly more efficiently (but still much much worse than we can do with SSE2 or AVX, or even scalar using @chrqlie's lookup table which will stay hot in L1d when used repeatedly like this, allowing SWAR in uint64_t
):
for (int j = 0; j < 10000000; j++) {
#if 1 // extract low bit directly
unsigned long long tmp = pLong[j];
for (int i=0 ; i<64 ; i++) { // while(tmp) could mispredict, but good for sparse data
target[i] += tmp&1;
tmp >>= 1;
}
#else // bool -> int shifting a mask
unsigned long m = 1;
for (i = 0; i < 64; i++) {
target[i]+= (pLong[j] & m) != 0;
m = (m << 1);
}
#endif
Note that unsigned long
is not guaranteed to be a 64-bit type, and isn't in x86-64 System V x32 (ILP32 in 64-bit mode), and Windows x64. Or in 32-bit ABIs like i386 System V.
Compiled on the Godbolt compiler explorer by gcc, clang, and ICC, it's 1 fewer uops in the loop with gcc. But all of them are just plain scalar, with clang and ICC unrolling by 2.
# clang7.0 -O3 -march=sandybridge
.LBB1_2: # =>This Loop Header: Depth=1
# outer loop loads a uint64 from the src
mov rdx, qword ptr [r14 + 8*rbx]
mov rsi, -256
.LBB1_3: # Parent Loop BB1_2 Depth=1
# do {
mov edi, edx
and edi, 1 # isolate the low bit
add dword ptr [rsi + target+256], edi # and += into target
mov edi, edx
shr edi
and edi, 1 # isolate the 2nd bit
add dword ptr [rsi + target+260], edi
shr rdx, 2 # tmp >>= 2;
add rsi, 8
jne .LBB1_3 # } while(offset += 8 != 0);
This is slightly better than we get from test
/ setnz
. Without unrolling, bt
/ setc
might have been equal, but compilers are bad at using bt
to implement bool (x & (1ULL << n))
, or bts
to implement x |= 1ULL << n
.
If many words have their highest set bit far below bit 63, looping on while(tmp)
could be a win. Branch mispredicts make it not worth it if it only saves ~0 to 4 iterations most of the time, but if it often saves 32 iterations, that could really be worth it. Maybe unroll in the source so the loop only tests tmp
every 2 iterations (because compilers won't do that transformation for you), but then the loop branch can be shr rdx, 2
/ jnz
.
On Sandybridge-family, this is 11 fused-domain uops for the front end per 2 bits of input. (add [mem], reg
with a non-indexed addressing mode micro-fuses the load+ALU, and the store-address+store-data, everything else is single-uop. add/jcc macro-fuses. See Agner Fog's guide, and https://stackoverflow.com/tags/x86/info). So it should run at something like 3 cycles per 2 bits = one uint64_t per 96 cycles. (Sandybridge doesn't "unroll" internally in its loop buffer, so non-multiple-of-4 uop counts basically round up, unlike on Haswell and later).
vs. gcc's not-unrolled version being 7 uops per 1 bit = 2 cycles per bit. If you compiled with gcc -O3 -march=native -fprofile-generate
/ test-run / gcc -O3 -march=native -fprofile-use
, profile-guided optimization would enable loop unrolling.
This is probably slower than a branchy version on perfectly predictable data like you get from memset
with any repeating byte pattern. I'd suggest filling your array with randomly-generated data from a fast PRNG like an SSE2 xorshift+, or if you're just timing the count loop then use anything you want, like rand()
.
edited 1 hour ago
answered 5 hours ago
Peter CordesPeter Cordes
130k18197334
130k18197334
Thanks @peter-cordes for the writeup. I will have to study it carefully since I believe it's the way to go.
– pktCoder
2 hours ago
add a comment |
Thanks @peter-cordes for the writeup. I will have to study it carefully since I believe it's the way to go.
– pktCoder
2 hours ago
Thanks @peter-cordes for the writeup. I will have to study it carefully since I believe it's the way to go.
– pktCoder
2 hours ago
Thanks @peter-cordes for the writeup. I will have to study it carefully since I believe it's the way to go.
– pktCoder
2 hours ago
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55081525%2fcount-each-bit-position-separately-over-many-64-bit-bitmasks-with-avx-but-not-a%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
5
using
uint16
as counters would likely result in overflow– VTT
10 hours ago
"like using some sort special assembly instruction that represents the above loop." --> a good compiler will see this. Refer to your compiles optimization settings to activate such.
– chux
10 hours ago
2
Suggestion: take down question, add a test harness around it to report time and post that code asking for improvements. As is now, question is too broad.
– chux
10 hours ago
1
You should fill the buffer with realistic data too. The number of set bits and their pattern affects the performance of this code, accidentally having all zeroes would unfairly benefit branchy code which would perform worse on maximally random bits (50% chance, uncorrelated)
– harold
8 hours ago
2
@Gene: no,
popcnt
is a horizontal popcnt that adds up all the bits in one integer. The OP wants separate counts for every bit-position over multiple integers. You want to unpack bit to something wider (e.g. nibbles or bytes), then vertical add (likepaddb
) as many times as you can without overflow, then expand to wider counters.– Peter Cordes
6 hours ago