直接的C代码需要很长时间才能运行，有什么方法可以优化它吗？

Question

我编写了一个基本代码，利用蛙跳积分器（又名踢-漂移-踢）来模拟重力系统。到目前为止，它已经成功地模拟了您输入的任何势能的轨道。

但是，对于适当的测试粒子模拟，我想将粒子数增加到约 100 万，这就是主要问题的体现 - 它非常慢，大约 35 分钟 1 Myr（模拟进行的时间步长为0.001 Myr，因此 for 循环中有 1000 步）。

我的大部分编码经验都是使用 Python 进行的，因此我担心我错过了 C 代码中的一些关键内容，这些关键内容会显着降低代码速度。我测试了不同的优化方法，例如并行化我的时间循环，以便我可以一次集成更多粒子，但这要么略微改善了运行时间，要么根本没有改善。

下面我将粘贴带有所有必要注释的代码，但我减少了这个问题的代码大小，因为我使用的一些电位非常庞大并且占用空间。我将感谢您的指导和建议，我想真正了解 C 的运行方式，而不是将自己限制在半生不熟的代码中，我可能错过了关键的 C 编码实践，并为代码快速运行添加了人为障碍等。

一些紧迫的问题是：

执行轨道积分时，while 循环比 for 循环慢吗？
我用来保存粒子信息的 if 条件是否会减慢我的代码速度？
有没有办法创建一个 buffer 数组来转储粒子坐标/速度以便稍后保存？我担心 20x1,000,000x7 双数组太大而无法提高计算效率。
如果没有粒子-粒子相互作用需要计算，使用 OpenMP 会被认为是多余的还是仍然可以加快计算速度？

编辑：我不知道编译如何影响代码的性能，所以我将在此处添加以下信息：我只是使用 gcc-13 编译它，没有其他选项，除非我正在测试 OpenMP，在这种情况下我添加了 -fopenmp。哪些优化器标志最适合我的代码，还是我只使用 -O3 进行编译？

这是代码

#include <stdlib.h>
#include <math.h>
// #include <omp.h> //previous experiments, touched on at the end

//Model Parameters

//Halo
const double Mh=2e12; //Mass of halo
const double ah=16e3; //Halo paramter
const double ch=0.53980675319058735; //Halo paramter c=15.3, simplified for formula value

//t_force parameters
const double Mt_f=2.5*3.43*1e10; //Maximum mass
const double R_t=(15*2.5)*1000; // Fixed paramater determining the scale of the potential
const double mdt=2000; //Mass gain rate (units = Myr)

// Units
#define pi 3.141592653589793
#define G 0.00449857598201285

// Sim Setup
const double theta=75*pi/180.; //Inclination of disc, explained further down
const double CCD=1; // Period of file saving, explained further down
const double dt=0.001; //Fixed time step (units = Myr)
const double runtime=4e3; //Total runtime (units = Myr)
const int timelim=runtime/dt; //Integer limit in for-loop
char name[999]="SIMNAME"; //Simulation snapshot name format
//Number of particles
#define pno 1000000

//Norms of vectors norm()
double norm(double x[3])
    {
      return sqrt(x[0]*x[0]+x[1]*x[1]+x[2]*x[2]);
    }

// Function that rotates particles around x-axis
void rotate(double x[3],double v[3],double theta)
{
    double yd,zd,vy,vz;
    yd=x[1]*cos(theta)-x[2]*sin(theta);//
    zd=x[1]*sin(theta)+x[2]*cos(theta);//
    x[1]=yd;//
    x[2]=zd;//
    vy=v[1]*cos(theta)-v[2]*sin(theta);
    vz=v[1]*sin(theta)+v[2]*cos(theta);
    v[1]=vy;
    v[2]=vz;
}

//Mass function of t_force - a time dependant potential (only increasing mass, other parameters stay constant)
double Mt(double t)
    {
        double mass;
        if (t<mdt) {mass=Mt_f*t/mdt;}
        else {mass=Mt_f;}
        return mass;
    }

//Calculating force for every particle (only potentials, no particle-particle interactions). h_force is a static potential representing an NFW halo
void calculate_force(double x[3], double a[3], double mt_c)
{
    //Normss
    double md=norm(x);
    //Forces
    double t_force[3]={0.,0.,0.};
    double h_force;
    torus_force_r[0]=((G*mt_c*x[0]*((3*pow(x[2],2))/(2*pow(x[0],2)+2*pow(x[1],2)+2*pow(x[2],2))-0.5))/pow(R_t,3)-(0.3750*G*mt_c*((480*x[0]*pow(x[2],2))/pow(8*pow(x[0],2)+8*pow(x[1],2)+8*pow(x[2],2),2)-(35*pow(x[2],4)*(32*pow(x[0],3)+32*x[0]*pow(x[1],2)+32*x[0]*pow(x[2],2)))/pow(8*pow(x[0],4)+16*pow(x[0],2)*pow(x[1],2)+16*pow(x[0],2)*pow(x[2],2)+8*pow(x[1],4)+16*pow(x[1],2)*pow(x[2],2)+8*pow(x[2],4),2))*pow(pow(x[0],2)+pow(x[1],2)+pow(x[2],2),2))/pow(R_t,5)-(1.5*G*mt_c*x[0]*((35*pow(x[2],4))/(8*pow(x[0],4)+16*pow(x[0],2)*pow(x[1],2)+16*pow(x[0],2)*pow(x[2],2)+8*pow(x[1],4)+16*pow(x[1],2)*pow(x[2],2)+8*pow(x[2],4))-(30*pow(x[2],2))/(8*pow(x[0],2)+8*pow(x[1],2)+8*pow(x[2],2))+0.3750)*(pow(x[0],2)+pow(x[1],2)+pow(x[2],2)))/pow(R_t,5)-(6*G*mt_c*x[0]*pow(x[2],2)*(pow(x[0],2)+pow(x[1],2)+pow(x[2],2)))/(pow(R_t,3)*pow(2*pow(x[0],2)+2*pow(x[1],2)+2*pow(x[2],2),2)));

    torus_force_r[1]=((G*mt_c*x[1]*((3*pow(x[2],2))/(2*pow(x[0],2)+2*pow(x[1],2)+2*pow(x[2],2))-0.5))/pow(R_t,3)-(0.3750*G*mt_c*((480*x[1]*pow(x[2],2))/pow(8*pow(x[0],2)+8*pow(x[1],2)+8*pow(x[2],2),2)-(35*pow(x[2],4)*(32*pow(x[0],2)*x[1]+32*pow(x[1],3)+32*x[1]*pow(x[2],2)))/pow(8*pow(x[0],4)+16*pow(x[0],2)*pow(x[1],2)+16*pow(x[0],2)*pow(x[2],2)+8*pow(x[1],4)+16*pow(x[1],2)*pow(x[2],2)+8*pow(x[2],4),2))*pow(pow(x[0],2)+pow(x[1],2)+pow(x[2],2),2))/pow(R_t,5)-(1.5*G*mt_c*x[1]*((35*pow(x[2],4))/(8*pow(x[0],4)+16*pow(x[0],2)*pow(x[1],2)+16*pow(x[0],2)*pow(x[2],2)+8*pow(x[1],4)+16*pow(x[1],2)*pow(x[2],2)+8*pow(x[2],4))-(30*pow(x[2],2))/(8*pow(x[0],2)+8*pow(x[1],2)+8*pow(x[2],2))+0.3750)*(pow(x[0],2)+pow(x[1],2)+pow(x[2],2)))/pow(R_t,5)-(6*G*mt_c*x[1]*pow(x[2],2)*(pow(x[0],2)+pow(x[1],2)+pow(x[2],2)))/(pow(R_t,3)*pow(2*pow(x[0],2)+2*pow(x[1],2)+2*pow(x[2],2),2)));

    torus_force_r[2]=((G*mt_c*x[2]*((3*pow(x[2],2))/(2*pow(x[0],2)+2*pow(x[1],2)+2*pow(x[2],2))-0.5))/pow(R_t,3)+(0.3750*G*mt_c*pow(pow(x[0],2)+pow(x[1],2)+pow(x[2],2),2)*((60*x[2])/(8*pow(x[0],2)+8*pow(x[1],2)+8*pow(x[2],2))-(480*pow(x[2],3))/pow(8*pow(x[0],2)+8*pow(x[1],2)+8*pow(x[2],2),2)-(140*pow(x[2],3))/(8*pow(x[0],4)+16*pow(x[0],2)*pow(x[1],2)+16*pow(x[0],2)*pow(x[2],2)+8*pow(x[1],4)+16*pow(x[1],2)*pow(x[2],2)+8*pow(x[2],4))+(35*pow(x[2],4)*(32*pow(x[0],2)*x[2]+32*pow(x[1],2)*x[2]+32*pow(x[2],3)))/pow(8*pow(x[0],4)+16*pow(x[0],2)*pow(x[1],2)+16*pow(x[0],2)*pow(x[2],2)+8*pow(x[1],4)+16*pow(x[1],2)*pow(x[2],2)+8*pow(x[2],4),2)))/pow(R_t,5)+(0.5*G*mt_c*((6*x[2])/(2*pow(x[0],2)+2*pow(x[1],2)+2*pow(x[2],2))-(12*pow(x[2],3))/pow(2*pow(x[0],2)+2*pow(x[1],2)+2*pow(x[2],2),2))*(pow(x[0],2)+pow(x[1],2)+pow(x[2],2)))/pow(R_t,3)-(1.5*G*mt_c*x[2]*((35*pow(x[2],4))/(8*pow(x[0],4)+16*pow(x[0],2)*pow(x[1],2)+16*pow(x[0],2)*pow(x[2],2)+8*pow(x[1],4)+16*pow(x[1],2)*pow(x[2],2)+8*pow(x[2],4))-(30*pow(x[2],2))/(8*pow(x[0],2)+8*pow(x[1],2)+8*pow(x[2],2))+0.3750)*(pow(x[0],2)+pow(x[1],2)+pow(x[2],2)))/pow(R_t,5));    

for (int ind=0;ind<3;ind++)
    {
        h_force=...;
        a[ind]=t_force[ind]+h_force;

    }
}


//Calculating force for every particle, for setting up initial conditions
void force(double x[3], double a[3])
{
    //Norms
    double md=norm(x);
    //Forces
    double h_force;
    for (int ind=0;ind<3;ind++)
    {
        h_force=...;
        a[ind]=h_force;
    }
    
}

// Assigning velocities to particles 
void initial_cond(double x[3],double v[3], double a[3])
{
    //Norms
    double md=norm(x);
    double phi=atan2(x[1],x[0]);
    //Forces
    force(x,a);
    v[0]=sqrt(norm(a)*md)*cos(phi+pi/2);
    v[1]=sqrt(norm(a)*md)*sin(phi+pi/2);
    v[2]=0;
}


//'Kick' part of a Leapfrog integrator
void kick(double x[3], double v[3], double a[3], double dt, double t)
{ 
    calculate_force(x, a, t);
    for (int k=0;k<3;k++)
    {
        v[k]+=a[k]*dt/2.;
    }
}
//'Drift' part of Leapfrog integrator
void drift(double x[3], double v[3], double dt)
{
    for (int k=0;k<3;k++)
    {
         x[k]+=v[k]*dt;
    }
}

// Function to save particles positions/velocities/energy/etc to a file
void snapshot( char nom[999], int snapshot,char *n1, char *n2, char *n3, double x[pno][3], double v[pno][3],double mt_c)
{
    FILE *fill;
    char NAM[999];
    sprintf(NAM, "%s_%.12s_%.12s_%.12s.%06d.dat", nom,n1,n2,n3,snapshot);
    fill = fopen(NAM,"w");    
    fprintf(fill,"%.16f\t%.16f\t%.16f\t%.16f\t%.16f\t%.16f\t%.16f\n", 0.0,0.0,0.0,0.0,0.0,0.0,0.0);
    for (int j=0;j<pno;j++)
            {
            fprintf(fill,"%.16f\t%.16f\t%.16f\t%.16f\t%.16f\t%.16f\t%.16f\n", x[j][0],x[j][1],x[j][2],v[j][0],v[j][1],v[j][2],
                calculate_erg(x[j],v[j],mt_c));
            }
    fclose(fill);

}
/////////////////////Finished defining functions/////////////////////

//Core of code
int main(int argc, char **argv)
{
    
    //Setting coordinates and velocities, prior to launch
    double t, mt_c;
    double mt_c;
    int step=1;

    // Coordinates, velocities, and accelerations of toy particles
    static double x[pno][3];
    static double v[pno][3];
    static double a[pno][3];

    // Loading particle coordinates from .IC file, getting the initial velocities and writing them into the aformentioned arrays
    FILE *f1;
    char IC[999];
    sprintf(IC,"NFW.IC");
    char output_H[9];
    char output_TM[9];
    char output_TR[9];
    f1 = fopen(IC,"r");    
    for (int j=0;j<pno;j++)
    {
        fscanf(f1,"%lf\t%lf\t%lf\n", &x[j][0],&x[j][1],&x[j][2]);
        initial_cond(x[j],v[j],a[j]);
        // Rotating the coordiantes and velocities, my test requires a tilted disc of toy particles
        rotate(x[j],v[j],pi/2-theta);
    }

    // Writing down simulation parameters for filenames
    snprintf(output_H,9,"%f",log10(Mh));
    snprintf(output_TM,9,"%f",log10(Mt_f));
    snprintf(output_TR,9,"%f",log10(R_t));

    // Setting initial mass of t_force potential to 0, saving the particles as an "initial conditions" file
    mt_c=0;
    snapshot(name,0,output_H,output_TM,output_TR,x,v,mt_c);

    // Main orbit integration
    // Prior to this it was a while-loop but from previous work in Python
    // I was cautious it would slow computation
    for (int k=0;k<timelim;k+=1)
    {
        // Get current simulation time and t_force mass to use in force calculations
        t=k*dt;
        mt_c=Mt(t);

        // Leapfrog integrator loop
        // Prior to this version, I added a "#pragma omp parallel for" but it only slowed the code down
        for (int j=0;j<pno;j++)
        {
            kick(x[j],v[j],a[j],dt,mt_c);
            drift(x[j],v[j],dt);
            kick(x[j],v[j],a[j],dt,mt_c);
        }

        // When time meets the condition below - save snapshot. 
        // I am cautious that this is a bottleneck, but many runs were inconclusive
        if (t>CCD*step)
        {   
            snapshot(name,step,output_H,output_TM,output_TR,x,v,mt_c);
            step+=1;
        }

    }

}

Answer 1

您会因为担心

if

构造和循环类型之间的差异而为小事而烦恼。 C 不是一种复杂的语言，它是编译语言，因此所有这些控制流结构都由编译器翻译为简单的机器指令。

您的代码主要由影响性能的三个因素主导：

处理的数据量巨大。
您正在处理的许多复杂的数学表达式和数学库函数调用。
事实上，您正在主循环的每迭代中执行文件系统访问 - 没有明显的原因！

最后一个占主导地位，再多的编译器优化也不会让你的文件系统运行得更快，即使在具有缓存等功能的 SSD 上，文件 I/O 也会比 RAM 访问慢几个数量级。我敢打赌，如果您可以在“快照”中分解文件 I/O，您将看到性能的最显着提升。

不仅是文件 I/O，还包括使用格式化 I/O (

e.g. fprintf()

)，即使没有文件 I/O，这也会很慢。无论使用何种语言，任何代码应用程序都是如此。

在没有

snapshot()

的情况下进行测试 - 在任何情况下都不会读取该数据，因此完全没有必要。如果您确实需要该数据，请将其缓冲在内存中并以较低的频率转储（或者如果只是在同一应用程序的其他地方使用它，则根本不转储），或者如果数据需要持久保存（存在），则使用内存映射文件终止后）。

我担心 20 x 1000000 x 7 双数组太大而无法提高计算效率。

不像文件 I/O 那样低效！ 140MB 并不算大，除非您运行的是 30 年前的 PC！ DDR4 RAM 的传输速率>17GB/s。你再次为小事而操心。一般来说，在 PC 应用程序中，在出现问题时使用内存会加快速度。

直接的C代码需要很长时间才能运行，有什么方法可以优化它吗？

问题描述投票：0回答：1

1个回答

最新问题

直接的C代码需要很长时间才能运行，有什么方法可以优化它吗？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1