为什么一个图(大约 10k 到 100k 条边和顶点)需要很长时间才能删除?

问题描述 投票:0回答:2

我正在编写一个程序来处理图形。我在循环中定义了一个图并向其添加了顶点和边(大约 4k 个顶点和 40k 个边)。但是当一轮循环结束后,需要很长时间才能开始下一轮循环。

typedef adjacency_list <listS, vecS, bidirectionalS, property<vertex_name_t, LineLabel>> LineGraph
for (int i = 0; i < num_comp; i++) {
        if (num_vertices(subgraphs[i]) == 1) {
            continue;
        }
        cout << "Subgraph " << i + 1 << ":" << endl;
        cout << "has " << num_vertices(subgraphs[i]) << " vertices" << endl;
        LineGraph llineg;
        get_line_graph(subgraphs[i], llineg);//vertices and edges are added here
    }

调试的时候,好像一个循环结束的时候,程序会跳转到一个名为scoped_ptr.hpp的文件,然后运行

~scoped_ptr() BOOST_SP_NOEXCEPT
    {
#if defined(BOOST_SP_ENABLE_DEBUG_HOOKS)
        boost::sp_scalar_destructor_hook( px );
#endif
        boost::checked_delete( px );
    }

然后它转到 checked_delete.hpp 并运行

template<class T> inline void checked_delete(T * x) BOOST_NOEXCEPT
{
    // intentionally complex - simplification causes regressions
    typedef char type_must_be_complete[ sizeof(T)? 1: -1 ];
    (void) sizeof(type_must_be_complete);
    delete x;
}

我想图中的边太多了,所以从内存中删除它们需要很长时间。

我该怎么做才能让它更快?

我对这个问题做了一个小测试。该图有 10617 个顶点和 72172 条边。

int main() {
    typedef adjacency_list<setS, vecS, bidirectionalS> G;
    timer t;
    for (int j = 0; j < 2; j++) {
        cout << t.elapsed() << endl;
        int num_v = 10617;
        int num_e = 72172;
        G g;
        for (int i = 0; i < num_v; i++) {
            add_vertex(g);
        }
        ifstream infile("wordassociation-2011.txt"); //read edges from file
        int a, b;
        char c;
        int i = 0;
        for (int j = 0; j < num_a; ++j) {
            infile >> a >> c >> b;
            add_edge(a, b, g);
            i++;
            if (i % 5000 == 0) {
                cout << "reading..." << endl;
            }
        }
        cout << t.elapsed() << endl;
    }
}

结果显示开始下一个循环大约需要2min。picture of result 0 阅读... 阅读... 阅读... 阅读... 阅读... 阅读... 阅读... 阅读... 阅读... 阅读... 阅读... 阅读... 阅读... 阅读... 0.981 122.192 阅读... 阅读...

c++ graph boost-graph
2个回答
0
投票

所以,我不能一个人呆得够好:)

我从 http://w3.usf.edu/FreeAssociation/AppendixA/index.html 下载了数据集(显然这是您的来源,因为计数完全匹配)。

我写了一个解析器,读入

Row
数据类型:

// part of speech
enum POS { Noun, Verb, Adjective, Adverb, Pronoun, Preposition, Interjection, Conjunction, Other };

using Atom = AtomPool::Atom;

struct Row {
    Atom     cue, target;
    bool     normed_;
    unsigned _g;   // norm_group
    unsigned _p;   // partic_response
    double   fsg;  // forward_strength
    double   bsg;  // backward_strength
    double   msg;  // mediated_strength "2-step strength"
    double   osg;  // overlapping_strength
    unsigned _m;   // mediators
    unsigned mmia; // mediators_missing_in_action
    double   _o;   // overlapping_associates shared by cue and target
    unsigned omia; // overlapping_missing_in_action

    struct Attributes {
        unsigned ss;  // set size
        unsigned fr;  // frequency
        double   con; // concreteness
        bool     h;   // is homograph?
        POS      ps;  // part of speech
        double   mc;  // mean connectivity among its associates
        double   pr;  // probablity of a resonant connection
        double   rsg; // resonant strength
        unsigned uc;  // use code
    };
    Attributes qattr, tattr; // cue, target attributes
};

我通过使用字符串原子进行了一些优化,因为字符串可能会被大量重复使用。我是对的。

只是解析器

住在Coliru

#include <boost/container/flat_set.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/spirit/home/x3.hpp>

#include <iomanip>
#include <iostream>

#include <chrono>
#include <iomanip>
#include <iostream>
#include <utility>
namespace {
    static long elapsed() {
        auto now = std::chrono::high_resolution_clock::now;
        using namespace std::chrono_literals;
        static auto start = now();
        auto        t     = now();

        return (t - std::exchange(start, t)) / 1ms;
    }

    void trace(auto const&... args) {
        ((std::cout << std::setw(10) << elapsed() << "ms ") << ... << args) << std::endl;
    }
} // namespace

struct AtomPool {
    using Atom = std::uint32_t; // Backing::size_type;
    static constexpr Atom max() { return std::numeric_limits<Atom>::max(); }

    size_t size() const { return _index.size(); }
    size_t size_bytes() const { return _backing.size(); }

    AtomPool() {
        _backing.reserve(100'000);
        _index.reserve(12'000);
    }

    ~AtomPool() {
        if (_insertions)
            trace("AtomPool deduplicated ", (_dedup * 100.0 / _insertions), "% of ", _insertions,
                  " insertions");
    }

  private:
    using Backing = std::vector<char>;

    struct Cmp {
        AtomPool const& _pool;
        using is_transparent = void;

        template <typename... T> bool operator()(T const&... v) const { return (view(v) < ...); }

      private:
        std::string_view view(Atom id) const { return _pool.lookup(id); }
        std::string_view view(std::string_view sv) const { return sv; }
    };

    Backing                               _backing;
    boost::container::flat_set<Atom, Cmp> _index{Cmp{*this}};
    size_t _insertions = 0, _dedup = 0;

  public:
    Atom atomize(std::string const& text) {
        if (_backing.size() > max())
            throw std::range_error("AtomPool");

        _insertions += 1;

        if (auto it = _index.find(text); it != _index.end()) {
            _dedup += 1;
            return *it;
        }

        auto pos = _backing.size();
        _backing.insert(_backing.end(), begin(text), end(text));
        _backing.push_back('\0');

        _index.insert(pos);
        return pos;
    }

    std::string_view lookup(Atom id) const {
        assert(id < _backing.size());
        return _backing.data() + id;
    }
};

namespace DataSet {

    /*
     * |--------------+-----------------------------------------------------|
     * | Abbreviation | Equivalence                                         |
     * |==============+=====================================================|
     * | CUE          | Normed Word                                         |
     * | TARGET       | Response to Normed Word                             |
     * | NORMED?      | Is Response Normed?                                 |
     * | #G           | Group size                                          |
     * | #            | P  Number of Participants Producing Response        |
     * | FSG          | Forward Cue-to-Target Strength                      |
     * | BSG          | Backward Target-to-Cue Strength                     |
     * | MSG          | Mediated Strength                                   |
     * | OSG          | Overlapping Associate Strength                      |
     * | #            | M  Number of Mediators                              |
     * | MMIA         | Number of Non-Normed Potential Mediating Associates |             
     * | #            | O  Number of Overlaping Associates                  |
     * | OMIA         | Number of Non-Normed Overlapping Associates         |
     * | QSS          | Cue: Set Size                                       |
     * | QFR          | Cue: Frequency                                      |
     * | QCON         | Cue: Concreteness                                   |
     * | QH           | Cue is a Homograph?                                 |
     * | QPS          | Cue: Part of Speech                                 |
     * | QMC          | Cue: Mean Connectivity Among Its Associates         |
     * | QPR          | Cue: Probability of a Resonant Connection           |
     * | QRSG         | Cue: Resonant Strength                              |
     * | QUC          | Cue: Use Code                                       |
     * | TSS          | Target: Set Size                                    |
     * | TFR          | Target: Frequency                                   |
     * | TCON         | Target: Concreteness                                |
     * | TH           | Target is a Homograph?                              |
     * | TPS          | Target: Part of Speech                              |
     * | TMC          | Target: Mean Connectivity Among Its Assoc.          |
     * | TPR          | Target: Probability of a Resonant Connection        |
     * | TRSG         | Target: Resonant Strength                           |
     * | TUC          | Target: Use Code                                    |
     * |--------------+-----------------------------------------------------|
     */
    enum ColId {
        CUE, TARGET, NORMED_, _G, _P, FSG, BSG, MSG, OSG, _M, MMIAS, _O, OMIAS, QSS, //
        QFR, QCON, QH, QPS, QMC, QPR, QRSG, QUC, TSS, TFR, TCON, TH, TPS, TMC, TPR, //
        TRSG, TUC, //
        NUMCOLUMNS
    };

    // part of speech
    enum POS { Noun, Verb, Adjective, Adverb, Pronoun, Preposition, Interjection, Conjunction, Other };

    using Atom = AtomPool::Atom;

    struct Row {
        Atom     cue, target;
        bool     normed_;
        unsigned _g;   // norm_group
        unsigned _p;   // partic_response
        double   fsg;  // forward_strength
        double   bsg;  // backward_strength
        double   msg;  // mediated_strength "2-step strength"
        double   osg;  // overlapping_strength
        unsigned _m;   // mediators
        unsigned mmia; // mediators_missing_in_action
        double   _o;   // overlapping_associates shared by cue and target
        unsigned omia; // overlapping_missing_in_action

        struct Attributes {
            unsigned ss;  // set size
            unsigned fr;  // frequency
            double   con; // concreteness
            bool     h;   // is homograph?
            POS      ps;  // part of speech
            double   mc;  // mean connectivity among its associates
            double   pr;  // probablity of a resonant connection
            double   rsg; // resonant strength
            unsigned uc;  // use code
        };
        Attributes qattr, tattr; // cue, target attributes
    };
} // namespace DataSet
#include <boost/fusion/adapted.hpp>

BOOST_FUSION_ADAPT_STRUCT(DataSet::Row::Attributes, ss, fr, con, h, ps, mc, pr, rsg, uc)
BOOST_FUSION_ADAPT_STRUCT(DataSet::Row, cue, target, normed_, _g, _p, fsg, bsg, msg, osg, _m, mmia, _o, omia,
                          qattr, tattr)

struct Reader {
    using Atom = AtomPool::Atom;
    using Row  = DataSet::Row;
    using Data = std::vector<Row>;

    Reader(std::string const& fname, AtomPool& pool) : _atoms(pool), _mfs{fname} { parse(); }

  private:
    AtomPool&                            _atoms;
    boost::iostreams::mapped_file_source _mfs;
    Data                                 _rows;

    using It = char const*;

    void parse() try {
        It const f = _mfs.begin(), l = _mfs.end();

        long const total_bytes = l - f;
        _rows.reserve(total_bytes / 150);
        namespace x3  = boost::spirit::x3;
        namespace enc = x3::standard; // iso8859_1;
        using enc::blank;
        using enc::char_;
        using enc::lit;
        using enc::space;
        // using enc::symbols;

        auto atom = [this](auto& ctx) {
            auto& vec = _attr(ctx);
            _val(ctx) = _atoms.atomize(std::string(vec.begin(), vec.end()));
        };

        auto normed = x3::symbols<bool>{}.add("NO", false)("YES", true).sym;
        normed.name("YES|NO");

        auto col = x3::rule<Atom, Atom>{"col"} = x3::lexeme[*~char_(",\r\n")][atom];

        auto header = x3::eps   //
            >> "CUE" >> ','     //
            >> "TARGET" >> ','  //
            >> "NORMED?" >> ',' //
            >> "#G" >> ','      //
            >> "#P" >> ','      //
            >> "FSG" >> ','     //
            >> "BSG" >> ','     //
            >> "MSG" >> ','     //
            >> "OSG" >> ','     //
            >> "#M" >> ','      //
            >> "MMIAS" >> ','   //
            >> "#O" >> ','      //
            >> "OMIAS" >> ','   //
            >> "QSS" >> ','     //
            >> "QFR" >> ','     //
            >> "QCON" >> ','    //
            >> "QH" >> ','      //
            >> "QPS" >> ','     //
            >> "QMC" >> ','     //
            >> "QPR" >> ','     //
            >> "QRSG" >> ','    //
            >> "QUC" >> ','     //
            >> "TSS" >> ','     //
            >> "TFR" >> ','     //
            >> "TCON" >> ','    //
            >> "TH" >> ','      //
            >> "TPS" >> ','     //
            >> "TMC" >> ','     //
            >> "TPR" >> ','     //
            >> "TRSG" >> ','    //
            >> "TUC" >> x3::eol;

        auto yen = lit('\xa5') | x3::iso8859_1::lit("¥");

        auto eoc = &(',' | x3::eol | x3::eoi); // end-of-column predicate

        auto strength            //
            = x3::double_ >> eoc //
            | -yen >> x3::attr(NAN);

        auto count                                    //
            = x3::uint_ >> &(',' | x3::eol | x3::eoi) //
            | -yen >> x3::attr(-1u);

        auto homograph = x3::matches[enc::graph - ','];
        using DataSet::POS;
        auto part_of_speech = //
            x3::symbols<POS>{}
                .add("N", POS::Noun) //
            ("V", POS::Verb)         //
            ("AJ", POS::Adjective)   //
            ("AD", POS::Adverb)      //
            ("P", POS::Pronoun)      //
            ("PP", POS::Preposition) //
            ("I", POS::Interjection) //
            ("C", POS::Conjunction)  //
            // these were undocumented but deduced from context
            ("INT", POS::Interjection) //
            ("A", POS::Adverb)         //
            ("AV", POS::Adverb)        //
            ("ADV", POS::Adverb)       //
            ("ADJ", POS::Adjective)    //
            ("PRP", POS::Pronoun)      // "ACCOUNT" probably mistagged in dataset
                .sym |
            eoc >> x3::attr(POS::Other);

        auto attrs = x3::rule<Row::Attributes, Row::Attributes>{"attrs"} = x3::eps

            > count > ','          // SS
            > count > ','          // FR
            > strength > ','       // CON
            > homograph > ','      // H
            > part_of_speech > ',' // PS
            > strength > ','       // M
            > strength > ','       // PR
            > strength > ','       // RSG
            > count;               // UC

        auto row = x3::rule<Row, Row>{"row"} = (*header >> !x3::eoi)

            > col > ','      // CUE
            > col > ','      // TARGET
            > normed > ','   // NORMED?
            > count > ','    // #G
            > count > ','    // #P
            > strength > ',' // FSG
            > strength > ',' // BSG
            > strength > ',' // MSG
            > strength > ',' // OSG
            > count > ','    // #M
            > count > ','    // MMIAS
            > strength > ',' // #O
            > count > ','    // OMIAS
            > attrs > ','    // qattr
            > attrs;         // tattr

        auto file = x3::rule<Data, Data>{"file"} = *(row >> x3::eol);

        phrase_parse(f, l, x3::eps > file > x3::eoi, blank, _rows);

        using DataSet::NUMCOLUMNS;
        trace("Parsed: ", total_bytes / 1024.0 / 1024, "MiB ", "into ", _rows.size(), " rows");

        trace(_atoms.size(), " atoms totaling ", _atoms.size_bytes(), " bytes of backing storage");
    } catch (boost::spirit::x3::expectation_failure<It> const& ef) {
        It   b = _mfs.begin(), e = _mfs.end();
        auto w = ef.where();
        auto sol = b + std::string_view(b, w).find_last_of("\r\n") + 1;

        auto line = std::string_view(sol, e);
        line      = line.substr(0, line.find_first_of("\r\n"));

        auto l = 1 + std::count(b, w, '\n'), c = 1 + (w - sol);
        trace("Expected ", ef.which(), " at L:", l, ":", c, " at\n", //
              "    ", line, "\n",                                    //
              std::setw(c + 4), '^', "----- here");                  //
        std::exit(1);
    }
};

int main(int argc, char** argv) {
    AtomPool pool;

    trace("Start");
    {
        for (std::string fname : std::vector(argv+1, argv+argc))
            Reader r{fname, pool};
    }
    trace("Done");
}

打印,对于小型演示数据集

g++ -std=c++20 -O2 -Wall -DNDEBUG -pedantic -pthread main.cpp -lboost_iostreams
./a.out /Archive2/bd/de4002f5d4d3ee/main.cpp
         0ms Start
         0ms Parsed: 0.00769901MiB into 40 rows
         0ms 70 atoms totaling 462 bytes of backing storage
         0ms Done
         0ms AtomPool deduplicated 12.5% of 80 insertions

在本地,整个数据集的解析时间约为 90 毫秒:

/build/sotest dataset/Cue_Target_Pairs.* dataset/full.txt 

     0ms Start
    10ms Parsed: 1.15371MiB into 8476 rows
     0ms 3732 atoms totaling 26643 bytes of backing storage
     8ms Parsed: 1.03647MiB into 7590 rows
     0ms 5340 atoms totaling 39238 bytes of backing storage
    13ms Parsed: 1.42218MiB into 10500 rows
     0ms 6901 atoms totaling 51720 bytes of backing storage
     9ms Parsed: 1.14505MiB into 8440 rows
     0ms 7917 atoms totaling 59821 bytes of backing storage
     9ms Parsed: 1.26198MiB into 9287 rows
     0ms 8709 atoms totaling 66346 bytes of backing storage
    10ms Parsed: 1.35513MiB into 9888 rows
     0ms 9456 atoms totaling 72769 bytes of backing storage
     9ms Parsed: 1.25216MiB into 9214 rows
     0ms 10055 atoms totaling 77640 bytes of backing storage
     9ms Parsed: 1.19052MiB into 8781 rows
     0ms 10617 atoms totaling 82271 bytes of backing storage
    87ms Parsed: 9.8172MiB into 72176 rows
     0ms 10617 atoms totaling 82271 bytes of backing storage
     0ms Done
     0ms AtomPool deduplicated 96.3225% of 288704 insertions

创建图表

让我们创建所有(规范化和未规范化)单词作为顶点并添加所有边。为此,我们需要轻松地将

cue
映射到其相应的数据
Row

using Data = flat_multimap<Atom, Row>;

现在我们把创建向量变成语义动作插入到地图中:

auto map_insert = [](auto& ctx) {
    auto& map = _val(ctx);
    auto& row = _attr(ctx);
    map.emplace_hint(map.end(), row.cue, std::move(row));
};

auto file = x3::rule<Data, Data>{"file"} = *(row[map_insert] >> x3::eol);

索引会带来一些开销,现在整个数据集的解析时间约为 120 毫秒

现在让我们从

_rows
创建图表:

using VertexProp = boost::optional<DataSet::Row>;
using Graph      = boost::adjacency_list<boost::vecS, boost::vecS, boost::directedS, VertexProp>;

使用从原子到顶点描述符的临时映射,

buildGraph
的实现可以非常简单:

Graph buildGraph() const {
    using V = Graph::vertex_descriptor;

    Graph g;
    g.m_vertices.reserve(_atoms.size());

    std::map<Atom, V> vmap; // not flat-map for stability

    for (auto& [cue, row] : _rows) {
        auto s = vmap.find(cue);
        auto t = vmap.find(row.target);
        if (s == vmap.end()) s = vmap.emplace(cue, add_vertex(row, g)).first;
        if (t == vmap.end()) t = vmap.emplace(row.target, add_vertex(g)).first;

        add_edge(s->second, t->second, g);
    }

    return g;
}

在我们的

main
中,让我们在解析后构建图表:

int main(int argc, char** argv) {
    AtomPool pool;

    trace("Start");
    for (std::string fname : std::vector(argv + 1, argv + argc)) {
        Reader r{fname, pool};
        trace("Parsed ", fname);
        {
            Graph g = r.buildGraph();
            trace("Graph creation done; ", num_vertices(g), " vertices and ", num_edges(g), " edges");
        }
        trace("Graph destruction done");
    }
    trace("Done");
}

住在Coliru

./a.out /Archive2/bd/de4002f5d4d3ee/main.cpp
     0ms Start
    17ms Parsed: 0.00769901MiB into 40 rows
     0ms 70 atoms totaling 462 bytes of backing storage
     0ms Parsed /Archive2/bd/de4002f5d4d3ee/main.cpp
     0ms Graph creation done; 70 vertices and 40 edges
     0ms Graph destruction done
     0ms Done
     0ms ~AtomPool deduplicated 12.5% of 80 insertions

对于完整数据集,输出变为:

 11873ms Parsed: 9.8172MiB into 72176 rows
     0ms 10617 atoms totaling 82271 bytes of backing storage
     0ms Parsed dataset/full.txt
    12ms Graph creation done; 10617 vertices and 72176 edges
     1ms Graph destruction done
     0ms Done
     0ms ~AtomPool deduplicated 96.3225% of 288704 insertions

注意

比您在问题中报告的边缘多四个。也许你没有像我一样添加未记录的词性代码?

如您所见,几乎没有时间用于破坏。这甚至是从

Reader
复制所有数据 - 因此
Reader
可以在创建图形后进行处理 - 也因此
VertexProp
将是可变的。

如果你不需要它,你可以使用 this 代替

using VertexProp = DataSet::Row const*;

// ... and later
if (s == vmap.end()) s = vmap.emplace(cue, add_vertex(&row, g)).first;
if (t == vmap.end()) t = vmap.emplace(row.target, add_vertex(nullptr, g)).first;

listS
EdgeContainerSelector?

我没有看到显着差异,尽管至少图形破坏现在在 ms 尺度上注册:

     0ms Start
 11930ms Parsed: 9.8172MiB into 72176 rows
     0ms Parsed dataset/full.txt
    12ms Graph creation done; 10617 vertices and 72176 edges
     1ms Graph destruction done
     0ms Done
     0ms ~AtomPool deduplicated 92.6451% of 144352 insertions

bidirectionalS
?

这在应用程序领域似乎没有多大意义¹。而且它有很大的存储和插入开销。但是,在某些访问模式下,这些可能会在提高查询性能方面得到回报。 YMMV.

Compare Live or locally against the full data set:

     0ms Start
 12931ms Parsed: 9.8172MiB into 72176 rows
     0ms Parsed dataset/full.txt
    15ms Graph creation done; 10617 vertices and 72176 edges
     6ms Graph destruction done
     0ms Done
     0ms ~AtomPool deduplicated 92.6451% of 144352 insertions

正如预测的那样,增加低效率确实在时间上出现了。

¹ 目标已标准化的线索的前向和后向强度之间的相关性为正但不高,r = .29 (n = 63,619),并且根据前向强度知识正确猜测后向强度的机会很低


0
投票

问题可能是在边缘属性中使用动态分配。如果你能避免这种情况,并可能使用

setS
作为 EdgeContainerSelector(
adjacency_list
的第一个模板参数),你可能会看到很大的改进。

更新

我决定试验一下数据集,看看我会写什么,看看这是否会重现性能问题。 在这里找到它

© www.soinside.com 2019 - 2024. All rights reserved.