哪个提交有这个blob？

Question

鉴于blob的哈希，有没有办法获得在他们的树中有这个blob的提交列表？

Answer 1

以下两个脚本都将blob的SHA1作为第一个参数，在它之后，可选地，git log将理解的任何参数。例如。 --all搜索所有分支而不仅仅是当前的分支，或者-g来搜索reflog，或者你想要的其他任何东西。

在这里它是一个shell脚本 - 短而甜，但很慢：

#!/bin/sh
obj_name="$1"
shift
git log "$@" --pretty=format:'%T %h %s' \
| while read tree commit subject ; do
    if git ls-tree -r $tree | grep -q "$obj_name" ; then
        echo $commit "$subject"
    fi
done

而Perl中的优化版本仍然很短但速度更快：

#!/usr/bin/perl
use 5.008;
use strict;
use Memoize;

my $obj_name;

sub check_tree {
    my ( $tree ) = @_;
    my @subtree;

    {
        open my $ls_tree, '-|', git => 'ls-tree' => $tree
            or die "Couldn't open pipe to git-ls-tree: $!\n";

        while ( <$ls_tree> ) {
            /\A[0-7]{6} (\S+) (\S+)/
                or die "unexpected git-ls-tree output";
            return 1 if $2 eq $obj_name;
            push @subtree, $2 if $1 eq 'tree';
        }
    }

    check_tree( $_ ) && return 1 for @subtree;

    return;
}

memoize 'check_tree';

die "usage: git-find-blob <blob> [<git-log arguments ...>]\n"
    if not @ARGV;

my $obj_short = shift @ARGV;
$obj_name = do {
    local $ENV{'OBJ_NAME'} = $obj_short;
     `git rev-parse --verify \$OBJ_NAME`;
} or die "Couldn't parse $obj_short: $!\n";
chomp $obj_name;

open my $log, '-|', git => log => @ARGV, '--pretty=format:%T %h %s'
    or die "Couldn't open pipe to git-log: $!\n";

while ( <$log> ) {
    chomp;
    my ( $tree, $commit, $subject ) = split " ", $_, 3;
    print "$commit $subject\n" if check_tree( $tree );
}

Answer 2

不幸的是脚本对我来说有点慢，所以我不得不优化一下。幸运的是，我不仅有哈希，还有文件的路径。

git log --all --pretty=format:%H -- <path> | xargs -n1 -I% sh -c "git ls-tree % -- <path> | grep -q <hash> && echo %"

Answer 3

我认为这将是一个普遍有用的东西，所以我写了一个小的perl脚本来做到这一点：

#!/usr/bin/perl -w

use strict;

my @commits;
my %trees;
my $blob;

sub blob_in_tree {
    my $tree = $_[0];
    if (defined $trees{$tree}) {
        return $trees{$tree};
    }
    my $r = 0;
    open(my $f, "git cat-file -p $tree|") or die $!;
    while (<$f>) {
        if (/^\d+ blob (\w+)/ && $1 eq $blob) {
            $r = 1;
        } elsif (/^\d+ tree (\w+)/) {
            $r = blob_in_tree($1);
        }
        last if $r;
    }
    close($f);
    $trees{$tree} = $r;
    return $r;
}

sub handle_commit {
    my $commit = $_[0];
    open(my $f, "git cat-file commit $commit|") or die $!;
    my $tree = <$f>;
    die unless $tree =~ /^tree (\w+)$/;
    if (blob_in_tree($1)) {
        print "$commit\n";
    }
    while (1) {
        my $parent = <$f>;
        last unless $parent =~ /^parent (\w+)$/;
        push @commits, $1;
    }
    close($f);
}

if (!@ARGV) {
    print STDERR "Usage: git-find-blob blob [head ...]\n";
    exit 1;
}

$blob = $ARGV[0];
if (@ARGV > 1) {
    foreach (@ARGV) {
        handle_commit($_);
    }
} else {
    handle_commit("HEAD");
}
while (@commits) {
    handle_commit(pop @commits);
}

我今晚回家时会把它放在github上。

更新：看起来像有人already did this。那个使用相同的一般想法，但细节是不同的，实施要短得多。我不知道哪个会更快但性能可能不是这里的问题！

更新2：对于它的价值，我的实现速度要快几个数量级，特别是对于大型存储库。那git ls-tree -r真疼。

更新3：我应该注意，上面的性能评论适用于我在第一次更新中链接的实现。 Aristotle's implementation与我的表现相当。对于那些好奇的人的评论中的更多细节。

Answer 4

虽然原始问题没有要求它，但我认为检查暂存区域以查看是否引用了blob也很有用。我修改了原始的bash脚本来执行此操作，并在我的存储库中找到了引用损坏blob的内容：

#!/bin/sh
obj_name="$1"
shift
git ls-files --stage \
| if grep -q "$obj_name"; then
    echo Found in staging area. Run git ls-files --stage to see.
fi

git log "$@" --pretty=format:'%T %h %s' \
| while read tree commit subject ; do
    if git ls-tree -r $tree | grep -q "$obj_name" ; then
        echo $commit "$subject"
    fi
done

Answer 5

鉴于blob的哈希，有没有办法获得在他们的树中有这个blob的提交列表？

使用Git 2.16（2018年第一季度），git describe将是一个很好的解决方案，因为它被教导深挖树木以找到引用给定blob对象的<commit-ish>:<path>。

请参阅commit 644eb60，commit 4dbc59a，commit cdaed0c，commit c87b653，commit ce5b6f9，commit 91904f5（2017年11月16日）和commit 2deda00，Stefan Beller (stefanbeller)（2017年11月2日）。（Junio C Hamano -- gitster --在commit 556de1a合并，2017年12月28日）

builtin/describe.c：描述一个blob

有时用户会得到一个对象的哈希，并且他们想要进一步识别它（例如：使用verify-pack找到最大的blob，但这些是什么？或者这个问题很严重“Which commit has this blob?”）

在描述提交时，我们尝试将它们锚定到标记或引用，因为它们在概念上比提交更高级别。如果没有完全匹配的参考或标签，我们就不走运了。因此，我们使用启发式方法来构成提交的名称。这些名称含糊不清，可能有不同的标记或引用要锚定，并且DAG中可能存在不同的路径以准确地到达提交。

在描述blob时，我们也希望从更高层描述blob，这是(commit, deep/path)的元组，因为所涉及的树对象相当无趣。多个提交可以引用相同的blob，那么我们如何决定使用哪个提交？

这个补丁实现了一种相当天真的方法：由于没有从blob到blob发生的提交的后向指针，我们将从任何可用的提示开始，按照提交的顺序列出blob，一旦我们找到了blob，我们将采取列出blob的第一个提交。

例如：
git describe --tags v0.99:Makefile
conversion-901-g7672db20c2:Makefile
告诉我们Makefile中的v0.99是在commit 7672db2中引入的。

以相反的顺序执行步行以显示blob的引入而不是其最后一次出现。

这意味着git describe man page增加了这个命令的目的：

git describe不是简单地使用可从中获取的最新标记来描述提交，而是在用作git describe <blob>时，实际上基于可用的ref给对象一个人类可读的名称。

如果给定对象引用blob，则将其描述为<commit-ish>:<path>，以便可以在<path>中的<commit-ish>处找到blob，<committ-ish>:<path>本身描述了在从HEAD反向修订步行中发生此blob的第一次提交。

但：

BUGS

无法描述树对象以及不指向提交的标记对象。在描述blob时，指向blob的轻量级标签会被忽略，但是尽管轻量级标签是有利的，但仍然将blob描述为git gc。

Answer 6

所以...我需要在超过8GB的repo中找到超过给定限制的所有文件，修改超过108,000。我改编了亚里士多德的perl脚本以及我写的红宝石脚本以达到这个完整的解决方案。

首先，#!/usr/bin/env ruby require 'log4r' # The output of git verify-pack -v is: # SHA1 type size size-in-packfile offset-in-packfile depth base-SHA1 # # GIT_PACKS_RELATIVE_PATH=File.join('.git', 'objects', 'pack', '*.pack') # 10MB cutoff CUTOFF_SIZE=1024*1024*10 #CUTOFF_SIZE=1024 begin include Log4r log = Logger.new 'git-find-large-objects' log.level = INFO log.outputters = Outputter.stdout git_dir = %x[ git rev-parse --show-toplevel ].chomp if git_dir.empty? log.fatal "ERROR: must be run in a git repository" exit 1 end log.debug "Git Dir: '#{git_dir}'" pack_files = Dir[File.join(git_dir, GIT_PACKS_RELATIVE_PATH)] log.debug "Git Packs: #{pack_files.to_s}" # For details on this IO, see http://stackoverflow.com/questions/1154846/continuously-read-from-stdout-of-external-process-in-ruby # # Short version is, git verify-pack flushes buffers only on line endings, so # this works, if it didn't, then we could get partial lines and be sad. types = { :blob => 1, :tree => 1, :commit => 1, } total_count = 0 counted_objects = 0 large_objects = [] IO.popen("git verify-pack -v -- #{pack_files.join(" ")}") do |pipe| pipe.each do |line| # The output of git verify-pack -v is: # SHA1 type size size-in-packfile offset-in-packfile depth base-SHA1 data = line.chomp.split(' ') # types are blob, tree, or commit # we ignore other lines by looking for that next unless types[data[1].to_sym] == 1 log.info "INPUT_THREAD: Processing object #{data[0]} type #{data[1]} size #{data[2]}" hash = { :sha1 => data[0], :type => data[1], :size => data[2].to_i, } total_count += hash[:size] counted_objects += 1 if hash[:size] > CUTOFF_SIZE large_objects.push hash end end end log.info "Input complete" log.info "Counted #{counted_objects} totalling #{total_count} bytes." log.info "Sorting" large_objects.sort! { |a,b| b[:size] <=> a[:size] } log.info "Sorting complete" large_objects.each do |obj| log.info "#{obj[:sha1]} #{obj[:type]} #{obj[:size]}" end exit 0 end - 这样做是为了确保所有对象都在packfiles中 - 我们不扫描不在pack文件中的对象。

下一步运行此脚本以查找CUTOFF_SIZE字节上的所有blob。将输出捕获到“large-blobs.log”之类的文件

cat edited-large-files.log | cut -d' ' -f4 | xargs git-find-blob | tee large-file-paths.log

接下来，编辑该文件以删除您不等待的任何blob以及顶部的INPUT_THREAD位。一旦只有你想要找到的sha1的行，就像这样运行以下脚本：

git-find-blob

#!/usr/bin/perl # taken from: http://stackoverflow.com/questions/223678/which-commit-has-this-blob # and modified by Carl Myers <[email protected]> to scan multiple blobs at once # Also, modified to keep the discovered filenames # vi: ft=perl use 5.008; use strict; use Memoize; use Data::Dumper; my $BLOBS = {}; MAIN: { memoize 'check_tree'; die "usage: git-find-blob <blob1> <blob2> ... -- [<git-log arguments ...>]\n" if not @ARGV; while ( @ARGV && $ARGV[0] ne '--' ) { my $arg = $ARGV[0]; #print "Processing argument $arg\n"; open my $rev_parse, '-|', git => 'rev-parse' => '--verify', $arg or die "Couldn't open pipe to git-rev-parse: $!\n"; my $obj_name = <$rev_parse>; close $rev_parse or die "Couldn't expand passed blob.\n"; chomp $obj_name; #$obj_name eq $ARGV[0] or print "($ARGV[0] expands to $obj_name)\n"; print "($arg expands to $obj_name)\n"; $BLOBS->{$obj_name} = $arg; shift @ARGV; } shift @ARGV; # drop the -- if present #print "BLOBS: " . Dumper($BLOBS) . "\n"; foreach my $blob ( keys %{$BLOBS} ) { #print "Printing results for blob $blob:\n"; open my $log, '-|', git => log => @ARGV, '--pretty=format:%T %h %s' or die "Couldn't open pipe to git-log: $!\n"; while ( <$log> ) { chomp; my ( $tree, $commit, $subject ) = split " ", $_, 3; #print "Checking tree $tree\n"; my $results = check_tree( $tree ); #print "RESULTS: " . Dumper($results); if (%{$results}) { print "$commit $subject\n"; foreach my $blob ( keys %{$results} ) { print "\t" . (join ", ", @{$results->{$blob}}) . "\n"; } } } } } sub check_tree { my ( $tree ) = @_; #print "Calculating hits for tree $tree\n"; my @subtree; # results = { BLOB => [ FILENAME1 ] } my $results = {}; { open my $ls_tree, '-|', git => 'ls-tree' => $tree or die "Couldn't open pipe to git-ls-tree: $!\n"; # example git ls-tree output: # 100644 blob 15d408e386400ee58e8695417fbe0f858f3ed424 filaname.txt while ( <$ls_tree> ) { /\A[0-7]{6} (\S+) (\S+)\s+(.*)/ or die "unexpected git-ls-tree output"; #print "Scanning line '$_' tree $2 file $3\n"; foreach my $blob ( keys %{$BLOBS} ) { if ( $2 eq $blob ) { print "Found $blob in $tree:$3\n"; push @{$results->{$blob}}, $3; } } push @subtree, [$2, $3] if $1 eq 'tree'; } } foreach my $st ( @subtree ) { # $st->[0] is tree, $st->[1] is dirname my $st_result = check_tree( $st->[0] ); foreach my $blob ( keys %{$st_result} ) { foreach my $filename ( @{$st_result->{$blob}} ) { my $path = $st->[1] . '/' . $filename; #print "Generating subdir path $path\n"; push @{$results->{$blob}}, $path; } } } #print "Returning results for tree $tree: " . Dumper($results) . "\n\n"; return $results; }脚本在哪里。

<hash prefix> <oneline log message>
    path/to/file.txt
    path/to/file2.txt
    ...
<hash prefix2> <oneline log msg...>

输出将如下所示：

grep

等等。将列出在其树中包含大文件的每个提交。如果你uniq出一个以制表符开头的行，而git describe, that I mention in my previous answer那个，你将有一个所有路径的列表，你可以过滤分支删除，或者你可以做一些更复杂的事情。

让我重申一下：这个过程成功运行，10GB回购，108,000次提交。虽然运行在大量的blob上花了比我预测的更长的时间，但是超过10个小时，我将不得不看看memorize位是否正常...

Answer 7

除了git log之外，git diff和--find-object=<object-id>现在也从“commit 4d8c51a”选项中受益，将结果限制为涉及命名对象的更改。这是在Git 2.16.x / 2.17（2018年第一季度）

参见commit 5e50525，commit 15af58c，commit cf63051，commit c1ddc46，commit 929ed70，Stefan Beller (stefanbeller)，Junio C Hamano -- gitster --（2018年1月4日）。（由commit c0d75f0合并于diffcore，2018年1月23日）

Which commit has this blob?：添加一个pickaxe选项来查找特定的blob

有时用户会得到一个对象的哈希，他们想要进一步识别它（例如：使用verify-pack查找最大的blob，但这些是什么？或者这个Stack Overflow问题“git-describe”）

人们可能会试图将git describe <blob-id>扩展到也可以使用blob，这样implemented here会将描述描述为'：'。这是diff;从大量的回复（> 110）可以看出，事实证明这是正确的。要做到正确的困难部分是选择正确的'commit-ish'，因为这可能是（重新）引入blob或删除blob的blob的提交; blob可以存在于不同的分支中。

Junio暗示了解决这个问题的不同方法，这个补丁实现了。教$ ./git log --oneline --find-object=v2.0.0:Makefile b2feb64 Revert the whole "ask curl-config" topic for now 47fbfde i18n: only extract comments marked with "TRANSLATORS:"机器另一个标志，用于限制所显示的信息。例如：
Makefile
我们观察到与2.0一起运载的v1.9.2-471-g47fbfded53出现在v2.0.0-rc1-5-gb2feb6430b和qazxswpoi。这些提交都发生在v2.0.0之前的原因是使用这种新机制找不到的恶意合并。

哪个提交有这个blob？

问题描述投票：129回答：7

7个回答

`builtin/describe.c`：描述一个blob

BUGS

Which commit has this blob?：添加一个pickaxe选项来查找特定的blob

最新问题

哪个提交有这个blob？

问题描述 投票：129回答：7

7个回答

builtin/describe.c：描述一个blob

BUGS

Which commit has this blob?：添加一个pickaxe选项来查找特定的blob

最新问题

问题描述投票：129回答：7

`builtin/describe.c`：描述一个blob