在r中读取二进制映射文件

Question

我试图在R中读取包含360x180值的简单2D数组的二进制文件。作为参考，二进制文件可以在这里找到：

http://transcom.project.asu.edu/download/transcom03/smoothmap.fix.2.bin

以下是此.bin的自述文件：

文件'smoothmap.fix.2.bin'包含一个真实的二进制数组，其大小为360 x 180.该数组包含数字1到22，表示TransCom 3实验中的22个基函数。此文件是在托管UNIX的SGI Origin 2000上编写的。

我的代码：

to.read <- file("smoothmap.fix.2.bin", "rb")
raw.transcom <- readBin(to.read, integer(), n = 360*180, size = 4, endian = "big")
transcom <- matrix(raw.transcom, 180, 360, byrow = F)

现在raw.transcom只包含垃圾值：

unique(raw.transcom)
 [1]     259200          0 1101004800 1082130432 1092616192 1097859072 1100480512 1102053376 1086324736
[10] 1077936128 1101529088 1095761920 1096810496 1099956224 1091567616 1084227584 1090519040 1094713344
[19] 1099431936 1073741824 1093664768 1088421888 1065353216 1098907648

那为什么会这样？

我一直在看这个一个小时，我很难过。使用endian-ness设置和readBin中的“大小”，但这没有帮助。

如何正确读取此文件？

Answer 1

好吧，我没有时间用“R”方式来做这个，但我确实可以访问GDL并找到了this，所以我把它们拼凑在一起：

Data  = read_binary('smoothmap.fix.2.bin',DATA_TYPE=4,ENDIAN='big');
Data = Data[1:64800]
Data = reform(Data,[360,180])

openw,unit,'testfile.dat',/get_lun
printf,unit,Data
free_lun,unit

并设法生成：http://rud.is/dl/testfile.dat.gz

如果你抓住它并做：

x <- as.numeric(scan("testfile.dat.gz", "numeric"))

length(x)
## [1] 64800

table(x)
##   0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20   21   22 
## 7951 1643 1189  796  868 1688  864 2345 2487  509  733 1410 5144 2388 2433 4111 7617 2450 1671 2058 9161 2334 2950

它看起来确实为您指定的定义得到了正确的值，您可以将其转换为矩阵。

回过头来看，因为我现在需要弄清楚如何在R中执行此操作:-)

UPDATE

得到它了！

我很高兴我找到了IDL代码来验证R结果。

x <- readBin("smoothmap.fix.2.bin", "raw", file.size("smoothmap.fix.2.bin"))
x <- x[-(1:4)]
x <- x[-((length(x)-3):length(x))]

table(readBin(rawConnection(x), "numeric", 360*180, 4, endian="big"))
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20   21   22 
## 7951 1643 1189  796  868 1688  864 2345 2487  509  733 1410 5144 2388 2433 4111 7617 2450 1671 2058 9161 2334 2950

理想情况下，我们会检查前4个字节和后4个字节是否相等，但是这个hack shld会让你通过。

把它们放在一起

添加了代码验证位......

#' Read in a binary array, likely written with IDL
#' 
#' @param x path to file (auto-expanded & tested for existence)
#' @param n number of `float` elements to read in
#' @param endian endian-ness (default `big`)
#' @return numeric vector of length `n`
read_binary_float <- function(x, n, endian="big") {

  x <- normalizePath(path.expand(x))

  x <- readBin(con = x, what = "raw", n = file.size(x))

  first4 <- x[1:4] # extract front bits
  last4 <- x[(length(x)-3):length(x)] # extract back bits

  # convert both to long ints      

  f4c <- rawConnection(first4)
  on.exit(close(f4c), add=TRUE)
  f4 <- readBin(con = f4c, what = "integer", n = 1, size = 4L, endian=endian)

  l4c <- rawConnection(last4)      
  on.exit(close(l4c), add=TRUE)      
  l4 <- readBin(con = l4c, what = "integer", n = 1, size = 4L, endian=endian)

  # validation

  stopifnot(f4 == l4) # check front/back are equal
  stopifnot(f4 == n*4) # check if `n` matches expected record count

  # strip off front and back bits

  x <- x[-(1:4)]
  x <- x[-((length(x)-3):length(x))]

  # slurp it all in

  rc <- rawConnection(x)      
  on.exit(close(rc), add=TRUE)

  readBin(con = rc, what = "numeric", n = n, size = 4L, endian=endian)

}

快速举例：

library(magrittr)

read_binary_float("smoothmap.fix.2.bin", 360*180) %>% 
  matrix(nrow = 360, ncol = 180) %>% 
  image()

这个文件似乎符合Fortran“无格式I / O”规范：https://docs.oracle.com/cd/E19957-01/805-4939/6j4m0vnc4/index.html：它证实了

"# records" | record | record | … | record | "# records"

我们看到了。所以这个函数可以推广到不仅仅支持float转换：

read_binary_array <- function(x, type=c("byte", "integer", "float"), endian="big") {

  type <- match.arg(trimws(tolower(type)), c("byte", "integer", "float"))
  type_size <- unname(c("byte"=1, "integer"=4, "float"=4)[type])

  x <- normalizePath(path.expand(x))

  x <- readBin(con = x, what = "raw", n = file.size(x))

  first4 <- x[1:4]
  last4 <- x[(length(x)-3):length(x)]

  f4c <- rawConnection(first4)
  on.exit(close(f4c), add=TRUE)
  f4 <- readBin(con = f4c, what = "integer", n = 1, size = 4L, endian=endian)

  l4c <- rawConnection(last4)
  on.exit(close(l4c), add=TRUE)
  l4 <- readBin(con = l4c, what = "integer", n = 1, size = 4L, endian=endian)

  stopifnot(f4 == l4) # check front/back are equal
  stopifnot((f4 %% type_size == 0)) # shld have nothing left over

  n_rec <- f4 / type_size
  message(sprintf("Reading in %s records...", scales::comma(n_rec)))

  x <- x[-(1:4)]
  x <- x[-((length(x)-3):length(x))]

  rc <- rawConnection(x)
  on.exit(close(rc), add=TRUE)

  what <- switch(type, byte="raw", integer="integer", float="numeric")
  dat <- readBin(con = rc, what = what, n = n_rec, size = type_size, endian=endian)

  dat

}

Answer 2

这是不完整的，发布进度。

数据文件中可能存在未记录的“特征”，因为前八个字节不是数据的一部分。（该文件是259208，但是360*180*4==259200。）但是，我确实发现了一些有趣的东西：

d <- readBin(file("~/Downloads/smoothmap.fix.2.bin", "rb"), integer(), n = 360*180, size = 4, endian = "big")

head(d)
# [1] 259200      0      0      0      0      0

我将推断第一个4字节整数（259200）表示数据的大小，所以我建议我们可以丢弃它。你可能会认为这里有一个适当长度的向量，但这是因为你强迫readBin停止加载数据。来自?readBin：

   n: integer.  The (maximal) number of records to be read.  You
      can use an over-estimate here, but not too large as storage
      is reserved for 'n' items.

因此，读取超出预期文件大小应该是安全的，它将自己处理EOF。我会随意增加10：

length(d)
# [1] 64800
d <- readBin(file("~/Downloads/smoothmap.fix.2.bin", "rb"), integer(), n = 360*180+10, size = 4, endian = "big")
length(d)
# [1] 64802
tail(d)
# [1] 1098907648 1098907648 1098907648 1098907648 1098907648     259200

（注意，即使我建议读另外10个字节，只有两个可用。所以你知道，n参数的基本原理是预分配内存，仅此而已。）那259200再次存在，我推断这确认了数据的结束，因此我们应该能够安全地丢弃这两个（第一个/最后一个）数字。

d <- d[-c(1, length(d))]

第一个非零数字是：

head(which(d>0))
# [1] 4321 4322 4323 4324 4325 4326
d[4321]
# [1] 1101004800

并看着这些位：

intToBits(d[4321])
#  [1] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 00 01 01
# [26] 00 00 00 00 00 01 00

因此，如果您推断出直接二进制解释，则该值为2820，这与可用值的smoothmap.readme描述不匹配。此外，我们期待看到：

intToBits(22)
#  [1] 00 01 01 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

所以看起来你的位是......不是正确的顺序，或类似的东西。如果你intToBits所有唯一值，你会注意到所有位1-19（最低有效位）为零。

从这里开始，我很茫然......

sapply(unique(d), function(a) packBits(rev(intToBits(a)), type="integer"))
#  [1]    0 1410  258 1154 3714 6530 3458  770  514 5506 2690 1666 2434 2178 1282  130  642 4482    2 3202 1794  508  386

在r中读取二进制映射文件

问题描述投票：1回答：2

2个回答

最新问题

在r中读取二进制映射文件

问题描述 投票：1回答：2

2个回答

最新问题

问题描述投票：1回答：2