项目概况:
我正在使用docxtractr库从文档提取项目中提取多个文件中的单词表并将它们转换为数据框。以下代码适用于特定表,即第6个表。
由于多个文件,我正在使用lapply迭代每个文件并完成数据帧操作任务。数据框具有额外列的原因是因为Word表中存在用于计算的隐藏值。这些表曾经与大型复杂的宏excel文件绑定。
我的守则
library("docxtractr")
sourcesSummary <- lapply(files, function(x){
doc <- read_docx(x)
kingsTbls <- docx_extract_all_tbls(doc)
sources <- docx_extract_tbl(doc, 6, header = FALSE)
sources <- data.frame(sources)
# The below two lines are the issue #
sources[9,3:4] <- sources[9,2:3]
sources[24,3:4] <- sources[24,2:3]
})
`
默认数据框
V1 V2 V3 V4
1 SOURCES OF FUNDS AMOUNT PER UNIT <NA>
2 Proposed A Loan 87 $7,208,000.00 $82,851
3 Proposed B Loan $0.00 $ 0
4 Investor Equity $1,948,362.00 $22,395
5 Operating Partner Equity $0.00 $ 0
6 Other $0.00 $ 0
7 Other $0.00 $ 0
8 Other $0.00 $ 0
9 TOTAL SOURCE OF FUNDS $9,156,362 $105,246 <NA>
10 <NA> <NA> <NA>
11 USES OF FUNDS AMOUNT PER UNIT <NA>
12 Existing Mortgage(s) $0 $ 0
13 Purchase Price $9,011,000 $103,575
14 Origination Fees $54,060 $ 621
15 FM application Fee $7,208 $ 83
16 Investor Fees $0.00 $ 0
17 Closing Costs $0.00 $ 0
18 Other (Yield Maintenance) $0 $ 0
19 Capital Improvements $39,650.00 $ 456
20 Processing Fee $3,000.00 $ 34
21 Third Party $11,000.00 $ 126
22 Legal $12,500.00 $ 144
23 Repair Escrow (Funded) $0.00 $ 0
24 TOTAL USE OF FUNDS $9,138,418 $105,039 <NA>
25 <NA> <NA> <NA>
26 CASH OUT/(CASH IN) 132 $17,944 $ 207
问题:
我所面临的问题围绕在数据框内移动/移位值。我已经在控制台中成功完成了它,但是,当我在lapply中运行相同的代码时,它无法正常执行。
我试图将第9行和第24行中的值向右移动一列。在控制台中,在特定的测试变量上,下面的代码工作正常,但是当我在所有文件上运行lapply时它不会。
问题代码
sources[9,3:4] <- sources[9,2:3]
sources[24,3:4] <- sources[24,2:3]
我也尝试在括号内使用drop = FALSE
,但也不起作用。
当前输出与lapply
V2 V3
24 $9,138,418 $105,039
期望的输出
请注意,第9行和第24行已更改。当我在控制台中运行我在特定数据框架上的代码时,这就是生成的结果,而lapply似乎也是问题。之后我将删除第二列和NA行。
V1 V2 V3 V4
1 SOURCES OF FUNDS AMOUNT PER UNIT <NA>
2 Proposed A Loan 87 $7,208,000.00 $82,851
3 Proposed B Loan $0.00 $ 0
4 Investor Equity $1,948,362.00 $22,395
5 Operating Partner Equity $0.00 $ 0
6 Other $0.00 $ 0
7 Other $0.00 $ 0
8 Other $0.00 $ 0
9 TOTAL SOURCE OF FUNDS $9,156,362 $105,246
10 <NA> <NA> <NA>
11 USES OF FUNDS AMOUNT PER UNIT <NA>
12 Existing Mortgage(s) $0 $ 0
13 Purchase Price $9,011,000 $103,575
14 Origination Fees $54,060 $ 621
15 FM application Fee $7,208 $ 83
16 Investor Fees $0.00 $ 0
17 Closing Costs $0.00 $ 0
18 Other (Yield Maintenance) $0 $ 0
19 Capital Improvements $39,650.00 $ 456
20 Processing Fee $3,000.00 $ 34
21 Third Party $11,000.00 $ 126
22 Legal $12,500.00 $ 144
23 Repair Escrow (Funded) $0.00 $ 0
24 TOTAL USE OF FUNDS $9,138,418 $105,039
25 <NA> <NA> <NA>
26 CASH OUT/(CASH IN) 132 $17,944 $ 207
提前感谢您的意见!
函数返回的最后一个表达式由函数返回。在你的情况下,这就是sources[24,3:4] <- sources[24,2:3]
,这就是你得到的原因,
V2 V3
24 $9,138,418 $105,039
解决方案是通过在函数末尾添加sources
来显式返回return(sources)
,或者只是添加sources
。所以,你的代码应该是这样的:
library("docxtractr")
sourcesSummary <- lapply(files, function(x){
doc <- read_docx(x)
kingsTbls <- docx_extract_all_tbls(doc)
sources <- docx_extract_tbl(doc, 6, header = FALSE)
sources <- data.frame(sources)
# The below two lines are the issue #
sources[9,3:4] <- sources[9,2:3]
sources[24,3:4] <- sources[24,2:3]
sources #<- New code!
})