我正在使用accepted solution here将Excel工作表转换为数据表。如果我有“完美”的数据,这很好用,但是如果我的数据中间有一个空白单元格,则似乎在每一列中都放错了数据。
我认为这是因为在下面的代码中:
row.Descendants<Cell>().Count()
是已填充单元格的数目(不是所有列)AND:
GetCellValue(spreadSheetDocument, row.Descendants<Cell>().ElementAt(i));
似乎是要查找下一个填充的单元格(不一定是该索引中的单元格,所以如果第一列为空并且我调用ElementAt(0),它将在第二列中返回值。
这里是完整的解析代码。
DataRow tempRow = dt.NewRow();
for (int i = 0; i < row.Descendants<Cell>().Count(); i++)
{
tempRow[i] = GetCellValue(spreadSheetDocument, row.Descendants<Cell>().ElementAt(i));
if (tempRow[i].ToString().IndexOf("Latency issues in") > -1)
{
Console.Write(tempRow[i].ToString());
}
}
这很有意义,因为Excel将不会为空单元格存储值。如果使用“ Open XML SDK 2.0生产率工具”打开文件并将XML遍历到单元级别,您将看到只有具有数据的单元将位于该文件中。
您的选择是在要遍历的单元格范围内插入空白数据,或者以编程方式找出跳过的单元格并适当地调整索引。
我制作了一个示例excel文档,在单元格引用A1和C1中带有一个字符串。然后,我在Open XML Productivity Tool中打开了excel文档,这是存储的XML:
<x:row r="1" spans="1:3"
xmlns:x="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
<x:c r="A1" t="s">
<x:v>0</x:v>
</x:c>
<x:c r="C1" t="s">
<x:v>1</x:v>
</x:c>
</x:row>
在这里,您将看到数据对应于第一行,并且该行仅保存了两个单元格的数据。保存的数据对应于A1和C1,并且不保存任何具有空值的单元格。
要获得所需的功能,您可以像上面一样遍历单元格,但是您需要检查单元格引用的值,并确定是否已跳过任何单元格。为此,您将需要两个实用程序函数来从单元格引用中获取列名称,然后将该列名称转换为基于零的索引:
private static List<char> Letters = new List<char>() { 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', ' ' };
/// <summary>
/// Given a cell name, parses the specified cell to get the column name.
/// </summary>
/// <param name="cellReference">Address of the cell (ie. B2)</param>
/// <returns>Column Name (ie. B)</returns>
public static string GetColumnName(string cellReference)
{
// Create a regular expression to match the column name portion of the cell name.
Regex regex = new Regex("[A-Za-z]+");
Match match = regex.Match(cellReference);
return match.Value;
}
/// <summary>
/// Given just the column name (no row index), it will return the zero based column index.
/// Note: This method will only handle columns with a length of up to two (ie. A to Z and AA to ZZ).
/// A length of three can be implemented when needed.
/// </summary>
/// <param name="columnName">Column Name (ie. A or AB)</param>
/// <returns>Zero based index if the conversion was successful; otherwise null</returns>
public static int? GetColumnIndexFromName(string columnName)
{
int? columnIndex = null;
string[] colLetters = Regex.Split(columnName, "([A-Z]+)");
colLetters = colLetters.Where(s => !string.IsNullOrEmpty(s)).ToArray();
if (colLetters.Count() <= 2)
{
int index = 0;
foreach (string col in colLetters)
{
List<char> col1 = colLetters.ElementAt(index).ToCharArray().ToList();
int? indexValue = Letters.IndexOf(col1.ElementAt(index));
if (indexValue != -1)
{
// The first letter of a two digit column needs some extra calculations
if (index == 0 && colLetters.Count() == 2)
{
columnIndex = columnIndex == null ? (indexValue + 1) * 26 : columnIndex + ((indexValue + 1) * 26);
}
else
{
columnIndex = columnIndex == null ? indexValue : columnIndex + indexValue;
}
}
index++;
}
}
return columnIndex;
}
然后,您可以遍历单元格,并检查将哪些单元格引用与columnIndex进行比较。如果小于,则将空白数据添加到tempRow中,否则只需读取单元格中包含的值即可。 (注意:我没有测试下面的代码,但是总体思路应该有所帮助):
DataRow tempRow = dt.NewRow();
int columnIndex = 0;
foreach (Cell cell in row.Descendants<Cell>())
{
// Gets the column index of the cell with data
int cellColumnIndex = (int)GetColumnIndexFromName(GetColumnName(cell.CellReference));
if (columnIndex < cellColumnIndex)
{
do
{
tempRow[columnIndex] = //Insert blank data here;
columnIndex++;
}
while(columnIndex < cellColumnIndex);
}
tempRow[columnIndex] = GetCellValue(spreadSheetDocument, cell);
if (tempRow[i].ToString().IndexOf("Latency issues in") > -1)
{
Console.Write(tempRow[i].ToString());
}
columnIndex++;
}
[我忍不住要从Amurra的答案中优化子例程以消除对Regex的需求。
添加了另一个实现,这次是事先知道列数的方法:
要读取空白单元格,我正在使用在行读取器外部和while循环中分配的名为“ CN”的变量,我正在检查列索引是否大于我的变量,因为在每次读取单元格之后该变量都会递增。如果这不匹配,我将用我想要的值填充列。这是我用来将空白单元格追上我所尊重的列值的技巧。这是代码:
这是我的解决方案。当缺少的字段排在行尾时,我发现上述方法似乎无法正常工作。
这里是IEnumerable
的实现,应执行您想要的,已编译并经过单元测试的内容。
///<summary>returns an empty cell when a blank cell is encountered
///</summary>
public IEnumerator<Cell> GetEnumerator()
{
int currentCount = 0;
// row is a class level variable representing the current
// DocumentFormat.OpenXml.Spreadsheet.Row
foreach (DocumentFormat.OpenXml.Spreadsheet.Cell cell in
row.Descendants<DocumentFormat.OpenXml.Spreadsheet.Cell>())
{
string columnName = GetColumnName(cell.CellReference);
int currentColumnIndex = ConvertColumnNameToNumber(columnName);
for ( ; currentCount < currentColumnIndex; currentCount++)
{
yield return new DocumentFormat.OpenXml.Spreadsheet.Cell();
}
yield return cell;
currentCount++;
}
}
这里是它所依赖的功能:
/// <summary>
/// Given a cell name, parses the specified cell to get the column name.
/// </summary>
/// <param name="cellReference">Address of the cell (ie. B2)</param>
/// <returns>Column Name (ie. B)</returns>
public static string GetColumnName(string cellReference)
{
// Match the column name portion of the cell name.
Regex regex = new Regex("[A-Za-z]+");
Match match = regex.Match(cellReference);
return match.Value;
}
/// <summary>
/// Given just the column name (no row index),
/// it will return the zero based column index.
/// </summary>
/// <param name="columnName">Column Name (ie. A or AB)</param>
/// <returns>Zero based index if the conversion was successful</returns>
/// <exception cref="ArgumentException">thrown if the given string
/// contains characters other than uppercase letters</exception>
public static int ConvertColumnNameToNumber(string columnName)
{
Regex alpha = new Regex("^[A-Z]+$");
if (!alpha.IsMatch(columnName)) throw new ArgumentException();
char[] colLetters = columnName.ToCharArray();
Array.Reverse(colLetters);
int convertedValue = 0;
for (int i = 0; i < colLetters.Length; i++)
{
char letter = colLetters[i];
int current = i == 0 ? letter - 65 : letter - 64; // ASCII 'A' = 65
convertedValue += current * (int)Math.Pow(26, i);
}
return convertedValue;
}
将其投入课堂,然后尝试一下。
这里是Waylon's answer的略微修改版本,它也依赖于其他答案。它将他的方法封装在一个类中。
我改变了
IEnumerator<Cell> GetEnumerator()
to
IEnumerable<Cell> GetRowCells(Row row)
这是类,您无需实例化它,它仅用作实用程序类:
public class SpreedsheetHelper
{
///<summary>returns an empty cell when a blank cell is encountered
///</summary>
public static IEnumerable<Cell> GetRowCells(Row row)
{
int currentCount = 0;
foreach (DocumentFormat.OpenXml.Spreadsheet.Cell cell in
row.Descendants<DocumentFormat.OpenXml.Spreadsheet.Cell>())
{
string columnName = GetColumnName(cell.CellReference);
int currentColumnIndex = ConvertColumnNameToNumber(columnName);
for (; currentCount < currentColumnIndex; currentCount++)
{
yield return new DocumentFormat.OpenXml.Spreadsheet.Cell();
}
yield return cell;
currentCount++;
}
}
/// <summary>
/// Given a cell name, parses the specified cell to get the column name.
/// </summary>
/// <param name="cellReference">Address of the cell (ie. B2)</param>
/// <returns>Column Name (ie. B)</returns>
public static string GetColumnName(string cellReference)
{
// Match the column name portion of the cell name.
var regex = new System.Text.RegularExpressions.Regex("[A-Za-z]+");
var match = regex.Match(cellReference);
return match.Value;
}
/// <summary>
/// Given just the column name (no row index),
/// it will return the zero based column index.
/// </summary>
/// <param name="columnName">Column Name (ie. A or AB)</param>
/// <returns>Zero based index if the conversion was successful</returns>
/// <exception cref="ArgumentException">thrown if the given string
/// contains characters other than uppercase letters</exception>
public static int ConvertColumnNameToNumber(string columnName)
{
var alpha = new System.Text.RegularExpressions.Regex("^[A-Z]+$");
if (!alpha.IsMatch(columnName)) throw new ArgumentException();
char[] colLetters = columnName.ToCharArray();
Array.Reverse(colLetters);
int convertedValue = 0;
for (int i = 0; i < colLetters.Length; i++)
{
char letter = colLetters[i];
int current = i == 0 ? letter - 65 : letter - 64; // ASCII 'A' = 65
convertedValue += current * (int)Math.Pow(26, i);
}
return convertedValue;
}
}
现在您可以通过这种方式获取所有行的单元格:
// skip the part that retrieves the worksheet sheetData
IEnumerable<Row> rows = sheetData.Descendants<Row>();
foreach(Row row in rows)
{
IEnumerable<Cell> cells = SpreedsheetHelper.GetRowCells(row);
foreach (Cell cell in cells)
{
// skip part that reads the text according to the cell-type
}
}
它将包含所有单元格,即使它们为空。
请参阅我的实现:
Row[] rows = worksheet.GetFirstChild<SheetData>()
.Elements<Row>()
.ToArray();
string[] columnNames = rows.First()
.Elements<Cell>()
.Select(cell => GetCellValue(cell, document))
.ToArray();
HeaderLetters = ExcelHeaderHelper.GetHeaderLetters((uint)columnNames.Count());
if (columnNames.Count() != HeaderLetters.Count())
{
throw new ArgumentException("HeaderLetters");
}
IEnumerable<List<string>> cellValues = GetCellValues(rows.Skip(1), columnNames.Count(), document);
//Here you can enumerate through the cell values, based on the cell index the column names can be retrieved.
HeaderLetters使用此类收集:
private static class ExcelHeaderHelper
{
public static string[] GetHeaderLetters(uint max)
{
var result = new List<string>();
int i = 0;
var columnPrefix = new Queue<string>();
string prefix = null;
int prevRoundNo = 0;
uint maxPrefix = max / 26;
while (i < max)
{
int roundNo = i / 26;
if (prevRoundNo < roundNo)
{
prefix = columnPrefix.Dequeue();
prevRoundNo = roundNo;
}
string item = prefix + ((char)(65 + (i % 26))).ToString(CultureInfo.InvariantCulture);
if (i <= maxPrefix)
{
columnPrefix.Enqueue(item);
}
result.Add(item);
i++;
}
return result.ToArray();
}
}
辅助方法是:
private static IEnumerable<List<string>> GetCellValues(IEnumerable<Row> rows, int columnCount, SpreadsheetDocument document)
{
var result = new List<List<string>>();
foreach (var row in rows)
{
List<string> cellValues = new List<string>();
var actualCells = row.Elements<Cell>().ToArray();
int j = 0;
for (int i = 0; i < columnCount; i++)
{
if (actualCells.Count() <= j || !actualCells[j].CellReference.ToString().StartsWith(HeaderLetters[i]))
{
cellValues.Add(null);
}
else
{
cellValues.Add(GetCellValue(actualCells[j], document));
j++;
}
}
result.Add(cellValues);
}
return result;
}
private static string GetCellValue(Cell cell, SpreadsheetDocument document)
{
bool sstIndexedcell = GetCellType(cell);
return sstIndexedcell
? GetSharedStringItemById(document.WorkbookPart, Convert.ToInt32(cell.InnerText))
: cell.InnerText;
}
private static bool GetCellType(Cell cell)
{
return cell.DataType != null && cell.DataType == CellValues.SharedString;
}
private static string GetSharedStringItemById(WorkbookPart workbookPart, int id)
{
return workbookPart.SharedStringTablePart.SharedStringTable.Elements<SharedStringItem>().ElementAt(id).InnerText;
}
该解决方案处理共享单元格项(SST索引的单元格。)>
所有好的例子。这是我正在使用的一个,因为我需要跟踪所有行,单元格,值和标题以进行关联和分析。
字母代码是26的基本编码,因此应将其转换为偏移量。
您可以使用此功能从通过标头索引的行中提取单元格:
好吧,我并不是这方面的专家,但是其他答案对我来说似乎太过致命了,所以这是我的解决方案:
很抱歉为这个问题发布另一个答案,这是我使用的代码。