读取Excel Open XML会忽略空白单元格

问题描述 投票:42回答:14

我正在使用accepted solution here将Excel工作表转换为数据表。如果我有“完美”的数据,这很好用,但是如果我的数据中间有一个空白单元格,则似乎在每一列中都放错了数据。

我认为这是因为在下面的代码中:

row.Descendants<Cell>().Count()

是已填充单元格的数目(不是所有列)AND:

GetCellValue(spreadSheetDocument, row.Descendants<Cell>().ElementAt(i));

似乎是要查找下一个填充的单元格(不一定是该索引中的单元格,所以如果第一列为空并且我调用ElementAt(0),它将在第二列中返回值。

这里是完整的解析代码。

DataRow tempRow = dt.NewRow();

for (int i = 0; i < row.Descendants<Cell>().Count(); i++)
{
    tempRow[i] = GetCellValue(spreadSheetDocument, row.Descendants<Cell>().ElementAt(i));
    if (tempRow[i].ToString().IndexOf("Latency issues in") > -1)
    {
        Console.Write(tempRow[i].ToString());
    }
}
excel datatable openxml openxml-sdk
14个回答
42
投票

这很有意义,因为Excel将不会为空单元格存储值。如果使用“ Open XML SDK 2.0生产率工具”打开文件并将XML遍历到单元级别,您将看到只有具有数据的单元将位于该文件中。

您的选择是在要遍历的单元格范围内插入空白数据,或者以编程方式找出跳过的单元格并适当地调整索引。

我制作了一个示例excel文档,在单元格引用A1和C1中带有一个字符串。然后,我在Open XML Productivity Tool中打开了excel文档,这是存储的XML:

<x:row r="1" spans="1:3" 
   xmlns:x="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
  <x:c r="A1" t="s">
    <x:v>0</x:v>
  </x:c>
  <x:c r="C1" t="s">
    <x:v>1</x:v>
  </x:c>
</x:row>

在这里,您将看到数据对应于第一行,并且该行仅保存了两个单元格的数据。保存的数据对应于A1和C1,并且不保存任何具有空值的单元格。

要获得所需的功能,您可以像上面一样遍历单元格,但是您需要检查单元格引用的值,并确定是否已跳过任何单元格。为此,您将需要两个实用程序函数来从单元格引用中获取列名称,然后将该列名称转换为基于零的索引:

    private static List<char> Letters = new List<char>() { 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', ' ' };

    /// <summary>
    /// Given a cell name, parses the specified cell to get the column name.
    /// </summary>
    /// <param name="cellReference">Address of the cell (ie. B2)</param>
    /// <returns>Column Name (ie. B)</returns>
    public static string GetColumnName(string cellReference)
    {
        // Create a regular expression to match the column name portion of the cell name.
        Regex regex = new Regex("[A-Za-z]+");
        Match match = regex.Match(cellReference);

        return match.Value;
    }

    /// <summary>
    /// Given just the column name (no row index), it will return the zero based column index.
    /// Note: This method will only handle columns with a length of up to two (ie. A to Z and AA to ZZ). 
    /// A length of three can be implemented when needed.
    /// </summary>
    /// <param name="columnName">Column Name (ie. A or AB)</param>
    /// <returns>Zero based index if the conversion was successful; otherwise null</returns>
    public static int? GetColumnIndexFromName(string columnName)
    {
        int? columnIndex = null;

        string[] colLetters = Regex.Split(columnName, "([A-Z]+)");
        colLetters = colLetters.Where(s => !string.IsNullOrEmpty(s)).ToArray();

        if (colLetters.Count() <= 2)
        {
            int index = 0;
            foreach (string col in colLetters)
            {
                List<char> col1 = colLetters.ElementAt(index).ToCharArray().ToList();
                int? indexValue = Letters.IndexOf(col1.ElementAt(index));

                if (indexValue != -1)
                {
                    // The first letter of a two digit column needs some extra calculations
                    if (index == 0 && colLetters.Count() == 2)
                    {
                        columnIndex = columnIndex == null ? (indexValue + 1) * 26 : columnIndex + ((indexValue + 1) * 26);
                    }
                    else
                    {
                        columnIndex = columnIndex == null ? indexValue : columnIndex + indexValue;
                    }
                }

                index++;
            }
        }

        return columnIndex;
    }

然后,您可以遍历单元格,并检查将哪些单元格引用与columnIndex进行比较。如果小于,则将空白数据添加到tempRow中,否则只需读取单元格中包含的值即可。 (注意:我没有测试下面的代码,但是总体思路应该有所帮助):

DataRow tempRow = dt.NewRow();

int columnIndex = 0;
foreach (Cell cell in row.Descendants<Cell>())
{
   // Gets the column index of the cell with data
   int cellColumnIndex = (int)GetColumnIndexFromName(GetColumnName(cell.CellReference));

   if (columnIndex < cellColumnIndex)
   {
      do
      {
         tempRow[columnIndex] = //Insert blank data here;
         columnIndex++;
      }
      while(columnIndex < cellColumnIndex);
    }
    tempRow[columnIndex] = GetCellValue(spreadSheetDocument, cell);

    if (tempRow[i].ToString().IndexOf("Latency issues in") > -1)
    {
       Console.Write(tempRow[i].ToString());
    }
    columnIndex++;
}

0
投票

[我忍不住要从Amurra的答案中优化子例程以消除对Regex的需求。


0
投票

添加了另一个实现,这次是事先知道列数的方法:


0
投票

要读取空白单元格,我正在使用在行读取器外部和while循环中分配的名为“ CN”的变量,我正在检查列索引是否大于我的变量,因为在每次读取单元格之后该变量都会递增。如果这不匹配,我将用我想要的值填充列。这是我用来将空白单元格追上我所尊重的列值的技巧。这是代码:


0
投票

这是我的解决方案。当缺少的字段排在行尾时,我发现上述方法似乎无法正常工作。


0
投票

19
投票

这里是IEnumerable的实现,应执行您想要的,已编译并经过单元测试的内容。

    ///<summary>returns an empty cell when a blank cell is encountered
    ///</summary>
    public IEnumerator<Cell> GetEnumerator()
    {
        int currentCount = 0;

        // row is a class level variable representing the current
        // DocumentFormat.OpenXml.Spreadsheet.Row
        foreach (DocumentFormat.OpenXml.Spreadsheet.Cell cell in
            row.Descendants<DocumentFormat.OpenXml.Spreadsheet.Cell>())
        {
            string columnName = GetColumnName(cell.CellReference);

            int currentColumnIndex = ConvertColumnNameToNumber(columnName);

            for ( ; currentCount < currentColumnIndex; currentCount++)
            {
                yield return new DocumentFormat.OpenXml.Spreadsheet.Cell();
            }

            yield return cell;
            currentCount++;
        }
    }

这里是它所依赖的功能:

    /// <summary>
    /// Given a cell name, parses the specified cell to get the column name.
    /// </summary>
    /// <param name="cellReference">Address of the cell (ie. B2)</param>
    /// <returns>Column Name (ie. B)</returns>
    public static string GetColumnName(string cellReference)
    {
        // Match the column name portion of the cell name.
        Regex regex = new Regex("[A-Za-z]+");
        Match match = regex.Match(cellReference);

        return match.Value;
    }

    /// <summary>
    /// Given just the column name (no row index),
    /// it will return the zero based column index.
    /// </summary>
    /// <param name="columnName">Column Name (ie. A or AB)</param>
    /// <returns>Zero based index if the conversion was successful</returns>
    /// <exception cref="ArgumentException">thrown if the given string
    /// contains characters other than uppercase letters</exception>
    public static int ConvertColumnNameToNumber(string columnName)
    {
        Regex alpha = new Regex("^[A-Z]+$");
        if (!alpha.IsMatch(columnName)) throw new ArgumentException();

        char[] colLetters = columnName.ToCharArray();
        Array.Reverse(colLetters);

        int convertedValue = 0;
        for (int i = 0; i < colLetters.Length; i++)
        {
            char letter = colLetters[i];
            int current = i == 0 ? letter - 65 : letter - 64; // ASCII 'A' = 65
            convertedValue += current * (int)Math.Pow(26, i);
        }

        return convertedValue;
    }

将其投入课堂,然后尝试一下。


15
投票

这里是Waylon's answer的略微修改版本,它也依赖于其他答案。它将他的方法封装在一个类中。

我改变了

IEnumerator<Cell> GetEnumerator()

to

IEnumerable<Cell> GetRowCells(Row row)

这是类,您无需实例化它,它仅用作实用程序类:

public class SpreedsheetHelper
{
    ///<summary>returns an empty cell when a blank cell is encountered
    ///</summary>
    public static IEnumerable<Cell> GetRowCells(Row row)
    {
        int currentCount = 0;

        foreach (DocumentFormat.OpenXml.Spreadsheet.Cell cell in
            row.Descendants<DocumentFormat.OpenXml.Spreadsheet.Cell>())
        {
            string columnName = GetColumnName(cell.CellReference);

            int currentColumnIndex = ConvertColumnNameToNumber(columnName);

            for (; currentCount < currentColumnIndex; currentCount++)
            {
                yield return new DocumentFormat.OpenXml.Spreadsheet.Cell();
            }

            yield return cell;
            currentCount++;
        }
    }

    /// <summary>
    /// Given a cell name, parses the specified cell to get the column name.
    /// </summary>
    /// <param name="cellReference">Address of the cell (ie. B2)</param>
    /// <returns>Column Name (ie. B)</returns>
    public static string GetColumnName(string cellReference)
    {
        // Match the column name portion of the cell name.
        var regex = new System.Text.RegularExpressions.Regex("[A-Za-z]+");
        var match = regex.Match(cellReference);

        return match.Value;
    }

    /// <summary>
    /// Given just the column name (no row index),
    /// it will return the zero based column index.
    /// </summary>
    /// <param name="columnName">Column Name (ie. A or AB)</param>
    /// <returns>Zero based index if the conversion was successful</returns>
    /// <exception cref="ArgumentException">thrown if the given string
    /// contains characters other than uppercase letters</exception>
    public static int ConvertColumnNameToNumber(string columnName)
    {
        var alpha = new System.Text.RegularExpressions.Regex("^[A-Z]+$");
        if (!alpha.IsMatch(columnName)) throw new ArgumentException();

        char[] colLetters = columnName.ToCharArray();
        Array.Reverse(colLetters);

        int convertedValue = 0;
        for (int i = 0; i < colLetters.Length; i++)
        {
            char letter = colLetters[i];
            int current = i == 0 ? letter - 65 : letter - 64; // ASCII 'A' = 65
            convertedValue += current * (int)Math.Pow(26, i);
        }

        return convertedValue;
    }
}

现在您可以通过这种方式获取所有行的单元格:

// skip the part that retrieves the worksheet sheetData
IEnumerable<Row> rows = sheetData.Descendants<Row>();
foreach(Row row in rows)
{
    IEnumerable<Cell> cells = SpreedsheetHelper.GetRowCells(row);
    foreach (Cell cell in cells)
    {
         // skip part that reads the text according to the cell-type
    }
}

它将包含所有单元格,即使它们为空。


5
投票

请参阅我的实现:

  Row[] rows = worksheet.GetFirstChild<SheetData>()
                .Elements<Row>()
                .ToArray();

  string[] columnNames = rows.First()
                .Elements<Cell>()
                .Select(cell => GetCellValue(cell, document))
                .ToArray();

  HeaderLetters = ExcelHeaderHelper.GetHeaderLetters((uint)columnNames.Count());

  if (columnNames.Count() != HeaderLetters.Count())
  {
       throw new ArgumentException("HeaderLetters");
  }

  IEnumerable<List<string>> cellValues = GetCellValues(rows.Skip(1), columnNames.Count(), document);

//Here you can enumerate through the cell values, based on the cell index the column names can be retrieved.

HeaderLetters使用此类收集:

    private static class ExcelHeaderHelper
    {
        public static string[] GetHeaderLetters(uint max)
        {
            var result = new List<string>();
            int i = 0;
            var columnPrefix = new Queue<string>();
            string prefix = null;
            int prevRoundNo = 0;
            uint maxPrefix = max / 26;

            while (i < max)
            {
                int roundNo = i / 26;
                if (prevRoundNo < roundNo)
                {
                    prefix = columnPrefix.Dequeue();
                    prevRoundNo = roundNo;
                }
                string item = prefix + ((char)(65 + (i % 26))).ToString(CultureInfo.InvariantCulture);
                if (i <= maxPrefix)
                {
                    columnPrefix.Enqueue(item);
                }
                result.Add(item);
                i++;
            }
            return result.ToArray();
        }
    }

辅助方法是:

    private static IEnumerable<List<string>> GetCellValues(IEnumerable<Row> rows, int columnCount, SpreadsheetDocument document)
    {
        var result = new List<List<string>>();
        foreach (var row in rows)
        {
            List<string> cellValues = new List<string>();
            var actualCells = row.Elements<Cell>().ToArray();

            int j = 0;
            for (int i = 0; i < columnCount; i++)
            {
                if (actualCells.Count() <= j || !actualCells[j].CellReference.ToString().StartsWith(HeaderLetters[i]))
                {
                    cellValues.Add(null);
                }
                else
                {
                    cellValues.Add(GetCellValue(actualCells[j], document));
                    j++;
                }
            }
            result.Add(cellValues);
        }
        return result;
    }


private static string GetCellValue(Cell cell, SpreadsheetDocument document)
{
    bool sstIndexedcell = GetCellType(cell);
    return sstIndexedcell
        ? GetSharedStringItemById(document.WorkbookPart, Convert.ToInt32(cell.InnerText))
        : cell.InnerText;
}

private static bool GetCellType(Cell cell)
{
    return cell.DataType != null && cell.DataType == CellValues.SharedString;
}

private static string GetSharedStringItemById(WorkbookPart workbookPart, int id)
{
    return workbookPart.SharedStringTablePart.SharedStringTable.Elements<SharedStringItem>().ElementAt(id).InnerText;
}

该解决方案处理共享单元格项(SST索引的单元格。)>


2
投票

所有好的例子。这是我正在使用的一个,因为我需要跟踪所有行,单元格,值和标题以进行关联和分析。


2
投票

字母代码是26的基本编码,因此应将其转换为偏移量。


1
投票

您可以使用此功能从通过标头索引的行中提取单元格:


1
投票

好吧,我并不是这方面的专家,但是其他答案对我来说似乎太过致命了,所以这是我的解决方案:


1
投票

很抱歉为这个问题发布另一个答案,这是我使用的代码。

© www.soinside.com 2019 - 2024. All rights reserved.