如何循环遍历多行并标记化,返回包含所有标记的数组?

问题描述 投票:0回答:1

我之前发布了一个(格式糟糕且乏善可陈)问题,询问如何将数组作为输入参数传递并返回修改后的数组。经过一番摸索后,我发现该函数对于单行输入工作正常,但当文件包含由换行符分隔的多行时,就会出现问题。

我对 C 和 StackOverflow 都很陌生,所以我很感谢任何关于如何改进我的代码(和帖子)的建议。谢谢你。

我目前拥有的代码:

#include <stdio.h>
#include <string.h>

#define MAX_LINE_LEN 1000

const char delimiter[] = " \t\r\n\v\f";

void tokenize(char *string, char *ret[MAX_LINE_LEN]) {
    char *ptr;
    ptr = strtok(string, delimiter);
    int i = 0;
    while (ptr != NULL) {
        ret[i] = ptr;
        i++;
        ptr = strtok(NULL, delimiter);
    }
}

int main(void) {
    char line[MAX_LINE_LEN];
    static char *temparr[MAX_LINE_LEN] = {0};

    while (fgets(line, sizeof(line), stdin)) {
        tokenize(line, temparr);
    }

    int i = 0;
    while (temparr[i]) {
        printf("%s\n", temparr[i]);
        i++;
    }
}

输入:

There was nothing else to do. 
The deed had already been done and there was no going back. 
It now had been become a question of how they were going to be able to get out of this situation and escape.

输出看起来是正确的:

There
was
nothing
else
to
do.
The
deed
had
already
been
done
and
there
was
no
going
back.
It
now
had
been
become
a
question
of
how
they
were
going
to
be
able
to
get
out
of
this
situation
and
escape.

但是当每行由换行符分隔时:

There was nothing else to do. 
The deed had already been done and there was no going back. 
It now had been become a question of how they were going to be able to get out of this situation and escape.

它只返回最后一行的标记化数组:

It
now
had
been
become
a
question
of
how
they
were
going
to
be
able
to
get
out
of
this
situation
and
escape.

当最后一行很短时:

There was nothing else to do. 
The deed had already been done and there was no going back. 
It now had been become a question of how they were going to be able to 
get out of this situation and escape.

返回数组为:

get
out
of
this
situation
and
escape.
pe.

they
were
going
to
be
able
to

我假设我在循环 fgets() 函数时出错了,但我不确定为什么或如何继续获取第一个输出。我尝试过包括“ " 作为分隔符之一,但它似乎没有做任何事情。 我还被告知 strtok() 不安全(不是线程安全的,修改原始字符串,...)。我不确定这在这里如何发挥作用,但是有哪些替代方案?

(测试段落取自https://randomwordgenerator.com/paragraph.php

c tokenize
1个回答
0
投票

您必须为标记分配单独的空间,否则它们都指向输入缓冲区中的位置。

此外,您应该传递添加标记的初始索引和数组长度,以避免写入超出其边界。

这是修改后的版本:

#include <stdio.h>
#include <string.h>

#define MAX_LINE_LEN 1000
#define MAX_TOKEN 1000

const char delimiter[] = " \t\r\n\v\f";

size_t tokenize(char *string, char *ret[], size_t i, size_t n) {
    char *ptr;
    while (i < n && (ptr = strtok(string, delimiter)) != NULL) {
        ret[i] = strdup(ptr);
        i++;
        str = NULL;
    }
    return i;
}

int main(void) {
    char line[MAX_LINE_LEN];
    char *temparr[MAX_TOKEN];
    size_t n = 0;

    while (fgets(line, sizeof(line), stdin)) {
        n = tokenize(line, temparr, n, MAX_TOKEN);
    }

    for (size_t i = 0; i < n; i++) {
        printf("%s\n", temparr[i]);
    }
    for (size_t i = 0; i < n; i++) {
        free(temparr[i]);
    }
    return 0;
}
© www.soinside.com 2019 - 2024. All rights reserved.