Apache sqoop增量保存作业设置

Question

对于使用sqoop保存的作业，原始文档指出将导入较新的记录。没问题。

但是，如果我们要使用保存的作业，并且想使用> =最后保存的值，例如在并发使用情况数据库中，此数据类型可能会发生时间戳记错误？我们可以影响保存的值或比较结果吗，即> =而不是>？

Answer 1

我们还尝试根据上一个时间戳使用Sqoop作业使用增量上传。您将面临的挑战是，如果Sqoop服务重新启动，它将失去时间戳或您正在使用的控制变量的轨道。

我建议改为调整导入查询。例如，如果要将table_1从RDBMS增量导入到Hive，请在bash shell中使用如下所示的内容：

    # Get the max id from the hive table
    maxCount=`hive -S -e "SELECT MAX(id) FROM hivedb.table_1" | head -1 | cut -d ' ' -f1`

    # If maxCount is not a number, make it zero.
    re='^[0-9]+$'
    if ! [[ $maxCount =~ $re ]] ; then
       maxCount=0
    fi

    # Build the SQL query
    sql_query="select col1, col2, ..., coln from  table_1 (NOLOCK) WHERE id > ${maxCount}"

    # Run the Sqoop import
    sqoop import --connect 'jdbc:jdbcUrl;UserName=usrname;password=password;database=dbName' --query "$sql_query AND \$CONDITIONS" -m 4 --split-by id --hive-table hivedb.table_1 --hive-import ;

Apache sqoop增量保存作业设置

问题描述投票：0回答：1

1个回答

最新问题

Apache sqoop增量保存作业设置

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1