来自HDFS的OraclePropertyGraphDataLoader loadData

Question

我正在使用Spark + Hive来构建图形和关系，并将平面OPV / OPE文件导出到HDFS，每个减速器一个OPV / OPE CSV。我们所有的图形数据库都可以加载到OPG / PGX上进行分析，就像魅力一样。

现在，我们要在Oracle Property Graph上加载这些顶点/边。

我以这种方式从hdfs转储文件名：

$ hadoop fs -find '/user/felipeferreira/dadossinapse/ops/*.opv/*.csv' | xargs -I{}  echo 'hdfs://'{} > opvs.lst
$ hadoop fs -find '/user/felipeferreira/dadossinapse/ops/*.ope/*.csv' | xargs -I{}  echo 'hdfs://'{} > opes.lst

我正在试验一些有问题和疑虑的groovy shell：

opvs = new File('opvs.lst') as String[]
opes = new File('opes.lst') as String[]

opgdl.loadData(opg, opvs, opes, 72)

这不是开箱即用的，我收到的错误就像

java.lang.IllegalArgumentException: loadData: part-00000-f97f1abf-5f69-479a-baee-ce0a7bcaa86c-c000.csv flat file does not exist

我将使用loadData接口中提供的InputStream方法来管理它，希望能解决这个问题，但我有一些问题/ sugestions：

loadData是否支持vfs，所以我可以直接加载'hdfs：// ...'文件？
在文件名中使用glob语法不是很好，所以我们可以这样做：

opgdl.loadData(opg, 'hdfs:///user/felipeferreira/opvs/**/*.csv' ...

提前致谢！

Answer 1

您可以使用OraclePropertyGraphDataLoader中的备用API，您可以在其中为用于加载的opv / ope文件指定InputStream对象。这样，您可以使用FsDataInputStream对象从HDFS环境中读取文件。

一小部分样本如下：

// ====== Init HDFS File System Object
Configuration conf = new Configuration();
// Set FileSystem URI
conf.set("fs.defaultFS", hdfsuri);
conf.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
conf.set("fs.file.impl", org.apache.hadoop.fs.LocalFileSystem.class.getName());
// Set HADOOP user
System.setProperty("HADOOP_USER_NAME", "hdfs");
System.setProperty("hadoop.home.dir", "/");

//Get the filesystem - HDFS
FileSystem fs = FileSystem.get(URI.create(hdfsuri), conf);`

// Read files into InputStreams using HDFS FsDataInputStream Java APIs
**Path pathOPV = new Path("/path/to/file.opv");
FSDataInputStream inOPV = fileSystem.open(pathOPV);
Path pathOPV = new Path("/path/to/file.ope");
FSDataInputStream inOPE = fileSystem.open(pathOPE);**

cfg = GraphConfigBuilder.forPropertyGraphHbase().setName("sinapse").setZkQuorum("bda1node05,bda1node06").build()

opg = OraclePropertyGraph.getInstance(cfg)
opgdl = OraclePropertyGraphDataLoader.getInstance();
opgdl.loadData(opg, **inOPV, inOPE**, 100);

如果这个适合您，请告诉我们。

Answer 2

为了跟踪，这里是我们采用的解决方案：

通过NFS网关在groovy shell下面的文件夹上安装hdfs。

将文件名导出到OPV / OPE文件列表：

$ find ../hadoop/user/felipeferreira/dadossinapse/ -iname "*.csv" | grep ".ope" > opes.lst
$ find ../hadoop/user/felipeferreira/dadossinapse/ -iname "*.csv" | grep ".opv" > opvs.lst

然后就像在opg / hbase上加载数据一样简单：

cfg = GraphConfigBuilder.forPropertyGraphHbase().setName("sinapse").setZkQuorum("bda1node05,bda1node06").build()

opg = OraclePropertyGraph.getInstance(cfg)
opgdl = OraclePropertyGraphDataLoader.getInstance()

opvs = new File("opvs.lst") as String[]
opes = new File("opes.lst") as String[]

opgdl.loadData(opg, opvs, opes, 100)

这似乎受到nfs网关的瓶颈，但我们将在下周对此进行评估。

图表数据加载到目前为止运行得很好。如果有人建议更好的方法，请告诉我！

来自HDFS的OraclePropertyGraphDataLoader loadData

问题描述投票：4回答：2

2个回答

最新问题

来自HDFS的OraclePropertyGraphDataLoader loadData

问题描述 投票：4回答：2

2个回答

最新问题

问题描述投票：4回答：2