Inceptor 数据导入导出之Insert overwrite

2020-09-02 其他常见问题

内容纲要

概要说明

面对一些重要数据，很多场景下都需要将数据仓库进行复制，这可以是整个数据库的更广泛级别，也可以是较小的级别，例如表或分区。本案例将演示使用Insert overwrite dirctory 然后SQL 建外表的方式导入导出。

详细说明

Inceptor 数据导入导出方法大致分为以下几种：

Export/Import
Insert Overwrite Dirctory，然后建外表的方式
HDFS 的 get、put，以及Inceptor 的 Load 或者 Location

本案例介绍使用Insert overwrite dirctory的方式导入导出 Inceptor 数据大致分为4步：

insert 到源集群的 HDFS ，并get 到本地文件系统
将源集群的文件 scp 到目标集群，并 put 到目标集群的HDFS
在目标集群创建外表指定location，或者load 数据到外表
在目标集群创建ORC 非分区普通表，然后insert into select

1、insert到源集群HDFS，并get 到本地文件系统

a、写入源 HDFS 文件系统，语法：

INSERT OVERWRITE DIRECTORY '/crmdev/orc_unpart/' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' SELECT * FROM crmdev.orc_unpart;

说明：

这个语句将查询结果写入一个目录，而不是文件，写入的结果可能是多个文件。
写入本地文件系统要加上 LOCAL 关键字；写入的是 server 所在的pod内的路径；
ROW FORMAT指定文件的行格式，不指定使用默认值；
STORED AS指定文件格式，不指定则使用默认值；

注意事项：

必须是HDFS 文件系统；
必须指定分隔符，默认的分隔符是SOH，在导入的时候识别不了；

b、get 到源集群的本地文件系统

$ hadoop fs -get /crmdev/orc_unpart/
2020-04-07 11:01:00,360 INFO util.KerberosUtil: Using principal pattern: HTTP/_HOST
$ ls -l
总用量 4
drwxr-xr-x 2 root root 4096 4月   7 11:01 orc_unpart

2、将源集群的文件 scp 到目标集群，并 put 到目标集群的HDFS

$ scp -r orc_unpart/ 172.22.22.24:/mnt/disk1/crmdev/
root@172.22.22.24's password: 
000000_0                            100%   17MB  17.3MB/s   00:00
000001_0                            100%   17MB  17.2MB/s   00:00
000002_0                            100%   17MB  17.1MB/s   00:00

在目标集群将数据上传到目标集群的HDFS：

$ hadoop fs -put orc_unpart/ /crmdev/
2020-04-07 11:34:08,536 INFO util.KerberosUtil: Using principal pattern: HTTP/_HOST

3、在目标集群创建外表指定location，或者load 数据到外表

创建外表的语句中ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’需要与INSERT OVERWRITE DIRECTORY ‘/crmdev/orc_unpart/’ ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ SELECT * FROM crmdev.orc_unpart;中的保持一致。

CREATE EXTERNAL TABLE crmdev.csv_unpart(
  group_id int DEFAULT NULL, 
  code string DEFAULT NULL, 
  name string DEFAULT NULL, 
  new_price decimal(8,2) DEFAULT NULL, 
  main_percent decimal(8,2) DEFAULT NULL, 
  today_ranking decimal(8,2) DEFAULT NULL, 
  rise_percent decimal(6,2) DEFAULT NULL, 
  fiveday_percent decimal(8,2) DEFAULT NULL, 
  fiveday_ranking decimal(6,2) DEFAULT NULL, 
  fiveday_rise_percent decimal(5,2) DEFAULT NULL, 
  teneday_percent decimal(8,2) DEFAULT NULL, 
  tenday_ranking decimal(6,2) DEFAULT NULL, 
  tenday_rise_percent decimal(5,2) DEFAULT NULL, 
  guild string DEFAULT NULL, 
  code_id string DEFAULT NULL, 
  data_dt date DEFAULT NULL
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
STORED AS CSVFILE;

然后在建表的时候直接指定 location 到 HDFS 文件位置，如下：

CREATE EXTERNAL TABLE crmdev.csv_unpart(
  group_id int DEFAULT NULL, 
  code string DEFAULT NULL, 
  name string DEFAULT NULL, 
  new_price decimal(8,2) DEFAULT NULL, 
  main_percent decimal(8,2) DEFAULT NULL, 
  today_ranking decimal(8,2) DEFAULT NULL, 
  rise_percent decimal(6,2) DEFAULT NULL, 
  fiveday_percent decimal(8,2) DEFAULT NULL, 
  fiveday_ranking decimal(6,2) DEFAULT NULL, 
  fiveday_rise_percent decimal(5,2) DEFAULT NULL, 
  teneday_percent decimal(8,2) DEFAULT NULL, 
  tenday_ranking decimal(6,2) DEFAULT NULL, 
  tenday_rise_percent decimal(5,2) DEFAULT NULL, 
  guild string DEFAULT NULL, 
  code_id string DEFAULT NULL, 
  data_dt date DEFAULT NULL
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
STORED AS CSVFILE
LOCATION '/crmdev/orc_unpart/';

严禁直接针对该HDFS 文件创建ORC 表，后者load 到ORC 表，在 load 或者location建表的时候不会报错，但是在查询的时候会抛异常，会报错format 错误：

java.io.IOException: Malformed ORC file

4、在目标集群创建ORC 非分区普通表，然后insert into select

在目标集群创建 ORC 目标表；

CREATE TABLE crmdev.orc_unpart(
  group_id int DEFAULT NULL, 
  code string DEFAULT NULL, 
  name string DEFAULT NULL, 
  new_price decimal(8,2) DEFAULT NULL, 
  main_percent decimal(8,2) DEFAULT NULL, 
  today_ranking decimal(8,2) DEFAULT NULL, 
  rise_percent decimal(6,2) DEFAULT NULL, 
  fiveday_percent decimal(8,2) DEFAULT NULL, 
  fiveday_ranking decimal(6,2) DEFAULT NULL, 
  fiveday_rise_percent decimal(5,2) DEFAULT NULL, 
  teneday_percent decimal(8,2) DEFAULT NULL, 
  tenday_ranking decimal(6,2) DEFAULT NULL, 
  tenday_rise_percent decimal(5,2) DEFAULT NULL, 
  guild string DEFAULT NULL, 
  code_id string DEFAULT NULL, 
  data_dt date DEFAULT NULL
)
STORED AS ORC;

然后 INSERT INTO TABLE crmdev.orc_unpart SELECT * FROM crmdev.csv_unpart;将迁移过来的外表的数据写入到内表；

INSERT INTO TABLE crmdev.orc_unpart SELECT * FROM crmdev.csv_unpart;

至此，sql 方式迁移 ORC 非分区表完成；理论上，这种方式支持所有表的迁移；