Transcription

Tableau Spark SQL Setup Instructions1. Prerequisites2. Configuring Hive3. Configuring Spark & Hive4. Starting the Spark Service and the Spark Thrift Server5. Connecting Tableau to Spark SQL5A. Install Tableau DevBuild 8.2.3 5B. Install the Spark SQL ODBC5C. Opening a Spark SQL ODBC Connection6. Appendix: SparkSQL 1.1 Patch Installation Steps6A. Pre-Requisites:6B. Apache Hadoop Install:Install Java:Install Hadoop:Edit Config FilesStart Hadoop and format namenodeCreate HDFS directoriesInstall PostgreSQLInstall and configure HiveConfigure PostgreSQL as Hive metastoreStart metastore and hiveserver2 servicesTo Shutdown Hadoop:To Start Hadoop:Loading TestV1 v2 data via BeelineSpark SQL 1.1.patch Install1. PrerequisitesThere are a number of prerequisites required to be able to run Tableau with Spark SQL. Themain requirements are:Server Side: Spark V1.2 - please use the 1.2 branch at https://github.com/apache/spark/tree/branch-1.2 Hadoop V2.4 or higher Hive V0.12 or V0.13Client Side: Tableau 8.3.1

Simba Spark ODBC Driver V1.0.4: http://databricks.com/spark-odbc-driver-download2. Configuring Hive There are no special Hive configurations when using with Spark SQLIf installing from scratch you can follow the Appendix 6B steps for our sample sparkcluster configuration3. Configuring Spark & Hive There are no special Spark configurations, the defaults will get you up and runningSee Appendix 6B for our sample cluster configuration4. Starting the Spark Service and the Spark Thrift Server Verify that you have HiveServer2 running and you are using PostgreSQL or MySQLas a metastore then run the following SPARK HOME/sbin/start-master.sh SPARK HOME/sbin/start-slaves.sh SPARK HOME/sbin/start-thriftserver.sh --master spark://localhost:7077 -driver-class-path CLASSPATH --hiveconf hive.server2.thrift.bind.host localhost -hiveconf hive.server2.thrift.port 10001 Note, we randomly choose port 100015. Connecting Tableau to Spark SQL5A. Install Tableau DevBuild 8.3.1 The first thing you must do is install the latest version of Tableau - anything 8.3.1 or latershould work. The Spark SQL connection will be hidden in the product unless you install aspecial license key. Please e-mail Jackie Clough if you do not have the special license key.To install a new key you must go to Help - Manage Product Keys

5B. Install the Spark SQL ODBCTo install the Spark SQL ODBC driver, simply open the appropriate version of the driver foryour system and follow the instructions: Windows 64-bit: SimbaSparkODBC64.msi Windows 32-bit: SimbaSparkODBC32.msi Max OSX: SimbaSparkODBC.dmg When installing the Mac driver, you may get a message that says“SimbaSparkODBC.dmg can’t be opened because it is from an unidentifieddeveloper.” To allow the driver to be installed, go to Applications - SystemPreferences - Security & Privacy - General Tab and click Open Anyways5C. Opening a Spark SQL ODBC ConnectionIf you have properly installed Tableau and the special license key, you should see Spark SQL(Beta) as one of the connection options after clicking Connect to Data. Select Spark SQL(Beta) and you will see a dialog box similar to below:

The parameters you need to enter include: Server: Server name or IP address of your Spark server Port: Default value of your Spark Thrift server port Type: Spark ThriftServer (Spark 1.1 and later) Authentication: User Name User Name: blank

6. Appendix: Spark SQL 1.1.x Installation Steps6A. Pre-Requisites Sample: OS: CentOS 6.5CPU: 2 dual coreRAM: 16GB6B. Apache Hadoop Install:Install Java: Copy/scp the java rpm file jdk-7u25-linux-x64.rpmto /tmp and extract/install with rpm: rpm -Uvh jdk-7u25-linux-x64.rpmSet environment variables: export JAVA HOME /usr/java/jdk1.7.0 25/Verify java version: java -versionInstall Hadoop: ssh-keygen -t rsa -P ""cat /.ssh/id rsa.pub /.ssh/authorized keysssh localhostssh actual server name useradd hadoopcd /wget rent/hadoop2.4.1.tar.gztar xzvf hadoop-2.4.1.tar.gzchown -R hadoop:hadoop /hadoop-2.4.1Edit Config FilesEdit the following config files located in /hadoop-2.4.1/etc/hadoop, These will varydepending on your environment but are provided here as a sample:

core-site.xml configuration property name fs.defaultFS /name value hdfs://localhost:8020 /value final true /final /property property name hadoop.tmp.dir /name value /data/hadoop data /value description A base for other temporary directories. /description /property /configuration mapred-site.xml configuration property name mapreduce.framework.name /name value yarn /value /property /configuration yarn-site.xml configuration property name yarn.nodemanager.aux-services /name value mapreduce shuffle /value /property property name ass /name value org.apache.hadoop.mapred.ShuffleHandler /value /property !-- To increase number of apps that can run in YARN -- property name yarn.nodemanager.resource.cpu-vcores /name value 4 /value /property property name yarn.nodemanager.resource.memory-mb /name value 8192 /value /property property name yarn.scheduler.minimum-allocation-mb /name

value 512 /value /property property name yarn.nodemanager.pmem-check-enabled /name value false /value /property property name yarn.nodemanager.vmem-check-enabled /name value false /value /property /configuration In addition, add the following environment variables to /.bashrc: export JAVA HOME /usr/java/jdk1.7.0 25 export HADOOP PREFIX /hadoop-2.4.1 export HADOOP CONF DIR HADOOP PREFIX/etc/hadoop export YARN CONF DIR HADOOP CONF DIR export PATH PATH: HADOOP PREFIX/bin export HADOOP INSTALL /hadoop-2.4.1 export HADOOP HOME /hadoop-2.4.1 export HADOOP COMMON LIB NATIVE DIR HADOOP HOME/lib/native export HADOOP OPTS " HADOOP OPTS -Djava.library.path HADOOP HOME/lib/" export HIVE HOME /usr/local/hive-0.12.0/ export PATH PATH: HIVE HOME/bin export SPARK MASTER PORT 7077Start Hadoop and format namenode /hadoop-2.4.1/bin/hdfs namenode ry-daemon.sh start historyserverCreate HDFS directories kdirInstall PostgreSQL-p /user/root-p /user/hive-p /user/hive/metastore/user/anonymous

edit /etc/yum.repos.d/CentOS-Base.repo by adding "exclude postgresql*" tothe "[base]" and "[update]" sections wget -O http://yum.postgresql.org/9.3/redhat/rhel-6-x86 64/pgdg-centos93-9.31.noarch.rpm rpm -Uvh pgdg-centos93-9.3.1.noarch.rpm yum install postgresql93-server service postgresql-9.3 initdb chkconfig postgresql-9.3 on service postgresql-9.3 startInstall and configure Hive cd /usr/local wget hive-0.12.0.tar.gz tar xvf hive-0.12.0.tar.gzConfigure /usr/local/hive-0.12.0/conf/hive-site.xml to look something like this configuration property name javax.jdo.option.ConnectionURL /name value jdbc:postgresql://localhost/metastore /value /property property name javax.jdo.option.ConnectionDriverName /name value org.postgresql.Driver /value /property property name javax.jdo.option.ConnectionUserName /name value hiveuser /value /property property name javax.jdo.option.ConnectionPassword /name value mypassword /value /property property name datanucleus.autoCreateSchema /name value false /value /property property

name hive.metastore.uris /name value thrift://localhost:9083 /value description IP address (or fully-qualified domain name) and port ofthe metastore host /description /property property name hive.metastore.warehouse.dir /name value /user/hive/metastore /value /property /configuration Configure PostgreSQL as Hive metastore Set "standard conforming strings" to off in /var/lib/pgsql/9.3/data/postgresql.conf standard conforming strings off listen addresses '*'Allow remote access by adding the following to /var/lib/pgsql/9.3/data/pg hba.confunder the IPv6 section host allall0.0.0.0 0.0.0.0passwordRestart service service postgresql-9.3 restartInstall PostgreSQL JDBC Driver yum install postgresql-jdbc ln -s /usr/share/java/postgresql-jdbc.jar /usr/local/hive/lib/postgresqljdbc.jarCreate metastore database and user account su - postgres psql CREATE USER hiveuser WITH PASSWORD 'password'; CREATE DATABASE metastore; \c metastore; \i ostgres/hiveschema-0.12.0.postgres.sql \o /tmp/grant-privsSELECT 'GRANT SELECT,INSERT,UPDATE,DELETEON "' schemaname '"."' tablename '" TO hiveuser;'FROMpg tablesWHERE tableowner CURRENT USER and schemaname 'public';\o \i /tmp/grant-privsVerify connection with hive user psql -h myhost -U hiveuser -d metastoreCreate softlink to hive-site.xml file ln -s /hadoop-2.4.1/etc/hadoop/hive-site.xml /usr/local/hive-0.12.0/conf/hive-site.xmlTo Shutdown Hadoop:

/hadoop-2.4.1/sbin/mr-jobhistory-daemon.sh stop historyserver /hadoop-2.4.1/sbin/stop-yarn.sh /hadoop-2.4.1/sbin/stop-dfs.shTo Start Hadoop: /hadoop-2.4.1/sbin/start-dfs.sh /hadoop-2.4.1/sbin/start-yarn.sh /hadoop-2.4.1/sbin/mr-jobhistory-daemon.sh start historyserverStart metastore and hiveserver2 services mkdir /var/log/hive nohup hive --service metastore /var/log/hive/metastore.log & nohup hive --service hiveserver2 /var/log/hive/hiveserver2.log &Spark SQL 1.1.x Install build spark from source with maven cd /opt wget 3/binaries/apache-maven-3.2.3-bin.tar.gz tar xvf apache-maven-3.2.3-bin.tar.gz mv apache-maven-3.2.3 /opt/maven ln -s /opt/maven/bin/mvn /usr/bin/mvn vim /etc/profile.d/maven.shAdd the following contents: #!/bin/bash MAVEN HOME /opt/maven PATH MAVEN HOME/bin: PATH export PATH MAVEN HOME export CLASSPATH . Save and close the file. Make it executable using the following command. chmod x /etc/profile.d/maven.sh Then, set the environment variables permanently by running the following command: source /etc/profile.d/maven.sh Get Spark source code mkdir /usr/local/spark-1.1.x-bin-hadoop2.4

cd /usr/local/spark-1.1.x-bin-hadoop2.4 wget https://github.com/apache/spark/archive/master.zip unzip master.zip mv spark-master/* /usr/local/spark-1.1.x-bin-hadoop2.4/ cd /usr/local/spark-1.1.x-bin-hadoop2.4 export MAVEN OPTS "-Xmx2g -XX:MaxPermSize 512M XX:ReservedCodeCacheSize 512m" mvn -Pyarn -Phadoop-2.4 -Dhadoop.version 2.4.0 -Phive -DskipTests cleanpackageWait for compiler to finishConfigure Spark: The following is optional as the default setting work just fine.edit v.sh add the following, whichwill vary depending on your environment.Add the following under the Yarn configurations SPARK EXECUTOR CORES 4#, Number of cores for the workers (Default:1). SPARK EXECUTOR MEMORY 4G#, Memory per Worker (e.g. 1000M, 2G)(Default: 1G) SPARK DRIVER MEMORY 4G#, Memory for Master (e.g. 1000M, 2G)(Default: 512 Mb)Starting Spark: To start spark master/worker and hive-thriftserver connector run the following ster.sh aves.sh riftserver.sh --master spark://localhost:7077 --driver-class-path CLASSPATH --hiveconf hive.server2.thrift.bind.hostlocalhost --hiveconf hive.server2.thrift.port 10001

Tableau Spark SQL Setup Instructions 1.Prerequisites 2.Configuring Hive 3.Configuring Spark & Hive 4.Starting the Spark Service and the Spark Thrift Server 5.Connecting Tableau to Spark SQL 5A. Install Tableau DevBuild 8.2.3 5B. Install the Spark SQL ODBC 5C. Opening a Spark SQL ODBC Connect