欢迎关注大数据技术架构与案例微信公众号:过往记忆大数据
过往记忆博客公众号iteblog_hadoop
欢迎关注微信公众号:
过往记忆大数据

Hadoop元数据合并异常及解决方法

  这几天观察了一下Standby NN上面的日志,发现每次Fsimage合并完之后,Standby NN通知Active NN来下载合并好的Fsimage的过程中会出现以下的异常信息:

2014-04-23 14:42:54,964 ERROR org.apache.hadoop.hdfs.server.namenode.ha.
    StandbyCheckpointer: Exception in doCheckpoint
java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(SocketInputStream.java:152)
        at java.net.SocketInputStream.read(SocketInputStream.java:122)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
        at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
        at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream
(HttpURLConnection.java:1323)
        at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
        at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.doGetUrl
(TransferFsImage.java:268)
        at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient
(TransferFsImage.java:247)
        at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.
uploadImageFromStorage(TransferFsImage.java:162)
        at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.
doCheckpoint(StandbyCheckpointer.java:174)
        at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.
access$1100(StandbyCheckpointer.java:53)
        at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer
$CheckpointerThread.doWork(StandbyCheckpointer.java:297)
        at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer
$CheckpointerThread.access$300(StandbyCheckpointer.java:210)
        at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer
$CheckpointerThread$1.run(StandbyCheckpointer.java:230)
        at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal
(SecurityUtil.java:456)
        at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer
$CheckpointerThread.run(StandbyCheckpointer.java:226)

  上面的代码贴出来有点乱啊,可以看下下面的图片截图:

StandbyCheckpointer

StandbyCheckpointer

  于是习惯性的去Google了一下,找了好久也没找到类似的信息。只能自己解决。我们通过分析日志发现更奇怪的问题,上次Checkpoint的时间一直都不变(一直都是Standby NN启动的时候第一次Checkpoint的时间),如下:

2014-04-23 14:50:54,429 INFO 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Triggering 
checkpoint because it has been 70164 seconds since the last checkpoint, 
which exceeds the configured interval 600

难道这是Hadoop的bug?于是我就根据上面的错误信息去查看源码,经过仔细的分析,发现上述的问题都是由StandbyCheckpointer类输出的:

private void doWork() {
      // Reset checkpoint time so that we don't always checkpoint
      // on startup.
      lastCheckpointTime = now();
      while (shouldRun) {
        try {
          Thread.sleep(1000 * checkpointConf.getCheckPeriod());
        } catch (InterruptedException ie) {
        }
        if (!shouldRun) {
          break;
        }
        try {
          // We may have lost our ticket since last checkpoint, log in again, 
          // just in case
          if (UserGroupInformation.isSecurityEnabled()) {
            UserGroupInformation.getCurrentUser().checkTGTAndReloginFromKeytab();
          }
          
          long now = now();
          long uncheckpointed = countUncheckpointedTxns();
          long secsSinceLast = (now - lastCheckpointTime)/1000;
          
          boolean needCheckpoint = false;
          if (uncheckpointed >= checkpointConf.getTxnCount()) {
            LOG.info("Triggering checkpoint because there have been " + 
                uncheckpointed + " txns since the last checkpoint, which " +
                "exceeds the configured threshold " +
                checkpointConf.getTxnCount());
            needCheckpoint = true;
          } else if (secsSinceLast >= checkpointConf.getPeriod()) {
            LOG.info("Triggering checkpoint because it has been " +
                secsSinceLast + " seconds since the last checkpoint, which " +
                "exceeds the configured interval " + checkpointConf.getPeriod());
            needCheckpoint = true;
          }
          
          synchronized (cancelLock) {
            if (now < preventCheckpointsUntil) {
              LOG.info("But skipping this checkpoint since we are about"+
              " to failover!");
              canceledCount++;
              continue;
            }
            assert canceler == null;
            canceler = new Canceler();
          }
          
          if (needCheckpoint) {
            doCheckpoint();
            lastCheckpointTime = now;
          }
        } catch (SaveNamespaceCancelledException ce) {
          LOG.info("Checkpoint was cancelled: " + ce.getMessage());
          canceledCount++;
        } catch (InterruptedException ie) {
          // Probably requested shutdown.
          continue;
        } catch (Throwable t) {
          LOG.error("Exception in doCheckpoint", t);
        } finally {
          synchronized (cancelLock) {
            canceler = null;
          }
        }
      }
    }
  }

  上面的异常信息是由 doCheckpoint()函数执行的过程中出现问题而抛出来的,这样导致lastCheckpointTime = now;语句永远执行不到。那么为什么doCheckpoint()执行过程会出现异常??根据上述堆栈信息的跟踪,发现是由TransferFsImage类的doGetUrl函数中的下面语句导致的:

if (connection.getResponseCode() != HttpURLConnection.HTTP_OK) {

  由于connection无法得到对方的响应码而超时。于是我就想到是否是我的集群socket超时设置的有问题??后来经过各种分析发现不是。于是我只能再看看代码,我发现了上述代码的前面有如下设置:

 if (timeout <= 0) {
      Configuration conf = new HdfsConfiguration();
      timeout = conf.getInt(DFSConfigKeys.DFS_IMAGE_TRANSFER_TIMEOUT_KEY,
          DFSConfigKeys.DFS_IMAGE_TRANSFER_TIMEOUT_DEFAULT);
}

if (timeout > 0) {
      connection.setConnectTimeout(timeout);
      connection.setReadTimeout(timeout);
}

if (connection.getResponseCode() != HttpURLConnection.HTTP_OK) {
      throw new HttpGetFailedException(
          "Image transfer servlet at " + url +
          " failed with status code " + connection.getResponseCode() +
          "\nResponse message:\n" + connection.getResponseMessage(),
          connection);
}

DFS_IMAGE_TRANSFER_TIMEOUT_KEY这个时间是由dfs.image.transfer.timeout参数所设置的,默认值为10 * 60 * 1000,单位为毫秒。然后我看了一下这个属性的解释:

Timeout for image transfer in milliseconds. This timeout and the related dfs.image.transfer.bandwidthPerSec parameter should be configured such that normal image transfer can complete within the timeout. This timeout prevents client hangs when the sender fails during image transfer, which is particularly important during checkpointing. Note that this timeout applies to the entirety of image transfer, and is not a socket timeout.

这才发现问题,这个参数的设置和dfs.image.transfer.bandwidthPerSec息息相关,要保证Active NN在dfs.image.transfer.timeout时间内把合并好的Fsimage从Standby NN上下载完,要不然会出现异常。然后我看了一下我的配置

<property>
         <name>dfs.image.transfer.timeout</name>
         <value>60000</value>
</property>

<property>
         <name>dfs.image.transfer.bandwidthPerSec</name>
         <value>1048576</value>
</property>

60秒超时,一秒钟拷贝1MB,而我的集群上的元数据有800多MB,显然是不能在60秒钟拷贝完,后来我把dfs.image.transfer.timeout设置大了,观察了一下,集群再也没出现过上述异常信息,而且以前的一些异常信息也由于这个而解决了。。

本博客文章除特别声明,全部都是原创!
原创文章版权归过往记忆大数据(过往记忆)所有,未经许可不得转载。
本文链接: 【Hadoop元数据合并异常及解决方法】(https://www.iteblog.com/archives/1019.html)
喜欢 (8)
分享 (0)
发表我的评论
取消评论

表情
本博客评论系统带有自动识别垃圾评论功能,请写一些有意义的评论,谢谢!
(2)个小伙伴在吐槽
  1. 亲,我是个菜鸟,请问一下你说的元数据是怎么看的?不会是指你上传的文件大小吧?

    胡洋洋2015-01-05 21:24 回复
    • 元数据可以通过<code>dfs.namenode.edits.dir</code>参数配置,但是不能直接查看,在Hadoop 2.3.0版本开始提供了一个Edits Viewer,可以查看元数据。

      w3970907702015-01-06 09:47 回复