Those of us who have experienced troubleshooting a VMware environment (or any enterprise infrastructure) know the importance of logs. When your IaaS platform hits a problem, and life somehow likes to make it happen just before you go on your holiday with loved ones, the ability to analyse the log is essential. In fact, it is the first thing that VMware Support asks you. Many vendors support will also ask you for log.
An enterprise-wide Log Management platform is essential if you are operating a DC infrastructure. The larger and more complex your environment, the more critical it becomes. Let’s list down the benefits and requirements, starting with benefits.
No missing logs
- Logs can be rotated, or your ESXi hits a PSOD. You have no log. That’s the end of troubleshooting as there is nothing you can do.
- Here is a real life case where the log was gone.
No need to upload logs
- This is a big time saving, as uploading GB of data is not easy. Your vendor Support Engineer is able to WebEx or do other remote session. If you are busy attending other matters, the remote session can be delegated to another colleague, as all that is needed is to facilitate the screen sharing session. You can even record it for learning for your broader team.
- If you are a VMware Mission Critical Support (MCS) customer, I’d encourage you to have regular session with your MCS Engineer and go through the key dashboards of Log Insight proactively. You should do the same with your key vendors.
- Log Insight comes with hundreds of built-in query for vSphere. It also has many built-in queries, fields and dashboards for other VMware products, such as SRM, NSX, and VSAN. This speeds up analysis. If you set up alert, you can be informed before the situation degenerates.
- For non VMware, you are not forgotten. The growing list of Content Pack is getting better over time, and they are independently released of Log Insight.
A unique insight into your environment
- How well do you your environment? Sure, they are healthy, and vSphere admin client or vRealize Operations tells you that they are. But what does the log say? Can you tell your customers with confidence that there is no error or warning lurking around in the millions of log entries?
- One reason why VMware Support asks for the log file is there are many information that is not available, or readily available in vCenter.
- There are information that is in the UI, but not easy to query. Take for example, the vCenter Tasks and Events. It is not easy to analyse across times. If you have multiple vCenter Servers across the globe, that’s even harder. Can you tell your auditor “who does what to what object and when“? That’s not possible without powerful analysis. Given a VM, can you prove which ESXi Hosts it has ever run on since it was provisioned?
- There are information that is simply not there in the UI. They are only available in log. Take for example, the information on vMotion. Do you know how long is the pre-copy just before the stun time?
Helps in mastering the products you are in charge
- You know VMware products well. You can design and you can troubleshoot. How well do you know the log? If there is a problem, do you know what log entries to search for? Do you know what those codes in the log mean? VMware products generate log entries, and they provide valuable information. Learning the log helps you deepen your skills.
- During the webex session with the vendor support engineer, you can see how they troubleshoot, and learn from the joint-troubleshooting session.
Protect against future incident
- The first time a problem happens, your management will forgive that it takes time to figure out the root cause. She will however expects that you implement measurement. There must be counter action that you do as a result of the incident. You implement an alert, so when the same problem happens again in the future, you will know it within minutes, and before the customers know it.
- Once you know the root cause, Log Insight enables you to set up alert that will be triggered when it receives the same log entry.
We have covered some of the benefits. I hope you are convinced that you need an enterprise wide log platform.
Let’s cover the capabilities that make up a great log analysis for SDDC:
Deep understanding into VMware
- Not just ESXi and vCenter. It has to know all the VMware products that you are having. Some of you run many VMware products, and you want a tool that understand all of them. Log Insight comes with a wide variety of Content Pack.
- This is important if you are running a global operation. You want a single pane of glass. ESXi and Windows can generate a lot of logs. I have seen how a single ESXi saturated a WAN link. Due to a bug in LDAP configuration, it generated excessive logs.
- syslog as protocol is not compressed. Log Insight proprietary protocol provides compression. Steven Flander has shown a 30x compression in his blog.
- There are situation where you only want to forward selected entries. For example, you have limited WAN link in your remote sites. You have set your queries to certain errors, warning, events and you only need to track them. Log Insight allows you to create filters, and forward filters to multiple destination.
- This is important as you eventually want to capture all logs, not just VMware. You want a cross analysis of the events logs from Windows, RedHat, physical network and storage devices. The end result is a lot of log entries, especially when you have thousands of VMs and hundreds of ESXi Host. Scalability matters.
- Scalability refers to the speed of log ingestion, the amount of data it can store, and the speed of query. Querying the last 1 hour maybe fast. Try querying the last 3 months 🙂
- Log Insight scales horizontally, and it comes with a built-in load balancing.
- What if the remote syslog is not available? This could be due to unreliable WAN link, or the remote syslog is having some maintenance. In the case of syslog, your source (e.g. ESXi, SRM, vCenter) will simply drop the log. Yes, you lose the logs as they are not sent to your central syslog server. Log Insight prevents this by caching it. It can keep these entries, and will resend when the source is reachable.
- That’s the source. What about the destination? If you have a central log platform, you want it to be available all the time. But what if you are doing maintenance, and that requires reboot? Log Insight can ensure that ingestion is not affected by having cluster. You can cluster your central Log Insight instance for higher availability.
- By default, the syslog entries are not sent via secure channel. Log Insight enables you to send via secured channel.
- The enterprise-wide log management platform will have a diverse set of users. You want to be able to control who can see what data.
Long Term archival
- Most log entries lose their value after a few months. Audit logs, however, should be kept for years. In some customers, this is 7 years. You need the ability to filter these security logs, and send them to a separate system so they are available. Log Insight achieves this via its Event Forwarding feature. You create a dedicated Log Insight VM for this purpose, and have it received only and all the audit logs.
- If you have log management platform that covers everything, it becomes a critical component of infrastructure. You need to have DR for it. Log Insight, via its Event Forwarding, can be architected as Active/Active instance.
I hope the above is useful. For additional info on why Log Insight is a great fit, review this 12 reasons from Steven Flanders.
I recommend that you join the Log Insight community and provide feedback on the next version.
Have fun in the weird and wonderful world of logs! 🙂