Open Source Observability

author-image

By

Open at Intel host Katherine Druckman spoke with Dotan Horovits, a CNCF ambassador and AWS senior developer advocate, about the importance of observability in CI/CD pipelines and his work with OpenTelemetry, as well as the larger cloud native open source communities. They dug deep into the whys, hows, and importance of observability, especially as it impacts developers. Enjoy this transcript of their conversation. 
 

“Because of the growing complexity of the architectures today, and you add on top of the containerized applications and microservices, now you have the serverless and you have the WASM. And each one of these requires a different approach with observability.”

— Dotan Horovits, CNCF Ambassador and AWS Senior Developer Advocate 

 

Katherine Druckman: Hey, Dotan, thank you for joining me. I'm talking to Dotan Horovits, who is, among other things, a CNCF ambassador involved in many open source communities. You've probably seen him speak at an event if you go to open source conferences, but I'm not going to say too much more. First, thank you. Thank you for joining me, but would you introduce yourself a little bit? 

Dotan Horovits: Yeah, thanks so much for having me. A pleasure finally being here. I've been talking about this for so long. 

Katherine Druckman: I know. Yeah. We have. 

Dotan's Background and Experience

Dotan Horovits: Yeah. Yeah. I'm a CNCF ambassador, which CNCF is in the Cloud Native Computing Foundation. I assume all the listeners here, since they're open source, know that. And I've pretty much been in the cloud native and open source sphere for a long time. Also generally on the DevOps, the broader DevOps sphere. I've been working with vendors, projects, and communities in the space of DevOps tools and observability more specifically in the recent years. That's my passion. It's also my podcast. My podcast, Open Observability talks. Your listeners are podcast fans, so maybe they'll fancy... 

Katherine Druckman: Yes, there's another one. 

Dotan Horovits: That one as well. Yeah, it's already the fifth year. Pretty cool. And it's a niche thing, but open source, DevOps, observability, platform engineering, and the likes where my passion lies. 

Katherine Druckman: That's the great thing about being involved in open source or it's a blessing and a curse maybe. I don't know. When you are, and a lot of us will experience that or have experienced that in the last couple of years, but even when you're between gigs, you still have a lot to do, right? You're still very plugged in and socializing with peers and whatnot. It's a nice thing, I think, in a way. 

Dotan Horovits: Yeah, yeah. No, because I have my own ecosystem, my own peers, and my own communities that carry on with me regardless of who my employer is. When I go to a conference to speak or when I attend a meeting or track and meetings and things like that, I don't do that on behalf of an employer. Even though I'm moving on to the next gig or in between gigs, it's still there and people still reach out to me. That's something that's like a whole realm, whole set of assets as well that goes with you. And also when you join a new employer, then you bring it with you. I think it's an amazing sense of value. And also it helps obviously in between to gain that sense because there's obviously a lack of certainty in these points in time, but this helps you remember all the value that you carry with you and the backing of the community is amazing, has been really, truly inspiring. 

Current Projects and Passions

Katherine Druckman: Oh, that's awesome. Tell me, what are you most excited about right now? Given that today, right, nobody's telling you what to prioritize. You get to pick whatever you want to prioritize. What is that? 

Dotan Horovits: My latest passion within the CNCF realm, and particularly in the OpenTelemetry sphere, is CI/CD observability. We all love CI/CD, but apparently when it comes to observability, people tend to think about what happens after deployment and the monitoring and understanding the system in production. We tend to think less about the actual pipelines, the release pipelines, and what goes on there. Although it's very much critical for us, I know that the lead time for change, DORA metric, is essentially how fast we can go from the commit all the way down to getting it into production. In many cases, you suffer flaky testing and slowness and runs that fail, pipeline runs, and so on. This is something I've been experiencing for a long time and a pain that I've seen echoing within the community for a long time. 

One of the problems I've seen is the lack of standardization. You have point solutions for Jenkins, or Argo, or whatnot, but because it's a point solution to a specific tool, it's not comprehensive...definitely not as much as we have in production. Trying to standardize on that and, obviously, OpenTelemetry being my passion project in other realms, for me it was the natural thing to try and make it extend OpenTelemetry to address this. A few years ago, I raised it as an OTEP, an OpenTelemetry extension proposal, and that has evolved. And currently we actually have a SIG, a special interest group, within OpenTelemetry designated for CI/CD observability within OTeL. We have semantic conventions of OTeL already, including aspects relating to CI/CD. That's actually hot news. It's part of the latest release of the semantic conventions of OpenTelemetry. I'm actually now working on a CNCF blog post to lay out all this journey and the milestone and what's heading next. Stay tuned for the CNCF blog that hopefully will come out before KubeCon, hopefully. Yeah. Really excited about that and really exciting to see more people joining that, more people seeing the value. If any listeners are interested in that, feel free to reach out to me, I’m happy to have more people involved. 

CI/CD Observability and OpenTelemetry

Katherine Druckman: You said a word we all love to throw out there, value. How do you communicate the value of the work that you're doing, like the projects that you're working on, right? The value of this type of OpenTelemetry and observability into the CI/CD pipeline. 

Dotan Horovits: I think the top-level metric, as I said, is the lead time for change. And that's become pretty predominant in the DevOps space or the software development life cycle in general, that engineering organizations measure their effectiveness based on these four metrics, sometimes extended to five or whatnot. But the basic four, one of which is the, as I said, lead time for change or change time, which is the time from commit all the way to production. And when organizations really feel that this is slow on the one hand and on the other hand find themselves flying in the dark without any visibility into why that happens, where it fails within the pipeline, this is the classic symptoms that you're lacking observability. We know already, we've been educated for a good few years to realize that when it's in production, but people don't make the same connection so intuitively on the release pipeline. 

First of all, raising awareness. I've been raising awareness that for a good few years for people to connect observability also in the space of the pre-deployment stages of the release pipelines. And secondly, the best practices around that, even the same tooling. You have already tools in place, whether open source or proprietary, that you monitor your production environment, just use the same suite of tools and the same practices on your release pipelines. That's the best practice. And then obviously there are semantics that are specific to release pipelines, to testing, to attributes that you need to deliver, the environment related and so on. And then this is where we need to standardize and create this language. Essentially you could monitor today as well. The only problem is that there is no standard way of reporting that from your pipeline to the monitoring side. Once we have agreed upon semantic conventions or a language, if you like, then we will be able to uniformly do that and then you can choose the best tools for the job to monitor that. That's the idea. And, as I said, the first verbs if you wish, and nouns of this language are now officially part of the OpenTelemetry semantic conventions. 

Developer Experience and Productivity

Katherine Druckman: Here's something that I'm curious about. We love to talk about business value and speeding up processes and eliminating hurdles on one end of the conversation, but all of these things can also tie together to improve developer experience. I wondered if you could talk a little bit about that. And here's where I'm going with this in my thinking. In a previous life, back when I used to do this kind of work, I was once tasked with improving a pipeline, like we're talking about, specifically for developer experience, right? That was my specific mandate. But obviously it's valuable because when developers get stuck in slow feedback loops, it's frustrating and it's inefficient, and there are lots of reasons why we want to fix that. I'm just wondering, again, I think we tend to measure value in a certain way, but I'd like to flip it around and talk about it, the actual people doing the work. What are your thoughts on that? 

Dotan Horovits: Well, definitely, a major incentive for that is the developer experience and developer productivity. This is why, by the way, I find a lot of very natural partnering with platform engineering teams within larger organizations because their goal and their focus is on how can we improve the developer experience and developer productivity and make sure that the product developers, the application developers, have to deal less, or spend less time, about these infrastructure related things and more about doing the differentiated code, the actual business logic that they're tasked to do. 

Katherine Druckman: And what they want to do, right? The more fun work. 

Dotan Horovits: Yeah, exactly. Fun. It helps obviously reduce the churn because good engineers don't want to mess around with these things. If they have more attractive offerings there, the retention of the employees, the satisfaction, the productivity, as I said, there are very impressive stats about in general in these practices. And luckily with the rise of platform engineering, you get a bit more stats on the impact of that, but you really see the amount of pull requests, the amount of time that it takes to ship the code and so on, getting all boosted things for that. Definitely no one likes struggling with flaky tests and understanding what the heck is not working and what's going on there. This is a significant thing. Yeah, you're right, developer productivity, developer experience, it's the most basic thing. And obviously it still manifests in the top level when people measure the higher rank engineering, VP engineering directors, and whatnot looking at their team's productivity, they also see the business impact of the team, but it makes everyone's lives much, much better. 

Katherine Druckman: The type of work you do, right? The involvement that you have in all of these various communities, how do you balance your time and how you decide where to contribute what? And because you have a lot going on, right? You're giving a lot of talks, you're out there doing, we do similar work, right? You're out there doing podcasts, giving talks, and being involved in communities. How do you identify the balance of still getting your hands dirty in the technical work, but then also figuring out what to write in your CFP? 

Dotan Horovits: Well, that's a million-dollar question, I'd say. 

Katherine Druckman: Yeah, can you please fix this for me? Can you figure it out for me and then tell me? And then I'll... 

Dotan Horovits: And I'll tell you even more than the fact that it's very tricky. It's also not the same answer. It really changes over time and over the periods and the focus in the company that you work at or between jobs or whether you're really, now, let's say when I started the SIG, it took a lot of my time and effort focused into that. And obviously it came on the expense of other places that I needed to take a step back. It really is something that is dependent. Sometimes it's project related. It's really a certain period of time that you delve into something. Sometimes it's a longer term thing and then ongoing, and then you really need to carve out time box, how much time a week, a month I'm going to dedicate for that. Some things are recurrent, like the weekly or biweekly calls for a specific working group or something like that. 

You can really plan around these and see which working groups I'm going to attend, or TACs, or SIGs or whatnot. The async one is trickier because it keeps on coming regardless of time zone, regardless of weekends. And then you need to really carve out the time that you dedicate for going over, I don't know, a certain GitHub issue, or whatnot. Also, the speaking agents, first of all, there’s the seasonality of the events. Obviously, there are months in the year that I travel more and there are more events, and then there are the times, the summer period, that is less eventful in that respect. Of course, you also need to balance it with the top-level goals of the company that you work with, your employer, and the derivative into the developer relations, DevRel strategy. It needs to be, is the focus now more on the community side? On the advocacy side? On the open source? On the commercial offering, or solution, or integration? AI, now everyone wants to talk about AI, and the people are being asked to shift more. It's really something that changes and evolves over time depending on all these factors. 

The Impact of AI on Observability

Katherine Druckman: Okay. Well you brought it up. Everyone does want to talk about AI, don't they? This is something I think we're all talking about right now, right? From a lot of angles. And I'm wondering the work you do specifically around telemetry and deployment systems, I wonder what your thinking is... How is this conversation going to change over the next year even? We're talking about the ways that AI can help us solve our problems and make our work more efficient and all of that, but we're also talking about how to build and deploy AI applications and solutions, right? All these things are happening concurrently, and I just wonder how you think things are going to shake out. If we have this conversation one year from now, what's going to be different, and how is AI going to play a role? 

Dotan Horovits: I think in the realm of observability, there's been a lot of weight put on the visualization layer. Having these pretty graphs and intelligent alerting, like putting the right threshold and avoiding alert fatigue, on the one hand. On the other hand, obviously, not missing out on the important things and how to strike that balance. But it was very human centric in the sense that someone needed to look at graphs to understand, hey, this is going up and this is going down at the same time, it's probably related and the correlation and all that. I think with AI, when implemented correctly, we'll be able to actually shift some of that away from the visualization heavy side into something that will be more an assistant. That's why I don't look at it as a fully automated cycle at this point. Definitely not in, as you said, a year from now, but a clever assistant can already flag, “hey, did you check out this and that?” 

And grouping, so you get, I don't know, 20 alerts, but they're all mapped into one root cause. Maybe being able to point to, “hey, this cluster.” Instead of getting messed up in the world of alerts, it gets one consolidated view saying this is a cluster that is all boiling down to probably A, B or C. I think this is the type of conversation that will be much easier.  

Also, in the aspect of generative AI, the ability to express things more naturally. We've been heavy on DSL, so domain specific languages for querying, time series data for metrics, and document-based for logs...Lucene, PromQL, LogQL and many, many other languages. This is a barrier to entry for many engineers. The ability to express yourself in a natural language, “hey, can you tell me all the services that have HTTP code?” Imagine asking that as if you were asking me and being able to query that as a conversational manner with an assistant that will translate that behind the scenes into the official DSL queries and with the right labeling and enrichment with relevant metadata to make the query very accurate and so on. 

This is also very much something that I look forward to seeing. In general I've been talking for a good few years before the hype with AI about the fact that, for me at least, observability is a data analytics problem. And I've had talks, I think even last year, when you and I met at Scale 21 Hex, that was my talk there. The fact that people need to move away from this low-level view of observability as logging, and tracing, and metrics. And this is all raw data. What we're looking for is not raw data. We are looking for insights. 

Katherine Druckman: Mm-hmm. 

Dotan Horovits: Looking at observability as a data analytics problem, I have data that is coming from different sources in different formats, different signal types, will be the front-end, the back end, infrastructure-related, from Kafka, from Redis, from I don't know, HEPD, or what not. And I need to figure out across these signals, across these sources, across these formats, what goes on in my system. And that's the core. And that's something I've been pushing for a long time, even moving away maybe from the traditional definition of observability to change this mindset. And now, with the introduction of AI, I think it really fits in nicely because AI will build on top of this data observability type of vision and add this intelligence on top of that. 

Katherine Druckman: Yeah. It's funny being in our field; things can move very quickly. They don't always, but they can move very quickly. Has the conversation associated with the presentation you just mentioned about the data analytics problem that you gave at SCaLE, has that conversation changed from then until now? Do you feel like the things that you were talking about earlier in the year have been heard and enacted upon in some way? 

Dotan Horovits: I get amazing responses when I give these talks. It really resonates with many in the audience or people that then afterwards listen to the recording online and then reach out to me. It sounds like many find this very, very appealing. And, as I said, I can't separate from AI because the trend of AI definitely amplified it because suddenly there are tools and means and algorithms to actually make that happen. People know that there's a conversation about AI and they're thinking about how AI can help with observability. Suddenly, they go back to what I said a few years ago about observability being a data analytics problem, saying, okay, so now we can apply AI to a data analytics problem because it's just a data lake of all sorts of data of different sources, types, format, and we need to figure out and ask intelligent questions and get intelligent responses back about that. 

I think now the conversation is picked up. I see lots of startups and initiatives within large vendors going down this path. And there are hurdles, at least in the incumbents, which have been making their money out of the gigabytes of telemetry that people store. Obviously, this creates an incentive to carry on the conversation about the raw data. Yes, collect all the logs, collect all the traces, collect all the metrics, because then you'll have everything to ask the question if something breaks. That's something that held us back from elevating the conversation, but I see that even they are now moving away or pushed by their customers to move further up the stack into the insights and decouple that storage and raw data from the insights. 

Future of Observability and Industry Trends

Katherine Druckman: It's funny. Again, we met at SCaLE when you were talking about giving your talk, and then right after that was KubeCon in Paris. Backing up a little bit, I feel like most conferences, I walk away with a theme, right? There is a central theme where everyone is buzzing about a specific topic. There is some overarching theme even with multiple tracks and whatnot. And I felt like in Chicago in the fall, it was AI, everything was AI, sprinkle some AI on it, but I felt like the theme of this spring's KubeCon EU was very much observability. That most of my conversations came back in some level to observability. And that felt like the theme of the whole event to me. And I wonder if that's going to continue in North America in the event coming up in the fall. Are the conversations in the industry so focused on observability right now? I mean, what do you expect to happen in the next year? Are we going to continue to see that theme? 

Dotan Horovits: Well, I live and breathe observability, so I'm biased. 

Katherine Druckman: Yeah, you're a little bit biased, aren't you? 

Dotan Horovits: A lot of my surrounding and ecosystem comes from observability, are passionate about observability and approach me in the angle of observability. I'm a true believer. I think these problems here are nothing fluffy. This is something hardcore that impacts, as we talked about before, the top line business APIs as well as the developer experience and developer productivity. I don't think it's going anywhere. And you see that. It's a very solid thing, and even maybe growing in need because of the growing complexity of the architecture today. And you add on top of the containerized applications and microservices, now you have the serverless, now you have WASM. And each one of these requires a different approach with observability. A note about KubeCon, for those who don't know, KubeCon has co-located events about different topics. You have ArgoCon, and you have this, and that. And most of these co-located events are half-day events. 

Katherine Druckman: Mm-hmm. 

Dotan Horovits: At least in the past years that there have been... Observability is one of the few ones that actually had a full day, and the agenda was such, and the demand was such, that it was very easy. It could have easily turned into more than one day, both on the demand and the offering of topics and the CFPs and so on. It's very interesting also to see that you have this urge by people to attend, to contribute content, to exchange knowledge. And it's called observability day, obviously, for those who are looking for it for next month's KubeCon, very fascinating and highly recommended co-located event the day before the main KubeCon starts. Check it out. I think the evidence is that it's going to stay there. If something more hyped will take over as the main theme, that's a good question to be asked. But observability is here to stay because the problem is here to stay, and it's far from being resolved, and that's, I think, the main thing we should all remember. 

Katherine Druckman: Yeah, I agree. Well, I mean, there are a lot of things that are here and always present. Security, right? Security applies everywhere, but we go through cycles where one subject matter gets to be the dominant theme of anything in spite of everything else that exists. But yeah, we'll see. I'm anxious for security to come back to the top, but we'll see. It'll be AI security and observability. That's the next one. 

Dotan Horovits: Yeah, definitely. 

Observability News and Announcements

Katherine Druckman: I don't want to take up the rest of your day, though I think I could. Is there anything else you really were excited to talk about that we haven't gotten to? 

Dotan Horovits: No, I’m happy to catch up with everyone who's passionate about these topics and open source. If anyone wants to be a part of either the SIG about CI/CD observability, or more broadly about OpenTelemetry, and doesn't know exactly how to start in the broader ecosystem of the CNCF and the cloud native ecosystem, I’m happy to help as a CNCF ambassador and as someone who's passionate about these topics.  

I’m looking forward to KubeCon and the news that’s going to come out of that show. By the way, lots of interesting news on the observability space, so stay tuned. Like Jaeger V2, Prometheus V3. Actually, Jaeger V2 was the topic of my livestream just yesterday on Open Observability Talks with Yuri Shkuro, the creator of Jaeger back at Uber. Very, very exciting to see. It's going to be re-architected very thoroughly based on OpenTelemetry. You also see the synergies. That's amazing to see the synergies between the different projects. 

And Prometheus back at PromCon last month, the big announcement was Prometheus 3, the next major release after seven years. Finally. And again, the theme of OpenTelemetry being a central part and even aiming to be the de facto backend for OpenTelemetry metrics. These are major statements that show where the community's heading, where the open standardization, open specification is kicking in and really converging us as a community, as an industry. Lots of exciting stuff.  

Hopefully, we’ll see OpenTelemetry itself graduating in the CNCF; there's now an ongoing evaluation. The CNCF is, sorry,  OpenTelemetry is an incubating project for those who don't know. And the next stage in the CNCF maturity lifecycle is graduating, like Kubernetes Prometheus, and Jaeger. It's now under evaluation and has been the second most active project in the CNCF for the last few years. I don't see a way that it'll not be acknowledged. Hopefully, we’ll have a bit more exciting news on the open observability side very soon. 

Katherine Druckman: Well, thanks. I think cross-pollination and our larger ecosystem is so important and, frankly, there can never be enough of it, right? We need to do a lot more projects, need to talk to each other, and hey, that's what we're doing right here. Thank you. Thank you for talking to me and thank you for talking to everyone listening. I appreciate it. 

Dotan Horovits: Thank you so much, Katherine, for having me. 

Katherine Druckman: You've been listening to Open at Intel. Be sure to check out more about Intel’s work in the open source community at Open.Intel, on X, or on LinkedIn. We hope you join us again next time to geek out about open source.  

About the Guest

Dotan Horovits, CNCF Ambassador, AWS Senior Developer Advocate 

Dotan Horovits lives at the intersection of technology, product, and innovation. With over 20 years in the tech industry as a software developer, a solutions architect, and a product manager, he brings a wealth of knowledge in cloud and cloud native solutions, DevOps practices, and more. Horovits is an international speaker and thought leader, as well as an ambassador of the Cloud Native Computing Foundation (CNCF). He runs the successful OpenObservability Talks podcast, where he evangelizes on observability in IT systems using popular open source projects such as Prometheus, OpenSearch, Jaeger, and OpenTelemetry.

About the Host

Katherine Druckman, Open Source Security Evangelist, Intel  

Katherine Druckman, an Intel open source security evangelist, hosts the podcasts Open at Intel, Reality 2.0, and FLOSS Weekly. A security and privacy advocate, software engineer, and former digital director of Linux Journal, she's a long-time champion of open source and open standards. She is a software engineer and content creator with over a decade of experience in engineering, content strategy, product management, user experience, and technology evangelism. Find her on LinkedIn