POPULARITY
In our latest episode of Data Bytes, we were honored to host Shinji Kim, a multifaceted leader in the technology industry and the Founder & CEO of Select Star. With a wealth of experience that spans multiple roles, from startup advising to product management, and academic backgrounds from renowned institutions like the University of Waterloo and Stanford University, Shinji is a wellspring of knowledge on data discovery and management. --- Support this podcast: https://podcasters.spotify.com/pod/show/women-in-data/support
Go to TeamSimmer and use the coupon code "DEVIATE" for 10% off on individual course purchases.>> Articles and content mentioned in the episode
Johan van de Werken thrives best at the sweet spot between data, business & technology. Graduating with a philosophy degree from the University of Utrect, my guest started his career as a journalist for several Dutch publications, writing about everything from events and pop culture to media, politics and economics. Around 2014 he switched from letters to numbers, working in CRO for several european ecommerce businesses. That led him to building dashboards and leveraging cloud platforms to turn raw data into usable marketing insights. Working at an analytics firm that exposed him to BigQuery, he thought about sharing what he was learning. Seeing that the domain GA4BigQuery.com was available, he registered it and started posting there as a side gig. It got noticed by Simo Ahava, the founder of Simmer. That led Johan the release of the GA4 and BigQuery course on their training platform. As we fast forward to 2023, GA4BigQuery is now a well-known resource for marketers. And its creator is now consulting full-time on data analytics under his own brand, Select Star. Except for when he's having fun playing in a punk rock cover band. See Shownotes for all people and links mentioned in the episode.
Mark is joined by Shinji Kim, Founder and CEO of Select Star to talk about their mission to re-invent data catalogs, data discovery and data lineage for the modern data stackData Discovery vs. Data Observability: Understanding the Differences for Better DataOpsSelect Star and dbt Labs Partner for Better Data Discovery on dbtSelect Star : Free Trial
Are you old enough to remember when "Data Entry" represented a whole category of white collar jobs? We've come a long way, baby! These days, almost anything can be automated, from the marketing that drives sales to the systems that deliver products and services. Learn more on this episode of DM Radio as Host @eric_kavanagh interviews several guests, including: Analyst Mark Palmer, plus Shinji Kim of Select Star, Dheeraj Pandey of DevRev, and Nick Durkin of Harness.io.
Highlights from this week's conversation include:Shinji's background and career journey (3:35)Defining “data discovery” (6:03)The best conditions to use Select Star (8:45)Where Select Star fits on the data spectrum (13:38)Why Select Star is needed (17:35)How Select Star uses metadata (21:02)Exposing data queries (27:04)Composing queries into metadata (33:27)Automating BI tools (37:28)Limits to data governance (41:39)Maintaining economies of scale (48:56)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we'll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
In this bonus episode, Eric and Kostas preview their upcoming conversation with Shinji Kim of Select Star.
We have the pleasure of hosting Shinji Kim: Founder and CEO of Select Star. One of Select Star's specializations is Data Discovery: giving the end user the knowledge to bridge the gap between periodic clean-up of dashboards and documentation. Make sure to follow Shinji on Linkedin for more.What's New In Data is a data thought leadership series hosted by John Kutay who leads data and products at Striim. What's New In Data hosts industry practitioners to discuss latest trends, common patterns for real world data patterns, and analytics success stories.
About ShinjiShinji Kim is the Founder & CEO of Select Star, an automated data discovery platform that helps you to understand & manage your data. Previously, she was the Founder & CEO of Concord Systems, a NYC-based data infrastructure startup acquired by Akamai Technologies in 2016. She led the strategy and execution of Akamai IoT Edge Connect, an IoT data platform for real-time communication and data processing of connected devices. Shinji studied Software Engineering at University of Waterloo and General Management at Stanford GSB.Links Referenced: Select Star: https://www.selectstar.com/ LinkedIn: https://www.linkedin.com/company/selectstarhq/ Twitter: https://twitter.com/selectstarhq TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is sponsored in part by our friends at AWS AppConfig. Engineers love to solve, and occasionally create, problems. But not when it's an on-call fire-drill at 4 in the morning. Software problems should drive innovation and collaboration, NOT stress, and sleeplessness, and threats of violence. That's why so many developers are realizing the value of AWS AppConfig Feature Flags. Feature Flags let developers push code to production, but hide that that feature from customers so that the developers can release their feature when it's ready. This practice allows for safe, fast, and convenient software development. You can seamlessly incorporate AppConfig Feature Flags into your AWS or cloud environment and ship your Features with excitement, not trepidation and fear. To get started, go to snark.cloud/appconfig. That's snark.cloud/appconfig.Corey: I come bearing ill tidings. Developers are responsible for more than ever these days. Not just the code that they write, but also the containers and the cloud infrastructure that their apps run on. Because serverless means it's still somebody's problem. And a big part of that responsibility is app security from code to cloud. And that's where our friend Snyk comes in. Snyk is a frictionless security platform that meets developers where they are - Finding and fixing vulnerabilities right from the CLI, IDEs, Repos, and Pipelines. Snyk integrates seamlessly with AWS offerings like code pipeline, EKS, ECR, and more! As well as things you're actually likely to be using. Deploy on AWS, secure with Snyk. Learn more at Snyk.co/scream That's S-N-Y-K.co/screamCorey: Welcome to Screaming in the Cloud. I'm Corey Quinn. Every once in a while, I encounter a company that resonates with something that I've been doing on some level. In this particular case, that is what's happened here, but the story is slightly different. My guest today is Shinji Kim, who's the CEO and founder at Select Star.And the joke that I was making a few months ago was that Select Stars should have been the name of the Oracle ACE program instead. Shinji, thank you for joining me and suffering my ridiculous, basically amateurish and sophomore database-level jokes because I am bad at databases. Thanks for taking the time to chat with me.Shinji: Thanks for having me here, Corey. Good to meet you.Corey: So, Select Star despite being the only query pattern that I've ever effectively been able to execute from memory, what you do as a company is described as an automated data discovery platform. So, I'm going to start at the beginning with that baseline definition. I think most folks can wrap their heads around what the idea of automated means, but the rest of the words feel like it might mean different things to different people. What is data discovery from your point of view?Shinji: Sure. The way that we define data discovery is finding and understanding data. In other words, think about how discoverable your data is in your company today. How easy is it for you to find datasets, fields, KPIs of your organization data? And when you are looking at a table, column, dashboard, report, how easy is it for you to understand that data underneath? Encompassing on that is how we define data discovery.Corey: When you talk about data lurking around the company in various places, that can mean a lot of different things to different folks. For the more structured data folks—which I tend to think of as the organized folks who are nothing like me—that tends to mean things that live inside of, for example, traditional relational databases or things that closely resemble that. I come from a grumpy old sysadmin perspective, so I'm thinking, oh, yeah, we have a Jira server in the closet and that thing's logging to its own disk, so that's going to be some information somewhere. Confluence is another source of data in an organization; it's usually where insight and a knowledge of what's going on goes to die. It's one of those write once, read never type of things.And when I start thinking about what data means, it feels like even that is something of a squishy term. From the perspective of where Select Start starts and stops, is it bounded to data that lives within relational databases? Does it go beyond that? Where does it start? Where does it stop?Shinji: So, we started the company with an intention of increasing the discoverability of data and hence providing automated data discovery capability to organizations. And the part where we see this as the most effective is where the data is currently being consumed today. So, this is, like, where the data consumption happens. So, this can be a data warehouse or data lake, but this is where your data analysts, data scientists are querying data, they are building dashboards, reports on top of, and this is where your main data mart lives.So, for us, that is primarily a cloud data warehouse today, usually has a relational data structure. On top of that, we also do a lot of deep integrations with BI tools. So, that includes tools like Tableau, Power BI, Looker, Mode. Wherever these queries from the business stakeholders, BI engineers, data analysts, data scientists run, this is a point of reference where we use to auto-generate documentation, data models, lineage, and usage information, to give it back to the data team and everyone else so that they can learn more about the dataset they're about to use.Corey: So, given that I am seeing an increased number of companies out there talking about data discovery, what is it the Select Star does that differentiates you folks from other folks using similar verbiage in how they describe what they do?Shinji: Yeah, great question. There are many players that popping up, and also, traditional data catalog's definitely starting to offer more features in this area. The main differentiator that we have in the market today, we call it fast time-to-value. Any customer that is starting with Select Star, they get to set up their instance within 24 hours, and they'll be able to get all the analytics and data models, including column-level lineage, popularity, ER diagrams, and how other people are—top users and how other people are utilizing that data, like, literally in few hours, max to, like, 24 hours. And I would say that is the main differentiator.And most of the customers I have pointed out that setup and getting started has been super easy, which is primarily backed by a lot of automation that we've created underneath the platform. On top of that, just making it super easy and simple to use. It becomes very clear to the users that it's not just for the technical data engineers and DBAs to use; this is also designed for business stakeholders, product managers, and ops folks to start using as they are learning more about how to use data.Corey: Mapping this a little bit toward the use cases that I'm the most familiar with, this big source of data that I tend to stumble over is customer AWS bills. And that's not exactly a big data problem, given that it can fit in memory if you have a sufficiently exciting computer, but using Tableau don't wind up slicing and dicing that because at some point, Excel falls down. From my perspective, problem with Excel is that it doesn't tend to work on huge datasets very well, and from the position of Salesforce, the problem with Excel is that it doesn't cost a giant pile of money every month. So, those two things combined, Tableau is the answer for what we do. But that's sort of the end-all for us of, that's where it stops.At that point, we have dashboards that we build and queries that we run that spit out the thing we're looking at, and then that goes back to inform our analysis. We don't inherently feed that back into anything else that would then inform the rest of what we do. Now, for our use case, that probably makes an awful lot of sense because we're here to help our customers with their billing challenges, not take advantage of their data to wind up informing some giant model and mispurposing that data for other things. But if we were generating that data ourselves as a part of our operation, I can absolutely see the value of tying that back into something else. You wind up almost forming a reinforcing cycle that improves the quality of data over time and lets you understand what's going on there. What are some of the outcomes that you find that customers get to by going down this particular path?Shinji: Yeah, so just to double-click on what you just talked about, the way that we see this is how we analyze the metadata and the activity logs—system logs, user logs—of how that data has been used. So, part of our auto-generated documentation for each table, each column, each dashboard, you're going to be able to see the full data lineage: where it came from, how it was transformed in the past, and where it's going to. You will also see what we call popularity score: how many unique users are utilizing this data inside the organization today, how often. And utilizing these two core models and analysis that we create, you can start looking at first mapping out the data flow, and then determining whether or not this dataset is something that you would want to continue keeping or running the data pipelines for. Because once you start mapping these usage models of tables versus dashboards, you may find that there are recurring jobs that creates all these materialized views and tables that are feeding dashboards that are not being looked at anymore.So, with this mechanism by looking initially data lineage as a concept, a lot of companies use data lineage in order to find dependencies: what is going to break if I make this change in the column or table, as well as just debugging any of issues that is currently happening in their pipeline. So, especially when you will have to debug a SQL query or pipeline that you didn't build yourself but you need to find out how to fix it, this is a really easy way to instantly find out, like, where the data is coming from. But on top of that, if you start adding this usage information, you can trace through where the main compute is happening, which largest route table is still being queried, instead of the more summarized tables that should be used, versus which are the tables and datasets that is continuing to get created, feeding the dashboards and is those dashboards actually being used on the business side. So, with that, we have customers that have saved thousands of dollars every month just by being able to deprecate dashboards and pipelines that they were afraid of deprecating in the past because they weren't sure if anyone's actually using this or not. But adopting Select Star was a great way to kind of do a full spring clean of their data warehouse as well as their BI tool. And this is an additional benefit to just having to declutter so many old, duplicated, and outdated dashboards and datasets in their data warehouse.Corey: That is, I guess, a recurring problem that I see in many different pockets of the industry as a whole. You see it in the user visibility space, you see it in the cost control space—I even made a joke about Confluence that alludes to it—this idea that you build a whole bunch of dashboards and use it to inform all kinds of charts and other systems, but then people are busy. It feels like there's no ‘and then.' Like, one of the most depressing things in the universe that you can see after having spent a fair bit of effort to build up those dashboards is the analytics for who internally has looked at any of those dashboards since the demo you gave showing it off to everyone else. It feels like in many cases, we put all these projects and amount of effort into building these things out that then don't get used.People don't want to be informed by data they want to shoot from their gut. Now, sometimes that's helpful when we're talking about observability tools that you use to trace down outages, and, “Well, our site's really stable. We don't have to look at that.” Very awesome, great, awesome use case. The business insight level of dashboard just feels like that's something you should really be checking a lot more than you are. How do you see that?Shinji: Yeah, for sure. I mean, this is why we also update these usage metrics and lineage every 24 hours for all of our customers automatically, so it's just up-to-date. And the part that more customers are asking for where we are heading to—earlier, I mentioned that our main focus has been on analyzing data consumption and understanding the consumption behavior to drive better usage of your data, or making data usage much easier. The part that we are starting to now see is more customers wanting to extend those feature capabilities to their staff of where the data is being generated. So, connecting the similar amount of analysis and metadata collection for production databases, Kafka Queues, and where the data is first being generated is one of our longer-term goals. And then, then you'll really have more of that, up to the source level, of whether the data should be even collected or whether it should even enter the data warehouse phase or not.Corey: One of the challenges I see across the board in the data space is that so many products tend to have a very specific point of the customer lifecycle, where bringing them in makes sense. Too early and it's, “Data? What do you mean data? All I have are these logs, and their purpose is basically to inflate my AWS bill because I'm bad at removing them.” And on the other side, it's, “Great. We pioneered some of these things and have built our own internal enormous system that does exactly what we need to do.” It's like, “Yes, Google, you're very smart. Good job.” And most people are somewhere between those two extremes. Where are customers on that lifecycle or timeline when using Select Star makes sense for them?Shinji: Yeah, I think that's a great question. Also the time, the best place where customers would use Select Star for is that after they have their cloud data warehouse set up. Either they have finished their migration, they're starting to utilize it with their BI tools, and they're starting to notice that it's not just, like, you know, ten to fifty tables that they're starting with; most of them have more than hundreds of tables. And they're feeling that this is starting to go out of control because we have all these data, but we are not a hundred percent sure what exactly is in our database. And this usually just happens more in larger companies, companies at thousand-plus employees, and they usually find a lot of value out of Select Star right away because, like, we will start pointing out many different things.But we also see a lot of, like, forward-thinking, fast-growing startups that are at the size of a few hundred employees, you know, they now have between five to ten-person data team, and they are really creating the right single source of truth of their data knowledge through a Select Star. So, I think you can start anywhere from when your data team size is, like, beyond five and you're continuing to grow because every time you're trying to onboard a data analyst, data scientist, you will have to go through, like, basically the same type of training of your data model, and it might actually look different because the data models and the new features, new apps that you're integrating this changes so quickly. So, I would say it's important to have that base early on and then continue to grow. But we do also see a lot of companies coming to us after having thousands of datasets or tens of thousands of datasets that it's really, like, very hard to operate and onboard anyone. And this is a place where we really shine to help their needs, as well.Corey: Sort of the, “I need a database,” to the, “Help, I have too many databases,” pipeline, where [laugh] at some point people start to—wanting to bring organization to the chaos. One thing I like about your model is that you don't seem to be making the play that every other vendor in the data space tends to, which is, “Oh, we want you to move your data onto our systems. The end.” You operate on data that is in place, which makes an awful lot of sense for the kinds of things that we're talking about. Customers are flat out not going to move their data warehouse over to your environment, just because the data gravity is ludicrous. Just the sheer amount of money it would take to egress that data from a cloud provider, for example, is monstrous.Shinji: Exactly. [laugh]. And security concerns. We don't want to be liable for any of the data—and this is, like, a very specific decision we've made very early on the company—to not access data, to not egress any of the real data, and to provide as much value as possible just utilizing the metadata and logs. And depending on the types of data warehouses, it also can be really efficient because the query history or the metadata systems tables are indexed separately. Usually, it's much lighter load on the compute side. And that definitely has, like, worked well for our advantage, especially being a SaaS tool.Corey: This episode is sponsored in part by our friends at Sysdig. Sysdig secures your cloud from source to run. They believe, as do I, that DevOps and security are inextricably linked. If you wanna learn more about how they view this, check out their blog, it's definitely worth the read. To learn more about how they are absolutely getting it right from where I sit, visit Sysdig.com and tell them that I sent you. That's S Y S D I G.com. And my thanks to them for their continued support of this ridiculous nonsense.Corey: What I like is just how straightforward the integrations are. It's clear you're extraordinarily agnostic as far as where the data itself lives. You integrate with Google's BigQuery, with Amazon Redshift, with Snowflake, and then on the other side of the world with Looker, and Tableau, and other things as well. And one of the example use cases you give is find the upstream table in BigQuery that a Looker dashboard depends on. That's one of those areas where I see something like that, and, oh, I can absolutely see the value of that.I have two or three DynamoDB tables that drive my newsletter publication system that I built—because I have deep-seated emotional problems and I take it out and everyone else via code—but as a small, contained system that I can still fit in my head. Mostly. And I still forget which table is which in some cases. Down the road, especially at scale, “Okay, where is the actual data source that's informing this because it doesn't necessarily match what I'm expecting,” is one of those incredibly valuable bits of insight. It seems like that is something that often gets lost; the provenance of data doesn't seem to work.And ideally, you know, you're staffing a company with reasonably intelligent people who are going to look at the results of something and say, “That does not align with my expectations. I'm going to dig.” As opposed to the, “Oh, yeah, that seems plausible. I'll just go with whatever the computer says.” There's an ocean of nuance between those two, but it's nice to be able to establish the validity of the path that you've gone down in order to set some of these things up.Shinji: Yeah, and this is also super helpful if you're tasked to debug a dashboard or pipeline that you did not build yourself. Maybe the person has left the company, or maybe they're out-of-office, but this dashboard has been broken and you're quote-unquote, “On call,” for data. What are you going to do? You're going to—without a tool that can show you a full lineage, you will have to start digging through somebody else's SQL code and try to map out, like, where the data is coming from, if this is calculating correctly. Usually takes, you know, few hours to just get to the bottom of the issue. And this is one of the main use cases that our customers bring up every single time, as more of, like, this is now the go-to place every time there is any data questions or data issues.Corey: The first and golden rule of cloud economics is step one, turn that shit off.Shinji: [laugh].Corey: When people are using something, you can optimize the hell out of it however you want, but nothing's going to beat turning it off. One challenge is when we're looking at various accounts and we see a Redshift cluster, and it's, “Okay. That thing's costing a few million bucks a year and no one seems to know anything about it.” They keep pointing to other teams, and it turns into this giant, like, finger-pointing exercise where no one seems to have responsibility for it. And very often, our clients will choose not to turn that thing off because on the one hand, if you don't turn it off, you're going to spend a few million bucks a year that you otherwise would not have had to.On the other, if you delete the data warehouse, and it turns out, oh, yeah, that was actually kind of important, now we don't have a company anymore. It's a question of which is the side you want to be wrong on. And in some levels, leaving something as it is and doing something else is always a more defensible answer, just because the first time your cost-saving exercises take out production, you're generally not allowed to save money anymore. This feels like it helps get to that source of truth a heck of a lot more effectively than tracing individual calls and turning into basically data center archaeologists.Shinji: [laugh]. Yeah, for sure. I mean, this is why from the get go, we try to give you all your tables, all of your database, just ordered by popularity. So, you can also see overall, like, from all the tables, whether that's thousands or tens of thousands, you're seeing the most used, has the most number of dependencies on the top, and you can also filter it by all the database tables that hasn't been touched in the last 90 days. And just having this, like, high-level view gives a lot of ideas to the data platform team about how they can optimize usage of their data warehouse.Corey: From where I tend to sit, an awful lot of customers are still relatively early in their data journey. An awful lot of the marketing that I receive from various AWS mailing lists that I found myself on because I've had the temerity to open accounts has been along the lines of oh, data discovery is super important, but first, they presuppose that I've already bought into this idea that oh, every company must be a completely data-driven company. The end. Full stop.And yeah, we're a small bespoke services consultancy. I don't necessarily know that that's the right answer here. But then it takes it one step further and starts to define the idea of data discovery as, ah, you will use it to find a PII or otherwise sensitive or restricted data inside of your datasets so you know exactly where it lives. And sure, okay, that's valuable, but it also feels like a very narrow definition compared to how you view these things.Shinji: Yeah. Basically, the way that we see data discovery is it's starting to become more of an essential capability in order for you to monitor and understand how your data is actually being used internally. It basically gives you the insights around sure, like, what are the duplicated datasets, what are the datasets that have that descriptions or not, what are something that may contain sensitive data, so on and so forth, but that's still around the characteristics of the physical datasets. Whereas I think the part that's really important around data discovery that is not being talked about as much is how the data can actually be used better. So, have it as more of a forward-thinking mechanism and in order for you to actually encourage more people to utilize data or use the data correctly, instead of trying to contain this within just one team is really where I feel like data discovery can help.And in regards to this, the other big part around data discovery is really opening up and having that transparency just within the data team. So, just within the data team, they always feel like they do have that access to the SQL queries and you can just go to GitHub and just look at the database itself, but it's so easy to get lost in the sea of metadata that is just laid out as just the list; there isn't much context around the data itself. And that context and with along with the analytics of the metadata is what we're really trying to provide automatically. So eventually, like, this can be also seen as almost like a way to, like, monitor the datasets, like, how you're currently monitoring your applications through Datadog or your website with your Google Analytics, this is something that can be also used as more of a go-to source of truth around what your state of the data is, how that's defined, and how that's being mapped to different business processes, so that there isn't much confusion around data. Everything can be called the same, but underneath it actually can mean very different things. Does that make sense?Corey: No, it absolutely does. I think that this is part of the challenge in trying to articulate value that is, I guess, specific to this niche across an entire industry. The context that drives data is going to be incredibly important, and it feels like so much of the marketing in the space is aimed at one or two pre-imagined customer profiles. And that has the side effect of making customers for whom that model doesn't align, look and feel like either doing something wrong, or makes it look like the vendor who's pitching this is somewhat out of touch. I know that I work in a relatively bounded problem space, but I still learn new things about AWS billing on virtually every engagement that I go on, just because you always get to learn more about how customers view things and how they view not just their industry, but also the specificities of their own business and their own niche.I think that is one of the challenges historically, with the idea of letting software do everything. Do you find the problems that you're solving tend to be global in nature or are you discovering strange depths of nuance on a customer-by-customer basis at this point?Shinji: Overall, a lot of the problems that we solve and the customers that we work with is very industry agnostic. As long as you are having many different datasets that you need to manage, there are common problems that arises, regardless of the industry that you're in. We do observe some industry-specific issues because your data is either, it's an unstructured data, or your data is primarily events, or you know, depending on how the data looks like, but primarily because of most of the BI solutions and data warehouses are operating as a relational databases, this is a part where we really try to build a lot of best practices, and the common analytics that we can apply to every customer that's using Select Star.Corey: I really want to thank you for taking so much time to go through the ins and outs of what it is you're doing these days. If people want to learn more, where's the best place to find you?Shinji: Yeah, I mean, it's been fun [laugh] talking here. So, we are at selectstar.com. That's our website. You can sign up for a free trial. It's completely self-service, so you don't need to get on a demo but, like, we'll also help you onboard and happy to give a free demo to whoever that is interested.We are also on LinkedIn and Twitter under selectstarhq. Yeah, I mean, we're happy to help for any companies that have these issues around wanting to increase their discoverability of data, and want to help their data team and the rest of the company to be able to utilize data better.Corey: And we will, of course, put links to all of that in the [show notes 00:28:58]. Thank you so much for your time today. I really appreciate it.Shinji: Great. Thanks for having me, Corey.Corey: Shinji Kim, CEO and founder at Select Star. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice along with an angry comment that I won't be able to discover because there are far too many podcast platforms out there, and I have no means of discovering where you've said that thing unless you send it to me.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.
OverviewShinji Kim, Founder and CEO of Select Star, joins us to discuss what it means to govern and manage data in modern organizations, and how companies can set themselves up for success when getting started.This episode was recorded live at the 2022 DataConnect Conference. Special thank you to WIA Advisory Board Member Dave Cherry for assistance in preparation.About Shinji KimShinji is the Founder & CEO of Select Star, an automated data discovery platform that helps you understand your data. Previously, she was the CEO of Concord Systems (concord.io), a NYC-based data infrastructure startup acquired by Akamai Technologies in 2016. She led building Akamai's new IoT data platform for real-time messaging, log processing, and edge computing. Shinji studied Software Engineering at University of Waterloo and General Management at Stanford GSB. She advises early stage startups on product strategy, customer development, and company building.Relevant Links- Select Star (Shinji's company)- Designing Data-Intensive Applications (Book by Martin Kleppmann)- Concord Systems, Inc. (Shinji's previous company that was acquired by Akamai)- Apache Kafka (Software)- Apache Samza (Software)- Shinji's Articles on MediumFollow Select Star- Select Star on Medium- Select Star on LinkedIn- Select Star on TwitterFollow Shinji- LinkedIn- TwitterFollow Lauren- LinkedIn- Twitter- WebsiteTranscriptClick Here
Should data governance be called "data governance" or is there a better name for it to ensure its adoption and support? What about the "data steward" role? How should we use communication and change management to help the organization better embrace it? What are those best practices to best improve data quality? How do we stop a data lake from turning into a data swamp? We address all these questions and more with Shinji Kim, the Founder and CEO of Select Star.
This bonus episode features conversations from season 2 of the Open||Source||Data podcast. In this episode, you'll hear from Barr Moses, Co-founder and CEO at Monte Carlo; Einat Orr, Co-founder and CEO at Treeverse; and Shinji Kim, Founder and CEO at Select Star.Sam sat down with each guest to discuss data observability. You can listen to the full episodes from Barr Moses, Einat Orr, and Shinji Kim by clicking the links below.-------------------Episode Timestamps:(00:35): Barr Moses(01:21): Einat Orr(02:07): Shinji Kim-------------------Links:Listen to Barr's episodeListen to Einat's episodeListen to Shinji's episode
Modern organizations eventually face data governance challenges. Keeping track of where data came from, what systems update it, in what ways updates can be made are just some of the issues to be tackled. Large organizations face additional challenges like training, onboarding, and capturing the institutional knowledge that leaves with the departure of key team The post Select Star with Shinji Kim appeared first on Software Engineering Daily.
Modern organizations eventually face data governance challenges. Keeping track of where data came from, what systems update it, in what ways updates can be made are just some of the issues to be tackled. Large organizations face additional challenges like training, onboarding, and capturing the institutional knowledge that leaves with the departure of key team The post Select Star with Shinji Kim appeared first on Software Engineering Daily.
Why do so many data governance programs fail? What are those common pitfalls that we can avoid and what are those approaches that work better? Shinji Kim is the Founder and CEO of Select Star and is joining us on this Lights On Data Show episode to share her thoughts and best practices on this topic.
Show Notes(02:00) Shinji reflected on her academic experience studying Software Engineering at the University of Waterloo in the late 2000s.(04:19) Shinji shared valuable lessons learned from her undergraduate co-op experience with statistical analysis at Sun Microsystems, software engineering at Barclays Capital, and growth marketing at Facebook.(08:52) Shinji shared lessons learned from being a Management Consultant at Deloitte.(14:01) Shinji revisited her decision to quit the job at Deloitte and create a social puzzle game called Shufflepix.(17:42) Shinji went over her time working as a Product Manager at the mobile ad exchange network YieldMo.(22:25) Shinji discussed the problem of stream processing at YieldMo, which sparked the creation of Concord.(26:17) Shinji unpacked the pain points with existing stream processing frameworks and the competitive advantage of using Concord.(33:19) Shinji recalled her time at Akamai — initially as a data engineer in the Platform Engineering unit and later as a product manager for the IoT Edge Connect platform.(37:26) Shinji explained why sharing context knowledge around data remains a largely unsolved problem.(42:07) Shinji unpacked the three capabilities of an ideal data discovery platform: (1) exposing up-to-date operational metadata along with the documentation, (2) tracking the provenance of data back to its source, and (3) guiding data usage.(46:59) Shinji unpacked the benefits of plugging BI tools into data discovery platforms and collecting metadata, which facilitates better visibility and understanding.(52:36) Shinji discussed the role of a data discovery platform within the modern data stack.(53:59) Shinji shared the hurdles that her team has to go through while finding early adopters of Select Star.(55:48) Shinji shared valuable hiring lessons learned at Select Star.(01:00:00) Shinji shared fundraising advice for founders currently seeking the right investors for their startups.(01:04:41) Closing segment.Shinji's Contact InfoLinkedInTwitterMediumSelect Star's ResourcesWebsiteBlogLinkedIn | Twitter | MediumMentioned ContentArticles“The Next Evolution of Data Catalogs: Data Discovery Platforms” (Feb 2021)“Data Discovery for Business Intelligence” (May 2021)PeopleMartin Kleppmann (Author of Designing Data-Intensive Applications)Emily Riederer (Senior Analytics Manager at Capital One)Anya Prosvetova (Tableau DataDev Ambassador)Book“Managing Oneself” (by Peter Drucker)NotesMy conversation with Shinji was recorded back in July 2021. Since then, many things have happened at Select Star:General Availability launch on Product Hunt: https://www.producthunt.com/posts/selectstarSnowflake partnership on data governance: https://blog.selectstar.com/selectstar-and-snowflake-partner-to-take-data-governance-to-a-new-level-a9d274e1d4c6Case studies with Pitney Bowes and HandshakeAbout the showDatacast features long-form, in-depth conversations with practitioners and researchers in the data community to walk through their professional journeys and unpack the lessons learned along the way. I invite guests coming from a wide range of career paths — from scientists and analysts to founders and investors — to analyze the case for using data in the real world and extract their mental models (“the WHY and the HOW”) behind their pursuits. Hopefully, these conversations can serve as valuable tools for early-stage data professionals as they navigate their own careers in the exciting data universe.Datacast is produced and edited by James Le. Get in touch with feedback or guest suggestions by emailing khanhle.1013@gmail.com.Subscribe by searching for Datacast wherever you get podcasts or click one of the links below:Listen on SpotifyListen on Apple PodcastsListen on Google PodcastsIf you're new, see the podcast homepage for the most recent episodes to listen to, or browse the full guest list.
Shinji Kim was born in Korea. Her family moved to Canada when she was 13 years old. She likes to hang out with her friends, watch movies and likes to be outdoors, skiing, hiking or doing yoga. She recently tore her ACL, though, which has put a damper on her physical activity. While she studied at the University of Waterloo, she was able to work with many different well known companies - Facebook, Sun Microsystems and Barclays. In 2014, she started her first company called Concord Systems, focusing on distributed stream processing. This was eventually acquired by Akamai, the largest CDN network in the world. We joked about the day it went down, cause it fell like the digital world went dark.At Akamai, she worked with large enterprises and saw that there was a problem around data discovery, and that it was growing in the middle market, as more companies migrated to the cloud. She decided to build an automated way for users to discovery and understand their data.This is the creation story of Select Star.SponsorsCourierImg.lyRoutableCTO.aiCloudways offers peace of mind and flexibility so you can focus on growing your business instead of dealing with server management. With Cloudways, you get an optimized stack, managed servers, backups, staging environment, integrated Git, pre-configured, Composer, 24/7 support, and a choice of five cloud providers: AWS, DigitalOcean, Linode, Google Cloud, and Vultr. Get up to 2 Month Free Hosting by using code "CODE30" and get $30 free hosting credit.LinksWebsite: https://selectstar.com/LinkedIn: https://www.linkedin.com/in/shinjikim/Support this podcast at — https://redcircle.com/code-story/donationsAdvertising Inquiries: https://redcircle.com/brandsPrivacy & Opt-Out: https://redcircle.com/privacy
Shinji Kim is Founder and CEO of Select Star. In this episode we discuss data discovery and more. This interview was also recorded as a video podcast. Check out the video on the Software Daily YouTube channel. Sponsorship inquiries: sponsor@softwareengineeringdaily.com The post Data Discovery with Shinji Kim appeared first on Software Engineering Daily.
Shinji Kim is Founder and CEO of Select Star. In this episode we discuss data discovery and more. This interview was also recorded as a video podcast. Check out the video on the Software Daily YouTube channel. Sponsorship inquiries: sponsor@softwareengineeringdaily.com
Shinji Kim is Founder and CEO of Select Star. In this episode we discuss data discovery and more. This interview was also recorded as a video podcast. Check out the video on the Software Daily YouTube channel. Sponsorship inquiries: sponsor@softwareengineeringdaily.com The post Data Discovery with Shinji Kim appeared first on Software Engineering Daily.
Shinji Kim is Founder and CEO of Select Star. In this episode we discuss data discovery and more. This interview was also recorded as a video podcast. Check out the video on the Software Daily YouTube channel. Sponsorship inquiries: sponsor@softwareengineeringdaily.com The post Data Discovery with Shinji Kim appeared first on Software Engineering Daily.